Data Mining Survivor: Imputation

DATA MINING
Desktop Survival Guide
by Graham Williams

Nearest Neighbours

We might, more reasonably, be more sophisticated and use the average value of the nearest neighbours, where the neighbours are determined by looking at the other variables (not yet implemented in Rattle).

Another approach to filling in the missing values is to look at the entities that are closest to the observation with a missing value, and to use the values for the missing variable of these nearby neighbours to fill in the missing value for this observation. Refer to Data Mining With R, page 48 and following for example R code to do this.

Nearest neighbour models tend to be at the opposite end of the scale of bias and variance to linear regression. Models have a low bias but high variance.

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010