Data Mining Survivor: Data_Cleaning

DATA MINING
Desktop Survival Guide
by Graham Williams

Missing Values

Missing data can affect modelling, particularly if the data is not randomly missing, but missing because of some underlying systematic reason (e.g., censoring). If data is missing at random (often abbreviated as MAR) then it is more likely that the missing values will have little affect on the modelling.

An excellent reference on dealing with missing data is ().

Missing values are specially recorded in R as NA. Various functions can be used to check for a missing value (is.na), to remove any entities with missing values (na.omit and to identify those entities that are complete (complete.cases. The apply function also comes in handy here.

> ds <- ds[!apply(is.na(ds),1,all),] # Remove all rows of all NA's. > ds <- na.omit(ds) # Remove all rows that have any NA's. > ds <- ds[complete.cases(ds),] # Remove all rows that have any NA's.

In some very simple (i.e., not rigorous) timing experiments the second of these using complete.cases is faster.

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010