|
DATA MINING
Desktop Survival Guide by Graham Williams |
|
|||
Remove Non-Numeric Columns |
We might only be interested in the numeric data, so we remove all columns that are not numeric from a dataset. We can use the survey dataset to illustrate this. First load the dataset and have a look at the column names and their types. We use the lapply function to apply the class function to each column of the data frame.
> load("survey.RData")
> colnames(survey)
[1] "Age" "Workclass" "fnlwgt" "Education"
[5] "Education.Num" "Marital.Status" "Occupation" "Relationship"
[9] "Race" "Sex" "Capital.Gain" "Capital.Loss"
[13] "Hours.Per.Week" "Native.Country" "Salary.Group"
> lapply(survey, class)
$Age
[1] "integer"
$Workclass
[1] "factor"
$fnlwgt
[1] "integer"
$Education
[1] "factor"
$Education.Num
[1] "integer"
$Marital.Status
[1] "factor"
$Occupation
[1] "factor"
\$Relationship
[1] "factor"
$Race
[1] "factor"
$Sex
[1] "factor"
$Capital.Gain
[1] "integer"
$Capital.Loss
[1] "integer"
$Hours.Per.Week
[1] "integer"
$Native.Country
[1] "factor"
$Salary.Group
[1] "factor"
|
We can now simply use is.numeric to select the numeric
columns and store the result in a new dataset, using
sapply to extract the list of numeric columns:
> survey.numeric <- survey[,sapply(survey, is.numeric)] |
You could instead build a list of the columns to remove and then explicitly remove them from the dataset in place, so that you don't create a need for extra data storage.
First build a numeric list of columns to remove, and reverse it since after we remove a column, all the remaining columns are shifted left and their index is then one less! We use sapply to extract the list of numeric columns (those for which is.numeric is true).
> rmcols <- rev(seq(1,ncol(survey))[!as.logical(sapply(survey, is.numeric))]) > rmcols [1] 15 14 10 9 8 7 6 4 2 |
Now remove the columns from the dataset simply by setting the column to NULL.
> for (i in rmcols) survey[[i]] <- NULL > colnames(survey) [1] "Age" "fnlwgt" "Education.Num" "Capital.Gain" [5] "Capital.Loss" "Hours.Per.Week" |
This same process can be used to remove or retain columns of any type, simply by using the appropriate R function: e.g., is.factor, is.logical, is.integer, or is.numeric.
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.