Data Mining Survivor: Examples - Survey Data: Data Preparation

DATA MINING
Desktop Survival Guide
by Graham Williams

Survey Data: Data Preparation

For this example we will use the survey dataset (see See Section 30.3.4). This dataset is a reasonable size and has some common real world issues. The vignette for arules, by the authors of the package (, ), also use a similar dataset, available within the package through data(Survey). We borrow some of their data transformations here.

We first review the dataset: there are 32,561 entities and 15 variables.

> load("survey.RData") > dim(survey) [1] 32561 15 > summary(survey) Age Workclass fnlwgt Min. :17.00 Private :22696 Min. : 12285 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827 Median :37.00 Local-gov : 2093 Median : 178356 Mean :38.58 State-gov : 1298 Mean : 189778 3rd Qu.:48.00 Self-emp-inc : 1116 3rd Qu.: 237051 Max. :90.00 (Other) : 981 Max. :1484705 NA's : 1836 Education Education.Num Marital.Status HS-grad :10501 Min. : 1.00 Divorced : 4443 Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23 Bachelors : 5355 Median :10.00 Married-civ-spouse :14976 Masters : 1723 Mean :10.08 Married-spouse-absent: 418 Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683 11th : 1175 Max. :16.00 Separated : 1025 (Other) : 5134 Widowed : 993 Occupation Relationship Prof-specialty : 4140 Husband :13193 Amer-Indian-Eskimo: 311 Craft-repair : 4099 Not-in-family : 8305 Asian-Pac-Islander: 1039 Exec-managerial: 4066 Other-relative: 981 Black : 3124 Adm-clerical : 3770 Own-child : 5068 Other : 271 Sales : 3650 Unmarried : 3446 White :27816 (Other) :10993 Wife : 1568 NA's : 1843 Sex Capital.Gain Capital.Loss Hours.Per.Week Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00 Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00 Median : 0 Median : 0.0 Median :40.00 Mean : 1078 Mean : 87.3 Mean :40.44 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00 Max. : 99999 Max. :4356.0 Max. :99.00 Native.Country Salary.Group United-States:29170 <=50K:24720 Mexico : 643 >50K : 7841 Philippines : 198 Germany : 137 Canada : 121 (Other) : 1709 NA's : 583

The first 5 rows of the dataset give some idea of the type of data:

> survey[1:5,] Age Workclass fnlwgt Education Education.Num Marital.Status 1 39 State-gov 77516 Bachelors 13 Never-married 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse 3 38 Private 215646 HS-grad 9 Divorced 4 53 Private 234721 11th 7 Married-civ-spouse 5 28 Private 338409 Bachelors 13 Married-civ-spouse Occupation Relationship Race Sex Capital.Gain Capital.Loss 1 Adm-clerical Not-in-family White Male 2174 0 2 Exec-managerial Husband White Male 0 0 3 Handlers-cleaners Not-in-family White Male 0 0 4 Handlers-cleaners Husband Black Male 0 0 5 Prof-specialty Wife Black Female 0 0 Hours.Per.Week Native.Country Salary.Group 1 40 United-States <=50K 2 13 United-States <=50K 3 40 United-States <=50K 4 40 United-States <=50K 5 40 Cuba <=50K

The dataset contains a mixture of categoric and numeric variables while the apriori algorithm works just with categoric variables (or factors). We note that the variable fnlwgt is a calculated value and not of interest to us so we can remove it from the dataset. The variable Education.Num is redundant since is it simply a numeric mapping of Education. We can remove these from the data frame simply by assigning NULL to them:

> survey$fnlwgt <- NULL > survey$Education.Num <- NULL

This still leaves Age, Capital.Gain, Capital.Loss, and Hours.Per.Week. Following (), we will partition Age and Hours.Per.Week into fours segments each:

> survey$Age <- ordered(cut(survey$Age, c(15, 25, 45, 65, 100)), labels = c("Young", "Middle-aged", "Senior", "Old")) > survey$Hours.Per.Week <- ordered(cut(survey$Hours.Per.Week, c(0, 25, 40, 60, 168)), labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))

Again following () we map Capital.Gain and Capital.Loss to None, and Low and High according to the median:

> survey$Capital.Gain <- ordered(cut(survey$Capital.Gain, c(-Inf, 0, median(survey$Capital.Gain[survey$Capital.Gain >0]), 1e+06)), labels = c("None", "Low", "High")) > survey$Capital.Loss <- ordered(cut(survey$Capital.Loss, c(-Inf, 0, median(survey$Capital.Loss[survey$Capital.Loss >0]), 1e+06)), labels = c("None", "Low", "High"))

That is pretty much it in terms of preparing the data for apriori:

> survey[1:5,] Age Workclass Education Marital.Status Occupation 1 Middle-aged State-gov Bachelors Never-married Adm-clerical 2 Senior Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial 3 Middle-aged Private HS-grad Divorced Handlers-cleaners 4 Senior Private 11th Married-civ-spouse Handlers-cleaners 5 Middle-aged Private Bachelors Married-civ-spouse Prof-specialty Relationship Race Sex Capital.Gain Capital.Loss Hours.Per.Week 1 Not-in-family White Male Low None Full-time 2 Husband White Male None None Part-time 3 Not-in-family White Male None None Full-time 4 Husband Black Male None None Full-time 5 Wife Black Female None None Full-time Native.Country Salary.Group 1 United-States <=50K 2 United-States <=50K 3 United-States <=50K 4 United-States <=50K 5 Cuba <=50K

The apriori function will coerce the data into the transactions data type, and this can also be done prior to calling apriori using the as function to view the data as a transaction dataset:

> library(arules) > survey.transactions <- as(survey, "transactions") > survey.transactions transactions in sparse format with 32561 transactions (rows) and 115 items (columns)

This illustrates how the transactions data type represents variables in a binary form, one binary variable for each level of each categoric variable. There are 115 distinct levels (values for the categoric variables) across all 13 of the categoric variables.

The summary function provides more details:

> summary(survey.transactions) transactions as itemMatrix in sparse format with 32561 rows (elements/itemsets/transactions) and 115 columns (items) most frequent items: Capital.Loss = None Capital.Gain = None 31042 29849 Native.Country = United-States Race = White 29170 27816 Salary.Group = <=50K (Other) 24720 276434 element (itemset/transaction) length distribution: 10 11 12 13 27 1809 563 30162 Min. 1st Qu. Median Mean 3rd Qu. Max. 10.00 13.00 13.00 12.87 13.00 13.00 includes extended item information - examples: labels variables levels 1 Age = Young Age Young 2 Age = Middle-aged Age Middle-aged

The summary begins with a description of the dataset sizes. This is followed by a list of the most frequent items occurring in the dataset. A Capital.Loss of None is the single most frequent item, occurring 31,042 times (i.e., pretty much no transaction has any capital loss recorded). The length distribution of the transactions is then given, indicating that some transactions have NA's for some of the variables. Looking at the summary of the original dataset you'll see that the variables Workclass, Occupation, and Native.Country have NA's, and so the distribution ranges from 10 to 13 items in a transaction.

The final piece of information in the summary output indicates the mapping that has been used to map the categoric variables to the binary variables, so that Age = Young is one binary variable, and Age = Middle-aged is another.

Now it is time to find all association rules using apriori. After a little experimenting we have chosen a support of 0.05 and a confidence of 0.95. This gives us 4,236 rules.

> survey.rules <- apriori(survey.transactions, parameter = list(support=0.05, confidence=0.95)) parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target 0.95 0.1 1 none FALSE TRUE 0.05 1 5 rules ext FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[115 item(s), 32561 transaction(s)] done [0.07s]. sorting and recoding items ... [36 item(s)] done [0.01s]. creating transaction tree ... done [0.08s]. checking subsets of size 1 2 3 4 5 done [0.23s]. writing ... [4236 rule(s)] done [0.00s]. creating S4 object ... done [0.04s].

> survey.rules set of 4236 rules

> summary(survey.rules) set of 4236 rules rule length distribution (lhs + rhs): 1 2 3 4 5 1 34 328 1282 2591 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 4.000 5.000 4.517 5.000 5.000 summary of quality measures: support confidence lift Min. :0.05003 Min. :0.9500 Min. :0.9965 1st Qu.:0.06469 1st Qu.:0.9617 1st Qu.:1.0186 Median :0.08435 Median :0.9715 Median :1.0505 Mean :0.11418 Mean :0.9745 Mean :1.2701 3rd Qu.:0.13267 3rd Qu.:0.9883 3rd Qu.:1.3098 Max. :0.95335 Max. :1.0000 Max. :2.9725

We can inspect the first 5 rules (slightly edited to suit publication):

> inspect(survey.rules[1:5]) lhs rhs support conf lift 1 {} => {Capital.Loss = None} 0.953 0.953 1.00 2 {Occupation = Machine-op-inspct} => {Workclass = Private} 0.058 0.955 1.37 3 {Occupation = Machine-op-inspct} => {Capital.Loss = None} 0.059 0.966 1.01 4 {Race = Black} => {Capital.Loss = None} 0.093 0.967 1.01 5 {Occupation = Other-service} => {Salary.Group = <=50K} 0.097 0.958 1.26

Or we can list the first 5 rules which have a lift greater that 2.5

> subset(survey.rules, subset=lift>2.5) set of 40 rules > inspect(subset(survey.rules, subset=lift>2.5)[1:5]) lhs rhs support conf lift 1 {Age = Young, Hours.Per.Week = Part-time} => {Marital.Status = Never-married} 0.06 0.95 2.9 2 {Age = Young, Relationship = Own-child} => {Marital.Status = Never-married} 0.10 0.97 2.9 3 {Age = Young, Hours.Per.Week = Part-time, Salary.Group = <=50K} => {Marital.Status = Never-married} 0.06 0.96 2.9 4 {Age = Young, Hours.Per.Week = Part-time, Native.Country = United-States}=>{Marital.Status=Never-married} 0.05 0.95 2.9 5 {Age = Young, Capital.Gain = None, Hours.Per.Week = Part-time} => {Marital.Status = Never-married} 0.05 0.96 2.9

Here we build quite a few more rules and then view the rule with highest lift:

> survey.rules <- apriori(survey.transactions, parameter = list(support = 0.05, confidence = 0.8)) parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target 0.8 0.1 1 none FALSE TRUE 0.05 1 5 rules ext FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[115 item(s), 32561 transaction(s)] done [0.09s]. sorting and recoding items ... [36 item(s)] done [0.02s]. creating transaction tree ... done [0.10s]. checking subsets of size 1 2 3 4 5 done [0.35s]. writing ... [13344 rule(s)] done [0.00s]. creating S4 object ... done [0.08s]. > inspect(SORT(subset(survey.rules, subset=rhs %in% "Salary.Group"), by="lift")[1:3]) lhs rhs support conf lift 1 {Occupation = Exec-managerial, Relationship = Husband, Capital.Gain = High} => {Salary.Group = >50K} 0.007 1 4.15 2 {Age = Middle-aged, Occupation = Exec-managerial, Capital.Gain = High} => {Salary.Group = >50K} 0.005 1 4.15 3 {Age = Middle-aged, Education = Bachelors, Capital.Gain = High} => {Salary.Group = >50K} 0.006 1 4.15

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010