Data Mining Survivor: Tutorial_Example

DATA MINING
Desktop Survival Guide
by Graham Williams

Using audit

The output from the decision tree building process includes much information:

Summary of the rpart model: n= 1400

Next, the structure of the tree is presented. First some background information about how the tree is listed. This tells us that a node number will be provided, followed by a split or test (), the number of entities at that node, how many entities are incorrectly classified (the loss), the default classification for the node (yval), and then the distribution of classes in that node (yprob). The order of the distribution is the same as that of the order of the classes, and is the same order for all nodes.

node), split, n, loss, yval, (yprob) * denotes terminal node

The root node, then, contains all 1400 entities, and 335 of them are classified as 1 rather than 0. The default class is 0, and for this node 76.07% of the entities have a 0 classification and 23.93% have 1.

1) root 1400 335 0 (0.76071429 0.23928571)

The next sub-node splits the root entities into one of two groups, identifying those with specific values for the Marital variable (the full list is replaced with [...] here). There are 746 entities in this group of which 47 will be incorrect when we take the default class as 0. The class distribution is 93.85% 0 and 6.15% 1. The '*' indicates that his node is not split any further--that is, it is a terminal node.

2) Marital=Absent, [...] 764 47 0 (0.93848168 0.06151832) *

The other side of the Marital split is then split further. We can see that node 13, for example, has split on the variable Deductions with a test of . There are only 8 entites in this node, with none incorrectly classified, the classification being 1, and the class distribution being 0% 0 and 100% 1. This is also a terminal node.

3) Marital=Married 636 288 0 (0.54716981 0.45283019) 6) Occupation=Cleaner, [...] 282 65 0 (0.76950355 0.23049645) 12) Deductions< 1679.667 274 57 0 (0.79197080 0.20802920) * 13) Deductions>=1679.667 8 0 1 (0.00000000 1.00000000) *

The rest of the tree is:

7) Occupation=Clerical,[...] 354 131 1 (0.37005650 0.62994350) 14) Education=Associate,[...] 165 81 1 (0.49090909 0.50909091) 28) Age< 33.5 36 9 0 (0.75000000 0.25000000) * 29) Age>=33.5 129 54 1 (0.41860465 0.58139535) 58) Age>=62 14 3 0 (0.78571429 0.21428571) * 59) Age< 62 115 43 1 (0.37391304 0.62608696) * 15) Education=Bachelor,[...] 189 50 1 (0.26455026 0.73544974) *

Next is listed the command line call to the rpart function:

Classification tree: rpart(formula = Adjusted ~ ., data = crs$dataset[crs$sample, c(2:10, 13)], method = "class")

Variables actually used in tree construction: [1] Age Deductions Education Marital Occupation

Root node error: 335/1400 = 0.23929 n= 1400

The complexity table is useful:

CP nsplit rel error xerror xstd 1 0.137313 0 1.00000 1.00000 0.047653 2 0.026866 2 0.72537 0.75821 0.043043 3 0.023881 4 0.67164 0.79403 0.043817 4 0.010000 6 0.62388 0.77910 0.043498

Finally, we get to see how long it took to build the tree:

Time taken: 0.14 secs

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Saturday, 16 January 2010