Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Using audit

Image rattle-audit-model-rpart

The output from the decision tree building process includes much information:



Summary of the rpart model:

n= 1400

Next, the structure of the tree is presented. First some background information about how the tree is listed. This tells us that a node number will be provided, followed by a split or test ($var op value$), the number of entities at that node, how many entities are incorrectly classified (the loss), the default classification for the node (yval), and then the distribution of classes in that node (yprob). The order of the distribution is the same as that of the order of the classes, and is the same order for all nodes.

node), split, n, loss, yval, (yprob)
      * denotes terminal node

The root node, then, contains all 1400 entities, and 335 of them are classified as 1 rather than 0. The default class is 0, and for this node 76.07% of the entities have a 0 classification and 23.93% have 1.

 1) root 1400 335 0 (0.76071429 0.23928571)

The next sub-node splits the root entities into one of two groups, identifying those with specific values for the Marital variable (the full list is replaced with [...] here). There are 746 entities in this group of which 47 will be incorrect when we take the default class as 0. The class distribution is 93.85% 0 and 6.15% 1. The '*' indicates that his node is not split any further--that is, it is a terminal node.

   2) Marital=Absent, [...] 764  47 0 (0.93848168 0.06151832) *

The other side of the Marital split is then split further. We can see that node 13, for example, has split on the variable Deductions with a test of $>=1679.667$. There are only 8 entites in this node, with none incorrectly classified, the classification being 1, and the class distribution being 0% 0 and 100% 1. This is also a terminal node.

   3) Marital=Married 636 288 0 (0.54716981 0.45283019)  
     6) Occupation=Cleaner, [...] 282  65 0 (0.76950355 0.23049645)  
      12) Deductions< 1679.667 274  57 0 (0.79197080 0.20802920) *
      13) Deductions>=1679.667 8   0 1 (0.00000000 1.00000000) *

The rest of the tree is:

     7) Occupation=Clerical,[...] 354 131 1 (0.37005650 0.62994350)  
      14) Education=Associate,[...] 165  81 1 (0.49090909 0.50909091)  
        28) Age< 33.5 36   9 0 (0.75000000 0.25000000) *
        29) Age>=33.5 129  54 1 (0.41860465 0.58139535)  
          58) Age>=62 14   3 0 (0.78571429 0.21428571) *
          59) Age< 62 115  43 1 (0.37391304 0.62608696) *
      15) Education=Bachelor,[...] 189  50 1 (0.26455026 0.73544974) *

Next is listed the command line call to the rpart function:

Classification tree:
rpart(formula = Adjusted ~ ., data = crs$dataset[crs$sample, 
    c(2:10, 13)], method = "class")



Variables actually used in tree construction:
[1] Age        Deductions Education  Marital    Occupation



Root node error: 335/1400 = 0.23929

n= 1400

The complexity table is useful:

        CP nsplit rel error  xerror     xstd
1 0.137313      0   1.00000 1.00000 0.047653
2 0.026866      2   0.72537 0.75821 0.043043
3 0.023881      4   0.67164 0.79403 0.043817
4 0.010000      6   0.62388 0.77910 0.043498

Finally, we get to see how long it took to build the tree:

Time taken: 0.14 secs

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Saturday, 16 January 2010