|
DATA MINING
Desktop Survival Guide by Graham Williams |
|
|||
Using audit |
The output from the decision tree building process includes much information:
Summary of the rpart model: n= 1400 |
Next, the structure of the tree is presented. First some background
information about how the tree is listed. This tells us that a node
number will be provided, followed by a split or test (
),
the number of entities at that node, how many entities are incorrectly
classified (the loss), the default classification for the node (yval),
and then the distribution of classes in that node (yprob). The order
of the distribution is the same as that of the order of the classes,
and is the same order for all nodes.
node), split, n, loss, yval, (yprob)
* denotes terminal node
|
The root node, then, contains all 1400 entities, and 335 of them are
classified as 1 rather than 0. The default class is 0, and for this
node 76.07% of the entities have a 0 classification and 23.93% have
1.
1) root 1400 335 0 (0.76071429 0.23928571) |
The next sub-node splits the root entities into one of two groups,
identifying those with specific values for the Marital variable (the
full list is replaced with [...] here). There are 746 entities in this
group of which 47 will be incorrect when we take the default class as
0. The class distribution is 93.85% 0 and 6.15% 1. The '*'
indicates that his node is not split any further--that is, it is a
terminal node.
2) Marital=Absent, [...] 764 47 0 (0.93848168 0.06151832) * |
The other side of the Marital split is then split further. We can see
that node 13, for example, has split on the variable Deductions with a
test of
. There are only 8 entites in this node, with none
incorrectly classified, the classification being 1, and the class
distribution being 0% 0 and 100% 1. This is also a terminal node.
3) Marital=Married 636 288 0 (0.54716981 0.45283019)
6) Occupation=Cleaner, [...] 282 65 0 (0.76950355 0.23049645)
12) Deductions< 1679.667 274 57 0 (0.79197080 0.20802920) *
13) Deductions>=1679.667 8 0 1 (0.00000000 1.00000000) *
|
The rest of the tree is:
7) Occupation=Clerical,[...] 354 131 1 (0.37005650 0.62994350)
14) Education=Associate,[...] 165 81 1 (0.49090909 0.50909091)
28) Age< 33.5 36 9 0 (0.75000000 0.25000000) *
29) Age>=33.5 129 54 1 (0.41860465 0.58139535)
58) Age>=62 14 3 0 (0.78571429 0.21428571) *
59) Age< 62 115 43 1 (0.37391304 0.62608696) *
15) Education=Bachelor,[...] 189 50 1 (0.26455026 0.73544974) *
|
Next is listed the command line call to the rpart function:
Classification tree:
rpart(formula = Adjusted ~ ., data = crs$dataset[crs$sample,
c(2:10, 13)], method = "class")
|
Variables actually used in tree construction: [1] Age Deductions Education Marital Occupation |
Root node error: 335/1400 = 0.23929 n= 1400 |
The complexity table is useful:
CP nsplit rel error xerror xstd
1 0.137313 0 1.00000 1.00000 0.047653
2 0.026866 2 0.72537 0.75821 0.043043
3 0.023881 4 0.67164 0.79403 0.043817
4 0.010000 6 0.62388 0.77910 0.043498
|
Finally, we get to see how long it took to build the tree:
Time taken: 0.14 secs |
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.