Data Mining Survivor: Tuning_Parameters

DATA MINING
Desktop Survival Guide
by Graham Williams

Complexity (cp)

Note that pruning is a mechanism for reducing the variance of the resulting models. However, for large datasets the reduction of variance is not usually useful thus unpruned trees may actually be better.

The variable Rarg[]cp governs the minimum complexity benefit that must be gained at each step in order to make a split worthwhile. The default is 0.01.

The complexity parameter (cp) is used to control the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue. We could also say that tree construction does not continue unless it would decrease the overall lack of fit by a factor of cp.

Setting this to zero will build a tree to its maximum depth (and perhaps will build a very, very, large tree). This is useful if you want to look at the values for CP for various tree sizes. This information will be in the text view window. You will look for the number of splits where the sum of the xerror (cross validation error, relative to the root node error) and xstd is minimum. This is usually early in the list.

The "Root node error: 67/105 = 0.6381" in R is a baseline error rate (i.e., the error we get if we classified everything as setosa).

The table following this message then expresses the decrease in error relative to this baseline error:

CP nsplit rel error xerror xstd 1 0.50746 0 1.000000 1.104478 0.069763 2 0.43284 1 0.492537 0.731343 0.076300 3 0.01000 2 0.059701 0.089552 0.035500

Here, we see that for the first row (i.e., if we just had built a root node tree, the cross validation error is 1.104 (i.e., worse than the baseline). Then, as we split 1 and then split again, we end up with a relative xerror 0f 0.89552, or, in the same terms as SPPS, I think this would be 0.6381*0.089552 or 0.057 with a std dev of 0.0355.

Randomness is used for generating the cross validation errors. We might notice that on separate runs of rpart with exactly the same settings the relative errors are consistent because there is no random sampling in this calculation. However the xerror and xtsd will vary unless we set the random number seed to the same value each time.

library(rpart) data(iris) set.seed(123) iris.rp <- rpart(Species ., method="class", data=iris, control=rpart.control(minsplit=4,cp=0.00001)) printcp(iris.rp) CP nsplit rel error xerror xstd 1 0.50000 0 1.00 1.21 0.048367 2 0.44000 1 0.50 0.76 0.061232 3 0.02000 2 0.06 0.10 0.030551 4 0.01000 3 0.04 0.11 0.031927 5 0.00001 5 0.02 0.10 0.030551

set.seed(1234) iris.rp <- rpart(Species ., method="class", data=iris, control=rpart.control(minsplit=4,cp=0.00001)) printcp(iris.rp)

CP nsplit rel error xerror xstd 1 0.50000 0 1.00 1.19 0.049592 2 0.44000 1 0.50 0.70 0.061101 3 0.02000 2 0.06 0.09 0.029086 4 0.01000 3 0.04 0.09 0.029086 5 0.00001 5 0.02 0.09 0.029086

So we can only accurately compare when we are sure we are using the same random number sequence. Rattle specifically sets the seed each time so that a user will not be puzzled about slightly different results every time they build the tree.

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010