Desktop Survival Guide
by Graham Williams
The Rargminbucket is the minimum number of observations in any terminal leaf node.
The two variables Rargminbucket and Rargminsplit are closely related. In rpart if either is not specified then by default the other is calculated as .
Using rpart directly we specify Roptionminbucket within an option called Roptioncontrol which takes the results from a function called rpart.control. In this example we
> audit <- read.csv(url("http://rattle.togaware.com/audit.csv")) > audit.rpart <- rpart(TARGET_Adjusted ~ Age + Marital + Occupation + Deductions, data=audit, method="class", control=rpart.control(minbucket=100)) > audit.rpart
Changing Rargminbucket can result in different variables being chosen at different nodes. Compare the tree obtain with the command above (with Rargminbucket set to 100) to the result when Rargminbucket is set to 10. Note how node 7 was originally split using Age but with the minimum bucket size set to 10 the node is split on Deductions. We can see why -- the resulting node 15 has only 30 entities:
[...] control=rpart.control(minbucket=100)) [...] 7) Occupation=Clerical [...] 516 207 1 (0.40116279 0.59883721) 14) Age< 36.5 151 72 0 (0.52317881 0.47682119) * 15) Age>=36.5 365 128 1 (0.35068493 0.64931507) * [...] control=rpart.control(minbucket=10)) [...] 7) Occupation=Clerical [...] 516 207 1 (0.40116279 0.59883721) 14) Deductions< 1299.833 486 207 1 (0.42592593 0.57407407) [...] 15) Deductions>=1299.833 30 0 1 (0.00000000 1.00000000) *
Whilst the default is to set Rargminbucket to be one third of Rargminsplit there is no requirement for Rargminbucket to be less than Rargminsplit. A node will always have at least Rargminbucket entities, and it will be considered for splitting if it has at least Rargminsplit entities and on splitting, each of its children have at least Rargminbucket entities.
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.