Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Model Tuning

What is the right value to use for each of the variables of the model building algorithms that we us in data mining? The variable settings can make the difference between a good and a poor model.

The package caret, as well as providing a unified interface to many of the model builders we have covered in this book, provides a parameter tuning approach. Here's a couple of examples:



> library(rattle)
> library(caret)
> data(audit)
> mysample <- sample(nrow(audit), 1400)
> myrpart <- train(audit[mysample, c(2,4:5,7:10)], 
                   as.factor(audit[mysample, c(13)]), "rpart")
Model 1: maxdepth=6
 collapsing over other values of maxdepth
> myrpart
Call:
train.default(x = audit[mysample, c(2, 4:5, 7:10)], y = as.factor(audit[mysample, 
    c(13)]), method = "rpart")

1400 samples, 7 predictors

largest class: 77.71% (0)

summary of bootstrap (25 reps) sample sizes:
    1400, 1400, 1400, 1400, 1400, 1400, ... 

boot resampled training results across tuning parameters:

  maxdepth  Accuracy  Kappa  Accuracy SD  Kappa SD  Optimal
  2         0.817     0.423  0.0142       0.0386           
  3         0.818     0.413  0.0171       0.0617    *      
  6         0.814     0.412  0.019        0.0488           

Accuracy was used to select the optimal model
> myrpart$finalModel
n= 1400 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 1400 312 0 (0.77714286 0.22285714)  
   2) Marital=Absent,Divorced,Married-spouse-absent,Unmarried,Widowed 773  38 0 (0.95084088 0.04915912) *
   3) Marital=Married 627 274 0 (0.56299841 0.43700159)  
     6) Education=College,HSgrad,Preschool,Vocational,Yr10,Yr11,Yr12,Yr1t4,Yr5t6,Yr7t8,Yr9 409 129 0 (0.68459658 0.31540342)  
      12) Deductions< 1708 400 120 0 (0.70000000 0.30000000) *
      13) Deductions>=1708 9   0 1 (0.00000000 1.00000000) *
     7) Education=Associate,Bachelor,Doctorate,Master,Professional 218  73 1 (0.33486239 0.66513761) *

Similarly we can replace rpart with rf.

The tune function from the e1071 package provides a simple, if sometimes computationally expensive, approach to find a good value for a collection of tuning variables. We explore the use of this function here.

The tune function provides a number of global tuning variables that affect how the tuning happens. The Rarg[]nrepeat variable (number of repeats) specifies how often the training should be repeated. The Rarg[]repeat.aggregate variable identifies a function that specifies how to combine the training results over the repeated training. The Rarg[]sampling identifies the sampling scheme to use, allowing for cross-validation, bootstrapping or a simple train/test split. For each type of sample, further variables are supplied, including, for example, cross = 10 to set the cross validation to be 10-fold. The Rarg[]sampling.aggregate variable specifies a function to combine the training results over the various training samples. A good default (provided by tune) is to train once with 10-fold cross validation.



Subsections
Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010