Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Training and Test Datasets

Often in modelling we build our model on a training set and then test its performance on a test set. The simplest approach to generating a partitioning of your dataset into a training and test set is with the sample function:

> sub <- sample(nrow(iris), floor(nrow(iris) * 0.8))
> iris.train <- iris[sub, ]
> iris.test <- iris[-sub, ]

The first argument to sample is the top of the range of integers you wish to choose from, and the second is the number to choose.

The sample.split function of the caTools package also comes in handy here. It will split a vector into two subsets, two thirds in one and one third in the other, maintaining the relative ratio of the different categoric values represented in the vector:

> mask <- sample.split(iris$Species)
> mask
  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
[...]
[145]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
> table(iris$Species)

    setosa versicolor  virginica
        50         50         50
> table(iris$Species[mask])

    setosa versicolor  virginica
        33         33         33
> table(iris$Species[!mask])

    setosa versicolor  virginica
        17         17         17



Copyright © 2004-2010 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010