Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Sampling

A traditional approach to solving class imbalance, and one that works well in many modelling situations, but not all, is to sample your data. Sampling aims to remove or at least redress the balance. It is a data preprocessing step whereby the algorithm used by the model builder does not generally need to be modified. Because of this, the approach is readily applicable (but not necessarily appropriate) to any model builder.

MENTION APPROACHES AND FOR EACH ILLUSTRATE HOW TO DO IT IN R.

Random undersampling will randomly choose a subset of the over represented class (or classes) to approach the same number as the underpresented class (or classes) for inclusion in the training dataset.

Random oversampling will randomly duplicate records from the under-represented class (or classes) for inclusion in the training dataset.

The synthetic minority oversampling technique (SMOTE).

Cluster-based oversampling.

One-sided selection.

Wilson's editing.



Copyright © 2004-2010 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010