DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
The task of classification is at the heart of data mining! Most of what we learn from a traditional data mining course focuses on the algorithms from machine learning and statistics that build classification models. These models can then be used to classify new entities. The actual structure of the model also gives us insight into the relationships between the variables that are important in differentiating the classes.
This chapter focuses on this common data mining task of classification and prediction. We consider binary (or two class) classification, but the concepts also apply to multi-class classification.
The chapter begins with the introduction of a framework in which we understand model building. We then continue with a review of risk charts as a mechanism for evaluating two class models. Whilst a separate chapter (See Chapter ) covers evaluation in detail we present the concept of risk charts here so that we can explore and compare the performance of the models we build as we introduce the different model builders.
Each of the model builders supported by Rattle is then introduced. The model builders focus on binary (tow-class) classification, where the aim is to distinguish between two classes of entities. Such problems abound, and the two classes might, for example, distinguish high risk and low risk insurance clients, productive and unproductive taxation audits, responsive and non-responsive customers, successful and unsuccessful security breaches, and many other similar examples.
Rattle provides a straight-forward interface to the collection of model builders commonly used in data mining for binary classification. For each, a basic collection of tuning parameters is exposed through the interface for fine tuning the model building process. Where possible, Rattle attempts to present good default values to allow the user to simply build a model with no or little tuning. This may not always be the right approach, but is certainly a good place to start.
The two class model builders provided by Rattle are: Decision Trees, Boosted Decision Trees, Random Forests, Support Vector Machines, and Logistic Regression.
We will consider each of the model builders deployed in Rattle and characterise them through the types of models they generate and how the model building algorithms search for the best model that captures or summarises what the data is indicating.
Whilst a model is being built you will see the cursor image change to indicate the system is busy, and the status bar will report that a model is being built.