DATA MINING
Desktop Survival Guide
by Graham Williams

A Framework for Modelling

Architects build models. Why? To see how things fit together, to make sure they do fit together, to see how things will work in the real world, and even to sell the idea behind the model they build! Data mining is about building models that give us insights into the world and how the world works. But even more than that, our models are often useful to give us guidance in how to deal with and interact with the real world!

Building models is fundamental to understanding our world. When we build a model, whether it be with lego bricks or computer software, we get a new perspective of how things fit together or interact. Once we have some basic models we can start to get ideas about more complex models, building on what has come before.

In understanding new complex ideas we often begin by trying to map the idea into concepts or constructs that we already know, by bringing those constructs together in different ways that reflect how we understand the new complex idea. As we learn more about the new complex idea we change our model to better reflect the idea, until eventually we have a model that matches the idea enough for us to make good effect of our understanding of the idea.

And so it is with model building in computer science. Indeed, writing a computer program is essentially about building a model.

There are three components to building a model: how do we represent the knowledge (the language for building models); how do we search through all the possible ways of building the model (sentences in the language); and how do we know when we have a good model (measurement). In all of the model building that we are going to talk about in this book, we will use this framework to present the approach and to contrast the approach to alternatives.

In this section we present a framework within which we cast the task of data mining--the task being model building. We refer to an algorithm for building a model as a model builder. Rattle supports a number of model builders, including decision tree induction, boosted decision trees, random forests, support vector machines, logistic regression, kmeans, and association rules. In essence, the model builders differ in how they represent the models they build (i.e., the discovered knowledge) and how they find (or search for) the best model within this representation.

We can think of the discovered knowledge, or the model, as being expressed as sentences in a language. We are familiar with the fact that we express ourselves using sentences in our own specific human languages (whether that be English, French, or Chinese, for example). As we know, there is an infinite number of sentences that we can construct in our human languages.

The situation is similar for the ``sentences'' we construct through using model builders--there is generally an infinite number possible sentences. In human language we are generally very well skilled at choosing sentences from this infinite number of possibilities to best represent what we would like to communicate. And so it is with model building. The skill is to express within the language chosen the best sentences that capture what it is we are attempting to model.

We formally present this general framework. The following sections then present models builders for various tasks in the context of this framework.

Todo: Framework goes here.

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010