Desktop Survival Guide
by Graham Williams
Todo: This chapter is under development. It will discuss the business problem, and how we might go about solving it and where we might get some data to help us out.
The first task for any data mining project is to identify the business problem, making sure it is a real problem requiring a solution, and that it is feasible to tackle the problem with data mining.
Chapter 2 has given a hint as to the kind of problem we might be tackling here. Chances are, of course, that it has something to do with the weather. We used the sample weather dataset, that is provided with Rattle, in Chapter 2 to build our first data mining model. We also briefly explored the data.
The ``business'' problem we talked about in Chapter 2 was a rather simplistic problem--helping us decide whether to take an umbrella with us tomorrow. We would classify this as ``toy'' problem, particularly noting that we had a quite tiny dataset to build the model from.
[Historic Note:]It is rather interesting to question what constitutes tiny, small, large, huge, and enormous datasets. More importantly, how much data do we need for data mining? When I was doing my PhD back in the 1980's, I was building decision trees from just 106 observations. It was enough for the theoretical explorations which lead to the core idea of the thesis, that of building multiple decision trees--today referred to as ensemble learning. Today, a data mining researcher is talking millions of observations and hundreds of variables!
Note that we are not proposing that data mining is the best approach, nor that our approach is a valid analysis of weather data. Indeed there are many statistical approaches to analysing time series type weather data that may give better analytical results.
We use the weather dataset for this book primarily because it is a readily available, public dataset of a reasonable size. Most large datasets which are used for real data mining are carefully guarded for privacy reasons, or for commercial reasons. Whilst this is unfortunate from an educational perspective, the weather data has been found to exhibit many aspects we find in other datasets used for data mining.
Weather providers a good generic dataset. Every one knows about the weather, but perhaps we might think it has little real bearing to ourselves. Of course, though, weather has great impact on humanity. The weather impacts on our transport systems, building and construction industries, agriculture, and even entrainment. The economic impacts of a few days of light rain alone, may ruin your day, and if they not planned for, could cost millions of dollars in damage and lost productivity.
In that context then, this chapter presents potential business problems that we may consider for the weather dataset. We discuss a number of issues relating to the business problems, particularly surrounding the data we might use for the data mining task.