Data Mining Survivor: CSV_Data0

DATA MINING
Desktop Survival Guide
by Graham Williams

Loading the File

As we saw in Chapter 2 Rattle will load the supplied sample data file (weather.csv) if no other data file is actually specified through the Filename button. This is the simplest way to load some data into Rattle, at least for learning the Rattle interface.

After identifying the file to load we need to remember to click the Execute button to actually load the dataset into Rattle. The main text panel of the Data tab changes to list the variables, together with their types and roles, and some other useful information (Figure 5.1).

After loading the data from the file into Rattle, the data thus becoming a dataset, we can begin to explore it. The top of the file can be viewed in the R Console, as we also saw in Chapter 2. Here we limit the display to just the first five columns and request just 6 observations.

> head(crs$dataset[1:5], 6)

Date Location MinTemp MaxTemp Rainfall 1 2007-11-01 Canberra 8.0 24.3 0.0 2 2007-11-02 Canberra 14.0 26.9 3.6 3 2007-11-03 Canberra 13.7 23.4 3.6 4 2007-11-04 Canberra 13.3 15.5 39.8 5 2007-11-05 Canberra 7.6 16.1 2.8 6 2007-11-06 Canberra 6.2 16.9 0.0

The first thing to note here is the rather stylised looking name for the dataset, crs$dataset. To understand this we start to get an understanding of how Rattle is dealing with its internal variables. In fact, Rattle uses a special type of variable, called an environment, to store its internal affairs. An environment can be thought of as a container of other variables. The crs$dataset notation refers to a variable called dataset within the crs environment.

Loading data into Rattle from a CSV file uses the read.csv function underneath. We can see this to be the case by reviewing the contents of the Log tab. From the log we will see something like:

> ds <- read.csv("file:.../weather.csv", na.strings=c(".", "NA", "", "?"))

The full path to the weather.csv file is truncated here for brevity.

There are two things to note. The first is that the dataset is loaded into the variable we mentioned above, called crs$dataset. The second is that by default Rattle will treat any one of four strings as representing missing values (NAs in R). This captures the most common approaches to representing missing values. SAS, for example, uses the dot (``.'') to denote missing values and R uses the special string ``NA'' to denote missing values. Other applications simply use the empty string, whilst yet others (including machine learning applications like C4.5) use the question mark (``?'').

The simplest use of the read.csv function often does not need to appear so complex. If we have a CSV file to load into R, perhaps called mydata.csv, we can usually simply type the command:

> ds <- read.csv("mydata.csv")

We can also load data directly from the Internet. For example, the weather dataset, as a CSV file, is available from http://rattle.togaware.com/weather.csv:

> ds <- read.csv("http://rattle.togaware.com/weather.csv")

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010