Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

R Dataset

Loading data from a data file (as in loading a exttt.csv or exttt.txt file) or direct from a database (through ODBC) is a convenience provided by Rattle. However, R supports many more options for importing data from a variety of sources.

Rattle can use any data frame loaded into R as a dataset to be mined. When choosing the R Dataset option of theData tab, the Data Name box will list each of the available data frames that can be brought into Rattle as a dataset.

R is very flexible in where it obtains its data from, and data from almost any source can be loaded. We have covered the more common sources which are directly supported by the Rattle interface.

As we note above we can access within Rattle any dataset (technically, any data frame) that has been loaded into R. Consequently, Rattle is able to access the same variety of sources as available through R.

Using the foreign package R can be used to read SPSS datasets (read.spss), SAS XPORT format datasets (read.xport, and DBF database files (read.dbf).

As an example, suppose we have an SPSS data file and read that into R through the following commands typed into the R Console:



> library(foreign)
> mydataset <- read.spss(file="mydataset.sav")

Then, as in Figure 5.7, we can find the data frame, mydataset, listed as an available R Dataset.

Figure 5.7: Loading an already defined R data frame as a dataset for use in Rattle.
Image load:rattle_rdataset_annotate

The datasets that we wish to use with Rattle need to be constructed in the same R session that is running Rattle (i.e., the same R Console in which we loaded the Rattle package).

An interesting variation that may at times be quite convenient is the ability to directly copy and paste a selection via the system clipboard. Through this mechanism we could highlight a collection of data from a spreadsheet and copy it to the clipboard. Then within R we can ``paste'' the data into a data frame using the read.table function.

Suppose we have open a spreadsheet with the data we see in Figure 5.8. If we select the 16 rows, including the header, in the usual way, we can very simply load the data using R:



> expenses <- read.table(file("clipboard"), header=TRUE)

By default the Date variable is loaded as a categoric, so we can convert it into a date type, and then list the data:



> expenses$Date <- as.Date(expenses$Date, format="%d-%b-%Y")
> head(expenses)



        Date Expense  Total
1 2005-11-17    19.5   19.5
2 2005-11-23   -15.0    4.5
3 2005-12-10    30.0   34.5
4 2006-01-23  -110.0  -75.5
5 2006-01-28   -20.0  -95.5
6 2006-02-14   -10.0 -105.5

Figure 5.8: Selected region of a spreadsheet copied to the clipboard.
Image load:gnumeric_date_dollars_selected

We can then load this into Rattle directly, as in Figure 5.9.

Figure 5.9: Loading an R data frame which was obtained from a copy-and-paste, via the clipboard, from a spreadsheet.
Image load:rattle_rdataset_expenses_annotate

Copyright © 2004-2010 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010