Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


ARFF Data

The Attribute-Relation File Format (ARFF) is an ASCII text file format that is essentially a CSV file with a header that describes the meta-data. ARFF was developed for use in the Weka machine learning software and there are quite a few datasets in this format now. We can load an ARFF dataset into Rattle through the ARFF option (Figure [*]), specifying the filename to load the data from.

Figure 5.4: Choosing the ARFF radio button to load an ARFF file.
Image load:rattle_arff_annotate

The key difference between CSV and ARFF is in the top part of the file which contains information about each of the variables in the data. An example of the ARFF format for our audit dataset is shown below. Note that ARFF refers to variables as attributes.



@relation audit
@attribute ID numeric
@attribute Age numeric
@attribute Employment {Consultant, PSFederal, PSLocal, ...}
@attribute Education {Associate, Bachelor, College, ...}
@attribute Marital {Absent, Civil, Divorced, Married, ...}
@attribute Occupation {Cleaner, Clerical, Executive, ...}
@attribute Income numeric
@attribute Gender {Female, Male}
@attribute Deductions numeric
@attribute Hours numeric
@attribute Accounts {Canada, China, Columbia, Cuba, ...}
@attribute Adjustment numeric
@attribute Adjusted {0, 1}
@data
1004641,38,Private,College,Separated,Service,71511.95,...
1010229,35,Private,Associate,Unmarried,Transport,,...
1024587,32,Private,HSgrad,Divorced,Clerical,82365.86,...
1038288,45,Private,?,Civil,Repair,27332.32,Male,0,55,...
1044221,60,Private,College,Civil,Executive,21048.33,...
...

The data description section is straightforward, beginning with the name of the dataset (or the relation in ARFF terminology). Each of the variables (or attribute in ARFF terminology) used to describe each observation is then identified, together with their data type. Each variable definition appears on a single line (we have truncated the lines in the above example). Numeric variables are identified as numeric, real, or integer. For categoric variables we list the possible values.

Two other data types recognised by ARFF are string and date. A string data type simple indicates that the variable can have any string as its value. A date data type also optionally specifies, as in the example above, the format in which the date is presented. The default for dates is the ISO-8601 format, which is ``yyyy-MM-dd'T'HH:mm:ss''.

Following the meta-data specification the actual observations are then listed, each on a single line, with fields separated by commas as with a CSV file.

A significant advantage of the ARFF data file over the CSV data file is the meta data information. This is particularly useful in Rattle where for categoric data the possible values are determined from the data (which may not included every possible value) rather than from a full list of possible values. We will come across this as an issue when we build and deploy models.

Comments can also be included in an ARFF file with a `%' at the beginning of the comment line. Including comments in the data file allows us to record extra information about the data set, including how it was derived, where it came from, and how it might be cited.

Missing values in an ARFF data file are identified using the question mark `?.' These are identified by read.arff underneath and we see them as the usual NAs in Rattle.

Overall, the ARFF format, whilst simple, is quite an advance over a CSV file. Nonetheless, CSV remains the more common data file.

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010