Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


ODBC Sourced Data

Much data is stored within databases and data warehouses. The Open Database Connectivity (ODBC) standard has been developed as a common approach for accessing data from databases (and hence, data warehouses). The technology is based on the Structured Query Language (SQL) used to query relational databases. We discuss here how to access data directly from the database.

Rattle can obtain a dataset from any database accessible through ODBC using Rattle's ODBC Option (Figure 5.5). Underneath, the RODBC package provides the actual interface to the ODBC data source.

Figure 5.5: Loading data through an ODBC database connection.
Image load:rattle_odbc_annotate

The key to accessing data via ODBC is to identify the data source through a so-called data source name (or DSN). Different operating systems provide different mechanisms for setting up DSNs. The setting up of DSNs is outside the scope of Rattle. It is, instead, a configuration task through your operating system. Under GNU/Linux, for example, using the unixodbc package, the system DSNs are often defined in the file /etc/odbcinst.ini and in /etc/odbc.ini. Under MS/Windows the control panel provides access to a DSN tool to identify data sources.

Using Rattle we specify a known DSN by typing the name into the text entry field we can see in Figure 5.5.

Once a DSN is specified Rattle will attempt to make a connection. This may require a username and password to be supplied. Many ODBC drivers will thus prompt us for this information before establishing the connection. Figure 5.6 illustrates a typical popup for entering such data, in this case, for connecting to a Netezza data warehouse.

To establish a connection using R directly we make use of the odbcConnect function from the RODBC package:



> library(RODBC)
> channel <- odbcConnect("myDWH", uid="kayon", pwd="toga")

Figure 5.6: Netezza ODBC connection

Once the connection is established Rattle queries the database for the names of the available tables, and provides access to that list through the Table combobox in Figure 5.5. We need to make a selection of just one of the tables listed.

The list of available tables is obtained using the RODBC function sqlTables:



> tables <- sqlTables(channel)

If there is a table in the connected database called, for example, clients, we can obtain a list of columns using the sqlColumns function:



> columns <- sqlColumns(channel, "clients")

Some basic options are available through Rattle for fine tuning the ODBC connection. One option allows us to limit the number of rows retrieved from the chosen table. If the Row Limit is set to 0 then all of the rows from the table are retrieved. Unfortunately, there is no SQL standard for limiting the number of rows returned from a query. For some database systems (e.g., Teradata and Netezza) the SQL keyword is LIMIT, and this is what is used by Rattle.

A further option indicates whether to believe the results of a query that reports on the number of rows in a table (some ODBC drivers behave differently in this respect). This needs to be set for some ODBC drivers. The Believe Num Rows option is an oddity required for some ODBC drivers and appears to be associated with the pre-fetch behaviour of these drivers. The default is to activate the check box (i.e., Believe Num Rows is True). However, if you find that you are not retrieveing all rows from the source table, the the ODBC driver may be using a pre-fetch mechanism that does not ``correctly'' report the number of rows (it is probably only reporting the number of rows limited to the size of the pre-fetch). In these cases deactivate the Believe Num Rows check box.

In summary, we establish a connection to a data source through the ODBC interface. We need to identify the table we would like to draw data from. A click of the Execute button will then load the data into Rattle.

The data loading step can be accomplished directly in R using the sqlQuery function:



> query <- "SELECT * FROM clients WHERE cost > 2500"
> myds <- sqlQuery(channel, query)

Using R directly provides a lot more scope for carefully identifying the data we wish to load. Any SQL query can be substituted for the simple SELECT statement used above. For those of us with skills in writing SQL queries, this provides quite a powerful mechanism for refining the data to be loaded.

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010