Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Using SQLite

SQLite (from www.sqlite.org) is an open source database package that is well supported in R. It has the advantage that it requires no setup or administration (other than installing the package) and is an embedded system so that there is less of a connection overhead. You are able to manage very large datasets in SQLite without needing to load all the data into memory, unlike R itself, so that you are able to manipulate the data using SQL then load in just the data you need.

For small dataset SQLite is a good choice, but for very large datasets, MySQL still performs very well.

A Google Summer of Code project, created a package that stores data frames and matrices into sqlite databases, called sqlite data frames (sdf). These sdf's behave like ordinary data frames so that existing R functions will work. This enables R users to work with very large datasets much more readily, with no user effort.

As an example, first create an empty SQLite database (outside of R) and import a CSV (comma separated value) file, telling sqlite to use commas, not '|':

$ sqlite3 -separator , audit.db
sqlite> create table audit(ID INTEGER, Age INTEGER, Employment TEXT,
                           Education TEXT, Marital TEXT, Occupation TEXT,
                           Income REAL, Sex TEXT, Deductions REAL,
                           Hours INTEGER, Accounts TEXT, 
                           Adjustment REAL, Adjusted INTEGER);
sqlite> .tables
audit
sqlite> .import audit.csv audit
sqlite> select count(*) from audit;
2001
sqlite> delete from audit where ID='"ID"';
sqlite> select count(*) from audit;
2000

sqlite> .quit

Now, using SQLiteDF we can treat the data as a Data Frame, avoiding loading it into R (except when it is copied internally in an R command):

> library(SQLiteDF)
> audit <- sdfImportSQLite("audit.db", "audit")
> audit
SQLite data frame "audit" (2000 row(s) by 13 column(s)) stored in file ".SQLiteDF/audit.db"

First 6 rows:
       ID Age Employment   Education     Marital  Occupation    Income      Sex
1 1004641  38  "Private"   "College" "Unmarried"   "Service"  81838.00 "Female"
2 1010229  35  "Private" "Associate"    "Absent" "Transport"  72099.00   "Male"
3 1024587  32  "Private"    "HSgrad"  "Divorced"  "Clerical" 154676.74   "Male"
4 1038288  45  "Private"  "Bachelor"   "Married"    "Repair"  27743.82   "Male"
5 1044221  60  "Private"   "College"   "Married" "Executive"   7568.23   "Male"
6 1047095  74  "Private"    "HSgrad"   "Married"   "Service"  33144.40   "Male"
  Deductions Hours       Accounts Adjustment Adjusted
1          0    72 "UnitedStates"          0        0
2          0    30      "Jamaica"          0        0
3          0    40 "UnitedStates"          0        0
4          0    55 "UnitedStates"       7298        1
5          0    40 "UnitedStates"      15024        1
6          0    30 "UnitedStates"          0        0
 ...
> sum(audit$Income)
[1] 169376920

The SQLite package, as an alternative, allows the easy import and export of data to text files.



library(RSQLite)
con <- dbConnect(SQLite(), "foo3.db")
dbGetQuery(con, "pragma cache_size")
    cache_size
  1 2000
dbGetQuery(con, "pragma cache_size=2500")
   NULL
dbGetQuery(con, "pragma cache_size")
   cache_size 
 1 2500

Now in R in the same directory:

library(DBI)
library(RSQLite)

driver<-dbDriver("SQLite")
connect<-dbConnect(driver, dbname = "audit.db")
dbWriteTable(connect, "audit", audit, overwrite = T, row.names = F)

dbListTables(connect)
[1] "audit"

query01 <- dbSendQuery(connect, statement = "select * from audit");
data01 <- fetch(query01, n = 10)
contents(data01)

sqliteCloseResult(query01)
sqliteCloseConnection(connect)
sqliteCloseDriver(driver)

Copyright © 2004-2010 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010