Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Application to Text

An example where correlations are also particularly interesting is in mining for information from a collection of text documents (referred to as text mining). A simple example here will illustrate.

The tm package provides functionality for text mining. The package also has a sample collection of data for mining, called the crude corpus. Without covering the details, we will load the 20 documents contained in the corpus and transform the collection of documents into a dataset. The dataset contains, as variables, the words (or key words) found in the documents. The observations record, for each document, the frequency of the words contained within that document:



> library(tm)
> data(crude)
> crude.dtm <- DocumentTermMatrix(crude, control=list(#weighting=weightTfIdf,
                                      stopwords=TRUE,
                                      removeNumbers=TRUE,
                                      stemming=TRUE,
                                      minDocFreq=2))
> crude.dtm.df <- as.data.frame(as.matrix(crude.dtm))

We can now load this data frame into Rattle to produce a correlation plot as in Figure [*].



Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010