DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Application to Text |
An example where correlations are also particularly interesting is in mining for information from a collection of text documents (referred to as text mining). A simple example here will illustrate.
The tm package provides functionality for text mining. The package also has a sample collection of data for mining, called the crude corpus. Without covering the details, we will load the 20 documents contained in the corpus and transform the collection of documents into a dataset. The dataset contains, as variables, the words (or key words) found in the documents. The observations record, for each document, the frequency of the words contained within that document:
> library(tm) > data(crude) > crude.dtm <- DocumentTermMatrix(crude, control=list(#weighting=weightTfIdf, stopwords=TRUE, removeNumbers=TRUE, stemming=TRUE, minDocFreq=2)) > crude.dtm.df <- as.data.frame(as.matrix(crude.dtm)) |
We can now load this data frame into Rattle to produce a correlation plot as in Figure .