Text Mining with R

See ttda and tm.

Text mining begins with feature extraction. Techniques include:

Using tm, here is a simple example. The crude dataset contains 20 news articles dealing with crude oil. The data type of the dataset is identified as a text document collection (TextDocCol). We can create our own text document collections using functions provided by the tm package which will read a collection of source documents from a specified directory, and process them into a TextDocCol. We can then take the TextDocCol and using TermDocMatrix generate a weighted count of terms in the documents (remove the weight argument if you just want to use term counting).

The actual data is :

> library(tm)
> vignette("tm")
> data(crude)
> class(crude)
[1] "TextDocCol"
[1] "tm"
> crude
A text document collection with 20 text documents
> crude@.Data
[1] "Diamond Shamrock Corp said that \neffective [...]"

[1] "OPEC may be forced to meet before a \nscheduled [...]"


[1] "Argentine crude oil production was \ndown 10.8 pct [...]"

> tdm <- TermDocMatrix(crude, weighting = "tf-idf", stopwords = TRUE)
An object of class "TermDocMatrix"
Slot "Data":
20 x 859 sparse Matrix of class "dgCMatrix"
   [[ suppressing 859 column names 'barrel', 'brings', 'citing' ... ]]

127 2 2.321928 4.321928 2.736966 2 4.643856 4.321928 2.736966 
144 . .        .        2.736966 . .        .        .        


> tdm <- TermDocMatrix(crude, stopwords = TRUE)

> tdm
An object of class "TermDocMatrix"
Slot "Data":
20 x 859 sparse Matrix of class "dgCMatrix"
   [[ suppressing 859 column names 'barrel', 'brings', 'citing' ... ]]

127 2 1 1 1 1 2 1 1 2 2 1 2 2 1 1 1 1 1 1 1  5 2 2 3 1 2
144 . . . 1 . . . . . . . . . . . . . . 4 1 12 . 1 5 . .
191 1 1 . . 1 1 . . 2 . . . 1 1 . . 1 . . .  2 1 2 . . .
194 1 1 . . 1 1 . . 3 . . . 2 1 . 1 . . . .  1 1 2 . . .


To transform tdm into a simple matrix to save the word counts or to compute various measures, such as to calculate the Euclidian distance:

> x <- as.matrix(tdm@Data)
> write.csv(x, "crude_words.csv")
> dist(x, method = "euclidean")

