|
DATA MINING
Desktop Survival Guide by Graham Williams |
|
|||
Text Mining with R |
See Rpackage[]ttda and Rpackage[]tm.
Text mining begins with feature extraction. Techniques include:
Using Rpackage[]tm, here is a simple example. The crude dataset contains 20 news articles dealing with crude oil. The data type of the dataset is identified as a text document collection (TextDocCol). We can create our own text document collections using functions provided by the Rpackage[]tm package which will read a collection of source documents from a specified directory, and process them into a TextDocCol. We can then take the TextDocCol and using Rfunction[]TermDocMatrix generate a weighted count of terms in the documents (remove the weight argument if you just want to use term counting).
The actual data is :
> library(tm)
> vignette("tm")
> data(crude)
> class(crude)
[1] "TextDocCol"
attr(,"package")
[1] "tm"
> crude
A text document collection with 20 text documents
> crude@.Data
[[1]]
[1] "Diamond Shamrock Corp said that \neffective [...]"
[[2]]
[1] "OPEC may be forced to meet before a \nscheduled [...]"
[...]
[[20]]
[1] "Argentine crude oil production was \ndown 10.8 pct [...]"
> tdm <- TermDocMatrix(crude, weighting = "tf-idf", stopwords = TRUE)
An object of class "TermDocMatrix"
Slot "Data":
20 x 859 sparse Matrix of class "dgCMatrix"
[[ suppressing 859 column names 'barrel', 'brings', 'citing' ... ]]
127 2 2.321928 4.321928 2.736966 2 4.643856 4.321928 2.736966
144 . . . 2.736966 . . . .
[...]
> tdm <- TermDocMatrix(crude, stopwords = TRUE)
> tdm
An object of class "TermDocMatrix"
Slot "Data":
20 x 859 sparse Matrix of class "dgCMatrix"
[[ suppressing 859 column names 'barrel', 'brings', 'citing' ... ]]
127 2 1 1 1 1 2 1 1 2 2 1 2 2 1 1 1 1 1 1 1 5 2 2 3 1 2
144 . . . 1 . . . . . . . . . . . . . . 4 1 12 . 1 5 . .
191 1 1 . . 1 1 . . 2 . . . 1 1 . . 1 . . . 2 1 2 . . .
194 1 1 . . 1 1 . . 3 . . . 2 1 . 1 . . . . 1 1 2 . . .
[...]
|
To transform tdm into a simple matrix to save the word counts
or to compute various measures, such as to calculate the Euclidian
distance:
> x <- as.matrix(tdm@Data) > write.csv(x, "crude_words.csv") > dist(x, method = "euclidean") |