Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Hot Spots

Cluster analysis can be used to find clusters that are most interesting according to some criteria. For example, we might cluster the spam7 data of the DAAG package (without using yesno in the clustering) and then score the clusters depending on the proportion of yes cases within the cluster. The following R code will build K clusters (user specified) and return a score for each cluster.

# Some ideas here from Felix Andrews
kmeans.scores <- function(x, centers, cases) 
{
  clust <- kmeans(x, centers)
  # Iterate over each cluster to generate the scores
  scores <- c()
  for (i in 1:centers) 
  {
    # Count number of TRUE cases in the cluster
    # as the proportion of the cluster size
    scores[i] <- sum( cases[clust$cluster == i] == TRUE ) / clust$size[i]
  }
  # Add the scores as another element to the kmeans list
  clust$scores <- scores
  return(clust)
}

We can now run this on our data with:

> library(DAAG)
> data(spam7)
> clust <- kmeans.scores(spam7[,1:6], centers=10, spam7["yesno"]=="y")
> clust[c("scores","size")]
$scores
 [1] 0.7037037 0.1970109 0.5995763 0.7656250 0.8043478 1.0000000 0.4911628
 [8] 0.7446809 0.6086957 0.6043956

$size
 [1]  162 2208  472  128   46    5 1075   47  276  182

Thus, cluster 5 with 46 members has a high proportion of positive cases and may be a cluster we are interested in exploring further. Clusters 4, 8, and 1 are also probably worth exploring.

Now that we have built some clusters we can generate some rules that describe the clusters:

hotspots <- function(x, cluster, cases)
{
  require(rpart)
  overall = sum(cases) / nrow(cases)
  x.clusters <- cbind(x, cluster)
  tree = rpart(cluster ~ ., data = x.clusters, method = "class")
  # tree = prune(tree, cp = 0.06)
  nodes <- rownames(tree$frame)
  paths = path.rpart(tree, nodes = nodes)

TO BE CONTINUED

  return(tree)
}

And to use it:

> h <- hotspots(spam7[,1:6], clust$cluster, spam7["yesno"]=="y")



Copyright © 2004-2010 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010