DATA MINING
Desktop Survival Guide
by Graham Williams

### Hot Spots

Cluster analysis can be used to find clusters that are most interesting according to some criteria. For example, we might cluster the spam7 data of the DAAG package (without using yesno in the clustering) and then score the clusters depending on the proportion of yes cases within the cluster. The following R code will build K clusters (user specified) and return a score for each cluster.

 ```# Some ideas here from Felix Andrews kmeans.scores <- function(x, centers, cases) { clust <- kmeans(x, centers) # Iterate over each cluster to generate the scores scores <- c() for (i in 1:centers) { # Count number of TRUE cases in the cluster # as the proportion of the cluster size scores[i] <- sum( cases[clust\$cluster == i] == TRUE ) / clust\$size[i] } # Add the scores as another element to the kmeans list clust\$scores <- scores return(clust) } ```

We can now run this on our data with:

 ```> library(DAAG) > data(spam7) > clust <- kmeans.scores(spam7[,1:6], centers=10, spam7["yesno"]=="y") > clust[c("scores","size")] \$scores [1] 0.7037037 0.1970109 0.5995763 0.7656250 0.8043478 1.0000000 0.4911628 [8] 0.7446809 0.6086957 0.6043956 \$size [1] 162 2208 472 128 46 5 1075 47 276 182 ```

Thus, cluster 5 with 46 members has a high proportion of positive cases and may be a cluster we are interested in exploring further. Clusters 4, 8, and 1 are also probably worth exploring.

Now that we have built some clusters we can generate some rules that describe the clusters:

 ```hotspots <- function(x, cluster, cases) { require(rpart) overall = sum(cases) / nrow(cases) x.clusters <- cbind(x, cluster) tree = rpart(cluster ~ ., data = x.clusters, method = "class") # tree = prune(tree, cp = 0.06) nodes <- rownames(tree\$frame) paths = path.rpart(tree, nodes = nodes) TO BE CONTINUED return(tree) } ```

And to use it:

 ```> h <- hotspots(spam7[,1:6], clust\$cluster, spam7["yesno"]=="y") ```