DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Other Cluster Algorithms |
To build churn prediction model the k-means algorithm has been reported to provide poor results in retail banking (see http://videolectures.net/sikdd08_popovic_cpm/). The results indicate c-means (using fuzzy logic) works better than k-means for churn, with millions of observations and thousands of variables. It is more robust to outliers
You can do a cmeans by mimicing the kmeans. Do a kmeans in Rattle then copy the command line from the log file and paste into the R console. Change kmeans to cmeans.
> library(e1071) > crs$cmeans <- cmeans(na.omit(crs$dataset[crs$sample,c(2:6,8,11:21)]), 10) |
In Fuzzy c-Means new data is compared to the cluster centers in order to assign clustering membership values to the test data.
FANNY in the cluster package is another fuzzy clustering function in R (Kaufman and Rousseeuw, 1990). fanny() works with arbitrary dissimilarities d[i,j] whereas all versions of k-means have to assume a euclidean measurement space. (Kaufman and Rousseeuw, 1990) show that when you have a data matrix X, and then use distances (i.e. SQUARED Euclidean distances), then fanny() does the same as fuzzy k-means. They say that they'd rather use the non-squared distance as fanny() does by default for good robustness reasons.
To use FANNY for assigning clustering membership values to new data is trickier. In FANNY cluster centers don't make sense since the sample space may contain unordered categorical variables mixed with continuous ones (and we use daisy() and not not dist() to compute dissimilarities for such data).
If we assume all continuous data and use simple Euclidean distances we can compute cluster centres and determine by minimisation cluster membership for new observations.
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.