DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
|
See http://www.jstatsoft.org/v30/i08/paper
Can only handle numeric data. Use weather for example, and extract just the numeric columns, and make sure they are numeric Todo: Fix generation of the weather data as these columns come out as character. For the first part mimic the paper with a 2 var dataset.
> vars <- c("MinTemp", "MaxTemp") > ds <- na.omit(apply(weather[vars], 2, as.numeric)) |
Now build the archetypes. We don't know how many we might want, but start with 4 to illustrate.
> set.seed(42) > a <- archetypes(ds, 4) |
Now let's explore them with two plots Todo: Get the two plots and display in a figure.
> atypes(a) > par(mfrow=c(1,1)) > plot(a, ds, chull=chull(ds), cex=0.6) > plot(a, ds, adata.show=TRUE, cex=0.6) |
Todo: Split out and comment on the following
> ahistory(a, step=0) > movieplot(a, ds) > # Avoid local minima > > set.seed(1960) > a4 <- stepArchetypes(data=ds, k=4, verbose=FALSE, nrep=4) > summary(a4) > plot(a4, ds) > bestModel(a4) > # What is best number of architypes (so iterate over the k). > > set.seed(1960) > as <- stepArchetypes(data=ds, k=1:10, verbose=FALSE, nrep=4) > # Have a look at the residual sum of squares (could be used below > # with whole data where there are some warnings. > > rss(as) > # Look at the iterations. For any that are 1 we might expect warnings > # from and rss of NA - problems with initial random starts. We don't > # have any here. > > iters(as) > # Now look at the "elbow criterion" for the best number of archetypes: 4 or 7. > > screeplot(as) > # We plotted 4 above, so let's look at 7 > > a7 <- bestModel(as[[7]]) > plot(a7, ds, chull=chull(ds)) > # Now do it with multiple numeric columns > > numcol <- c(2:6,8,11:20) > ds <- na.omit(apply(weather[numcol], 2, as.numeric)) > omitted <- attr(ds, "na.action") > # Let's have a look at parallel coordinates - no obvious number of prototypes. > > pcplot(ds) > # Experiment > > set.seed(1960) > as <- stepArchetypes(ds, k=1:15, verbose=FALSE, nrep=3) > # Know look for elbows - maybe 4 or 8. Let's go with 4 - the simpler number. > > screeplot(as) > a4 <- bestModel(as[[4]]) > # display (transpose to look better). > > t(atypes(a4)) > barplot(a4, ds, percentage=TRUE) # Fails > pcplot(a4, ds, data.col=rainbow_hcl(2)[as.numeric(weather$RainTomorrow[-omitted])]) |