Data Mining Resources

“So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.” Professor Hal Varian, Chief Economist at Google, speaking to the New York Times in February 2009.

Togaware Resources

Books on my Bookshelf Include:

A literate, agile, approach to data mining projects means the data miner's toolbox will include R and LaTeX (for using Sweave).

Other Resources

Using R for Data Mining

The open source statistical programming language R (based on S) is in daily use in academia and in business and government. We use R for data mining within the Australian Taxation Office. Rattle is used by those wishing to interact with R through a GUI.

R is memory based so that on 32bit CPUs you are limited to smaller datasets (perhaps 50,000 up to 100,000, depending on what you are doing). Deploying R on 64bit multiple CPU (AMD64) servers running GNU/Linux with 32GB of main memory provides a powerful platform for data mining.

R is open source, thus providing assurance that there will always be the opportunity to fix and tune things that suit our specific needs, rather than rely on having to convince a vendor to fix or tune their product to suit our needs.

Also, by being open source, we can be sure that the code will always be available, unlike some of the data mining products that have disappearded (e.g., IBM's Intelligent Miner).

Open standards are important for users, but vendors resist them for obvious reasons, and would prefer to lock you in to their products. A number of commercial tools claim support of, for example, the open standard PMML for interoperability (sharing models between applications). But the support is patchy and not worth the effort. We have started a PMML effort in R to attempt to address the desire for interoperability.

Specific commercial statistical products are excellent in handling very large datasets. But they are limited in the analytic algorithms they provide. Commercial vendors, naturally, need to be convinced of the usefulness of implementing new algorithms. On the other hand, a vast selection has been available for deployment in R for a long time.




Ads Follow - These are Not Endorsed by Togaware
Shop at Amazon