DATA MINING
Desktop Survival Guide
by
Graham Williams
Desktop Survival
Project Home
Introduction
Getting Started
The Business Problem
Data
Loading Data
Exploring Data
Interactive Graphics
Statistical Tests
Models
Network Analysis
Text Mining
Decision Trees
Random Forests
Boosting
Bagging
Support Vector Machine
Linear Regression
Neural Network
Naive Bayes
Survival Analysis
Evaluation and Deployment
Transforming Data
Deployment
Troubleshooting
Issues
Moving into R
R
Getting Help
Data
Graphics in R
Understanding Data
Preparing Data
Issues
Evaluating Models
Reporting
Fraud Analysis
Archetype Analysis
Algorithms
Bayes Classifier
K-Nearest Neighbours
Linear Models
Open Products
AlphaMiner
Borgelt Data Mining Suite
KNime
R
Rattle
Weka
Closed Products
Clementine
Equbits Foresight
GhostMiner
InductionEngine
ODM
Enterprise Miner
Statistica Data Miner
TreeNet
Virtual Predict
Installing Rattle
Projects
Bibliography
Index
Preface
Goals
Organisation
Features
Audience
Typographical Conventions
A Note on Languages
Currency
Acknowledgements
Introduction
Data Mining
The Business Problem
Types of Analysis
Data Mining Applications
A Framework for Modelling
Agile Data Mining
R
Rattle
Why R and Rattle?
Data Preparation
Number of Algorithms
Repeatability
Performance
Open Source Data Mining Business Case
Sample Business Case
Pros and Cons
Books on R
Getting Started
Initial Interaction with R
Quitting Rattle and R
First Contact
Loading a Dataset
Building a Model
Understanding Our Data
Evaluating the Model
Evaluating the Model
Interacting with R
Interacting with Rattle
Projects
Toolbar
Menus
Interacting with Plots
Keyboard Navigation
Summary
Command Summary
The Business Problem
Solar Panel Efficiency
Water Collection
Others
Other Business Problems
Fraud Detection
Loan Approval
Documenting the Business Problem
Summary
Resources
Exercises
Data
Data Nomenclature
Loading Data into Rattle
CSV Data
Datasets
Reading Direct from URL
Play Golf
Weather Data
Other Data Sources
ARFF Data
ODBC Sourced Data
Setting Up a Data Source Name
Netezza Setup
Teradata Setup
R Data
R Dataset
Data Entry
Data Tab Options in Rattle
Sampling Data
Variable Roles
Automatic Role Identification
Weights Calculator
Manipulating Data
Loading Data
CSV Data
Locating and Loading Data
Loading the File
CSV Options
Basic Data Summary
ARFF Data
ODBC Sourced Data
R Dataset
R Data
Library
Data Options
Sampling Data
Variable Roles
Automatic Role Identification
Weights Calculator
Command Summary
Exploring Data
Summarising Data
Summary
Describe
Basics
Kurtosis
Skewness
Missing
Exploring Distributions
Box Plot
Histogram
Cumulative Distribution Plot
Benford's Law
Other Digits
Stratified Benford Plots
Bar Plot
Dot Plot
Mosaic Plot
GGobi
Scatterplot
Data Viewer
Brushing
Identify Multivariate Outliers
Other Options
Quality Plots Using R
Further GGobi Documentation
Correlation Analysis
Hierarchical Correlation
Principal Components
Single Variable Overviews
Interactive Graphics
Interactive Visualisations
Latticist
GGobi
Scatterplot
Multiple Plots
Brushing
Other Plots
Data Viewer
Brushing
Identify Multivariate Outliers
Other Options
Quality Plots Using R
Further GGobi Documentation
Documenting Interactive Explorations
Code Review
Chapter Exercises
Command Summary
Statistical Tests
Documenting Interactive Explorations
Code Review
Further Resources
Chapter Exercises
Command Summary
Models
A Framework for Modelling
Descriptive Analytics
Predictive Analytics
Documenting Models
Summary
Code Review
Exercises
Further Resources
Command Summary
Cluster Analysis
Summary
Clusters
Basic Clustering
Hot Spots
Alternative Clustering
Other Cluster Examples
KMeans
Export KMeans Clusters
Discriminant Coordinates Plot
Number of Clusters
Hierarchical Clusters
Other Cluster Algorithms
Association Analysis
Summary
Overview
Algorithm
Usage
Read Transactions
file
format
sep
cols
rm.duplicates
Summary
Apriori
data
parameter
appearance
control
Inspect
Examples
Video Marketing
Survey Data
Other Examples
Resources and Further Reading
Basket Analysis
General Rules
Network Analysis
Documenting Interactive Explorations
Code Review
Chapter Exercises
Command Summary
Text Mining
Application to Text
Text Mining with R
Decision Trees
Knowledge Representation
Search Heuristic
Measures
Tutorial Example
Rattle
R
Tuning Parameters
Min Split (Rarg[]minsplit)
Min Bucket (minbucket)
Priors (prior)
Loss Matrix
Complexity (cp)
Other Options
Simple Example
Convert Tree to Rules
Predicting Salary Group
Issues
Summary
Code Review
Iris
Wine
Exercises
Resources
Command Summary
Random Forests
Formalities
Tutorial Example
Tuning Parameters
Number of Trees
Sample Size
Number of Variables
Summary
Overview
Algorithm
Usage
Random Forest
importance
classwt
Examples
Resources and Further Reading
Summary
Overview
Example
Algorithm
Resources and Further Reading
Boosting
Formalities
Tutorial Example
Tuning Parameters
Summary
Overview
AdaBoost Algorithm
Examples
Step by Step
Using gbm
Extensions and Variations
Alternating Decision Tree
Resources and Further Reading
Documenting
Code Review
Further Resources
Chapter Exercises
Command Summary
Bootstrapping
Summary
Usage
Further Information
Summary
Overview
Example
Algorithm
Resources and Further Reading
Bagging
Support Vector Machine
Formalities
Tutorial Example
Tuning parameters
Examples
Resources and Further Reading
Overview
Examples
Resources and Further Reading
Linear Regression
Linear Regression
Formalities
Tutorial Example
Tuning parameters
Generalized Regression
Formalities
Tutorial Example
Tuning parameters
Logistic Regression
Formalities
Tutorial Example
Tuning parameters
Discussion
Probit Regression
Formalities
Tutorial Example
Tuning Parameters
Multinomial Regression
Formalities
Tutorial Example
Tuning Parameters
Neural Network
Formalities
Tutorial Example
Tuning parameters
Documenting
Code Review
Further Resources
Chapter Exercises
Command Summary
Naive Bayes
Summary
Code Review
Resources
Exercises
Command Summary
Survival Analysis
Sample Data
Simple
Lung
Descriptive Analysis
Regression
survreg
Simple
Lung
coxph
Simple
Lung
Apply to New Data
More Input Variables
Decision Tree
Example from Singer and Willett
Other Approaches
Design Package
Random Survival Forests
Prediction on Test Data
Evaluation
The Evaluate Tab
Confusion Matrix
Measures
Graphical Measures
Risk Charts
Cost Curves
Lift
ROC Curves
Area Under Curve
Precision versus Recall
Sensitivity versus Specificity
Predicted versus Observed
Scoring
Documenting Interactive Explorations
Code Review
Chapter Exercises
Command Summary
Transforming Data
Rescale Data
Recenter
Scale [0,1]
Rank
Median/MAD
Peer Relativity Profiling Index
Impute
Zero/Missing
Mean/Median/Mode
Constant
Remap
Binning
Indicator Variables
Join Categorics
Math Transforms
Outliers
Cleanup
Delete Ignored
Delete Selected
Delete Missing
Delete Obs with Missing
Other Transformations
Removing Duplicates
Command Summary
Deployment
Documenting Deployment
Code Review
Chapter Exercises
Command Summary
Troubleshooting
Cairo
A factor has new levels
Issues
Model Selection
Overfitting
Imbalanced Classification
Sampling
Cost Based Learning
Model Deployment and Interoperability
SQL
PMML
XML for Data
Bibliographic Notes
Documenting
Code Review
Chapter Exercises
Command Summary
Moving into R
Interacting with R
Basic Command Line
Windows, Icons, Mouse, Pointer--WIMP
The Current Rattle State
Samples
Projects
The Rattle Log
Further Tuning Models
Emacs and ESS
Documenting
Code Review
Chapter Exercises
Command Summary
R
Evaluation
Exercises
Assignment
Libraries and Packages
Searching for Objects
Package Management
Information About a Package
Testing Package Availability
Packages and Namespaces
Basic Programming in R
Principles
Folders and Files
Flow Control
If Statement
For Loop
Functions
Apply
Methods
Objects
System
Running System Commands
System Parameters
Misc
Internet
Memory Management
Memory Usage
Garbage Collection
Errors
Frivolous
Sudoku
Further Resources
Using R
Specific Purposes
Survey Analysis
Getting Help
R Documentation
Data
Data Types
Numbers
Strings
Building Strings
Splitting Strings
Substitution
Trim Whitespace
Evaluating Strings
Logical
Dates and Times
Space
Data Structures
Vectors
Arrays
Lists
Sets
Matricies
Exercises
Data Frames
Accessing Columns
Removing Columns
Exercises
General Manipulation
Factors
Elements
Rows and Columns
Finding Index of Elements
Partitions
Head and Tail
Reverse a List
Sorting
Unique Values
Loading Data
Interactive Responses
Interactive Data Entry
Available Datasets
The Iris Dataset
CSV Data Used In The Book
The Wine Dataset
The Cardiac Arrhythmia Dataset
The Adult Survey Dataset
Foreign Formats
Stata Data
Conversions
Reading Variable Width Data
Saving Data
Formatted Output
Automatically Generate Filenames
Reading a Large File
Manipulating Data
Manipulating Data As SQL
Using SQLite
ODBC Data
Database Connection
Excel
Access
Clipboard Data
Spatial Data
Simple Map
A Density Map
Overlays and Point in Polygon
Other Data Formats
Fixed Width Data
Global Positioning System
Documenting a Dataset
Common Data Problems
Graphics in R
Basic Plot
Controlling Axes
Arrow Axes
Legends and Points
Tables Within Plots
Colour
Labels in Plots
Axis Labels
Legend
Labels Within Plots
Maths in Labels
Multiple Plots
MatPlot
Multiple Plots Using ggplot2
Using GGPlot
Networks
Symbols
Other Graphic Elements
Making an Animation
Animated Mandelbrot
Adding a Logo to a Graphic
Graphics Devices Setup
Screen Devices
Multiple Devices
File Devices
Multiple Plots
Copy and Print Devices
Graphics Parameters
Plotting Region
Locating Points on a Plot
Scientific Notation and Plots
Understanding Data
Single Variable Overviews
Textual Summaries
Multiple Line Plots
Separate Line Plots
Pie Chart
Fan Plot
Stem and Leaf Plots
Histogram
Barplot
Trellis Histogram
Histogram Uneven Distribution
Bump Chart
Density Plot
Basic Histogram
Basic Histogram with Density Curve
Practical Histogram
Multiple Variable Overviews
Scatterplot
Scatterplot with Marginal Histograms
Multi-Dimension Scatterplot
Correlation Plot
Colourful Correlations
Fluctuation Plot
Heat Map
Projection Pursuit
RADVIZ
Parallel Coordinates
Categoric and Numeric
Measuring Data Distributions
Textual Summaries
Boxplot
Multiple Boxplots
Boxplot by Class
Tuning a Boxplot
Boxplot Using Lattice
Boxplot Using ggplot
Violin Plot
What Distribution
Miscellaneous Plots
Line and Point Plots
Matrix Data
Multiple Plots
Aligned Plots
Probability Scale
Network Plot
Sunflower Plot
Stairs Plot
Graphing Means and Error Bars
Bar Charts With Segments
Bar Plot With Means
3d Bar Plot
Stacks Versus Lines
Multi-Line Title
Mathematics
Plots for Normality
Basic Bar Chart
Bar Chart Displays
Multiple Dot Plots
Alternative Multiple Dot Plots
3D Plot
Clustered Box Plot
Perspective Plots
Star Plot
Residuals Plot
Waterfall Plots
Dates and Times
Simple Time Series
Multiple Time Series
Plot Time Series
Plot Time Series with Axis Labels
Grouping Time Series for Box Plot
Time Series Heatmap
Textual Summaries
Stem and Leaf Plots
Histogram
Barplot
Density Plot
Basic Histogram
Basic Histogram with Density Curve
Practical Histogram
Correlation Plot
Colourful Correlations
Measuring Data Distributions
Textual Summaries
Boxplot
Multiple Boxplots
Boxplot by Class
Box and Whisker Plot
Box and Whisker Plot
Clustered Box Plot
Further Resources
Map Displays
Further Resources
Preparing Data
Data Selection and Extraction
Training and Test Datasets
Data Cleaning
Review Data
Selectively Changing Vector Values
Replace Indices By Names
Missing Values
Remove Levels from a Factor
Variable Manipulations
Remove Columns
Reorder Columns
Remove Non-Numeric Columns
Remove Variables with no Variance
Cleaning the Wine Dataset
Cleaning the Cardiac Dataset
Cleaning the Survey Dataset
Imputation
Nearest Neighbours
Multiple Imputation
Data Linking
Simple Linking
Record Linkage
Data Transformation
Aggregation
Sum of Columns
Pivot Tables
Normalising Data
Binning
Interpolation
Variable Selection
Classification
Classification
Classification
Issues
Incremental or Online Modelling
Model Tuning
Tuning rpart
Unbalanced Classification
Building Models
Temporal Analysis
Evaluation
Basics
Basic Measures
Cross Validation
Graphical Performance Measures
Lift
The ROC Curve
Other Examples
10 Fold Cross Validation
Area Under Curve
Calibration Curves
Reporting
Generating Open Document Format
Getting Started with odfWeave
OpenOffice.org Macro Support
Generating HTML
Generating PDF with LATEX
Configuration
Figure Sizes
Fraud Analysis
Archetype Analysis
Exercises
Using the help facilities of R, find help on the following topics and identify the relevant R function:
Regression trees
Generalised Linear Models
ROC
Mean and Median and Standard Deviation
Principal Components
Explore the
scale
function, run the examples, and describe in a couple of sentences what it does.
Calculate the following in R:
Copyright © 2004-2010 Togaware Pty Ltd
Support further development through the
purchase of the PDF
version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by
Togaware
. This page generated: Sunday, 22 August 2010