Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Normalise

Different model builders require different characteristics of the data they are building their models from. For example, in build a clustering using any kind of distance measure, we may need to normalise the data. Otherwise, a variable like Income will overwhelm a variable like Age, when calculating distances. A distance of 10 ``years'' may be more significant than a distance of $10,000, yet, $10000$ swamps $10$ when they are added together, as would be the case by calculating distances.

The Normalise option of the Transform tab can perform a number of normalisations, including re-centering and rescaling our data to be around zero (Recenter), rescaling our data to be in the range from 0 to 1 (Scale [0,1]), covert the numbers into a rank ordering (Rank), and finally, to do a robust rescaling around zero using the median (-Median/MAD). Figure 6.2 displays the interface.

Figure 6.2: Selection of normalisations.
Image rattle-audit-transform-normalise

We can see in Figure 6.2 the apprach we take to normalising (and to transforming) our data. The original data is not modified. Instead, a new variable is created with a prefix added to the variable's name that indicates the kind of transformation. As we can see in the figure, the prefixes are NORM_RECENTER_, NORM_SCALE01_, NORM_RANK_, and NORM_MEDIANAD_.

We can see the effect of the four normalisations in comparing the histogram of the variable, Age, in Figure 6.3, with the four plots in Figure 6.4 for the corresponding four normalisations.

Figure 6.3: Normalisations of Age.
Image rattle-audit-explore-distribution-age

Figure 6.4: Normalisations of Age.
Image rattle-audit-transform-normalise-age



Subsections
Copyright © Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
PDF version is properly formatted and forms a comprehensive book (draft with over 600 pages).
Brought to you by Togaware.