|
DATA MINING
Desktop Survival Guide by Graham Williams |
|
|||
Normalising Data |
R's scale is used to re-center and re-scale data in a numeric matrix. The re-centering involves subtracting a column's mean from each value in the column. The re-scaling then divides each value by the root-mean-square.
> ds <- wine[1:20,c(2,9,14)]
> summary(ds)
Alcohol Nonflavanoids Proline
Min. :13.16 Min. :0.1700 Min. : 735
1st Qu.:13.72 1st Qu.:0.2600 1st Qu.:1061
Median :14.11 Median :0.2950 Median :1280
Mean :14.01 Mean :0.2970 Mean :1235
3rd Qu.:14.32 3rd Qu.:0.3225 3rd Qu.:1352
Max. :14.83 Max. :0.4300 Max. :1680
> ds
Alcohol Nonflavanoids Proline
1 14.23 0.28 1065
2 13.20 0.26 1050
3 13.16 0.30 1185
4 14.37 0.24 1480
5 13.24 0.39 735
6 14.20 0.34 1450
7 14.39 0.30 1290
8 14.06 0.31 1295
9 14.83 0.29 1045
10 13.86 0.22 1045
11 14.10 0.22 1510
12 14.12 0.26 1280
13 13.75 0.29 1320
14 14.75 0.43 1150
15 14.38 0.29 1547
16 13.63 0.30 1310
17 14.30 0.33 1280
18 13.83 0.40 1130
19 14.19 0.32 1680
20 13.64 0.17 845
> scale(ds)
Alcohol Nonflavanoids Proline
1 0.4630901 -0.27054355 -0.7184008
2 -1.7198976 -0.58883009 -0.7819386
3 -1.8046738 0.04774298 -0.2100983
4 0.7598069 -0.90711662 1.0394785
5 -1.6351214 1.48003239 -2.1162325
6 0.3995079 0.68431605 0.9124029
7 0.8021950 0.04774298 0.2346663
8 0.1027912 0.20688625 0.2558456
9 1.7347334 -0.11140029 -0.8031179
10 -0.3210899 -1.22540316 -0.8031179
11 0.1875674 -1.22540316 1.1665541
12 0.2299555 -0.58883009 0.1923078
13 -0.5542245 -0.11140029 0.3617419
14 1.5651810 2.11660546 -0.3583532
15 0.7810009 -0.11140029 1.3232807
16 -0.8085532 0.04774298 0.3193834
17 0.6114485 0.52517278 0.1923078
18 -0.3846721 1.63917565 -0.4430703
19 0.3783139 0.36602952 1.8866493
20 -0.7873591 -2.02111950 -1.6502886
attr(,"scaled:center")
Alcohol Nonflavanoids Proline
14.0115 0.2970 1234.6000
attr(,"scaled:scale")
Alcohol Nonflavanoids Proline
0.47183042 0.06283646 236.07991510
> ds
Alcohol Nonflavanoids Proline
1 14.23 0.28 1065
2 13.20 0.26 1050
3 13.16 0.30 1185
4 14.37 0.24 1480
5 13.24 0.39 735
6 14.20 0.34 1450
7 14.39 0.30 1290
8 14.06 0.31 1295
9 14.83 0.29 1045
10 13.86 0.22 1045
11 14.10 0.22 1510
12 14.12 0.26 1280
13 13.75 0.29 1320
14 14.75 0.43 1150
15 14.38 0.29 1547
16 13.63 0.30 1310
17 14.30 0.33 1280
18 13.83 0.40 1130
19 14.19 0.32 1680
20 13.64 0.17 845
> summary(scale(ds))
Alcohol Nonflavanoids Proline
Min. :-1.805e+00 Min. :-2.021e+00 Min. :-2.116e+00
1st Qu.:-6.125e-01 1st Qu.:-5.888e-01 1st Qu.:-7.343e-01
Median : 2.088e-01 Median :-3.183e-02 Median : 1.923e-01
Mean :-3.381e-15 Mean :-6.217e-16 Mean : 3.886e-16
3rd Qu.: 6.485e-01 3rd Qu.: 4.058e-01 3rd Qu.: 4.994e-01
Max. : 1.735e+00 Max. : 2.117e+00 Max. : 1.887e+00
|
The function rescaler from Hadley Wickham's
reshape
package supports five
methods for rescaling/standardising data: rescale to
; subtract
mean and divide by the standard deviation; subtract median and divide
by median absolute deviation; convert values to a rank; and do
nothing.