Data Mining Survivor: Boosting - AdaBoost Algorithm

DATA MINING
Desktop Survival Guide
by Graham Williams

AdaBoost Algorithm

Boosting employs a weak learning algorithm (which we identify as the learner). Suppose the dataset (data) consists of entites described using variables (lines 1 and 2 of the meta-code below). The th variable (i.e., the last variable of each observation) is assumed to be the classification of the observation. In the algorithm presented here we denote the training data (an by matrix) as (line 3) and the class associated with each observation in the training data (a vector of length ) as (line 4). Without loss of generality we can restrict the class to be either (perhaps representing ) or (representing ). This will simplify the mathematics. Each observation in the training data is initially assigned the same weight: $w_i = \frac{1}{N}$ (line 5).

The weak learner will need to use the weights associated with each observation. This may be handled directly by the learner (e.g., rpart takes an option to specify the Roption[]weights) or else by generating a modified dataset by sampling the original dataset based on the weights.

The first model, $\mathcal{M}_1$ , is built by applying the weak $\id{learner}$ to the $\id{data}$ with weights (line 7). $\mathcal{M}_1$ , predicting either or , is then used to identify the set of indicies of misclassified entities (i.e., where $\mathcal{M}_1(x_p)\not=y_p)$ , denoted as $\id{ms}$ (line 8). For a completely accurate model we would have $\mathcal{M}_1(x_i) = y_i$ . Of course the model is expected to be only slightly better than random so ms is unlikely to be empty.

A relative error $\epsilon_1$ for $\mathcal{M}_1$ is calculated as the relative sum of the weights of the misclassified entities (line 9). This is used to calculate $\alpha_1$ (line 10), used, in turn, to adjust the weights (line 11). All weights could be either decreased or increased depending on whether the model correctly classifies the corresponding observation, as proposed by (). However, this can be simplified to only increasing the weights of the misclassified entities, as proposed by (). These entities thus become more important.

The learning algorithm is then applied to the new weighted $\id{data}$ with the $\id{learner}$ expected to give more focus on the difficult entities whilst building this next model, $\mathcal{M}_2$ . The weights are then modified again using the errors from $\mathcal{M}_2$ . The model building and weight modification is then repeated until the new model performs no better than random (i.e., the error is 50% or more: $\epsilon_i \ge 0.5$ ), or is perfect (i.e., the error rate is 0% and $\id{ms}$ is empty), or perhaps after a fixed number of iterations.

$\textstyle \parbox{0.985\textwidth}{\begin{codebox} \Procname{$\proc{AdaBoost}(... ... = \Sign\bigl(\sum_{j=1}^{T} \alpha_j \mathcal{M}_j(x)\bigr)$] \end{codebox}}$

The final model $\mathcal{M}$ (line 14) combines the other models using a weighted sum of the outputs of these other models. The weights, $\alpha_j$ , reflect the accuracy of each of the constituent models.

A simple example can illustrate the process. Suppose the number of training entities, , is 10. Each weight, , is thus initially (line 5). Imagine the first model, $\mathcal{M}_1$ , correctly classifies the first 6 of the 10 entities (e.g., $ms=\{7,8,9,10\}$ ), so that $\epsilon_1=0.1+0.1+0.1+0.1/1=0.4$ . Then $\alpha_1=\log(0.6/0.4)=0.405$ , and is the weight that will be used to multiply the results from this model to be added into the overall model score. The weights $w_7, \dots, w_{10}$ then become $0.1\times e^{0.405} = 0.1\times 1.5 = 0.15$ . That is, they now have more importance for the next model build. Suppose now that $\mathcal{M}_2$ correctly classifies 8 of the entities (with $ms=\{1,8\}$ ), so that $\epsilon_2=(0.1+0.15)/1.2=0.208$ and $\alpha_2=\log(0.792/0.208)=1.337$ . Thus $w_1 = 0.1\times e^{1.337} \approx 0.381$ and $w_8 = 0.15\times e^{1.337} \approx 0.571$ . Note how record 8 is proving particularly troublesome and so its weight is now the highest.

We can understand the behaviour of the function used to weight the models ( $\log((1-\epsilon_i)/\epsilon_i)$ ) and to adjust the observation weights ( $e^{\alpha_i}$ ) by plotting both of these for the range of errors we expect (from 0 to 0.5 or 50%).

plot(function(x) exp(log((1-x)/x)), xlim=c(0.01, 0.5), ylim=c(0.01, 200),
     xlab=expression(Error: epsilon), ylab="", log="y", lty=1, col=2, yaxt="n")
plot(function(x) log((1-x)/x), add=TRUE, lty=2, col=3)
axis(side=2, at=c(0.01, 0.1, 1, 10, 100), c(0.01, 0.1, 1, 10, 100))
grid()
exp.leg <- expression(alpha == log(over(1-epsilon, epsilon)), exp(alpha))
legend("topright", exp.leg, lty=2:1, col=3:2, inset=c(0.04, 0.1))

http://rattle.togaware.com/code/rplot-adaboost.R

First, looking at the value of the $\alpha$ 's ( $\log((1-\epsilon_i)/\epsilon_i)$ ), we see that for errors close to zero, that is, for very accurate models, the $\alpha_i$ (i.e., the weight that is used to multiply the results from the model) is very high. For an error of about 5% the multiplier is almost 3, and for a 1% error the multiplier is about 4.6. With an error of approximately 27% (i.e., $\epsilon=0.26894$ ) the multiplier is close to 1. For errors greater than this the model gets weights less than 1, heading down to a weight of 0 at $\epsilon=0.5$ .

In terms of building the models, the observation weights are multiplied by $e^{\alpha_i}$ . If the model we have just built ( $\mathcal{M}_i$ ) is quite accurate ( $\epsilon$ close to 0) then fewer entities are being misclassified, and their weights are increased significantly (e.g., for a 5% error the weight is multiplied by 19). For inaccurate models (as $\epsilon$ approaches 0.5) the multiplier for the weights approaches 1. Note that if $\epsilon_i=0.5$ the multiplier is 1, and thus no change is made to the weights of the entities. Of course, building a model on the same dataset with the same weights will build the same model, thus the criteria for continuing to build a model tests that $\epsilon < 0.5$ .

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010