The nature of boosting
Sunday, January 24, 2010 at 9:07PM Most of the people involved into computer vision field are somehow familiar with boosting. They know that boosting is a powerful ML algorithm and should be used in cases when there are many weak learners available, but only few of them are discriminative enough. They also know that boosting achieves its goals by managing weight distribution on training samples: when new weak classifier is added to the linear combination, samples in the training set are reweighed in a way that samples misclassified by the new classifier will have their weights increased and sample classified correctly - decreased. Those sample weights are used when error of the weak classifier over the training set is calculated. So each new weak classifier is focused mostly on the samples which were misclassified by its colleagues.
That’s the very good explanation of what AdaBoost algorithm does, but to understand boosting better you should get deeper. There are another interpretations of boosting algorithm available, and some of them are really interesting.
Boosting as a forward stagewise additive modelling
Boosting can be viewed as a forward stagewise additive modelling process. Let’s assume we have some loss function L(y,y’) and want to find additive combination of weak classifiers that minimizes it. One way to do it is to use greedy approach. First, let’s select single weak classifier h1(x) that minimizes loss function L(y, h1(x)) over the training set. Then, let’s select weak classifier h2(x) that minimizes L(y, h1(x) + h2(x)). The next classifier, h3(x), should minimize L(y, h1(x) + h2(x) + h3(x)) and so on. This method has a great connection to boosting (explored, for example, in this paper or in this great book): if you choose exponential loss function L(y, y’) = exp(-y*y’), you’ll get exactly the AdaBoost algorithm.
Boosting as a gradient descent
At each step of boosting we’re selecting weak classifier that maximally reduces loss over the training set. So, we’re in fact moving in the strong classifier space in the direction of the loss function decrease. Each weak classifier selection here can be viewed as a gradient descent step. Approaches like gradient boosting are entirely based on this observation. In gradient boosting, gradient descent is performed semi-explicitly: strong classifier is represented by a vector of its values on the training set samples. Then, at each boosting step partial derivatives of the loss function (with respect to classifier already trained) are calculated, and regression tree is fitted to the gradient value in a least-squares manner. That regression tree is than selected as a weak classifier.
Boosting and game theory
Boosting has several connections with the game theory. Consider matrix Mij with number of rows equal to the number of samples in the training set and number of columns equal to the number of weak classifiers. Mij will be equal to 1 if j-th weak classifier classifies i-th sample correctly (otherwise it will have value of 0). Then it can be shown using von Neumann’s minimax theorem (applied to the game in mixed strategies with payoff matrix M) that if for every possible weight distribution there exists weak learner with error less than 1/2-eps, than there exists a convex combination of weak learners with a margin of at least 2*eps for every training sample. It means that AdaBoost has at least a potential for success. One can go further and discover that boosting can be viewed as a special case of an algorithm for approximately solving matrix games. Both topics are covered in this paper very well.
Connection to game theory also leads us to idea of using linear or convex programming techniques in boosting. Some approaches are based on it, but I’m not familiar with them very well.
hr0nix |
Post a Comment |
boosting,
classification,
regression 

