Powered by Squarespace
This form does not yet contain any fields.

    Entries in boosting (5)

    Sunday
    Mar072010

    Misclassification loss and AdaBoost

    I’ve met some people working with AdaBoost who was looking at the misclassification error vs. boosting iteration number plot and wondering: “Why misclassification error rate is increasing on some iterations? Is there something wrong with my AdaBoost implementation? Shouldn’t AdaBoost always decrease the error?” Relax, guys, it’s all right. Instead of optimizing misclassification loss L1(y, y’) = I[y != sign(y’)], AdaBoost optimizes exponential loss L2(y, y’) = exp(-y*y’). Decrease in exponential loss does not always lead to decrease in misclassification loss and vice versa. Nevertheless, small value of the total loss of one type give you some hope that the total loss of another type will be small too.

    Here is plot from the “Elements of Statistical Learning” book comparing these two kinds of loss function on some synthetic data.

    Sunday
    Jan242010

    The nature of boosting

    Most of the people involved into computer vision field are somehow familiar with boosting. They know that boosting is a powerful ML algorithm and should be used in cases when there are many weak learners available, but only few of them are discriminative enough. They also know that boosting achieves its goals by managing weight distribution on training samples: when new weak classifier is added to the linear combination, samples in the training set are reweighed in a way that samples misclassified by the new classifier will have their weights increased and sample classified correctly - decreased. Those sample weights are used when error of the weak classifier over the training set is calculated. So each new weak classifier is focused mostly on the samples which were misclassified by its colleagues.

    That’s the very good explanation of what AdaBoost algorithm does, but to understand boosting better you should get deeper. There are another interpretations of boosting algorithm available, and some of them are really interesting.

    Boosting as a forward stagewise additive modelling

    Boosting can be viewed as a forward stagewise additive modelling process. Let’s assume we have some loss function L(y,y’) and want to find additive combination of weak classifiers that minimizes it. One way to do it is to use greedy approach. First, let’s select single weak classifier h1(x) that minimizes loss function L(y, h1(x)) over the training set. Then, let’s select weak classifier h2(x) that minimizes L(y, h1(x) + h2(x)). The next classifier, h3(x), should minimize L(y, h1(x) + h2(x) + h3(x)) and so on. This method has a great connection to boosting (explored, for example, in this paper or in this great book): if you choose exponential loss function L(y, y’) = exp(-y*y’), you’ll get exactly the AdaBoost algorithm.

    Boosting as a gradient descent

    At each step of boosting we’re selecting weak classifier that maximally reduces loss over the training set. So, we’re in fact moving in the strong classifier space in the direction of the loss function decrease. Each weak classifier selection here can be viewed as a gradient descent step. Approaches like gradient boosting are entirely based on this observation. In gradient boosting, gradient descent is performed semi-explicitly: strong classifier is represented by a vector of its values on the training set samples. Then, at each boosting step partial derivatives of the loss function (with respect to classifier already trained) are calculated, and regression tree is fitted to the gradient value in a least-squares manner. That regression tree is than selected as a weak classifier.

    Boosting and game theory

    Boosting has several connections with the game theory. Consider matrix Mij with number of rows equal to the number of samples in the training set and number of columns equal to the number of weak classifiers. Mij will be equal to 1 if j-th weak classifier classifies i-th sample correctly (otherwise it will have value of 0). Then it can be shown using von Neumann’s minimax theorem (applied to the game in mixed strategies with payoff matrix M) that if for every possible weight distribution there exists weak learner with error less than 1/2-eps, than there exists a convex combination of weak learners with a margin of at least 2*eps for every training sample. It means that AdaBoost has at least a potential for success. One can go further and discover that boosting can be viewed as a special case of an algorithm for approximately solving matrix games. Both topics are covered in this paper very well.

    Connection to game theory also leads us to idea of using linear or convex programming techniques in boosting. Some approaches are based on it, but I’m not familiar with them very well.

    Thursday
    Aug062009

    Genetic tree regression experimental results

    About a month ago I have posted about my experiments with genetic algorithm for boosting trees. Since than, I have not much time to implement something new, but I have found a lot of bugs in my previous implementation. Due to those bugs, on every boosting iteration my algorithm returned regression tree which was quite far from best possible. Actually, it worked only because of amazing ability of boosting to build rather good classifier from huge amount of really bad.

    Click to read more ...

    Thursday
    Jul092009

    Genetic tree boosting advances

    I’ve recently posted about some ideas in additive tree modelling that arise from my previous work. Last few weeks I was implementing genetic tree boosting approach and now I can share with you some results.

    Algorithm itself

    Tree boosting algorithm is currently implemented only for regression. It works as following: tree structure and features for split are selected using genetic algorithm; split thresholds for inner nodes and outcomes for terminals are selected using effective optimization algorithms. Fitness function depends on loss value of the current additive model (previously boosted additive model plus population member).

    Additional features

    I’ve also used shrinkage idea proposed by Friedman, an author of the stochastic gradient boostingapproach. Each tree in the additive model is scaled, but that scale is not taken into account for the tree currently being fitted (while previously boosted model is still scaled). So you fit tree without scale and then add it to the model with it. It makes boosting algorithm not so greedy and allows to fight overfitting.

    Experiment description

    I’ve used my cross-validation tool to compare my approach (I call it EvoBoost) with other algorithms on two UCI datasets: forest fires and concrete. Unfortunately, the only gradient boosting implementation I’ve found for MATLAB was this, and it does not look good. Even more, it does not crash only with max_tree_depth=2 and shrinkage=0.05, so I was forced to set the same parameters for my algorithm for comparison. I’ve also applied y => ln(y + 1) transform to the outcome in the forest fires dataset because the original data does was too skewed towards zero. For comparison, 4-fold cross-validation was performed.

    Experiment results

    Concrete:

    MethodMSE
    GLM(normal, identity) 111.126639
    GLM(poisson, log) 123.476678
    Decision tree 48.828096
    robustfit(bisquare) 183.808541
    robustfit(andrews) 185.738258
    SGB(200,2,0.05) 56.346052
    My approach(200,2,0.05) 57.177905
    My approach(200,3,0.5) 36.590385

    Forest fires:

    MethodMSE
    GLM(normal, identity) 2.075994
    GLM(poisson, log) 1.994226
    Decision tree 3.406037
    robustfit(bisquare) 2.127075
    robustfit(andrews) 2.100954
    SGB(200,2,0.05) 1.971933
    My approach(200,2,0.05) 1.987961
    My approach(200,3,0.5) 1.987159

     

    So, my approach performs approximately as well as the gradient boosting (stochastic part was disabled during experiments). It performs much better (at least on Concrete dataset) with shrinkage=0.5, but, unfortunately, I can’t compare it with the SGB implementation I have cause it crashes. Anyway, I need to install R.

    Future steps

    Currently I want to add support for classification and then perform two experiments: compare SGB with my approach on some classification tasks from UCI (1) and boost Viola-Jones classifier and, then, compare it with the results achieved in my previous work (2). I think I can do it till the end of the month. Good luck to me, yeah.

    Thursday
    Jun182009

    Genetic weak learner and classifier trees

    I have some free time now (most of my friends went to the student military camp), so I decided to continue my last research work.

    If you have read my paper about accelerating boosting with genetic algorithm, you have an idea of what the genetic weak learner is. In the original work that learner was only applied to the families of parametric classifiers that can be represented as points in multidimensional space. Proposed approach can not be used to build an additive model consisting of single-feature stump classifiers in cases when features have no good parametric representation. For example, if feature can be represented only by it’s number in the feature pool and there is no corellation between features with close numbers, genetic algorithm used in the weak learner simply turns into random search. Another bad thing about stump classifier is that it can’t represent dependencies between variables which is always an important feature of the learning algorithm.

    Now I’m trying to fix some problems of the original approach. Here are some ideas I’ve recently came to:

    1. Boosting in fact is a stagewise additive modelling algorithm. So, on each iteration it tries to fit the tree that, if added to the current model, will minimize some loss function. It’s an optimization problem with tree as a function parameter. There exist some well-known approaches for solving such optimization problem with genetic algorithm: GP and GEP.
    2. What’s bad for the stump can be not so bad for the tree with more than one split. When mutation changes feature used for the split in some node, all the features in another nodes remain the same and, so, there is a strong corellation between chromosome states before and after the mutation.
    3. If non-parametric feature usage become possible, I will compare my approach with some popular tree boosting approaches such as gradient boosting on different UCI datasets.
    4. As I already have mentioned in the paper, genetic learner computational complexity depends only on genetic algorithm parameters (such as population size or epoch count) and, therefore, genetic tree boosting algorithm also has an opportunity to significantly outperform existing approaches in learning time.

    Also I’m going to remember my API design skills and implement a library for boosting both tree-based classifiers and tree-based regressors in some convinient way. Keep reading the blog to know more =)