Powered by Squarespace
This form does not yet contain any fields.

    Entries in classification (6)

    Sunday
    Mar072010

    Misclassification loss and AdaBoost

    I’ve met some people working with AdaBoost who was looking at the misclassification error vs. boosting iteration number plot and wondering: “Why misclassification error rate is increasing on some iterations? Is there something wrong with my AdaBoost implementation? Shouldn’t AdaBoost always decrease the error?” Relax, guys, it’s all right. Instead of optimizing misclassification loss L1(y, y’) = I[y != sign(y’)], AdaBoost optimizes exponential loss L2(y, y’) = exp(-y*y’). Decrease in exponential loss does not always lead to decrease in misclassification loss and vice versa. Nevertheless, small value of the total loss of one type give you some hope that the total loss of another type will be small too.

    Here is plot from the “Elements of Statistical Learning” book comparing these two kinds of loss function on some synthetic data.

    Sunday
    Jan242010

    The nature of boosting

    Most of the people involved into computer vision field are somehow familiar with boosting. They know that boosting is a powerful ML algorithm and should be used in cases when there are many weak learners available, but only few of them are discriminative enough. They also know that boosting achieves its goals by managing weight distribution on training samples: when new weak classifier is added to the linear combination, samples in the training set are reweighed in a way that samples misclassified by the new classifier will have their weights increased and sample classified correctly - decreased. Those sample weights are used when error of the weak classifier over the training set is calculated. So each new weak classifier is focused mostly on the samples which were misclassified by its colleagues.

    That’s the very good explanation of what AdaBoost algorithm does, but to understand boosting better you should get deeper. There are another interpretations of boosting algorithm available, and some of them are really interesting.

    Boosting as a forward stagewise additive modelling

    Boosting can be viewed as a forward stagewise additive modelling process. Let’s assume we have some loss function L(y,y’) and want to find additive combination of weak classifiers that minimizes it. One way to do it is to use greedy approach. First, let’s select single weak classifier h1(x) that minimizes loss function L(y, h1(x)) over the training set. Then, let’s select weak classifier h2(x) that minimizes L(y, h1(x) + h2(x)). The next classifier, h3(x), should minimize L(y, h1(x) + h2(x) + h3(x)) and so on. This method has a great connection to boosting (explored, for example, in this paper or in this great book): if you choose exponential loss function L(y, y’) = exp(-y*y’), you’ll get exactly the AdaBoost algorithm.

    Boosting as a gradient descent

    At each step of boosting we’re selecting weak classifier that maximally reduces loss over the training set. So, we’re in fact moving in the strong classifier space in the direction of the loss function decrease. Each weak classifier selection here can be viewed as a gradient descent step. Approaches like gradient boosting are entirely based on this observation. In gradient boosting, gradient descent is performed semi-explicitly: strong classifier is represented by a vector of its values on the training set samples. Then, at each boosting step partial derivatives of the loss function (with respect to classifier already trained) are calculated, and regression tree is fitted to the gradient value in a least-squares manner. That regression tree is than selected as a weak classifier.

    Boosting and game theory

    Boosting has several connections with the game theory. Consider matrix Mij with number of rows equal to the number of samples in the training set and number of columns equal to the number of weak classifiers. Mij will be equal to 1 if j-th weak classifier classifies i-th sample correctly (otherwise it will have value of 0). Then it can be shown using von Neumann’s minimax theorem (applied to the game in mixed strategies with payoff matrix M) that if for every possible weight distribution there exists weak learner with error less than 1/2-eps, than there exists a convex combination of weak learners with a margin of at least 2*eps for every training sample. It means that AdaBoost has at least a potential for success. One can go further and discover that boosting can be viewed as a special case of an algorithm for approximately solving matrix games. Both topics are covered in this paper very well.

    Connection to game theory also leads us to idea of using linear or convex programming techniques in boosting. Some approaches are based on it, but I’m not familiar with them very well.

    Thursday
    Jul092009

    Genetic tree boosting advances

    I’ve recently posted about some ideas in additive tree modelling that arise from my previous work. Last few weeks I was implementing genetic tree boosting approach and now I can share with you some results.

    Algorithm itself

    Tree boosting algorithm is currently implemented only for regression. It works as following: tree structure and features for split are selected using genetic algorithm; split thresholds for inner nodes and outcomes for terminals are selected using effective optimization algorithms. Fitness function depends on loss value of the current additive model (previously boosted additive model plus population member).

    Additional features

    I’ve also used shrinkage idea proposed by Friedman, an author of the stochastic gradient boostingapproach. Each tree in the additive model is scaled, but that scale is not taken into account for the tree currently being fitted (while previously boosted model is still scaled). So you fit tree without scale and then add it to the model with it. It makes boosting algorithm not so greedy and allows to fight overfitting.

    Experiment description

    I’ve used my cross-validation tool to compare my approach (I call it EvoBoost) with other algorithms on two UCI datasets: forest fires and concrete. Unfortunately, the only gradient boosting implementation I’ve found for MATLAB was this, and it does not look good. Even more, it does not crash only with max_tree_depth=2 and shrinkage=0.05, so I was forced to set the same parameters for my algorithm for comparison. I’ve also applied y => ln(y + 1) transform to the outcome in the forest fires dataset because the original data does was too skewed towards zero. For comparison, 4-fold cross-validation was performed.

    Experiment results

    Concrete:

    MethodMSE
    GLM(normal, identity) 111.126639
    GLM(poisson, log) 123.476678
    Decision tree 48.828096
    robustfit(bisquare) 183.808541
    robustfit(andrews) 185.738258
    SGB(200,2,0.05) 56.346052
    My approach(200,2,0.05) 57.177905
    My approach(200,3,0.5) 36.590385

    Forest fires:

    MethodMSE
    GLM(normal, identity) 2.075994
    GLM(poisson, log) 1.994226
    Decision tree 3.406037
    robustfit(bisquare) 2.127075
    robustfit(andrews) 2.100954
    SGB(200,2,0.05) 1.971933
    My approach(200,2,0.05) 1.987961
    My approach(200,3,0.5) 1.987159

     

    So, my approach performs approximately as well as the gradient boosting (stochastic part was disabled during experiments). It performs much better (at least on Concrete dataset) with shrinkage=0.5, but, unfortunately, I can’t compare it with the SGB implementation I have cause it crashes. Anyway, I need to install R.

    Future steps

    Currently I want to add support for classification and then perform two experiments: compare SGB with my approach on some classification tasks from UCI (1) and boost Viola-Jones classifier and, then, compare it with the results achieved in my previous work (2). I think I can do it till the end of the month. Good luck to me, yeah.

    Thursday
    Jun182009

    Genetic weak learner and classifier trees

    I have some free time now (most of my friends went to the student military camp), so I decided to continue my last research work.

    If you have read my paper about accelerating boosting with genetic algorithm, you have an idea of what the genetic weak learner is. In the original work that learner was only applied to the families of parametric classifiers that can be represented as points in multidimensional space. Proposed approach can not be used to build an additive model consisting of single-feature stump classifiers in cases when features have no good parametric representation. For example, if feature can be represented only by it’s number in the feature pool and there is no corellation between features with close numbers, genetic algorithm used in the weak learner simply turns into random search. Another bad thing about stump classifier is that it can’t represent dependencies between variables which is always an important feature of the learning algorithm.

    Now I’m trying to fix some problems of the original approach. Here are some ideas I’ve recently came to:

    1. Boosting in fact is a stagewise additive modelling algorithm. So, on each iteration it tries to fit the tree that, if added to the current model, will minimize some loss function. It’s an optimization problem with tree as a function parameter. There exist some well-known approaches for solving such optimization problem with genetic algorithm: GP and GEP.
    2. What’s bad for the stump can be not so bad for the tree with more than one split. When mutation changes feature used for the split in some node, all the features in another nodes remain the same and, so, there is a strong corellation between chromosome states before and after the mutation.
    3. If non-parametric feature usage become possible, I will compare my approach with some popular tree boosting approaches such as gradient boosting on different UCI datasets.
    4. As I already have mentioned in the paper, genetic learner computational complexity depends only on genetic algorithm parameters (such as population size or epoch count) and, therefore, genetic tree boosting algorithm also has an opportunity to significantly outperform existing approaches in learning time.

    Also I’m going to remember my API design skills and implement a library for boosting both tree-based classifiers and tree-based regressors in some convinient way. Keep reading the blog to know more =)

     

    Thursday
    Jun112009

    The story of skin detection

    While working on my current project at R&D department of the BS Graphics company I’ve understand the importance of the skin detection very well. The people tracking system we’re developing uses it for two different reasons:

    1. It allows us to reduce amount of false positives that face detector produces. In our tracking system faces with less than a half of pixels classified as skin are simply rejected. It was the first application of the skin detection in our tracking system and it still serves us well.
    2. Face detection works slowly, especially when video frame is big. So, we did such a thing: when full-scale face search is running over frame, both skin and foreground masks are first calculated. Then only regions built from intersection of those masks are used for face detection. In most of the scenes that approach accelerates face detection 2 or 3 times.

    But the skin detection has it’s drawbacks and limitations. The most important fact about it is the following: skin color in the frame depends on camera quality and lighting conditions a lot. So, skin classifier learned with some specific lighting conditions can perform vary bad when those lighting conditions change.

    During the last few months I’ve tried 3 different approaches to skin detection, and now I’m more or less satisfied. Here’s the list of what I’ve tried. Maybe it will help someone.

    1. Skin detector proposed by Jones and Rehg. It uses Bayes classifier to classify pixels with given color to be skin or not. Conditional distribution learning is performed offline using a set of images with labeled skin pixels. Gaussian mixture models are used to represent conditional distribution of skin color. Mixture component parameters trained by Jones and Rehg are given in the end of the paper, so it is not necessary to train classifier by yourself. The only important thing is to convert GMM back to 3d histogram for better performance. Unfortunately, classifier is very sensitive to camera exposure and lighting changes. It is good for photos from web, but very bad for real-world scenarios.
    2.  Adaptive version of Jones-Rehg skin detector that uses information from face detector. Conditional distribution is learned using pixels lying in the face regions. Unfortunately, other skin-colored parts (such as arms, neck etc) can get into the negative samples, therefore, corrupting color distribution. That fact gave me the following idea: adaptation should not use any negative samples. It should fit some descriptive model to the skin color data we have instead.
    3. Wimmer-Radig descriptive skin model. Simple decision rule is fitted to the skin color distribution (in normalized RGB). I re-learn temporary skin model in some “key” frames and interpolate final skin model between them to make model evolution continuous. That approach is the best one until now, but, probably, some heavy tests will reveal it’s problems too.

    The following video shows the 3rd approach at work. Semi-transparent red mask indicates pixels classified as skin.

     

     

     

    Tuesday
    Feb102009

    Chinese, Japanese or Korean?

    Today I've found an interesting site AllLookSame.com. There is a test on web page that allows you to check your skills in recognizing Chinese, Korean and Japanese people faces. And I've failed that test. I've guessed only 4 of 20 people. Less than a quarter. Maybe computer can do it much more better?


    That would be really interesting to train intensity-based SVM classifier or, probably, boosted classifier based on haar or sparse features that can told us if given face is Korean, Japanese or Chinese. But that will require training set of Asian faces with ground truth. Because of my inability to distinguish such faces I can't do it on my own.

    But if I found such a face database some day, I swear I'll do my little research.