Powered by Squarespace
This form does not yet contain any fields.
    Sunday
    24Jan2010

    The nature of boosting

    Most of the people involved into computer vision field are somehow familiar with boosting. They know that boosting is a powerful ML algorithm and should be used in cases when there are many weak learners available, but only few of them are discriminative enough. They also know that boosting achieves its goals by managing weight distribution on training samples: when new weak classifier is added to the linear combination, samples in the training set are reweighed in a way that samples misclassified by the new classifier will have their weights increased and sample classified correctly - decreased. Those sample weights are used when error of the weak classifier over the training set is calculated. So each new weak classifier is focused mostly on the samples which were misclassified by its colleagues.

    That’s the very good explanation of what AdaBoost algorithm does, but to understand boosting better you should get deeper. There are another interpretations of boosting algorithm available, and some of them are really interesting.

    Boosting as a forward stagewise additive modelling

    Boosting can be viewed as a forward stagewise additive modelling process. Let’s assume we have some loss function L(y,y’) and want to find additive combination of weak classifiers that minimizes it. One way to do it is to use greedy approach. First, let’s select single weak classifier h1(x) that minimizes loss function L(y, h1(x)) over the training set. Then, let’s select weak classifier h2(x) that minimizes L(y, h1(x) + h2(x)). The next classifier, h3(x), should minimize L(y, h1(x) + h2(x) + h3(x)) and so on. This method has a great connection to boosting (explored, for example, in this paper or in this great book): if you choose exponential loss function L(y, y’) = exp(-y*y’), you’ll get exactly the AdaBoost algorithm.

    Boosting as a gradient descent

    At each step of boosting we’re selecting weak classifier that maximally reduces loss over the training set. So, we’re in fact moving in the strong classifier space in the direction of the loss function decrease. Each weak classifier selection here can be viewed as a gradient descent step. Approaches like gradient boosting are entirely based on this observation. In gradient boosting, gradient descent is performed semi-explicitly: strong classifier is represented by a vector of its values on the training set samples. Then, at each boosting step partial derivatives of the loss function (with respect to classifier already trained) are calculated, and regression tree is fitted to the gradient value in a least-squares manner. That regression tree is than selected as a weak classifier.

    Boosting and game theory

    Boosting has several connections with the game theory. Consider matrix Mij with number of rows equal to the number of samples in the training set and number of columns equal to the number of weak classifiers. Mij will be equal to 1 if j-th weak classifier classifies i-th sample correctly (otherwise it will have value of 0). Then it can be shown using von Neumann’s minimax theorem (applied to the game in mixed strategies with payoff matrix M) that if for every possible weight distribution there exists weak learner with error less than 1/2-eps, than there exists a convex combination of weak learners with a margin of at least 2*eps for every training sample. It means that AdaBoost has at least a potential for success. One can go further and discover that boosting can be viewed as a special case of an algorithm for approximately solving matrix games. Both topics are covered in this paper very well.

    Connection to game theory also leads us to idea of using linear or convex programming techniques in boosting. Some approaches are based on it, but I’m not familiar with them very well.

    Wednesday
    20Jan2010

    Localization and mapping

    Hey, have you ever seen George Klein’s SLAM engine in action? I can’t even call it awesome. It’s better. And it’s not just the algorithm itself. Just look into possibilities it opens.

    Friday
    25Dec2009

    An interesting approach to camera calibration

    Camera calibration is a process of determining intrinsic (like principal point or focal length) and extrinsic (position and orientation in space) parameters of a camera, which is often described by the pinhole camera model. In computer vision we usually perform calibration by analyzing images taken from camera. The most widely used approach to camera calibration is based on this paper by Zhengyou Zhang. It involves chessboard (or some other planar calibration pattern) and consists of the following steps:

    1. Find a chessboard (bigger is better). Note that it should have distinct width and height (measured in chessboard squares) which should both be even (otherwise you will be unable to determine chessboard orientation given its picture).
    2. Find a “calibration dude”.
    3. Calibration dude takes a chessboard and waves it in front of the camera attached to a computer with calibration software installed. Calibration software takes about 20 images with distinct chessboard orientations (camera orientation remains the same, of course), finds inner corners of the chessboard on every image and then uses them together with information about real-world size of chessboard to determine intrinsic camera parameters.
    4. Calibration dude puts chessboard on the floor in a way camera still can see it. Calibration software than takes one more image from camera, finds chessboard corners on it (again) and calculates camera position assuming that some predefined chessboard corner is located at the coordinate system origin and chessboard sides are oriented towards coordinate system axes. Of course, any other chessboard orientation can be specified in software, but this one is the most simple.

    What problems do we have there? First of all, camera can see no floor at all, so we can’t just put chessboard on it during step 4. Instead we need to set it up somewhere else, not on the ground level. We should then carefully measure its position and orientation and pass them as an input to the calibration tool.

    What if we have more than one camera seeing no floor, and those cameras are not overlapping? In this case we should repeat process described above for each camera, carefully measuring chessboard position in the world coordinate system every time. In fact, it’s a pain in the ass. Calibrating multiple cameras that way can be really slow and error-prone.

    Much more interesting approach to multiple non-overlapping camera calibration was proposed in this paper. Its key idea is to fix chessboard position (put it at the origin) and move mirror instead. Cameras will see chessboard reflection in that mirror and use reflected image for calibration. Of course, some questions arise.

    1. Is it legal to determine intrinsic camera parameters using reflected chessboard image? Answer is simple: yes. Authors prove that common calibration techniques give same result (except of coordinate system handedness) when applied to mirrored images.
    2. Don’t we need to know position and orientation of the mirror when calibrating extrinsic parameters? No, we don’t. It turns out that every mirrored chessboard image imposes constraint on the position and orientation of the real camera. And if we have five (or more) such images, we can reconstruct position and orientation without any knowledge of mirror position.

    This approach can save a lot of time and help to reduce part of the calibration error that arises from incorrect chessboard position and orientation determination. But it has it’s own drawbacks, of course. First of all, mirror is rather heavy. It’s not easy to manipulate it if your calibration dude is not a beefcake. Next, it’s hard to change orientation of the calibration pattern in frame from one snapshot to another when using mirror. It has to be in the field of view of the camera, and oriented such that the pattern’s image is reflected into the camera. These requirements may result in little variation in the pattern orientations as seen by the camera in the mirror and lead to solution degeneration.

    Despite the drawbacks, this approach has the potential to increase speed of the multiple camera calibration process a lot. We will probably try it in 2010.

    Ram Krishan Kumar, Adrian Ilie, Jan-Michael Frahm, & Marc Pollefeys (2008). Simple calibration of non-overlapping cameras with a mirror 2008 IEEE Conference on Computer Vision and Pattern Recognition
    Wednesday
    23Dec2009

    Some strange art detected

    Few days ago at the office building where our company’s office is located was an exhibition of some guy who draws labyrinths and then sticks pieces of old hardware (like old CPUs or floppy disks) to his paintings. I think his creative work is quite weird, but maybe you’ll like it, who knows :)

    Click to read more ...

    Monday
    07Dec2009

    Bayesian approach: don't hurry when reasoning

    Few days ago one friend of mine (let’s name him A) has mentioned some psychological test. Test consists of a single question, and, according to some statistics, 98 percent of serial killers answer that question right. When another friend of mine (B) gave correct answer to the question, A said that it’s highly probable that B is a serial killer. Was he right? Of course, he wasn’t. A’s problem is that he is not familiar with the Bayesian approach at all. And here is why.

    Let R be the event of giving the right answer to the question and M be the event that guy who answers is a maniac. From gathered statistics we know that P(R | M) = 0.98. Next, from Bayes theorem whe know that P(M | R) = P(R | M)P(M)/P(R). P(R) can be represented as P(R | M)P(M) + P(R | not M)P(not M). Next, assume that about 5 percent of usual people also gave the right answer to the question, so P(R | not M) = 0.05. Then, what’s the prior probability of M? I think we all agree that it’s quite small, about 1e-5 or even less. Now we are ready to calculate posterior probability of M:

    P(M | R) = (0.98 * 1e-5) / (0.98 * 1e-5 + 0.05 * (1 - 1e-5)) = 0.0000098 / (0.0000098 + 0.0499995) ~ 0.0002.

    Probability is very small, but why is that? That’s because my friend A was talking about the likelihood, but he didn’t take prior probabilities into account. And in this case prior probabilities are of great importance.

    Btw, what’s if serial killers always answer right and normal people always give wrong answers? Then P(R | not M) = 0 and P(M | R) = 1, so our model works in extreme cases too.

    Friday
    27Nov2009

    Bayesian approach: Lorenzo von Matterhorn

    Today we will consider a more sophisticated inference example, with both closed-form solution and Infer.NET program. We are going to find answers to some very important questions closely connected to the Lorenzo von Matterhorn trick from Barney Stinson’s Playbook.

    First of all, let’s figure out factors influencing the successful completion of the trick. It looks reasonable that girl should search for your imaginary name in Google. Without it, trick wouldn’t work. It means that girl should have some internet device and area should be covered with internet (WiFi, 3G etc). Also, if your ugliness outweigh your imaginary wealth and fame, girl may not come with you (nevertheless, it’s highly improbable). Such suggestions allows us to build the following probabilistic model with discrete variables: 

    In fact our model is a Bayesian network. It means that we should specify parent-conditional probability distribution for each variable it consists of. Let’s get started.

    P(I) = 0.8 because almost all the modern phones have at least GPRS support.
    P(C) = 0.9 because internet is available practically everywhere nowadays.
    P(G | I, C) = 0.95 (wouldn’t you google for this strange man?)
    P(G | not I, C) = 0.3 cause she can ask someone who has internet to google.
    P(G | I, not C) = P(G | not I, not C) = 0 cause nobody can google without internet.
    P(H) = 0.2 because not much guys are handsome.
    P(S | G, H) = 0.99
    P(S | not G, H) = 0.5 (it’s nice to be attractive, huh)
    P(S | G, not H) = 0.9 cause wealth is more important than handsome look.
    P(S | not G, not H) = 1e-5 (the worst case).

    Computing probability of success is quite straightforward, so we will concentrate on two little more sophisticated questions:

    1. What’s the probability of you being handsome if you’ve succeeded in the trick? 
    2. If you haven’t succeeded in the trick, what’s the probability she had looked for your imaginary name in Google?

    To compute those probabilities, we’ll use Bayes theorem together with the law of alternatives. This blog is not really a comfortable place for a lot of formulas, so the whole inference is available in the separate PDF. As we can see there, it’s not very hard to show that P(H | S) = 0.2449 and P(G | not S) = 0.2042. We can also write a small inference program using Infer.NET library which will give us exactly the same results.

    Calculated probabilities agree with our common sense very well. Value of P(H | S) shows that it’s not necessary to be some handsome guy to perform a trick. And value of P(G | not S) tells us that most of the failures happen when girl refuses to google for some reason.

    That was another example of probabilistic logic in action. Next time we’ll consider some continuos case like linear regression.

    Thursday
    19Nov2009

    Revolt of the C++ haters

    Probably, some of you have heard that GCC and Visual Studio C++ Compiler are written in C++. I guess that most of the C++ compilers are (if not, my plan will fail). So, if all the compiler binaries will somehow disappear someday, all the C++ sources (including sources of the compiler itself) will become useless. But how can we achieve that?

    The most simple (but also the most hard) way is to add some kind of time bomb to the sources of every C++ compiler. That time-bomb should be activated a few years later, when everyone will be using compromised version of the compiler. After the activation of the time-bomb all the compiler executables will be destroyed on the first compilation. Nice scenario, isn’t it? The problem is to add time-bomb in a way that nobody in the team of compiler developers will notice it. An interesting problem for a great hacker, I guess. Man, if you do it, you’ll become a legend!

    There is an alternative option. Not so powerful, but still possible. Every virus maker who cares should add search-for-compilers-and-delete-them-when-time-bomb-is-activated code to her (or his) brainchild. The problem here is that there are not much computer viruses for *nix systems, but most of the C++ compiler instances are concentrated there. Anyway, people, we should try!

    I propose the following date for the time-bomb activation: December 12, 2012. Develop your viruses and keep it quiet.

    Btw, it also can be a plot for a great action movie. Imagine some well-secured subterranean storage for the Last Compiler, a lot of men with big guns, and a bunch of heroes trying to fix the men’s biggest mistake: C++ language invention.

    Tuesday
    10Nov2009

    Bayesian approach: introduction

    As I’ve already said, I’m going to write a few posts about Bayesian approach to probability theory and, especially, to statistical machine learning. Someday I’m going to be an expert in this topic, but, currently, I’m far from it :) So, the following series of posts is my attempt to make things clear for myself and, probably, for someone just starting to learn this amazing topic. In fact, it means that there can be some mistakes. And if you find them, I’ll be glad you reveal that as soon as possible.

    Click to read more ...

    Sunday
    08Nov2009

    Treasury

    I’ve created the Treasury page where I will collect small descriptions (together with links) of different interesting things such as must-read books, helpful web-resources, great educational content and so on. There is not much stuff in the treasury now, but I’ll fill it in time, I swear =) Sensible suggestions are accepted.

    Currently I’m thinking about series of posts related to Bayesian approach to probability theory and, especially, to statistical learning. Infer.NET examples will be involved for sure.

    Friday
    30Oct2009

    True story at work

    Few days ago one really troublesome guy (who, btw, works as general producer in BS Graphics) came to the R&D department and said that our SVN repository (which is located on the one of the HDDs of the same file server where different production documents are stored) makes his work with documents impossible because of network lags, so we should consider moving our version control system to some another server. “Guy seems to be insane” we’ve thought, but then (just in case) we’ve checked out the file server to see if something strange is really happening. Well, it was. Server was performing really slow with some process consuming almost all the CPU time. After investigating what’s going on we’ve drawn some funny conclusions. Process consuming all the CPU time was an antivirus tool which was trying to remove a bunch of infected files from file server, but, unfortunately, those files were re-uploaded to the file server every second. Who do you think has his computer infected with a virus that was uploading those files to the file server? Yeah, it was our guy. The “general producer” guy.

    Moral there is quite simple: it’s better to keep your mouth shut if you aren’t really sure what’s happening and also don’t want to look stupid.

     

    It’s a bit strange, but I apparently have some kind of fetish for named math stuff with two or more last names in it’s caption. When I hear something like “Kullback-Leibler divergence”, “Malhotra-Kumar-Maheshwari algorithm” or “Kuhn-Tucker theorem” I feel really good. More than just good ;)