Powered by Squarespace
This form does not yet contain any fields.

    Entries in vision (7)

    Tuesday
    Feb092010

    Camera DEcalibration and REcalibration

    In one of the recent posts I have covered the topic of camera calibration a little. But that was not really interesting. Tons of very good books have been written about that. Today we’ll look into keeping your cameras in calibrated state after they’ve been calibrated instead.

    Suppose you have a 3d-tracking system that uses about 20 calibrated cameras installed in some crowded place like bank office or shopping mall. In such places calibrated cameras have a tendency to become uncalibrated sometimes, mostly due to vibrations caused by factors like construction work or cleaning. What should we do about that? Obviously, we should recalibrate cameras whose position or orientation has been changed (or change position and orientation back to original). But how can we find out what cameras should be calibrated again? Should we hire a man who’ll check image on every camera every day? Fortunately, we shouldn’t.

    Camera decalibration can be detected automatically

    What we have to do is to save snapshot from camera at the moment its extrinsic parameters were estimated. Then the following algorithm can be used to check for decalibration:

    1. Extract something like SURF features from the saved snapshot and current camera image.
    2. Match extracted features.
    3. If not less than 10-15% of the matched points have the same pixel coords, camera is OK. Otherwise, it seems to be decalibrated.

    As you can see, this algorithm is fully automatic and can be run, for example, every hour for every camera. It is also quite robust to presence of some temporary objects like humans in both images because it requires only small amount of matches to have the same pixel coordinates. It seems that it works bad (or doesn’t work at all) only in cases when there are very few good features available (for example, when camera looks at the white wall). But in those cases it’s very hard to detect camera decalibration even for human. One can also make this algorithm robust to illumination changes by using some lighting-invariant point descriptors or by image preprocessing.

    OK, we have determined that camera has become decalibrated. And we also have old camera position, orientation and point matches between current and old camera images. It seems that we can automatically determine new camera position easily. Or can’t we?

    Camera can not be recalibrated automatically

    And here is why. Of course, there is a 3d reconstruction algorithm that takes a list of coordinates of matches together with camera intrinsic matrix and returns transform from old camera view space into view space of the new one (together with 3d coordinates of the points). Unfortunately, it can calculate transform and 3d coordinates only up to scale. Neither coordinates of the matched pixels nor intrinsic matrix contain enough information about the scale of you scene. Are you measuring everything in meters? Or in inches? You need more information (like real 3d coordinates of one of the matched points) to rescale algorithm results. But you don’t have them. So the reconstruction algorithm gives you pretty much nothing (well, it gives you correct camera orientation, but you can’t use it without rescaled translation).

    It’s interesting that I’ve failed to find any papers covering the topic of decalibration and recalibration although it’s quite important for some practical computer vision applications. Should I write a short paper about it by myself?

    Wednesday
    Jan202010

    Localization and mapping

    Hey, have you ever seen George Klein’s SLAM engine in action? I can’t even call it awesome. It’s better. And it’s not just the algorithm itself. Just look into possibilities it opens.

    Friday
    Dec252009

    An interesting approach to camera calibration

    Camera calibration is a process of determining intrinsic (like principal point or focal length) and extrinsic (position and orientation in space) parameters of a camera, which is often described by the pinhole camera model. In computer vision we usually perform calibration by analyzing images taken from camera. The most widely used approach to camera calibration is based on this paper by Zhengyou Zhang. It involves chessboard (or some other planar calibration pattern) and consists of the following steps:

    1. Find a chessboard (bigger is better). Note that it should have distinct width and height (measured in chessboard squares) which should both be even (otherwise you will be unable to determine chessboard orientation given its picture).
    2. Find a “calibration dude”.
    3. Calibration dude takes a chessboard and waves it in front of the camera attached to a computer with calibration software installed. Calibration software takes about 20 images with distinct chessboard orientations (camera orientation remains the same, of course), finds inner corners of the chessboard on every image and then uses them together with information about real-world size of chessboard to determine intrinsic camera parameters.
    4. Calibration dude puts chessboard on the floor in a way camera still can see it. Calibration software than takes one more image from camera, finds chessboard corners on it (again) and calculates camera position assuming that some predefined chessboard corner is located at the coordinate system origin and chessboard sides are oriented towards coordinate system axes. Of course, any other chessboard orientation can be specified in software, but this one is the most simple.

    What problems do we have there? First of all, camera can see no floor at all, so we can’t just put chessboard on it during step 4. Instead we need to set it up somewhere else, not on the ground level. We should then carefully measure its position and orientation and pass them as an input to the calibration tool.

    What if we have more than one camera seeing no floor, and those cameras are not overlapping? In this case we should repeat process described above for each camera, carefully measuring chessboard position in the world coordinate system every time. In fact, it’s a pain in the ass. Calibrating multiple cameras that way can be really slow and error-prone.

    Much more interesting approach to multiple non-overlapping camera calibration was proposed in this paper. Its key idea is to fix chessboard position (put it at the origin) and move mirror instead. Cameras will see chessboard reflection in that mirror and use reflected image for calibration. Of course, some questions arise.

    1. Is it legal to determine intrinsic camera parameters using reflected chessboard image? Answer is simple: yes. Authors prove that common calibration techniques give same result (except of coordinate system handedness) when applied to mirrored images.
    2. Don’t we need to know position and orientation of the mirror when calibrating extrinsic parameters? No, we don’t. It turns out that every mirrored chessboard image imposes constraint on the position and orientation of the real camera. And if we have five (or more) such images, we can reconstruct position and orientation without any knowledge of mirror position.

    This approach can save a lot of time and help to reduce part of the calibration error that arises from incorrect chessboard position and orientation determination. But it has it’s own drawbacks, of course. First of all, mirror is rather heavy. It’s not easy to manipulate it if your calibration dude is not a beefcake. Next, it’s hard to change orientation of the calibration pattern in frame from one snapshot to another when using mirror. It has to be in the field of view of the camera, and oriented such that the pattern’s image is reflected into the camera. These requirements may result in little variation in the pattern orientations as seen by the camera in the mirror and lead to solution degeneration.

    Despite the drawbacks, this approach has the potential to increase speed of the multiple camera calibration process a lot. We will probably try it in 2010.

    Ram Krishan Kumar, Adrian Ilie, Jan-Michael Frahm, & Marc Pollefeys (2008). Simple calibration of non-overlapping cameras with a mirror 2008 IEEE Conference on Computer Vision and Pattern Recognition
    Tuesday
    Aug182009

    Face tracker

    Yesterday I’ve been integrating CamShift tracker into our face tracking system based on Viola-Jones face detector. Now, when face detector can not provide any evidence to tracker for some object, CamShift tracker is initialized. Then it is used to track that object in video stream for a few seconds, hoping that face evidence will be provided sooner or later. This change makes tracking system more robust and stable. As you can see in the video below, it’s quite hard to make tracker fail when object is somehow visible.

    Idea of using CamShift together with face detector is not mine. I’ve found it in this paper from CLEAR2007.

    Legend: white rectangle is drawn when Viola-Jones based tracker tracks object, green is drawn when CamShift tracker does.


    Update: It’s not very hard to repeat my success. Implementation consists of 3 major parts: Viola-Jones face detector (I’ve took implementation from OpenCV together with trained classifier cascades for frontal and profile faces), CamShift algorithm implementation (I’ve used the one from OpenCV again) and tracking algorithm itself (my own C++ code, mostly based on Hungarian algorithm with some heuristics for false positive elimination, nothing you can’t find in 2d tracking papers available online).

    Monday
    Jul272009

    I really suck at age estimation

    We have recently performed a simple experiment at work. The purpose of the experiment was to compare the performance of our SVM-based gender classifier with the human ability to distinguish faces of people of different age. As the result, I was the only man beaten by the machine.

    All the faces were divided into 5 classes (age<20, 20<=age<30, 30<=age<40, 40<=age<50, age>=50). Each human participatin in the experiment labeled about 100 face pictures manually. Classifier performance was estimated using 10-fold cross validation on a bigger dataset.

    WhoError
    Machine0.444
    Me0.459
    Sveta (R&D engineer)0.343
    Irene (assistant)0.410


    Error value equals to P means that P x N of N samples were labeled right (correct label was selected by the decision rule from the set of 5 available labels, decision rule is either human brain or SVM classifier).

    I think that rise of the machines will happen very soon (hope some robot from the future will save my ass).

    Monday
    Jun222009

    Good computer vision book is available for free

    Today I’ve found that Richard Szeliski, famous researcher from MSR, one of the major developers of the photosynth, is currently writing a computer vision book. It’s based on his CV course at Stanford and University of Washington. And here goes one very good thing: he had made drafts of his book available on his web page. Book already contains tons of useful information and it’s constantly growing and improving. Read it, mail Richard if you found some errors and keep looking for new drafts!

    Thursday
    Jun112009

    The story of skin detection

    While working on my current project at R&D department of the BS Graphics company I’ve understand the importance of the skin detection very well. The people tracking system we’re developing uses it for two different reasons:

    1. It allows us to reduce amount of false positives that face detector produces. In our tracking system faces with less than a half of pixels classified as skin are simply rejected. It was the first application of the skin detection in our tracking system and it still serves us well.
    2. Face detection works slowly, especially when video frame is big. So, we did such a thing: when full-scale face search is running over frame, both skin and foreground masks are first calculated. Then only regions built from intersection of those masks are used for face detection. In most of the scenes that approach accelerates face detection 2 or 3 times.

    But the skin detection has it’s drawbacks and limitations. The most important fact about it is the following: skin color in the frame depends on camera quality and lighting conditions a lot. So, skin classifier learned with some specific lighting conditions can perform vary bad when those lighting conditions change.

    During the last few months I’ve tried 3 different approaches to skin detection, and now I’m more or less satisfied. Here’s the list of what I’ve tried. Maybe it will help someone.

    1. Skin detector proposed by Jones and Rehg. It uses Bayes classifier to classify pixels with given color to be skin or not. Conditional distribution learning is performed offline using a set of images with labeled skin pixels. Gaussian mixture models are used to represent conditional distribution of skin color. Mixture component parameters trained by Jones and Rehg are given in the end of the paper, so it is not necessary to train classifier by yourself. The only important thing is to convert GMM back to 3d histogram for better performance. Unfortunately, classifier is very sensitive to camera exposure and lighting changes. It is good for photos from web, but very bad for real-world scenarios.
    2.  Adaptive version of Jones-Rehg skin detector that uses information from face detector. Conditional distribution is learned using pixels lying in the face regions. Unfortunately, other skin-colored parts (such as arms, neck etc) can get into the negative samples, therefore, corrupting color distribution. That fact gave me the following idea: adaptation should not use any negative samples. It should fit some descriptive model to the skin color data we have instead.
    3. Wimmer-Radig descriptive skin model. Simple decision rule is fitted to the skin color distribution (in normalized RGB). I re-learn temporary skin model in some “key” frames and interpolate final skin model between them to make model evolution continuous. That approach is the best one until now, but, probably, some heavy tests will reveal it’s problems too.

    The following video shows the 3rd approach at work. Semi-transparent red mask indicates pixels classified as skin.