This project provides a generic platform for performing monocular and stereo tracking of 3D objects in video. It provides interfaces to do video/image loading, image detection/classification and model fitting within a single framework.
Currently built in are pixel-wise detection methods and a level set based pose estimation method. The system also supports basic temporal tracking. In all cases overloading base classes with your own algorithms for pose estimation/detection etc will allow extension of the project. Models are represented as .obj and can be rigid or articulated with simple tree models.
In all cases these are not strict minimum versions but merely the oldest version on which they have been tested to work.
Usage 'as is' for stereo tracking
Set the parameter 'root-dir' in the example app.cfg file to the full path of the examples directory. The reason why this is not automatically set from the path when the configuration file is opened is to allow you to keep the config files in a different directory to the data.
Set the 'output-dir' to whereever you want to save your tracking data.
Copy/symlink the video files into the examples directory and set stereo then set the files as 'left-input-video=FILENAME.avi' and 'right-input-video=FILENAME.avi'.
Calibrate your stereo camera and add the parameters to the camera configuration file 'camera/stereo_config.xml' in the examples directory. The software assumes the parameters are the same as those generated by the Bouguet Matlab software (i.e. the stereo camera transform is right w.r.t. left).
Set the 'localizer-type' in the configuration file. 'ArticulatedCompLS_GradientDescent_FrameToFrameLK' is the method from our TMI '18 paper, 'CompLS' is the component level set and optical flow method in our MICCAI '15 paper and 'pwp3d' is the single level set and SIFT point method closely based on our IPCAI '14 paper (with improved results).
Train an OpenCV classifier to recognise the features Hue, Saturation, Opponent 1 and Oppoenent 2 colour space. The easiest way to do this is to use a small Python script I wrote called trainer as part of a suite of computer vision utilities I use. Add the saved xml classifier to the classifier directory. There is a sample one there now trained on a basic image set but it's unlikely to work well on general images. The sample 'config_3class.xml' should be used for the classifier-type=MCRF with 'num-labels=3' in the 'app.cfg' file, this setup should be used if using the 'CompLS' 'localizer-type'. The sample 'config_2class.xml' should be used for the classifier-type=RF with 'num-labels=2' in the 'app.cfg' file, this setup should be used if using the 'pwp3d' 'localizer-type'.
Add an OBJ file for the object you want to track. An example of the .json configuration to do this is in the model directory in the examples. Set the path for the json file to the directory where it is stored. Again, this isn't computed automatically to allow you to have a different working directory to where you store your model files.
Find the starting pose of the object you want to track. This part is hard and ideally we'd have a more reliable way of automating it. Usually the best bet is to try to estimate as much as you can manually and then let set the number of gradient descent steps to be very large and let the application converge as best it can. Then reset the starting pose from the output pose file. The starting pose parameters are in the order: R11 R12 R13 Tx R21 R22 R23 Ty R31 R32 R33 Tz A1 A2 A3. Where RXY is the component of the rotation matrix and T is the translation vector to the origin of the model coordinates. A1, A2, A3 are articulated components which can just be set to zero if you have a static model.
Launch the application and drag and drop the configuration file.
Click on the 'Start Tracking' button and watch some nice tracking (hopefully)!
Note: Although the code backend completely supports monocular tracking, the Cinder GUI is only set up for stereo. If you want to do monocular tracking it's possible to 'hack' this using a camera calibration file with identity rotation and null translation and setting the 'left-input-video' and 'right-input-video' files to point to the same file in the application configuration.
Usage as a framework
Right now the core algorithm works as follows: frames are loaded using any class that inherits from the Handler interface and passed (via the single TTrack manager) to the detector. There features which are to be used for pose estimation are extracted, right now this is a pixel-wise classification for region based pose estimation although any other type of feature is possible. The TTrack manager then passes this detected frame to the tracker (which inherits from the Tracker interface). This defines a custom initialization function which allows the first pose to be roughly estimated before a more precise pose localizer (which inherits from Localizer) focusses on refinement.
Trackable objects are represented by subclassing the Model class and are loaded via a .json file which defines a possibly articulated structure where the standard tree-like model is used. Each node of the tree (from root to children to children-of-children) represents a single rigid body transform from its parent and possibly also some geometry (in mesh format) and texture (mtl or gl textures). It is useful to not make geometry mandatory as some robotic manipulators define their structure this way. Right now the example file is for a robotic instrument so the transforms are defined with DH paramters (using subclass DenavitHartenbergArticulatedModel) although it should be easy to subclass Model to handle SE3 or another parameterization.
- Max Allan
- Ping-Lin Chang
We would like to acknowledge Saurabh Agarwal of IIT Delhi and AIIMS Hospital, Delhi for supplying the example model OBJ file.