The depth information is then used to calculate a skeletal model of any humans in view by running it through a Random Forest classifier, which classifies each pixel as being a body part (such as forearm or head), or background. The features used is the set of distances in depth between groups of two pixels at a predetermined offsets from the pixel being classified. RGB camera data is not used for this, due to its high variability. Pixels corresponding to each body part are then clustered and fit to a skeletal model on a frame-by-frame basis, with no tracking occurring between frames.
The RF classifier was trained by taking a motion-capture dataset of human movements and augmenting it with artificial distortions to increase the training set to a million examples.