When it comes to Kinect, there is only one image, which is captured by the IR depth sensor; then how does the stereo triangulation work? Actually, there are two images instead of one. The second image is invisible – it's a pattern of the IR emitter that is already defined with the IR laser. The IR laser is not modulated. All that the laser does is project a pseudorandom pattern of specs on the Kinect environment. These two images are not equivalent as there is some distance between the IR emitter and IR depth sensor. These two images are considered as correspondence to the different camera, and allow you to use stereo triangulation to calculate depth as shown in the following image. It demonstrates how x1 and x2 are getting measured using stereo triangulation for a point X in the scene: