RGB-D cameras [20], [13] are an emerging trend of technologies that provide high quality synchronized depth and color data. Using active sensing techniques, robust depth estimation has been achieved at real time. Microsoft Kinect [13], a depth camera that has made it into consumer applications, is a huge success with far-reaching implications for real-world visual perception. One key area of depth camera usage is in object recognition, a fundamental problem in computer vision and robotics. Traditionally, the success of visual recognition has been limited to specific cases, such as handwritten digits and faces. The most recent trend in computer vision is large-scale recognition (hundreds of thousands of categories, as in ImageNet [7]). For real-world object recognition, recent studies (e.g. [14]) have shown that it is feasible to robustly recognize hundreds of household objects. Kinect has adopted a recognition-based approach to estimate human poses from depth images [23]. The core of a robust recognition system is to extract meaningful representations (features) from high-dimensional observations such as images, videos, 3D point clouds and audio. Given the wide availability of depth cameras, it is an open question what is the best way to extract features over a depth map. There has been a lot of work in robotics on 3D features from point clouds: Spin Images [12] is a classical example of local 3D features analogous to SIFT [19]; the Fast Point Feature Histogram [21] is another example of an efficient 3D feature. These 3D features, developed on point