In this paper, we introduce a novel computer vision based attack
that automatically discloses inputs on a touch-enabled device while
the attacker cannot see any text or popup in a video of the victim
tapping on the touch screen. We carefully analyze the shadow formation
around the fingertip, apply the optical flow, deformable partbased
model (DPM), k-means clustering and other computer vision
techniques to automatically locate the touched points. Planar homography
is then applied to map the estimated touched points to a
reference image of software keyboard keys. Recognition of passwords
is extremely challenging given that no language model can
be applied to correct estimated touched keys. Our threat model is
that a webcam, smartphone or Google Glass is used for stealthy attack
in scenarios such as conferences and similar gathering places.
We address both cases of tapping with one finger and tapping with
multiple fingers and two hands. Extensive experiments were performed
to demonstrate the impact of this attack. The per-character
(or per-digit) success rate is over 97% while the success rate of recognizing
4-character passcodes is more than 90%. Our work is the
first to automatically and blindly recognize random passwords (or
passcodes) typed on the touch screen of mobile devices with a very
high success rate.