Phase 1 (face feature extraction). The Proteus algorithm applies Principal Component Analysis (PCA) to the face feature set to perform feature selection; Phase 2 (voice feature extraction). It extracts a set of MFCCs from each preprocessed audio frame and represents them in a matrix form where each row is used for each frame and each column for each MFCC index. And to reduce the dimensionality of the MFCC matrix, it uses the column means of the matrix as its voice feature vector; Phase 3 (fusion of face and voice features). Since the algorithm measures face and voice features using different units, it standardizes them individually through the z-score normalization method, as in score-level fusion. The algorithm then concatenates these normalized features to form one big feature vector. If there are N face features and M voice features, it will have a total of N + M features in the concatenated, or fused, set. The algorithm then uses LDA to perform feature selection from the fused feature set. This helps address the curse of the dimensionality problem by removing irrelevant features from the combined set; and Phase 4 (authentication). The algorithm uses Euclidean distance to determine the degree of similarity between the fused features sets from the training data and each test sample. If the distance value is less than or equal to a predetermined threshold, it accepts the test subject as a legitimate user. Otherwise, the subject is declared an impostor. Implementation We implemented our quality-based score-level and feature-level fusion approaches on a randomly selected Samsung Galaxy S5 phone. User friendliness and execution speed were our guiding principles. User interface. Our first priority when designing the interface was to ensure users could seamlessly capture face and voice biometrics simultaneously. We thus adopted a solution that asks users to record a short video of their faces while speaking a simple phrase. The prototype of our graphical user interface (GUI) (see Figure 3) gives users real-time feedback on the quality metrics of their face and voice, guiding them to capture the best-quality samples possible; for example, if the luminosity in the video differs significantly from the average luminosity of images in the training database, the user may get a prompt saying, Suggestion: Increase lighting. In addition to being user friendly, the video also facilitates integration of other security features (such as liveness checking7 ) and correlation of lip movement with speech.8 To ensure fast authentication, the Proteus face- and voice-feature extraction algorithms are executed in parallel on different processor cores; the Galaxy S5 has four cores. Proteus also uses similar parallel programming techniques to help ensure the GUI’s responsiveness. Security of biometric data. The greatest risk from storing biometric data on a mobile device (Proteus stores data from multiple biometrics) is the possibility of attackers stealing and using it to impersonate a legitimate user. It is thus imperative that Proteus stores and processes the biometric data securely. The current implementation stores only MFCCs and PCA coefficients in the device persistent memory, not raw biometric data, from which deriving useful biometric data is nontrivial.16 Proteus can enhance security significantly by using cancelable biometric templates19 and encrypting, storing, and processing biometric data in Trusted Execution Environment tamper-proof hardware highly isolated from the rest of the device software and hardware; the Galaxy S5 uses this approach to protect fingerprint data.22 Storing and processing biometric data on the mobile device itself, rather than offloading these tasks to a remote server, eliminates the challenge of securely transmitting the biometric data and authentication decisions across potentially insecure networks. In addition, this approach alleviates consumers’ concern regarding the security, privacy, and misuse of their biometric data in transit to and on remote systems. Performance Evaluation We compared Proteus recognition accuracy to unimodal systems based on face and voice biometrics. We measured that accuracy using the standard equal error rate (EER) metric, or the value where the false acceptance rate (FAR) and the false rejection rate (FRR) are equal. Mechanisms enabling secure storage and processing of biometric data must therefore be in place. Database. For our experiments, we created a CSUF-SG5 homegrown multimodal database of face and voice samples collected from University of California, Fullerton, students, employees, and individuals from outside the university using the Galaxy S5 (hence the name). To incorporate various types and levels of variations and distortions in the samples, we collected them in a variety of real-world settings. Given such a diverse database of multimodal biometrics is unavailable, we plan to make our own one publicly available. The database today includes video recordi