Visual Passwords Using Automatic Lip Reading

  • Ahmad B A Hassanat Mutah University
Keywords: speaker recognition, speaker authentication, lip reading, visual speech recognition, speech reading, VSR, visual feature extraction, , visual words, Behaviometrics, security.


This paper presents a visual passwords system to increase security. The system depends mainly on recognizing the speaker using the visual speech signal alone. The proposed scheme works in two stages: setting the visual password stage and the verification stage. At the setting stage the visual passwords system request the user to utter a selected password, a video recording of the user face is captured, and processed by a special words-based VSR system which extracts a sequence of feature vectors. In the verification stage, the same procedure is executed, the features will be sent to be compared with the stored visual password. The proposed scheme has been evaluated using a video database of 20 different speakers (10 females and 10 males), and 15 more males in another video database with different experiment sets. The evaluation has proved the system feasibility, with average error rate in the range of 7.63% to 20.51% at the worst tested scenario, and


F. T. Commission, "Consumer Sentinel Network Data Book for January

T. Chen and R. R. Rao, "Audio-Visual Integration in Multimodal Communication," Special Issue on Multimedia Signal Processing, IEEE Proceedings, vol. 86, p. 837

A. B. Hassanat, "Visual Words for Automatic Lip-Reading," PhD Thesis, University of Buckingham, Buckingham, UK, 2009.

A. B. Hassanat and S. Jassim, "A special purpose knowledge-based face localization method," in SPIE, Florida, 2008, pp. 69820-69829.

A. B. Hassanat and S. Jassim, "Visual words for lip-reading," in SPIE, Florida, 2010, p. 77080B.

J. P. Barker and F. Berthommier, "Estimation of speech acoustics from visual speech features: A comparison of linear and non-linear models," in Auditory-Visual Speech Processing, Santa Cruz, 1999, p. 112

H. C. Yehia, T. Kuratate, and E. Vatikiotis-Bateson, "Using speech acoustics to drive facial motion," in 14th Int. Congr. Phonetic Sciences, San Francisco, 1999, p. 631

A. V. ,. &. Y. H. C. (. v. 1. (. ). I. Barbosa, "Measuring the relation between speech acoustics and 2-D facial motion," in IEEE International Conference on Acoustics, Speech Signal Processing, 2001, pp. 181-184.

J. Jiang, A. Alwan, P. Keating, E. Auer, and L. Bernstein, "On the relationship between face movements, tongue movements, and speech acoustics," Special issue of EURASIP Journal on Applied Signal Processing on joint audio-visual speech processing, vol. 11, pp. 1174-1188, 2002.

P. S. Aleksic and A. K. Katsaggelos, "Audio-Visual Biometrics," IEEE Proceedings, vol. 94, p. 2025

C. C. Chibelushi, J. S. Mason, and F. Deravi, "Integration of Acoustic and Visual Speech for Speaker Recognition," in 3rd European Conference on Speech Communication and Technology, vol. 1, 1993, p. 157

M. R. Civanlar and T. Chen, "Password-free network security through joint use of audio and video," SPIE Photonic East, p. 120

U. V. Chaudhari, G. N. Ramaswamy, G. Potamianos, and C. Neti, "Audio-visual speaker recognition using time-varying stream reliability prediction," in IEEE International Conference on Acoustics, Speech Signal Processing, vol. 5, Hong Kong, China, 2003, pp. 712-715.

M. I. Faraj and J. Bigun, "Audio visual person authentication using lip-motion from orientation maps," Pattern Recognition Letters, vol. 28, no. 11, p. 1368

R. W. Frischholz and U. Dieckmann, "BioID: A multi modal biometric identi?cation system. J., 33,," IEEE Computer, vol. 33, p. 64

T. Warkand and S. Sridharan, "Adaptive fusion of speech and lip information for robust speaker identi?cation," Digital Signal Processing, vol. 11, no. 3, p. 169

D. J. Shiell, L. H. Terry, P. S. Aleksic, and A. K. Katsagge, "An Automated System for Visual Biometrics," in 45th Annual Allerton Conference on Communication, Control, and Computing, Urbana-Champaign, IL, 2007, pp. 869-876.

P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features," in Computer Vision and Pattern Recognition, vol. 1, 2001, p. 511

T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, "Active shape models-their training and application," Computer vision and image understanding, vol. 61, no. 1, pp. 38-59, 1995.

N. Fox, B. O'Mullane, and R. B. Reilly, "The Realistic Multi-modal VALID database and Visual Speaker Identification Comparison Experiments," Lecture Notes in Computer Science, vol. 3546, 2005.

H. E. Cetingul, Y. Yemez, E. Erzin, and A. M. Tekalp, "Discriminative analysis of lip motion features for speaker identification and speech-reading. , 15(10), ().," IEEE Transactions on Image Processing,, vol. 15, no. 10, pp. 2879-2891, 2006.

E. Erzin, Y. Yemez, and A. M. Tekalp. (2006) Multimedia, Vision and Graphics Laboratoty. [Online].

D. J. Shiell, L. H. Terry, P. Aleksic, and A. K. Katsaggelos, "Audio-Visual and Visual-Only Speech and Speaker Recognition: Issues about Theory, System Design," Visual speech recognition: lip segmentation and mapping, pp. 1-38, 2009.

F. Shafait, "Real Time Lip Motion Analysis for a Person Authentication System using Near Infrared Illumination," Master thesis, Harburg, 2005.

A. B. Hassanat and S. Jassim, "Color-based Lip Localization Method," in Proceedings of the SPIE, 2010, pp. 77080Y-77092Y.

I. SOBEL, "Camera Models and Perception," PhD thesis, Stanford University, Stanford.

R. Goecke, J. B. Millar, A. Zelinsky, and J. Robert-Ribes, "Automatic extraction of lip feature points," in Australian Conference on Robotics and Automation, 2000, pp. 31-36.

A. W. C. Liew, S. H. Leung, and W. H. Lau, "Segmentation of Color Lip Images by Spatial Fuzzy Clustering," IEEE Transactions on Fuzzy Systems, vol. 11, no. 4, pp. 542-549, 2003.

Y. C. Ho and D. L. Pepyne, "Simple explanation of the no-free-lunch theorem and its implications," Journal of Optimization Theory and Applications, vol. 115, no. 3, p. 549