唇读识别中若干问题的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

唇读识别中若干问题的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Study on Lip-reading Recognition
作者：张泽梁
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：唇读 ; 特征提取 ; 变形模板 ; 傅立叶描述子 ; 隐马尔可夫模型
英文关键词：lip-reading ; visual feature extraction ; deformable template ; Fourier descriptor ; HMM
学位年度：2012
导师：李雄飞
学科代码：081202
学位授予单位：吉林大学
论文提交日期：2012-12-01
答辩委员会主席：宫雷光

摘要

自动语音识别技术是未来人机界面的重要组成部分，主要是通过利用声音达到理解自然语言，身份识别等目的。语音识别技术已经取得显著的成就，而且已有一些比较成功的应用，比如IBM的ViaVoice系统。此类系统，在词汇量不是很大、噪音较小的环境下表现良好，但是在真实的应用环境下，性能就会显著下降。而在未来的人机交互领域应用中，对系统的健壮性就提出了更高的要求，比如：汽车，机场，现场采访方面的应用等等。因此我们需要寻求新的方法，而利用唇动的视觉特征（唇读）与声音特征相结合的方法，已经被众多研究者证明是非常有效的，它不仅可以提高自动语音系统的识别率，而且能使系统更加健壮，更能适应真实的环境。本文主要围绕唇读识别中如何提高唇部特征提取的有效性和提高唇读的识别率方面开展研究。主要工作及创新如下：
     （1）提出了基于MPEG-4参数的唇部特征提取算法。唇部特征的选择在唇读识别的研究中起着至关重要的作用，本文从MPEG-4中选取了24个同唇部发音有密切关联的人脸动画特征参数，用这些特征参数来描述唇部特征。为了将唇部区域同脸部其它区域区分开，本文采用了6个GMM对唇部区域颜色进行描述；为了能够更好的描述嘴唇的形状和对嘴唇的轮廓进行跟踪，本文根据对唇部区域颜色描述的6个GMM和嘴唇轮廓相关信息创建新的搜索能量函数，并将其用于变形模板中，使用极大似然度算法求出唇部区域的GMM参数和脸部其它区域的GMM参数，有效的将唇部区域同脸部其它区域进行了区分，同时获得ROI（感兴趣区域）的轮廓分布。为了去除人脸整体运动对唇部区域跟踪的影响，利用脸部的4个特征点进行脸部运动的姿态校正，估计出脸部运动的过程。最后根据脸部特征点的运动计算出人脸动画的特征参数值，在实验中得到了较好的效果。
     （2）提出了基于傅立叶描述子的唇形分类方法。通过AdaBoost演算法获得嘴唇的位置与大小后，首先通过边缘侦测法找出唇形边缘，然后唇形的外形经由傅立叶描述子找出重要特征值，最后转换出来的傅立叶描述子经过正规化处理后，输入到人工神经元网络中进行分类。经实验证明，使用傅立叶描述子来进行唇形分类的正确率可达到85%。
     （3）提出了基于改进隐马尔可夫模型的唇读识别方法，建立了一个基于改进隐马尔可夫模型的唇读识别系统。隐马尔可夫模型借助其自身的优势使得其在近几年也逐渐的被应用到唇读识别的研究中，但是由于传统隐马尔可夫模型的局限性，使得唇读识别率不是很高。经过研究发现，其主要原因是传统隐马尔可夫模型的状态转移和输出观测值的马尔可夫假设条件对于唇读识别应用有一定的限制和影响。本文提出的方法对传统隐马尔可夫模型的状态转移和输出观测值的马尔可夫假设条件作出了改进，并在传统隐马尔可夫模型的的基础上导出新模型的学习算法，同时基于新的算法建立了唇读识别系统。该系统采用AdaBoost演算法对脸部和唇部进行检测，PCA和LDA对唇部图像像素降维的方法对唇部特征进行提取；矢量量化方法对唇部特征向量进行处理；改进的隐马尔可夫学习算法进行唇读识别。最终实验结果表明，改进的隐马尔可夫模型与传统隐马尔可夫模型在唇读识别上相比较，识别率得到了一定的提高。
As an important component of the future Human-Computer Interface (HCI),Automatic Speech Recognition (ASR) is designed for the purpose of realizingidentification recognition and natural language comprehension by means of humanvoice. Speech recognition technology has acquired significant achievements withsome successful popularity and applications. IBM’s ViaVoice system, for instance, hasgood performances when the vocabulary pool is small and when the noise is low. Butits performance will be greatly degraded when used in real application environments.In future applications of the human-computer interaction, such as in a car, at airport,or live interviews, higher requirements for robust systems will be needed, thereforewe need to explore new ways. Proved highly effective by most researchers, thecombination of visual features of lip motion with vocal features can raise therecognition rate of the automatic speech system, and make it more robust and moreadaptable to real environments. This article aims qt improving the effectiveness of lipfeature extraction and the recognition rate. The main works and innovative points areas follows:
     (1) We present an algorithm of lip feature extraction based on MPEG-4parameter. The choice of lip characteristics works crucially on lip-reading recognition.We have chosen24feature parameters which is associated closely with lip-reading inMPEG-4, and then we described lip characteristics by these parameters. In order toseparate lip area from the other areas facial, we described the colors of lip area by6GMM. To describe more correctly the shapes of lips and the tracking of the contoursof lips, we created the new function to search energy based on6GMM and theinformation related to lip contour, and used in the deformation of the template. Weobtained the GMM parameters of the lip area and other facial areas by the maximumlikelihood algorithm. We can effectively distinguish between the lip area and otherfacial areas, finding contour distribution of ROI (region of interest) too. With thepurpose of removing the impact of the overall movement of the face to the tracking ofthe lip area, we used the four characteristic points on the face to correct the posturesof the facial movements, by estimating the process of facial movements. Finally, we obtained the parameters of FAP (facial animation parameter) based on the of facialfeature points.
     (2) We present a lip classification method based on Fourier descriptors. Afterobtaining the location and size of the lips by AdaBoost algorithm, we firstly locate thelip edge by the edge detection method. Secondly, the shape of the lip identifiesimportant Eigen values by Fourier descriptors. Finally, we input Fourier descriptorswhich are converted after normalization process to the artificial neural network toclassify. The experiment proved that the lip classification accuracy rate can reach85%by using Fourier descriptors.
     (3) We present a lip-reading recognition method and the establishment of alip-reading recognition system based on HMM. In recent years, HMM has graduallybeen applied to the study of lip-reading recognition with its own advantages. Due tothe limitations of the traditional HMM, the rate of lip-reading recognition is not goodenough. According to the study, we found that the main reason is that the statetransition of the traditional HMM and the Markov assumptions of observations restrictand impact on the lip-reading recognition applications. In this paper, we present amethod to improve the transition and the traditional HMM values. We exported thelearning algorithm of the new model based on traditional HMM. We also established alip-reading recognition system based on the new algorithm. This system usedAdaBoost algorithm to extract the lip and face parameters, PCA and LDA algorithmto reduce the dimension of the lip feature, VQ algorithm to handle the lip featureeigenvectors, IHMM algorithm to recognize the lip-reading. The final experimentalresults show that the recognition rate of the improved HMM is better than thetraditional HMM.

引文

[1] H. McGurk, J. MacDonald. Hearing lips and seeing voices[J]. Nature,1976,264(12):746-748.
    [2] Hongxun Yao, Wen Gao. Face Detection and Location Based on Skin Chro-minance and Lip Chrominance Transform from Color Images[J]. PatternRecognition,2001,34(8):1555-1564.
    [3]姚鸿勋,刘明宝,高文.基于彩色图像的色系坐标变换的面部定位与跟踪法[J].计算机学报,2000,23(2):158-165.
    [4]姚鸿勋,吕雅娟,高文.基于色度分析的唇动特征提取与识别[J].电子学报,2002,30(2):168-172.
    [5] H.P. Graf, E. Cosatto, D. Gibbon, M. Kocheisen, E. Petajan. Multi-Modal Systemfor Locating Heads and Faces[C]. in Proc. IEEE FG’96, Killington, USA,1996:88-93.
    [6] H.A. Rowley, S. Baluja, T. Kanade. Neural Network-Based Face Detection[J].Pattern Analysis and Machine Intelligence,1998,20(1):23-38.
    [7] H.Yao, R.Wang, W.Gao. Method of deformable optimum threshold for lip-reading
    [C]. IEEE Fourth International Conferenceon Signal Processing, Beijing, China,1998:912-915.
    [8] P. Aleksic, J. Williams, Z. Wu, A. Katsaggelos. Audio visual continuous speechrecognition using MPEG-4compliant visual features[C]. In ICIP,2002:960-963.
    [9] T. Chen. Audiovisual speech processing[J]. Signal Processing Magazine,2001,18(1):9-21.
    [10] X. P. Hong, H. X. Yao, Q. H. Liu, R. Chen. An information acquiring channel lipmovement[C]. ICACII,2005:232-238.
    [11] R. G cke. Audio-video automatic speech recognition: an example of improvedperformance through multimodal sensor input[C]. In NICTA-HCSNet MultimodalUser Interaction Workshop,2006:25-32.
    [12] M. Kaynak, Q. Zhi, A. Cheok, K. Sengupta, K. C.Chung. Audio-visual modelingfor bimodal speech recognition[C]. In ICSMC,2001:181-186.
    [13] M. Leszczynski, W. Skarbek. Viseme recognition-a comparative study[C]. InAVSS-Advanced Video and Signal Based Surveillance,2005:287-292.
    [14] N. Li, S. Dettmer, M. Shah. Lipreading using eigensequences[C]. In AFGR,1995:1103-1120.
    [15] Pandzic, I.S., Forchheimer. MPEG-4Facial Animation: The Standard,Implementation and Applications[M]. John Wiley and Sons Ltd.,2002.
    [16] Leszczynski, M., Skarberk, W.. Viseme Recognition-A Comparative Study[C].In Conference on Advanced Video and Signal Based Surveillance,2005:287-292.
    [17] K. Mase, A. Pentland. Automatic lipreading by optical-flow analysis[J]. Systemsand Computers in Japan,1991,22(6):67-76.
    [18] Scott, K.C., Kagels, D.S., Watson, S.H., Rom, H., Wright, J.R., Lee, M., Hussey,K.J. Synthesis of Speaker Facial Movement to Match Selected SpeechSequences[C]. In5th Australian Conference on Speech, Science and Technology,2004:284-296.
    [19] I. Matthews, T. Cootes, J. Bangham, S. Cox, R. Harvey. Extraction of visualfeatures for lipreading[J]. Pattern Analysis and Machine Intelligence,2002,24(2):198-213.
    [20] A. V. Nefian, L. Liang, X. Pi, X. Liu, K. Murphy. Dynamic Bayesian networksfor audio-visual speech recognition[J]. EURASIP Journal on Applied SignedProcessing,2002,2002(1):1274-1288.
    [21] Kulkarni, A. D.. Artificial Neural Network for Image Understanding[M]. VanNostrand Reinhold,1994.
    [22] L.R. Rabiner. A tutorial on hidden Markov models and selected applications inspeech recognition[J]. Proceedings of the IEEE,1989,77(2):257-285.
    [23] Kumar, K.Tsuhan Chen Stern, R.M.. Profile View Lip Reading[C]. Acoustics, Speechand Signal Processing, ICASSP2007. IEEE International Conference on,2007:429-432.
    [24] G. Potamianos, H. Graf, E. Cosatto. An image transform approach for hmm basedautomatic lipreading[C]. Image Processing,1998. ICIP98. Proceedings.1998International Conference on,1998:173-177.
    [25] G. Potamianos, C. Neti, G. Iyengar, A. W. Senior, A. Verma. A cascade visualfront end for speaker independent automatic speechreading[J]. InternationalJournal of Speech Technology,2001,4(3):193-208.
    [26] S. Wang, W. Lau, S. Leung, H. Yan. A real-time automatic lipreading system[C].Circuits and Systems,2004. ISCAS '04. Proceedings of the2004InternationalSymposium on,2004:101-104.
    [27] W. Wang, D. Cosker, Y. Hicks, S. Saneit, J. Chambers. Video assisted speechsource separation[C]. Acoustics, Speech, and Signal Processing,2005.Proceedings.(ICASSP '05). IEEE International Conference on,2005:425-428.
    [28]徐彦君.汉语听觉视觉双横态数据库CAVSR1.0[J].声学学报(中文版),2000,25(1):42-49.
    [29]王志明,蔡莲红.汉语语音视位的研究[J].应用声学，2002,21(3):29-34.
    [30]柴秀娟,姚鸿勋,高文,王瑞.唇读识别中的基本口型分类[J].计算机科学，2002,29(2):130-133.
    [31]单卫,姚鸿勋,高文.唇读中序列口型的分类[J].中文信息学报，2002,16(1):31-36.
    [32] Tomoaki Yoshinaga, Satoshi Tamura, Koji Iwano, Sadaoki Furui. Audio-VisualSpeech Recognition Using Lip Information Extracted from Side-Face Images[J].EURASIP Journal on Audio, Speech, and Music Processing,2007,2007(1):51-60.
    [33] A. Adjoudani, C. Beno t. On the integration of auditory and visual parameters inan HMM-based ASR[J]. in Speechreading by Humans and Machines, D. G. Storkand M. E. Hennecke, Eds. Berlin, Germany:Springer,1996(3):461-471.
    [34] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A.Mashari, J. Zhou. Audio-visual speech recognition[C]. Center for Language andSpeech Processing, The Johns Hopkins University, Baltimore, MD, FinalWorkshop2000Report,2000.
    [35] P. Teissier, J. Robert-Ribes, J. L. Schwartz. Comparing models for audiovisualfusion in a noisy-vowel recognition task[J]. IEEE Trans. Speech Audio Processing,1999,7(11):629-642.
    [36] S. Dupont, J. Luettin. Audio-Visual Speech Modeling for Continuous SpeechRecognition[J]. IEEE Trans. Multim.,2000,2(3):141-151.
    [37] Yun-Long Lay, Chung-Ho Tsai, Hui-Jen Yang, Chern-Sheng Lin, Chuan-ZhaoLai. Expert Systems with Applications[J]. An International Journal,2008,34(2):1465-1473.
    [38] J. Robert-Ribes, M. Piquemal, J.-L. Schwartz, P. Escudier. Exploiting SensorFusion Architectures and Stimuli Complementarity in AV Speech Recognition[J].in Speechreading by Humans and Machines,1996,150(1):193-210.
    [39] G. Potamianos, C. Neti, G. Gravier, A. Garg, A.W.Senior. Recent Advances in theAutomatic Recognition of Audiovisual Speech[J]. Proceedings of the IEEE,2003,91(9):1306-1326.
    [40] R.E. Schapire. The boosting approach to machine learning: an overview[C]. InMSRI Workshop on Nonlinear Estimation and Classification,2002.
    [41] Y. Freund, R.E. Schapire. A decision-theoretic generalization of on-line learningand an application to boosting[J]. Journal of Computer and System Science,1997,55(1):119-139.
    [42] Say Wei Foo, Yong Lian, Liang Dong. Recognition of visual speech elementsusing adaptively boosted hidden Markov models[J]. In Circuits and Systems forVideo Technology,2004,14(5):693-705.
    [43] Carsten Meyer, Hauke Schramm. Boosting HMM acoustic models in largevocabulary speech recognition[J]. Speech Communication,2006,48(5):532-548.
    [44] Liang-Guo Zhang, Xilin Chen, Chunli Wang. Recognition of Sign LanguageSubwords Based on Boosted Hidden Markov Models[C]. Proceedings of the7thinternational conference on Multimodal interfaces,2005:282-287.
    [45] Dahai Yu, Ovidiu Ghita, Alistair Sutherland, Paul F. Whelan. A Novel VisualSpeech Representation and HMM Classification for Visual Speech Recognition[J].Lecture Notes In Computer Science,2009,5414(2009):398-409.
    [46] Wai Chee Yau, Dinesh Kant Kumar, Sridhar Poosapadi Arjunan. Voiceless speechrecognition using dynamic visual speech features[C]. HCSNet workshop on Useof vision in human-computer interaction,2006:93-101.
    [47] Wai Chee Yau, Dinesh Kant Kumar, Sridhar Poosapadi Arjunan. Visual SpeechRecognition Method Using Translation, Scale and Rotation Invariant Features[C],Video and Signal Based Surveillance, IEEE International Conference on,2006:956-963.
    [48] Wai Chee Yau, Dinesh Kant Kumar. Lip-Reading Technique Using spatio-temporal Templates and Support Vector Machines[J]. LNCS,2008,5197(2008):610-617.
    [49] Samuel Pachoud, Shaogang Gong, Andrea Cavallaro. Macro-cubo d basedprobabilistic matching for lip-reading digits[C]. Computer Vision and PatternRecognition,2008. CVPR2008. IEEE Conference on,2008:1-8.
    [50] Jong-Seok Lee, Cheol Hoon Park. Temporal Filtering of Visual Speech forAudio-Visual Speech Recognition in Acoustically and Visually ChallengingEnvironments[C]. ICMI’07Proceedings of the9thinternational conference onMultimodal interfaces,2007:220-227.
    [51] H. Ertan etingül, Yücel Yemez, Engin Erzin. Discriminative Analysis of LipMotion Features for Speaker Identification and Speech-Reading[J]. Journal IEEETranscations on Image Processing,2006,15(10):2879-2891.
    [52] Yun Fu, Xi Zhou, Ming Liu, Mark Hasegawa-Johnson, Thomas S. Huang.Lipreading by Locality Discriminant Graph[C]. Image Processing,2007. ICIP2007. IEEE International Conference on,2007:325-238.
    [53] A. V. Nefian, L. Liang, X. Pi, X. Liu, C. Mao. An coupled HMM for audio-visualspeech recognition[C]. In International Conference on Acoustics, Speech andSignal Processing,2002:2013-2016.
    [54] L. Liang, X. Liu, X. Pi, Y. Zhao, A. V. Nefian. Speaker independent audio-visualcontinuous speech recognition[C]. In International Conference on Multimedia andExpo,2002:25-28.
    [55] Liang Dong, Say Wei Foo, Yong Lian. A Two-Channel Training Algorithm forHidden Markov Model and Its Application to Lip Reading[J]. EURASIP Journalon Applied Signal Processing,2005,2005(1):1382-1399.
    [56]姚鸿勋.唇读识别中若干关键技术的研究与实践[D].哈尔滨：哈尔滨工业大学计算机学院，2003.
    [57] K.Messer, J.Matas, J.Kitte. Xm2Vtsdb: The extended m2vt database[C]. In2thinternational Conference on audio and video-based Bimetric personauthentication(AVBPS’99),1999.
    [58] M atthews. Extraction of visual features for lipreading[J]. IEEE Transatction onPattern Analysis and Machine Intelligence,2002,24(2):198-213.
    [59] F. J. Huang, T. Chen. Real-Time Lip-Synch Face Animation driven by humanvoice[C]. IEEE Second Workshop on Multimedia Signal Processing,1998:352-357.
    [60] Movellan J. R. Visual Speech Recognition with Stochastic Networks[J].Advances in Neural Information Processing System,1995,7(1):15-26.
    [61]王东,蒙山,张有为.汉语听觉视觉语音识别CAVSR双模态数据库的建立与结构[J].五邑大学学报(自然科学版),2001,15(1):50-54.
    [62]赵晖,林成龙,唐朝京.基于视频三音子的汉语双模态语料库Bi-VSSDatabase[J].中文信息学报,2009,23(5):98-10.
    [63]洪晓鹏,姚鸿勋,徐铭辉.基于句子级的唇读语料库及其切分算法[J].计算机工程与应用,2005,41(3):174-177.
    [64]严乐贫,奉小惠.双模态车载语音控制仿真系统的设计与实现[J].计算机与现代化,2010,2010(8):211-215.
    [65]李刚,王蒙军,林凌.面向残疾人的汉语可视语音数据库[J].中国生物医学工程学报,2007,26(3):355-360.
    [66] Edwards, G.J., Taylor, C.J. Active appearance models[J]. Pattern Analysis andMachine Intelligence,2001,23(6):681-685.
    [67]高新波,谢维信.模糊聚类理论发展及应用的研究进展[J],科学通报,1999,44(21):2241-2250.
    [68] G.Rabi, S. W. Lu. Energy minimization for extracting mouth curves in a facialimage[C]. The Int’l Conf. Intelligent Information Systems,1997:381-385.
    [69] T. Chen. Audiovisual speech processing[J]. IEEE Signal Processing Magazine,2001,18(1):9-21.
    [70]黄国宏,刘刚.一种新的基于高斯混合模型的线性判别分析[J].计算机工程与应用,2007,43(27):75-77.
    [71]熊昌镇,李林,李颖宏,李正熙.一种高清视频车辆定位与跟踪的方法[C].2010国际信息技术与应用论坛论文集,2010.
    [72]苏波.隐Markov模型和神经网络人脸识别算法研究[D].四川：电子科技大学，2007.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700