用户名: 密码: 验证码:
噪声环境下的基于GMM/SVM说话人识别算法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
语言是人类最重要的交流工具,语音信号作为语言载体在不同的层面包含了大量的信息。其中与说话人相关的信息可以辨认说话人是谁或者确认此说话人是否为某特定的人。如今自动说话人识别技术在低噪声、低失真环境下的识别率已经相当的高,但实际环境中无处不在的噪声导致了训练模型和测试语音之间的失配,这使得噪声环境中说话人识别系统的识别率急剧下降。因此提高噪声环境下说话人识别系统的性能是说话人识别系统从实验室走向实用的关键,也是当前的研究热点。
     说话人识别技术主要包括特征提取和模式识别两大模块,本文分别从人的发音器官和听觉器官研究了特征参数的提取和抗噪性能,并对当今主流分类器做了深入的研究。所有工作主要针对噪声环境下文本无关的开集说话人辨识展开。
     在预处理阶段,考虑到广泛应用于编码理论的信息熵代表信源的平均不定性能导致语音的熵和噪声的熵存在较大差异,本文采用了基于熵函数的语音端点检测方法,试验表明谱熵法在信噪比较低和非平稳噪声下具有良好的性能,进而提出了一种动态阈值的方法检测语音端点。
     考虑到噪声频带一般不覆盖整个语音范围,因此,本文采用多子带特征提取,并在每个频带内使用基于Teager能量的子倒谱特征。本文还设计了一种用AdaBoost算法优化的支持向量机与高斯混合模型相结合的系统。首先用优化的支持向量机对每个子带分别决策,筛选出训练集之外的说话者,然后对集内人用判别结果进行特征加权以突出对识别结果影响较大的子带特征,从而降低了噪声对对识别结果的影响,最后用优化的高斯混合模型进行识别。试验结果表明,本文系统在低信噪比环境下具有较好的识别性能。
Language is the most important tool for human intercommunion. Speech signal as the carrier of language embodies much information in different level. The information about speaker can be used for identify the people who is speaking or whether he is the specified people. In modern day, automatic speaker recognition has performed quite perfect in the low noise and low distortion. But the mismatch between the training data and the test data that result from all kinds of noise in real environment make the speaker recognition rate dramatically declined. So improving the performance in the noise environment is the key for the system come to practice from laboratory.
     The technology of speaker recognition is composed of feature extraction and pattern classification. This paper studied feature extraction and robustness by analyzing pronunciation organ and hearing organ. In addition, some primary classifiers are intensively researched. All of works focus on text-independent open-set speaker recognition in noisy environment.
     Considering information entropy that is comprehensively applied to code theory represent average unconfirmed information source, the entropy of speech and the entropy of noise must be different. This paper adopts speech endpoint detection method based on entropy function. The experiment shows the spectrum entropy performed much well in low SNR and unconfirmed noisy condition. Further, a method based on dynamic threshold is proposed to detect speech endpoint.
     Considering noise frequency band rarely covers the whole scope of speech, this paper adopts multiple subbands feature extraction and uses sub-cepstrum based Teager energy in every subband. Furthermore, a hybrid system of Support Vector Machine (SVM) and Gaussian Mixed Model (GMM) optimized by AdaBoost algorithm is introduced. Firstly, this system filters out the speaker not included in the training set by using SVM. Then, weights the decision results of the training set to highlight subband features that affected the recognition results much more than others. By this way, the noise impact on the outcome of the identification is reduced. Finally, the optimized GMM is used for recognition. The experiment result shows this system performed still well in lower SNR condition.
引文
[1] K. H. Davis, R. Biddulph and S. Balashek. Automatic recognition of spoken dig its. Journal of Acoustical Society of America. 1952, 24(6):637-642P
    [2] Y. Cheng and H. C. Leung. Speaker verification using fundamental frequency, in Proc. of the International Conference on Acoustics, Speech and Signal Processing (1CASSP). 2002
    [3] M. K. Sonmez, L. Heck, and et al. A lognormal tied mixture model of pitch for prosody-based speaker recognition, in Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH). 1997
    [4] B. S. Atal and M.R. Schroeder. Predictive Coding of Speech Signals and Subjective Error Criteria. IEEE Trans. ASSP. 1979(27): 247-254P
    [5] R. J. Mammone, X. Y. Zhang and R. P. Ramachandran. Robust Speaker Recognition. IEEE Signal Processing Magazine. 1996, 13(5):58-71P
    [6] J. P. Campbell. Speaker Recognition: A Tutorial. Proceedings of IEEE. 1997, 85(9):1437-1462P
    [7] 王金甲.噪声环境下鲁棒性文本自由说话人辨认系统的研究.燕山大学.信息科学与工程学院硕士毕业论文.2003:45-55页
    [8] J. J. Wolf. Efficient Parameters for Speaker Recognition. Journal of the Acoustical Society of America. 1972, 51(6): 2044-2056P
    [9] B. S. Atal. Effectiveness of Linear Prediction Characteristics of Speech Wave for Automatic Speaker Identification and Verification. Journal of the Acoustical Society of America. 1974, 55(6):1034-1312P
    [10] S. Furui. Cepstral Analysis Technique for Automatic Speaker Verification. IEEE Trans on Acoustics Speech and Signal Processing. 1981,29(2): 254-271P
    [11] M. S. Zilovic, R. P. Ramachandran and R. J. Mammone. Speaker Identification Based on the Use of Robust Cepstral Feature Obtained from Pole-Zero Transfer Functions. IEEE Trans on Speech and Audio Processing. 1998,6(3):260-267P
    [12] M. S. Zilovic, R. P. Ramachandran and R. J. Mammone. A Fast Algorithm for Finding the Adaptive Component Weighted Cepstrum for Speaker Recognition. IEEE Trans on Speech and Audio Processing. 1997,5(1):84-86P
    [13] H. Murthy, F. Beaufays, L. Heck and M. Weintraub. Robust Text-Independent Speaker Identification over Telephone Channels. IEEE Trans on Speech and Audio Processing. 1999, 7(5):554-568P
    [14] H. Hemansky. Perceptual Linear Predictive (PLP) Analysis for Speech. Journal of the Acoustical Society of America. 1990, 87(4):1738-1752P
    [15] 吴启晖,王金龙.基于谱熵的检测.电子与信息学报.2001,23(10):989-993页
    [16] 陈四根,和应民.一种基于信息熵的语音端点检测方法.应用科技.2002,28(3):13-14页
    [17] S. B. Davies P. Mermelstein. Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentence. IEEE Trans. On Acoustics, Speech and Signal Processing. 1980, 28(4): 357-366P
    [18] A. Biem, E. Mcdermott and S. Katagiri. A Discriminative Filter Bank Model for Speech Recognition. Proceedings of Eurospeech. 1995, 1(2): 545-548P
    [19] R. Zelinski and P. Noll. Adaptive Transform Coding of Speech signal. IEEE Trans. On Acoustics, Speech and Signal Processing. 1977, 25(4):299-309P
    [20] D. A. Reynolds. A Gaussian Mixture Modeling Approach to Text-Independent Speaker Identification. Ph.D. Thesis. Georgia Institute of Technology. Atlanta. GA. 1992
    [21] E. Erzin, A. Cetin and Y. Yardimici. Subband Analysis for Robust Speech Recognition in The Presence of Car Noise. In Proc. IEEE Int. Conf. On Acoustics, Speech and Signal Processing. 1995, 1: 417-420P
    [22] J. F. Raiser. On a Simple Algorithm to Calculate the 'Energy' of a Signal. In Proc. IEEE Int. Conf. On Acoustics, Speech and Signal Processing. 1990,1: 381-384P
    [23] P. Maragos, J. F. Kaiser and T. F. Quatieri. On Amplitude and Frequency Demodulation Using Energy Operators. IEEE Trans. on Signal Processing. 1993, 41(4): 1532-1550P
    [24] P. Maragos, J. F. Kaiser and W. F. Quatieri. Energy Separation in Signal Modulations with Application to Speech Analysis. IEEE Transactions on Signal Processing. 1993, 41(10): 3024—305P
    [25] A. E. Rosenberg and F. K. Soong. Evaluation of a vector quantization talker recognition system in text-independent and text-dependent models. Computer Speech and Language. 1987, 22:143-157P
    [26] 卢绪刚,陈道文.听觉计算模型在鲁棒性语音识别中的应用.声学学报.2000,25(6):492-498页
    [27] 蒋文建,韦岗.基于掩蔽特性的噪声环境下语音识别新特征.声学学报.2001,26(6):516-520页
    [28] 马志友,杨莹春,吴朝晖.二次特征提取及其在说话人识别中的应用.电路与系统学报.2003,25(6):492-498页
    [29] F. K. Soong. A Vector Quantization Approach to Speaker Recognition. AT&T Technical Journal. 1987, 66(1):14-26P
    [30] N. Z. Tishby. On the Application of Mixture AR Hidden Markov Models to Text Independent Speaker Recognition. IEEE Trans on Signal Processing. 1991, 39:563-570P
    [31] D. A. Reynolds and R. Rose. Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Trans on Speech and Audio Processing. 1995, 3(1):72-83P
    [32] B. Kusumoputro and I. Fanany. Bispectrum Analysis on Speaker Identification. Proceedings of SPIE. 2000, 4044:143-148P
    [33] D. f. Reynolds. Effect of Population Size and Telephone Degradations on Speaker Identification Performance. Proc. SPIE Conference on Automatic Systems for the Identification and Inspection of Humans. 1994
    [34] D. A. Reynolds. Large Population Speaker Identification Using Clean and Telephone Speech. IEEE Signal Processing Letters. 1995, 2(3):46-48P
    [35] D. A. Reynolds. Speaker Identification and Verification Using Gaussian Mixture Speaker Model. Speech Communication. 1995, 3(1):91-108P
    [36] A. Dempster, N. Laird and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Statistical Society. 1977, 39(1):1-38P
    [37] 高隽.人工神经网络原理及其仿真实例.北京:机械工业出版社.2003,7:89-92页
    [38] Vladimir N. Vapnik. The Nature of Statistical Learning Theory, New York: Spring-Verlag. 1995,中译本.张学工译(统计学习理论本质).北京:清华大学出版社.2000:56-62页
    [39] 张学工.关于统计学习理论与支持向量机.自动化学报.2000,26(1):50-53页
    [40] 肖嵘,王继成,张福炎.支持向量机理论总述.计算机科学.2000,27(3):46-52页
    [41] C. Burges. A tutorial on Support Vector Machine for Pattern Recognition. Data Mining and Knowledge Discovery. 1998,2(2): 121-167P
    [42] C. Cortes and V. Vapnik. Support Vector networks. Machine Learning. 1995, 20(3): 272-297P
    [43] Daniel J. Mashao. A Hybrid GMM-SVM Speaker Identification System. IEEE AFRICON. 2004:319-322P
    [44] 刘鸣,戴蓓倩,李辉等.鲁棒性话者辨识中一种改进的隐马尔科夫模型.电子学报.2002,30(1):46-48页
    [45] 何致远,胡起秀,徐光祜.两级决策的开集说话人辨认方法.清华大学学报(自然科学版).2003:49-52页
    [46] 黄伟,戴蓓倩,李辉.基于分类高斯混合模型和神经网络融合的与文本无关的说话人识别.模式识别与人工智能学报.2003:37-41页
    [47] H. fletcher. Speech and Hearing in Communication. New York: Krieger. 1953
    [48] K. Yoshida, K. Takagi and K. Ozeki. Speaker Identification Using Subband HMMs. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH). 1999
    [49] L. Besacier and J. F. Bonastre. Subband Architecture for Automatic Speaker Recognition. Signal Processing. 2000, 80: 1245-1259P
    [50] 黄伟.基于GMM-SVM和多子系统融合的与文本无关的话者识别.中国科学技术大学.信号与信息处理专业博士论文.2004,5:165-176页
    [51] C. R. Jankowski. Fine Structure Feature for Speaker Identification. Ph. D. Thesis. Massachusetts Institute of Technology. Dept. Electrical Engineering and Computer Science. 1996, 3(1):91-108P
    [52] C. R. Jankowsk, r. F, Quatieri and D. A. Reynolds. Measuring Fine Structure in Speech: Application to Speaker Identification. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing. 1995, 1(5):325-328P

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700