用户名: 密码: 验证码:
汉语连续语音声韵母类别属性检测技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
基于隐马尔可夫模型(HMM)的语音识别是主流的大词汇量语音识别方法,但是该方法没有考虑人的思维认知过程,忽略了很多语音及语言学知识,目前发展遇到了瓶颈。因此,一个以知识为基础并结合统计模型的新型语音识别框架应运而生。而语音知识属性的获取以及如何运用语音知识属性是该框架亟待解决的关键问题。本文重点研究了汉语连续语音声韵母边界和类别知识属性的提取,并将其应用于汉语语音识别中,具体工作如下:
     提出了一种基于Seneff听觉谱特征的汉语连续语音声韵母边界检测方法。通过研究声韵母能量集中区域和共振峰结构的差异性,利用Seneff听觉谱能够很好的凸显语音信号中变化剧烈区域和共振峰结构等特点,构建基于Seneff听觉谱的优选特征参数集,并对各种特征参数的候选边界点进行融合,实现了对声韵母边界点的检测。与基于模型的方法相比较,该方法克服了训练数据量大、检测鲁棒性差的缺点;与以帧长为单位的方法相比较,该方法克服了分帧处理精度低和容易漏检的问题。实验结果表明,本文算法具有较高的边界检测准确率、精确度和较强的鲁棒性,且算法复杂度较低。
     提出了一种基于能量和共振峰结构信息的鼻音检测方法。根据语音能量值大小和共振峰结构上的差异将语音分为响音与阻塞音,并进一步对响音中的鼻音进行检测,在保证鼻音检测正确率的前提下,通过分析易与鼻音混淆音段在能量和共振峰结构的区别,采用后处理逐步去除插入错误,与经典算法相比较,提高了鼻音检测的准确率。
     提出了一种基于能量变化率的塞音检测方法。通过分析塞音的音段时长和能量变化过程,提取能量变化率参数进行塞音检测,克服了传统采用爆发谱特征方法不稳定、非塞音也可能存在爆发谱等缺点,提高了塞音检测的性能,最后通过交叉验证得知本文方法具有较好的稳定性和泛化性能。
     提出了一种基于音段能量分布特性和谱统计量的塞擦音和摩擦音分类方法。通过分析非塞音中塞擦音和摩擦音发音过程及谱形状上的差异,提取音段能量分布特征和谱统计量参数,实现了塞擦音和摩擦音的分类,实验证明了该方法的有效性。
     最后,结合前面提出的声韵母边界检测和多种类别检测方法,采用二叉树的形式,实现了对汉语连续语音声韵母边界和类别知识属性的检测。并将其应用到基于分段条件随机场整合模型的连续语音识别基线系统中。实验结果表明,本文方法检测得到的声韵母边界和类别知识属性,能够有效提高基线系统的性能。
The method based on the Hidden Markov Model (HMM) is widely used for the large vocabulary speech recognition. But due to the problems that the cognitive processes of human thinking are not considered, and plenty of speech and linguistic knowledge are ignored, the development of this method has encountered a bottleneck. Consequently, a new knowledge-based speech recognition framework combined with statistical model is proposed. The acquisition and application of the speech knowledge attributes are two of the key technologies to the framework. This paper focuses on the boundary and category attribute detection of the initials and finals of Chinese continuous speech, and applies them to Chinese speech recognition. The work in this dissertation is summarized as follows:
     A boundary detection method for initials and finals of Chinese continuous speech based on Seneff auditory spectrum features is presented. Firstly, the dissimilarity of energy concentration and formant structure of different initials and finals are studied. Then, the Seneff auditory spectrum is used to highlight the dramatic changes of the speech regional and formant characteristics. According to the different energy distribute and formant structure characteristic, appropriate feature parameters are constructed, after which the candidate boundary points are integrated. Finally, the boundary of initials and finals is derived. Compared with scale-based approach,this method needs less training data and is more robust. And this method overcomes the problems of the frame-base method such as the low precision and missed detection. Experimental results show that the boundary detection algorithm has higher accuracy, precision and robustness, while the complexity is low.
     A nasal detection method is proposed based on the energy and formant structure information. First, the speech is divided into sonorant and obstruent based on the speech energy value and formants structural. Then, a new set of features is constructed to detect nasal in the sonorant so that the nasal detection correct is guaranteed. Finally, the insertion errors are gradually removed with post-processing based on the energy and the formant structure analysis of the sound segments which are easily mistaken for nasal. Compared with the classical algorithm, the nasal detection accuracy is improved.
     A stop detection method is put forward based on the energy change rate. By analyzing the stop segment duration and energy change process, the stop is detected in the obstruction. The problems of the traditional stop detection methods, including burst spectrum instability and burst spectrum in non-stop, are eliminated, so that the stop detection performance is improved. The stability and generalization of the method is verified by cross-validation results.
     After analyzing the dissimilarity of the pronunciation process and spectral shape of affricate and fricative, the segment energy distribute features and spectral statistics parameters are extracted to realize the affricate and fricative classification. Experimental results show that the method is effective.
     Finally, combined with the proposed boundary and multi-category detection method, the initials and finals category detection of Chinese continuous speech is realized in the form of binary tree. A baseline continuous speech recognition system based on segmentation conditional random integrated model is constructed, and the detected boundary and category attributes of initials and finals are applied to the system. Experimental results show that the detected boundary and category knowledge attributes effectively improve the performance of the baseline system.
引文
[1]倪崇嘉,刘文举,徐波.汉语大词汇量连续语音识别系统研究进展[J].中文信息学报,2009,23(1):112-124.
    [2]刘加.汉语大词汇量连续语音识别系统研究进展[J].电子学报,2000,28(1):85-91.
    [3]王炳锡,屈丹等.实用语音识别基础[M].北京:国防工业出版社,2005.
    [4] S. Dusan, L. R. Rabiner. On Integrating Insights from Human Speech Perception into Automatic Speech Recognition[A]. In: Proc. of InterSpeech[C], Lisbon, 2005: 1233–1236.
    [5] Chin-Hui Lee. An overview on automatic speech attribute transcription (ASAT)[A]. In: Proc. of InterSpeech[C], 2007: 1825-1828.
    [6] Chin-Hui Lee. From knowledge-ignorant to knowledge-rich modeling: A new speech research paradigm for next generation automatic speech recognition[A], In: Proc. of ICSLP Keynote speech[C], 2004:1137-1140.
    [7] Hou J, Rabiner L, Dusan S. Automatic Speech Attribute Transcription (ASAT)– The Front End Processor[A]. In: Proc. Of ICASSP[C], Toulouse, France, 2006:1-4.
    [8] Ilana Bromberg. Detection-Based ASR in the Automatic Speech Attribute Transcription Project[A]. In: Proc. Of Interspeech[C] , 2006: 517-520.
    [9] Chin-Hui Lee, MarkA.Clements, Sorin Dusan. An Overview on Automatic Speech Attribute Transcription (ASAT)[A]. In: Proc. of Interspeech[C], 2007: 1825-1828.
    [10] S. Dusan, L. Deng. Acoustic-to-articulatory inversion using dynamical and phonological constraints[A]. In: Proceedings of the 5th Speech Production Workshop: models and data[C], Kloster Seeon, Germany, 2000:1535-1538.
    [11] S. King, P. Taylor. Detection of phonological features in continuous speech using neural networks[J]. Computer, Speech and Language 14(4), 2000: 333-345.
    [12] I-F. Chen, H-M Wang. An Investigation of Phonological Feature Systems Used in Detection-Based ASR[A]. In: Proc. of ISCSLP[C] 2008:111-114.
    [13] Prateeti Mohapatra. Deriving Novel Posterior Feature Spaces For Conditional Random Field– Based Phone Recognition[D]. Presented in Partial Fullment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University. 2008:1132-1135.
    [14] Chiu-yu Tseng. Fluent Speech Prosody: Framework and Modeling[J]. Speech Communication (Special Issue on Quantitative Prosody Modeling for Natural Speech Description and Generation), 2005: 284-309.
    [15] Chiu-yu Tseng, Lin-shan Lee. Prosodic Patterns of Information Structure in Spoken Discourse-a Preliminary Study of Mandarin Spontaneous Lecture vs. Read Speech[A]. Speech Prosody 2010[C], Chicago, U.S.A, 2010:873-877.
    [16] Petr Schwarz, Pavel Matejka. Recognition of Phoneme Strings using TRAP Technique[A]. In: Proc. of Eurospeech, 2003: 825-828.
    [17] Jinyu Li, Chin-Hui Lee. On designing and evaluating speech events detectors[A]. In: Proc. of InterSpeech[C], Lisbon, Portugal, 2005:3365-3368.
    [18] Hou Jun, et al. Parallel and hierarchical speech feature classification using frame and segment-based methods[A]. In: Proc. of INTERSPEECH[C], 2008, 1329-1332.
    [19] Frantisek Grezl. Trap-Based Probabilistic Features For Automatic Speech Recognition[D]. Brno, CZ, 2007.
    [20] I-F. Chen Hsin-Min Wang. Articulatory Feature Asynchrony Analysis and Compensation in Detection-Based ASR[A]. In: Proc. of Interspeech[C], 2009: 3059-3062.
    [21]许见偟.中文语音属性侦测之研究[D].台湾:国立交通大学,2006.
    [22] Guoning Hu,DeLiang Wang .Separation of Stop Consonants[A]. In: Proc. of ICASSP[C], 2003,2:748-752 .
    [23] Guoning Hu, DeLiang Wang. Separation of Fricatives and Affricatives[A] . In: Proc. of ICASSP[C] , 2003,2:1001-1004.
    [24] John Lafferty, Andrew McCallum, Fernando Pereira. Conditional random ?elds: Probabilistic models for segmenting and labeling sequence data[A]. In: Proc. Of 18th International Conf. on Machine Learning[C], Morgan Kaufmann,SanFrancisco, CA, 2001: 282-289.
    [25] H.M. Wallach. Conditional random fields: An introduction[R]. University of Pennsylvania CIS, February 2004.
    [26] H.M. Wallach. Efficient training of conditional random fields[D]. University of Edinburgh, 2002.
    [27] Andrew McCallum. Efficiently Inducing Features of Conditional Random Fields[A]. In: Proc. of the 19th Conference in Uncertainty in Articifical Intelligence(UAI)[C], 2003:1324-1327.
    [28] T. Kudo. CRF++: Yet Another CRF Toolkit, [EB/OL] http://crfpp.sourceforge.net.
    [29] J. Morris, E. Fosler-Lussier. Combining phonetic attributes using conditional random fields[A]. In: Proc. of. Interspeech[C], 2006: 597-600.
    [30] Chi-Yueh Lin, Hsiao-Chuan Wang. Attribute-based Mandarin Speech Recognition using Conditional Random Fields[A]. In: Proc. Of Interspeech[C], 2007:1833-1836.
    [31] P. Mohapatra , E. Fosler-Lussier. Incorporating Suprasegmental Knowledge for Phone Recognition with Conditional Random Fields[A]. ISCA Tutorial and Research Workshop on Speech Analysis and Processing for Knowledge Discovery[C], Aalborg, Denmark, 2008:1753-1756.
    [32] Gunawardana. Hidden Conditional Random Fields for Phone Classification[A]. In: Proc. of Interspeech[C]. 2005:1117-1120.
    [33] Yun-Hsuan Sung Dan Jurafsky. Hidden Conditional Random Fields for Phone Recognition[J]. In IEEE Automatic Speech Recognition and Understanding Workshop. 2009:107-112.
    [34] Geoffrey Zweig, Patrick Nguyen. A Segmental CRF Approach to Large VocabularyContinuous Speech Recognition[A]. Automatic Speech Recognition & Understanding, ASRU[C], 2009, 152-157.
    [35] Geoffrey Zweig, Patrick Nguyen. From Flat Direct Models to Segmental CRF Models[A]. In: Proc. of ICASSP[C], IEEE International Conference on. 2010: 5530-5533.
    [36] Geoffrey Zweig, Patrick Nguyen. SCARF: A Segmental Conditional Random Field Speech Recognition Toolkit[A]. In: Proc. of. Interspeech[C], 2010:1021-1024.
    [37] P. Mohapatra, E. Fosler-Lussier. Investigations into phonological attribute classifier representations for CRF phone recognition[A], In: Proc. of. Interspeech[C], Brisbane, Australia, 2008:2558-2561.
    [38] Toledano D T, Gomez L A H., Grande L V . Automatic phonetic segmentation[J], IEEE Transactions on Audio Speech and Language Processing. 2003, 11(6): 617-625.
    [39] F. Malfrere, T. Dutiot. High-quality speech synthesis for phonetic speech segmentation[A]. In: Proc. Eurospeech[C],1997:2631-2634.
    [40] J. W. Kuo , H. M. Wang. Minimum Boundary Error Training for Automatic Phonetic Segmentation[A]. Interspeech ICSLP[C], 2006:1497-1500.
    [41] J.-W. Kuo, H.-Y. Lo, H.-M. Wang. Improved HMM/SVM methods for automatic phoneme segmentation[A]. In: Proc. of Interspeech[C], Antwerp, Belgium, 2007,2:2057-2060.
    [42] H. Y. Lo, H. M. Wang, Phonetic Boundary Refinement Using Neural Network [A]. In: Proc. of ICASSP[C], 2007:933-936.
    [43] J. van Santen , R. Sproat. High accuracy automatic segmentation[A]. In: Proc. Eurospeech[C], 1999:2809-2812.
    [44] Sorin Dusan, Lawrence Rabiner, On the Relation between Maximum Spectral Transition Positions and Phone Boundaries[A]. In: Proc. of Interspeech [C],2006, 1: 17–21.
    [45] Rajesh Janakiraman, J. Chaitanya Kumar, Hema A. Murthy, Robust Syllable Segmentation and its Application to Syllable-centric Continuous Speech Recognition[A]. IEEE Conference on Communications[C], 2010:1-5.
    [46] Almpanidis, G., Kotti, M., Kotropoulos. Robust Detection of Phone Boundaries Using Model Selection Criteria With Few Observations[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009,17(2):287-298.
    [47]王帆,郑方,吴文虎.基于多尺度分形维数的汉语语音声韵切分[J].清华大学学报(自然科学版),2002,42(1):68-71.
    [48]栗学丽,丁慧,徐柏龄.基于熵函数的耳语音声韵分割法[J].声学学报,2005,30(1):69-75.
    [49] Xueqin Chen, Heming Zhao. The Research of Endpoint Detection and Initial/Final Segmentation for Chinese Whispered Speech[A].In: Proc. of ICSP[C],Guilin,China,2006:1-4.
    [50]闫润强,祖漪清,朱贻盛.递归趋势分析在汉语语音声韵母切分中的应用研究[J].信号处理,2007,521-525.
    [51]张宝奇,张连海,屈丹.基于听觉事件检测的汉语语音声韵切分[J].声学学报.2010,35(6):701-707.
    [52] Seneff. System to independently modify exeitation and/or spectrum of speeeh waveform without explicit pitch extraction[J]. IEEE Transactions on Acoustics,Speech,and Signal Processing, 1982, 30(4):566-578.
    [53] Melvyn J. Hunt, Claude Leftbvre. Application of an auditory model to speaker recognition[J]. J. Acoust. Soc. Am., 1989,85(6):2623-2629.
    [54] Victor Zue, James Glass, Michael Phillips, Stephanie Seneff. The MIT SUMMIT Speech Recognition system: a progress report[A]. In: Proceedings of the workshop on Speech and Natural Language[C],1989,179-189.
    [55] Victor Zue, James Glass, Michael Phillips, Stephanie Seneff .Acoustic Segmentation And Phonetic Classification In The Summit System[A]. In: Proceedings of ICASSP[C], 1989,1:389-392.
    [56]李朝晖,迟惠生.听觉外周计算模型研究进展[J].声学学报,2006, 31(5):449-465.
    [57] Stephanie Seneff. A joint synchrony/mean-rate model of auditory speech processing[J]. Journal of Phonetics, 1988,16: 55-76.
    [58] Stephanie Seneff. Pitch and Spectral Analysis of Speech Based on an Auditory Synchrony Model[M]. Cambridge, Massachusetts Institute of Technology,1985.
    [59] Ahmed M. Abdelatty Ali. Auditory-Based Speech Processing Based on the Average Localized Synchrony Detection[A]. In: Proceedings of Acoustic Speech and Signal Processing (ICASSP)[C] , 2000,3:1623-1626.
    [60] Ahmed M. Abdelatty Ali. Auditory-Based Speech Processing Based on the Average Localized Synchrony Detection[A]. In: Proceedings of Acoustic Speech and Signal Processing (ICASSP)[C] , 2000,3:1623-1626.
    [61]张学武.语音与语言学词典[M].上海:上海辞书出版社,1981.
    [62] Guoning Hu, De Liang Wang. Auditory Segmentation Based on Onset and Offset Analysis[J]. IEEE Transaction On Audio , Speech And Language Processing, 2007,15(2):396-495.
    [63] Steve Young. The HTK Book(for HTK Version 3.4).Cambridge University Engineering Department, 2006:289.
    [64] J. D. Gibson, Speech coding methods, standards, and applications[J]. IEEE Circuits and Systems Magazine, 2005,5(4):30-49, 2005.
    [65] S. R. Mahadeva Prasanna, B. V. Sandeep Reddy, P. Krishnamoorthy. Vowel onset point detection using source, spectral peaks and modulation spectrum energies[J]. IEEE Transactions on Audio, Speech and Language Processing,2009,17(4):556-565.
    [66] K.Y. Leung, M. Siu. Speech Recognition Using Combined Acoustic and Articulatory Information with Retraining of Acoustic Model Parameters[A]. In: Proc. of ICSLP[C], 2002,3: 2117-2120.
    [67] M.Hasegawa-Johnson, J.Baker, S.Borys, K.Chen, et.al. Landmark-based speech recognition:Report of the 2004 Johns Hopkins summer workshop[A]. In: Proceedings of ICASSP[C],2005: 213–216.
    [68] J. Morris, E. Fosler-Lussier. Further experiments with detector-based conditional random fields in phonetic recognition[A].In: Proceedings of ICASSP[C], 2007:441-444.
    [69] Carla Lopes, Fernando Perdig?o.A Hierarchical Broad-class Classification to Enhance Phoneme Recognition[A]. In: Proceedings of European Signal Processing Conference[C], 2009,1760-1764.
    [70] Limin Du, Kenneth Noble Stevens. Automatic Detection of Landmark for Nasal Consonants from Speech Waveform[C], In: Proceedings of ICSLP 2004:112-115.
    [71] Sarah E. Borys. An SVM Front-end Landmark Speech Recognition System[M], University of Illinois, 2008.
    [72]方强,李爱军.普通话鼻化元音的研究[R].北京:中国社会科学院语言研究所,2003.
    [73]陈肖霞.语音变化问题研究[R].北京:中国社会科学院语言研究所,2005.
    [74] M.F. Dorman. Relative spectral change and formant transitions as cues to labial and alveolar place of articulation[J]. J.Acoust. Soc. Am. 1996,100(6):3825-3830.
    [75] A.R.Jayan, P. C. Pandey. Detection of stop landmarks using gaussian mixture model of speech spectrum[A]. In: Proceedings of ICASSP[C], 2009,1:4681–4684.
    [76] Chi-Yueh Lin, Hsiao-Chuan Wang. Using Burst Onset Information To Improve Stop/Affricate Phone Recognition[A]. In: Proceedings of ICASSP[C],1:4862-4865.
    [77] Prem C. Pandey, Milind S. Shah. Estimation of Place of Articulation During Stop Closures of Vowel–Consonant–Vowel Utterances[J]. IEEE Transactions on Audio, Speech and Language Processing, 2009,17(2):277-286.
    [78] Chi-Yueh Lin , Hsiao-Chuan Wang. Mandarin Stops Classification Based On Random Forest Approach[A]. In: Proceedings of ISCSLP 2008:1-4.
    [79] Yang J,Yang J Y. Why can LDA be performed in PCA transformed space[J]. Pattern Recognition,2003,36(2):563-566.
    [80] Richard O. Duda,Peter E. Hart David G. Stork著.李宏东,姚天翔译.模式分类[M].北京:机械工业出版社,2009.
    [81] Karen Forrest, Gary Weisme.Statistical analysis of word-initial voiceless obstruents: Preliminary data[J]. Acoustical Society of America, 1988,84(1):115-123.
    [82] Jongman A., Wayland R., Wong, S. Acoustic characteristics of English fricatives[J]. Journal of the Acoustical Society of America, 2000,108:1252-1263.
    [83] Sussman, H. M., Bessell, N., Dalston, E., Majors, T. An investigation of stop place of articulation as a function of syllable position: A locus equation perspective[J]. Journal of the Acoustical Society of America, 1998,101:2826-2838.
    [84] Kluender, K. R. Walsh, M. A. Amplitude rise time and the perception of the voiceless affricate/fricative distinction[J]. Perception and Psychophysics, 1992,51:328-333.
    [85] A. Mohamed, G. Dahl, G. E. Hinton. Deep Belief Networks for phone recognition[R]. InNIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.
    [86] S. Thomas, S. Ganapathy, H. Hermansky. Phoneme Recognition Using Spectral Envelope and Modulation Frequency Features[A]. In: Proceedings of ICASSP[C], 2009:4453-4456.
    [87] S. Thomas, S. Ganapathyand H. Hermansky. Tandem Representations of Spectral Envelope and Modulation Frequency Features for ASR[A]. In: Proceedings of Interspeech[C], 2009:2955-2958.
    [88] P. Clark, L. Atlas. Time-frequency Coherent Modulation Filtering of Non-stationary Signals[J]. IEEE Trans. on Signal Processing, 2009,57(11):4323-4332.
    [89] G. Sell, M. Slaney. Solving Demodulation as an Optimization Problem[J]. IEEE Trans. on Audio, Speech, Language Processing, 2010,18(8):2051-2066.
    [90] M. De Wachter, M. Matton, K. Demuynck, P. Wambacq, R. Cools. Template-Based Continuous Speech Recognition[J]. IEEE Transactions on Audio, Speech, Language Processing 15(4): 1377-1390, 2007.
    [91] S. Demange, D. Van Compernolle. HEAR: An Hybrid Episodic-Abstract speech Recognizer[A]. In:Proceedings of Interspeech[C], 2009:3067-3070.
    [92] Aren Jansen, Partha Niyogi.Detection-Based Speech Recognition with Sparse Point Process Models[A]. In:Proceedings of ICASSP[C], 2010:4362-4365. .

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700