用户名: 密码: 验证码:
基于统计的汉字识别后处理研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算机和网络技术的飞速发展,需要将大量现实生活中各种介质上的文本数字化,为了提高效率,减轻人的负担,出现了OCR技术——即光学字符识别。近年来,汉字OCR研究已经取得了很大的进步,许多商品化的识别系统成功的走向市场。但是,汉字结构复杂且变化性大的特点往往使单字识别率受到一定的限制。只依靠单纯的单字符识别,识别率已经很难得到进一步的提高。需要在单字符识别基础上,利用语言学知识和文本的上下文相关信息进行后处理。
     本文介绍了汉字识别后处理的研究意义和后处理的一些方法,并采用基于统计的后处理方法对单字符识别结果进行了后处理。通过对2000年全年的《人民日报》文本(约1930万字)进行二元字字同现统计,得到汉语文本中字与字之间的概率制约关系。根据Markov语言模型,将同现概率这种文本上下文相关信息应用到汉字识别后处理中。对单字符识别得到的结果进行二次加工,在一定程度上提高了整个系统的识别正确率。
With the development of the computer and network technology at full speed, it is needed to digitize the large amount of text in daily life on various kinds of medium. In order to raise the efficiency and lighten people's burden, OCR (Optical Character Recognition) technology has appeared. In recent years, Chinese character OCR study had already made heavy progress. A lot of commercialized recognition systems trend market successfully. But the character that Chinese character's structure is complex and change greatly often restrict the discerning rate of the individual character. Only rely on the single character recognition, raise the discerning rate is already very difficult. Based on the individual character recognition, it is needed for us to do post-processing using language knowledge and context relevant information of text.
    This thesis introduces the research meaning and some methods of Chinese characters recognition post-processing. And adopt stat.-base method to do the post-processing to the single character recognition result. Through counting all the adjoined two words in "People's Daily" text of the whole year 2000 (about 19,300,000 words), get the probabilistic relationship between the Chinese characters. According to Markov language model, use this probabilistic relationship between the Chinese characters into Chinese character post-processing. It can raise the discerning rate of the whole system to a certain extent.
引文
[1]Nagy G., At the Frontiers of OCR, Proceeding of IEEE, 1992, 80(7)
    [2]手写汉字识别后处理方法的研究与实现,华南理工大学硕士论文,2000年
    [3]胡加忠.《计算机文字识别技术》.1994.气象出版社.
    [4]吴佑寿,丁小青编著.《汉字识别原理方法与实现》.高等教育出版社.
    [5]张忻中,《汉字识别技术》,清华大学出版社.广西科学技术出版社 ISBN 7_302_010088_9/TP·407.
    [6]吴佑寿,丁小青.《汉字识别原理、方法和实现》.高等教育出版社.1992.
    [7]马少平、姜哲、夏莹,通用模型与领域模型相结合的手写体汉字识别后处理,863计划智能计算机主题学术会议论文集,2001.2.
    [8]Wells C J, Evett L J. Fast Dictionary Lookup for Contextual Word Recognition. Pattern Recognition,1990, 23(5).
    [9]Sinba R M K et al. Hybrid Contextual Text Recognition with String Matching. IEEE Trans on PAMI, 1993,15(9).
    [10]Cheng-Huang Tung and Hsi-Jian Lee, Increasing Character Recognition Accuracy by Detection and Correction of Erroneously Identified Characters. Pattern Recognition, 1994,27(9).
    [11]Pak-Kwong Wong and Chorkin Chan, Post-processing Statistical Language Models for a Handwritten Chinese Character Recognizer. IEEE Transactions on systems, man, and cybernetics, part B: Cybernetics. April 1999,29(2).
    [12]Takatoshi Yoshikawa etc. Post-processing for Character Recognition Using Pattern Features and Linguistic Information. Character Recognition Technologies, 1993, SPIE Vol. 1906, P246-255
    [13]关毅、张凯、付国宏,基于统计的计算语言模型,计算机应用研究,1999,(6),p26
    [14]刘秉权、王晓龙、王轩、关毅,基于短语的汉语N-gram语言模型研究,863计划智能计算机主题学术会议论文集,2001年2月,P355~361
    [15]常新功等.基于N元语法的文本识别后处理中的局部寻优方法.第五届全国
    
    汉字及汉语语音识别论文集.1994.
    [16]Peter F.Brown,Peter V.deSouza,Robert L.Mercer, etc.,Class-Based n-gram Models of Natural Language, 1992 Association for Computational Linguistics, P467~479
    [17]Amlan Kundu and Yang He, On Optimal Order In Modeling Sequence Of Letters In Words Of Common Language As A Markov Chain, Pattern Recognition, Vol.24, No.7,pp.603-608,1991
    [18]李宏东、叶秀清、顾伟康、路浩如、X.S.Ma,一种带有虚节点的HMM汉字识别后处理算法,信号处理,1999年9月,第十五卷,第三期,P254~259
    [19]秦娇华、向旭宇,HMM在汉字识别技术中的应用,现代计算机,2000年8月,总第97期,P29~31
    [20]杜林、吴健、孙玉方,基于词性标注的汉字识别后处理,1996年,香港、成都国际计算机会议(成都部分)论文集,1996.
    [21]蔡樱、盛立东,手写文稿识别的一种后处理方法和系统集成,中文信息学报,2000,14(3).
    [22]刘瑞正、赵海兰,一个基于综合匹配的汉字识别后处理系统,小型微型计算机系统,1998年5月,第19卷,第5期,P71~76
    [23]李元祥、丁晓青、刘长松,基于HMM的汉语文本识别后处理研究,中文信息学报,1999,13(4)P29~34
    [24]夏莹、马少平等,基于统计的汉字识别文本自动后处理方法,模式识别与人工智能,1996,9(2).
    [25]杜林,吴健,孙玉方.基于统计模型的汉字识别后处理.第六届全国汉字识别学术会议论文集.1996,175-179.
    [26]李元祥、丁晓青、刘长松,一种基于噪声信道模型的汉字识别后处理新方法,清华大学学报(自然科学版),2001年,第41卷,第1期,P24~28
    [27]李元祥、刘长松、丁晓青,一种利用校对信息的汉字识别自适应后处理方法,中文信息学报,2001,15(1).
    [28]张德喜、马少平等,基于统计与神经元方法相结合的手写体相似字识别,中文信息学报,1999,13(3).
    [29]刘瑞正、郑延斌,NETpocer:一个用于汉字识别后处理的人工神经网络系统,计算机工程与应用,1998.8.
    [30]刘瑞正,汉字识别后处理方法研究,中国科学院自动化研究所硕士论文,
    
    1991.6.
    [31]U.V.Marti and H. Bunke,Using A Statistical Language Model to Improve The Performance of an HMM-based Cursive Handwriting Recognition System,International Journal of Pattern Recognition and Artificial Intelligence, Vol.15, No.1(2001)65-90
    [32]Ying Xia. The post-processing statistical method of Chinese text recognition. Pattern Recognition and Artificial Intelligence. 1996, 9(2): 11-15.
    [33]Damian Lopez,Jose M.Sempere and Pedro Garcia,Error Correcting Analysis for Tree Languages,International Journal of Pattern Recognition and Artificial Intelligence, Vol.14, No.3 (2000) 357-368
    [34]Gu Huangyan et al. Markov Modeling of Mardarin Chinese for Decoding the Phonetic Sequence into Chinese Characters. Computer Speech & Language. 1991,5(563).
    [35]彭涛、田学东、郭宝兰,汉字识别后处理用汉语语料库统计,河北省科学院学报(增刊),2002年8月,第19卷,P200~204
    [36]Meteer M.POST Using Probability Language Processing.IJCAI'91
    [37]郝海芳.汉字识别系统后处理方法的研究与实现.武汉工业大学硕士论文.2000.
    [38]苗兰芳.基于N链字的汉字识别后处理方法的研究.杭州大学硕士论文.1998.
    [39]姜珊.汉字识别后处理的研究与实现.武汉工业大学硕士论文.1998.
    [40]张洪刚、刘刚、郭军,一种手写汉字识别结果可信度的测定方法,863计划智能计算机主题学术会议论文集,2001.2.
    [41]马少平、夏莹、朱小燕、姜哲,汉字识别系统的误识模型,清华大学学报(自然科学版),1998年,第38卷,第S1期,P108~111
    [42]李国华,夏莹,马少平等.基于词字间二元语法模型的汉字识别后处理方法.第五届全国汉字识别语音识别论文集.1994,181-186

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700