用户名: 密码: 验证码:
复杂的中文文档图像版面分析研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
光学字符识别(OCR)是一种实现文字自动输入的快捷省力方法,广泛应用于网上资源数据库和数字图书馆的建设。作为OCR进入自动化阶段的首要步骤,版面分析的正确性直接影响到系统输出结果的语义关系和逻辑关系。在各种文档图像中,由于中文文档图像背景、排版的复杂使得版面分析比西文版面难度大。因此对中文版面分析的研究具有重要的理论意义和实用价值。
     针对现有版面分析中所涉及到的图像倾斜检测、版面分割以及纯文本版面分析等算法容易受版面复杂度影响,本文根据中文版面特点,对中文文档图像版面分析算法进行了深入研究和大量实验,并取得了如下成果:
     1.现有的最近邻方法进行文档图像倾斜角计算时,由于被选择的最近邻对可能是错误的,导致计算出的倾斜角与实际角度相差较大。本文提出的基于改进的最近邻链方法,根据判断相似连通区之间同行或同列,构造两类相似k最近邻链表,避免了错误的最近邻链对计算角度的干扰,提高了计算倾斜角度的精确性。
     2.针对传统的游程平滑算法对平滑阈值选取敏感的缺点,提出了基于选择性连通区游程平滑算法,根据区域内、区域间连通区大小、距离特性进行阈值选取,克服了传统游程平滑算法对字体大小、字符间距、图像区域的依赖性,单一背景文档图像版面分割效果得到明显改善。
     3.已有的复杂背景的彩色文档图像分割算法普遍存在提高运行时间与分割正确率相矛盾的缺点,本文通过改进灰度化算法和基于边缘图像的动态聚类分割方法,克服了灰度化过程时文字区域颜色信息丢失并且仅对边缘图像进行处理,在提高版面分割速度的同时不会降低版面分割正确率。
     4.现有阅读顺序未知的复杂纯文本图像版面分析算法对参数选取具有敏感性和弱适用性,对此提出了基于SVM区域构造的版面分析算法。算法选取种子连通区作为测试的第一特征逐步构造区域,之后用投影法决定区域内阅读顺序。实验结果表明,提出的方法具有更好的适应性,对复杂的中文版面有满意的分析结果。
Optical character recognition (OCR) is an implementation of automatic text input faster and easier method, widely used online database and digital libraries. As the first step into the OCR automation phase, the accuracy of layout analysis directly affects the output of the semantic and logical relations. Out of different kinds of document layouts, Chinese document including diversified background and complicated layout is complex which making more difficult in analyzing Chinese document layout than the layout of other alphabetic languages. Thus, the study of layout analysis has important theoretical significance and application value. In order to solve the issues of existed algorithms involved in skew detection, page segmentation and plain text layout analysis which are vulnerable to the layout structure complexity, we do a great deal of experiments and acquired a series of valuable results which can be summarized in the following aspects:
     1. The precision of existing nearest-neighborhood algorithms for detecting skew angle is low because of selected nearest component maybe wrong. Taking into account that whether the pair of similar components is in the same row or column, improved k-nearest-neighborhood chain algorithm is proposed. This algorithm avoids the interference of mistaken nearest-neighborhood chain, so it improves the accuracy of skew angle.
     2. In order to remove the disadvantages of traditional run-length smoothing algorithms (RLSA) which are sensitive to the thresholds, we proposed a new constraint run-length smoothing algorithms based on the selective component according to the between-region and within-region distance. The new algorithm overcome the dependence of algorithms to the character size, spacing and the page segmentation under single background is improved.
     3. By using the improved color-to-gray algorithm and dynamic clustering algorithm based on edge detection we resolve the shortcomings of contradictions between running time and accuracy for page segmentation under complex background. The experiment shows that this new method speed the page segmentation without reducing the accuracy of page segmentation because of overcoming the loss of color information and segmenting only on edge image.
     4. Most algorithms for document layout analysis were sensitive to the parameters and had weak applicability. In order to make up these deficiencies,we presents an algorithm of region formation based on SVM for analyzing Chinese document. Seed connected components as the first feature for training are selected which can be used to form regions, next our technique decides the reading order by exploiting the projection method. Our extensive experimental results show that our proposed algorithm is more effective to analyze different kinds of document layout than other methods.
引文
[1]李钊.中英文混排文字识别系统的设计与实现[D].中国硕士学位论文全文数据库.电子科技大学:2007.
    [2]靳从.中文版面分析关键技术的研究[D].中国博士学位论文全文数据库.南京理工大学:2007.
    [3] Breuel, T.M,The OCRopus open source OCR system[A]. SPIE Document Recognition and Retrieval XV[C]. San Jose, USA, 2008: 0F1–0F15.
    [4]李艳玲.文本图像页面分割技术的研究[D].中国硕士学位论文全文数据库.苏州大学:2004.
    [5] ABBYY Fine Reader Engine 9.0 Software Toolkit Brings New Levels of Accuracy and Performance to Multipage Document Conversion[OL]. http://www.ABBYY.com.
    [6] Song Mao, Azriel Rosenfeld. Document structure analysis algorithms: A literature survey [A]. Proceedings of SPIE on Electronic Imaging[C]. San Jose, USA, 2003:197-207.
    [7] Anoop M. Namboodiri, Anil K. Jain. Document Structure and Layout Analysis[J]. Advances in Pattern Recognition, 2007:29-48.
    [8]刘建胜.文档图像版面理解的研究[D].中国博士学位论文全文数据库.重庆大学:2002.
    [9] Faisal Shafait, Daniel Keysers, Thomas M.Breuel. Pixel-accurate representation and evaluation of page segmentation in document images[A]. 18th International Conference on Pattern Recognition[C]. Hong Kong, China, 2006: 872-875.
    [10] A. Antonacopoulos, B. Gatos, D. Bridson, ICDAR2005 Page Segmentation Competition [A]. Proceedings of the 8th International Conference on Document Analysis and Recognition[C]. Seoul, South Korea, 2005:75–79.
    [11] A. Antonacopoulos, B.Gatos, D.Bridson. ICDAR2007 Page Segmentation Competition[A]. Proceedings of the 9th International conference on Document Analysis and Recognition [C]. Curitiba, Brazil, 2007:1279-1283.
    [12] Fu Chang, Shih-Yu Chu, Chi-Yen Chen. Chinese document layout analysis using anadaptive regrouping strategy[J]. Pattern Recognition, 2005, 38(2):261– 271.
    [13] Jie Xi, Jianming Hu, Lide Wu. Page segmentation of Chinese newspapers[J]. Pattern Recognition, 2002, 35(12):2695– 2704.
    [14]陈明,丁晓青,梁健.复杂中文报纸的版面分析、理解和重构[J].清华大学学报:自然科学版, 2001, 41(1): 29-32.
    [15]杨洋,平西建.复杂版面的文本图像分割算法[J].微计算机信息, 2006,22(5):224-226.
    [16] Y.Zhan, W. Wang, W. Gao. A robust split-and-merge text segmentation approach for images[A]. 18th International Conference on Pattern Recognition[C]. Hong Kong, China, 2006, 2:1002-1005.
    [17] Shafait, F., Keysers, D., Breuel, T.M. Performance comparison of six algorithms for page segmentation[A]. 7th IAPR Workshop on Document Analysis Systems[C]. Nelson, New Zealand, 2006: 368–379.
    [18]张淑兵.文本图像的几何畸变校正技术研究[D].中国硕士学位论文全文数据库.电子科技大学:2007.
    [19]夏波涌,童悍操.基于纹理梯度的文档图像的倾斜校正方法[J].计算机仿真,2009,26(3):240-243.
    [20]龚声蓉,刘纯平,王强.数字图像处理与分析[M].北京:清华大学出版社,2007.
    [21] L.O’Gorman. The Document Spectrum for Structural Page Layout Analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993, 15(11):1162-1173.
    [22] X.Jiang, H.Bunke. Skew detection of document images by focused nearest-neighbor clustering[A]. Proceedings of the 5th International Conference on Document Analysis and Recognition[C]. Bangalore, India, 1999: 629-632.
    [23] Yue Lu, Chew Lim Tan. Improved Nearest neighbor based approach to accurate document skew estimation[A].Proceedings of the 7th International Conference on Document Analysis and Recognition[C]. Edinburgh, Scotland, 2003, 1: 503-507.
    [24] Hong Liu, Qi Wu, Hongbin Zha. Skew detection for complex document images using robust borderlines in both text and non-text regions[J]. Pattern Recognition Letters, 2008, 29(13):1893-1990.
    [25] Shutao Li, James T. Kwok. Text extraction using edge detection and morphological dilation[A]. Proceedings of 2004 International Symposium on Intelligent Multimedia, video and speech Processing[C]. Hong Kong, China, 2004: 330-333.
    [26] Hung-Ming Sun. Page segmentation for Manhattan and non-Manhattan layout documents via selective CRLA[A].Proceedings of the 8th International Conference on Document Analysis and Recognition[C]. Seoul, Korea, 2005, 1: 116-120.
    [27] Phillip E. Mitchell, Hong Yan. Newspaper layout analysis incorporating connected component separation[J]. Image and Vision Computing, 2004, 22(4):307-317.
    [28]王妹华,曹阳,李佐.连通区的页面分割与分类方法[J].计算机辅助设计与图形学学报, 2002,14(1):17-21.
    [29] Di Wen, Xiao-Qing Ding. Visual Similarity Based Document Layout Analysis[J]. Computer Science and technology, 2006, 21(3):459-465.
    [30] Yu-Kumg Chen, Kai-Ti Hu. Classification of a text and a graphics images with asymmetrical filters[A]. 2008 Congress on Image and Signal Processing[C], Hainan, China 2008:476-480.
    [31] B. Gatos, I. Pratikakis, S. J. Perantonis. Adaptive degraded document image binarization[J]. Pattern Recognition, 2006, 39(3):317– 327.
    [32] D.Chen, J. Odobez, H. Bourlard. Text detection and recognition in images and video frames[J]. Pattern Recognition, 2004, 37(3):595-608.
    [33] Sunil Kumar. Rajat Gupta. Text extraction and document images segmentation using matched wavelets and MRF model[J]. IEEE Transactions on Image Processing, 2007, 16(8): 2117-2128.
    [34] Keechul Junga, Kwang In Kim, Anil K. Jain. Text information extraction in images and video: a survey[J]. Pattern Recognition, 2004, 37(5):977-997.
    [35]陈又新,刘长松,丁晓青.复杂彩色文本图像中字符的提取[J].中文信息学报, 2003,17(5):55-59.
    [36] M.Pietikaninen, O. Okun. Edge-based method for text detection from complex document images[A]. Proceedings of 6th International Conference on Document Analysis and Recognition[C]. Washington, USA, 2001:286-291.
    [37] Jian Yi, Yuxin Peng, Jiangguo Xiao. Color-based clustering for text detection and extraction in image[A]. Proceedings of 15th International Conference on Multimedia [C]. Augsburg, Germany, 2007: 847-850.
    [38] Yan Song, Anan Liu. A novel image text extraction method based on k-means clustering [A]. Seventh IEEE/ACIS International Conference on Computer and Information Science[C]. Portland, USA, 2008: 185-190.
    [39] Yen-Lin Chen, Bing-Fei Wu. A multi-plane approach for text segmentation of complex document images[J]. Pattern Recognition, 2009, 42(7):1419-1444.
    [40]陈明,丁晓青,吴佑寿.多层次可信度指导下的自底向上的版面分析算法[J].模式识别与人工智能,2003, 16(2):198-203.
    [41] Chung-Chih Wu, Chien-Hsing Chou, Fu Chang. A machine-learning approach for analyzing document layout structures with two reading orders[J]. Pattern Recognition, 2008, 41(10):3200– 3213.
    [42]边肇祺,张学工.模式识别[M].北京:清华大学出版社, 2000.
    [43] Breuel,T.M. High performance document layout analysis[A]. 2003 Symposium on Document Image Understanding[C]. Greenbelt, Maryland. 2003.
    [44] Shafait, F., Keysers, D., Breuel, T.M. Performance evaluation and Benchmarking of six-page segmentation algorithms[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(6):941-954.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700