用户名: 密码: 验证码:
农业古籍的计算机断句标点与分词标引研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
中国文化典籍是中华民族在数千年历史发展过程中创造的重要文明成果,蕴含着中华民族特有的精神价值、思维方式和想象力、创造力,是中华文明绵延不绝的历史见证,也是人类文明的瑰宝。对古籍的整理、保护与开发是炎黄子孙应尽的义务和职责。
     中国的古籍整理,有着悠久的历史和传统。从孔子删定《六经》、刘向父子编撰《七略》,到清人编定《四库全书》、《古今图书集成》,大规模的古籍整理持续不断,影响深远。建国后古籍整理领域取得的巨大成就举世瞩目,目前仅整理出版的农业古籍就达140余种。尽管如此,农业古籍的整理与开发仍然不足,已经整理出版的农业古籍只占全部存世农业古籍的15%左右,大量的农业古籍亟待整理。
     中国古籍的断句标点最晚于东汉时期已经开始了,其后各代这一工作连绵不绝,只是盛衰易势,治乱更迭而已。明代《永乐大典》所收各书无一不加圈点,而综观清代《四库全书》却无一圈一点。虽同为官方编撰类书,然差距之大、观念之异令人惊奇。民国后新式标点兴起,古籍断句标点之风方始流行,新中国建立后政府更是极力推动,新式标点整理古籍因而数量渐增。1989年我国制订《信息处理现代汉语分词规范》,然以现代文本为处理对象,而应用于古籍整理领域的专用古籍文本的分词规范尚未引起注意。
     正是基于这一现状,本文以农业古籍为研究对象,研究了农业古籍的断句标点、分词以及索引编制的历史与现状,重点探讨了计算机技术在农业古籍断句标点、分词标引中的应用,构建了农业古籍断句标点、分词标引的原型系统。主要研究内容如下:
     1)借鉴文本模式匹配、句法分析等技术,研究并设计出农业古籍自动断句标点的算法,设计出农业古籍断句标点的原型系统。
     通过对约2,000万汉字古籍文本语料的统计与分析,总结出断句标点常用的11种方法。首先采用句法特征词、同义语标志词进行初步断句;进而利用反义复合词、引书标志、时序词、数量词、重叠字词、动名结构词及比较句法进一步对子句进行断句、标点;最后使用农业用语和禁用模式进一步提高断句、标点后农业古籍的可读性和准确性。
     根据这些断句标点方法与规则,采用自动构建与人工优化相结合的办法构建了断句标点模式库与断句标点禁用模式库2类断句标点知识库。两者共同保证了断句标点功能的正常开展,目前已经构建的标点模式库共有1,166条规则,断句标点禁用模式库共184条规则。
     依据这些断句标点规则,利用本系统对6种农业古籍的断句标点测试,取得了60.5%的断句正确率与40.5%的标点正确率。
     2)借助N元分词、词典分词等技术,设计出农业古籍自动分词专用算法,设计出农业古籍分词的原型系统。
     考虑到目前尚无现成的古籍分词词典可用,因此构建古籍分词词典势在必行。而构建一部全面而权威的古籍分词词典又非短时间所能够完成,因此采用基于词典分词与N元语法分词的综合分词方法成为目前古籍分词较理想的方法。
     有基于此,本实验构建了基本词典群与禁用词典群等两个分词词典群共10多个数据库,其中基本词典群包括人名、地名、书名、职官名、物产名等数据库,而禁用词典群则包括成语、年号、虚词、数量词、时序词等数据库。分词词典群目前共收录各类词汇20万条,基本上满足了古籍分词的需要。
     综合采用分词词典分词和N元语法分词2种方法进行古籍文本分词,并采用子串比较过滤、相邻词过滤、高频词过滤、低频词过滤等方法对分词结果进行过滤,分别以12种农业古籍和379种《广东方志物产》为语料进行了古籍分词测试。从12种农业古籍中共识别出已有词1,164个,约占总词汇量的31%;未登录词2,530个,占总词汇的69%。从379种广东方志物产资料中共识别出已有词6,314个,占总词汇的8%;未登录词75,438个,则占总词汇的92%。其中出现10次以上的词汇为8,044个,占总词汇的10%。出现20次以上的词汇共3,760个,占总词汇的5%左右。
     通过对379种《广东方志物产》分词结果的分析,我们发现这样一个事实:当词频等级位于区间(2000,8000)时,词频等级与频次乘积基本为常数23,000,000,这一结果说明齐夫第一定律在古籍文本中同样适用。
     利用计算机实现农业古籍文本的断句标点与分词标引,并开发出相应的原型系统,是国内农业史、情报学、中文信息处理技术等学科结合的一次全新尝试。正因为如此,本项研究仍然稍显稚嫩,尚有进一步深入的必要。
     (1)目前采用的断句标点模式库共有各种规则1,100余条,数量有限,而且各个规则之间仍然有待于进一步梳理与优化。再者,目前断句标点所用的方法为模式识别方法,主要基于语词的应用,而对于句法特征的采用仍然有限。这主要因为目前缺少农业古籍分析的熟语料库,特别是缺少农业古籍词汇属性库,使本实验难以进行有效的句法分析。随着农业古籍词汇属性库的构建,古籍句法规则的分析将会逐步深入。立足于农业古籍词汇属性库与古籍句法规则库的断句标点将会取得更佳的效果。
     (2)分词系统采用词典分词与N元语法结合的综合分词方法,通过词典识别出的词汇占总词汇的比例仍然不高,在农业古籍中占31%,这一结果很显然乃分词词典收录的词汇在各子学科中分布并不均匀所致。所以,优化分词词典将是下一步需要继续研究的问题之一。
     尽管本课题得到了国家社科基金与教育部人文社会科学基金的支持,但是,因为课题涉及的范围太过广泛,且因为时间因素很难作全面而深入的探讨,只能留待于将来作更进一步的探讨与研究。
The ancient books in Chinese culture are the important civilization achievements created by Chinese in thousands of years, which contain the value, mode of thinking and imagination creative power belonging to Chinese nation. They witness China civilization that has been continuing in the long never-failing history. These ancient books are also the treasure of the humankind civilization and Chinese descendants should have the obligation and duty to sort out, protect and develop the ancient books.
     On arrangement of the Chinese ancient books, China has a long history and tradition. From Six Classic edited by Confucius and Seven Summaries collated by Liu Xiang and his son, to Imperial Collection of Four (Si Ku Quan Shu) and Integration of Classic Books(Gu jin Tu Shu Ji Cheng) compiled by the scholars in Qing Dynasty, the large-scale activities of ancient books arrangement have been going on and they influence the offspring profoundly. The arrangement of the Chinese ancient books has gained great achievements since 1949 that attracts world attention. We have cleaned up and published over 140 volumes of the agriculture ancient books. For all that, sorting and developing agricultural ancient books are still not enough. The arranged books account for 15%of the all books and it is urgent to clean up and publish a number of agricultural ancient books.
     The study of the sentence segmentation and punctuation of Chinese ancient books has already started not later than Eastern Han. Since the period, this work is continuous, just developing fast or slowly from time to time according to political and economical circumstances. Each book included in Ming Dynasty's Yong Le Encyclopedia was punctuated, on the contrary, each book included in Qing Dynasty's Imperial Collection of Four (Si Ku Quan Shu) had no punctuation. Although they are the same oriental encyclopedia edited by the government, they have different ideas and measures, which is surprising. With modern style punctuation rising since the Republic of China set up in 1912, the sentence segmentation and punctuation of Chinese ancient books have always attracted the scholars'attention. The government has made every effort since establishment of the People's Republic of China. As a result, the number of ancient books by modern style punctuation gradually increases. China issued A Standard on Word Segmentation for Modern Chinese Text in Information Processing in 1989, which deals with modern text originally, but the Standard on word segmentation for ancient Chinese Text is not yet drafted, which should be taken seriously.
     In view of existing situation, this article takes agriculture ancient books as a research object, studies the history and the current situation of punctuation, word segmentation and indexing, and emphasizes computer technique applied to these domains, designs prototype system of punctuation, word segmentation and indexing. The main contents of the article are as follows:
     1) Based on technique of pattern matching and sentence construction analysis, it constructs an algorithm of sentence segmentation and punctuation of the agricultural ancient books, and designs a prototype punctuation system of the agricultural ancient books.
     According to statistics and analysis of 20,000,000 characters from Chinese ancient works, the article summarizes 11 common methods for punctuation. It is proposed that the sentence should be initially segmented by syntax words (like empty word, conjunction and modal words and synonyms indication words). Then antonyms, cited books indications, time sequence, quantifiers, pleonasms and verb object structure are used for further sentence segmentation and punctuation. In addition, the comparative sentence supplies an auxiliary means of judgment of complex sentences and punctuation of clauses. Finally, the terms in agriculture and the stoplist of punctuation are applied to improve the readability of these books after marking the punctuations.
     According to these methods and rules of sentence segmentation and punctuation, the experiment sets up two knowledge bases, such as primary models table and stoplist models table constructed by artificial and automatic measures. Two knowledge bases assure to carry out function of punctuation. Up to now, we have set up 1,166 rules in primary models table and 184 rules in stoplist models table.
     Based on these rules of sentence segmentation and punctuation, we make a test of agriculture ancient books. In experiments, the average precision of sentence segmentation and punctuation reaches 60.5%and 40.5%.
     2) With help of information processing techniques, such as dictionary-based word segmentation and word segmentation by N-gram etc., it constructs an algorithm of word segmentation for the agriculture ancient books and designs the prototype system of word segmentation of agricultural ancient books.
     In consideration of that, up to now the system of word segmentation has not still had a dictionary for word segmentation, so it is necessary to set up a dictionary for word segmentation. However, it will take a long time to set up a dictionary for word segmentation, and the system adopts a comprehensive method of word segmentation including dictionary-based word segmentation and word segmentation by N-gram, which is an ideal method of word segmentation for the ancient books.
     According to what mentioned above, this experiment set up two clusters of dictionary for word segmentation, more than 10 databases, in which primary dictionary clusters include tables of personal name, place name, book title, officer name, products etc. and stoplist dictionary clusters include tables of idiom, title of resign, empty words, quantifier and time etc. The dictionaries for word segmentation have a vocabulary 200,000 currently, which satisfies the demand of word segmentation of ancient books.
     The experiment adopts a comprehensive method of word segmentation including dictionary-based word segmentation and word segmentation by N-gram, and uses some measures for noise reduction such as substring comparison, neighbor comparison, high frequency words, low frequency phrase words etc. Finally, taking 12 agriculture ancient books and 379 Local Chronicle of Guangdong:Products as the example respectively, the experiment makes a test of word segmentation of agricultural ancient books. From corpus of 12 agriculture ancient books, the experiment recognizes 1164 old words that account for 31%of total vocabulary and 2,530 new words that take up 69%of total vocabulary. From corpus of 379 Local Chronicle of Guangdong:Products, the experiment recognizes 6,314 old words that account for 8%of total vocabulary and 754,380 new words that take up 92% of total vocabulary. The words whose term frequency is more than 10 times are up to 8,044, which take up 10%of all words. In the meantime, the words whose term frequency is more than 20 times are up to 3,760 in all words, which take up 5%of the total vocabulary.
     By analysis on results of word segmentation on 379 Local Chronicle of Guangdong: Products, a fact is discovered that if the level of term frequency is in range (2000,8000), then the product of level of term frequency and frequency is a constant of 23 million. The appearance shows that Zipf s law be the same with the ancient Chinese text.
     With the help of the computer, carrying out the function of sentence segmentation and punctuation, word segmentation and indexing of agricultural ancient books text and developing a correspond system are the outcome of studying agriculture history with intelligence science and Chinese information processing technique etc. For it is the preliminary research, this article is still slightly immature and has a necessity for further study.
     1) Now a primary rules table for sentence segmentation and punctuation includes only more than 1,100 rules. The number of rules is limited, and each rule awaiting optimization. Furthermore, the measures of sentence segmentation and punctuation are mainly pattern recognition, that is to say, they base primarily on the application of phrase and the application of sentence characteristic, which is still limited. This mainly resulted from being short of a ripe corpus of the agriculture ancient books at present, especially lack of the vocabulary attribute corpus of the agriculture ancient books, which makes this experiment hard to make a valid sentence construction analysis. Along with constructing the vocabulary attribute corpus of the agricultural ancient books, word segmentation system will gradually strengthen sentence analysis of ancient books. Based on the vocabulary attribute corpus and the tables of rules of sentence segmentation and punctuation of the agriculture ancient books, sentence segmentation and punctuation can achieve better results.
     2) Because the experiment adopts the comprehensive methods including dictionary-based word segmentation and word segmentation by N-gram, the recall ratio is still low. The ratio of old words by dictionary-based word segmentation is 31%in the agricultural ancient books while 10%in Local Chronicle of Guangdong:Products. It is obvious that it results from that the words of the dictionaries of word segmentation distributes unevenly in different themes. Consequently, the next step is to continue a study of how to optimize the dictionary of word segmentation.
     This topic has gained the support of national social science fund and humanities and social sciences fund of Ministry of Education, but with the topic relating very extensively and lacking in time, it is very difficult to make a thorough study, so these problems awaiting solution in future.
引文
[1]刘琳,吴洪泽.古籍整理学[M].成都:四川大学出版社,2003.
    [2]黄永年.古籍整理概论[M].上海:上海书店出版社,2001.
    [3]程毅中.古籍整理浅谈[M].北京:燕山出版社,2001.
    [4]来新夏.古籍整理散论[M].北京:书目文献出版社,1994.
    [5]常娥.古籍智能处理技术研究[D].南京:南京农业大学,2007.
    [6](新加坡)林大芽著.中国古籍数学化研究论集[C].长沙:湖南大学出版社,1989.
    [7]刘琳,吴洪泽.古籍整理学[M].成都:四川大学出版社,2003.
    [8]中国学术文献网络出版总库[EB/OL]. [2009-03-07]. http://epub.cnki.net/grid2008 /brief/index.aspx?code:ZKCALD.
    [9]施德庆.陕西省开发中医古籍整理工作计算机系统[J].中国出版.1987(5):66.
    [10]曹书杰.古籍整理与电子计算机应用研究的思考[J].古籍整理研究学刊.1988(1):46-51.
    [11]中国古籍整理出版十年规划和“八五”计划(1991—1995—2000)[EB/OL]. [2009-02-23]. http://www.guji.cn/left_01.php.
    [12]国家古籍整理出版”十五”(2001—2005年)重点规划[EB/OL]. [2009-02-22].http://www. guji.cn/guihua/shiwu.doc.
    [13]彭立.微机古籍辅助整理系统[J].东北师大学报(自然科学版).1991(1):53-57.
    [14]姚松.计算机用于古籍整理研究的现状与展望[J].中国典籍与文化.1995(2):121-127.
    [15]于亭.计算机与古籍整理研究手段现代化[J].古汉语研究.2000(3):66-70.
    [16]陈东辉.欧美中国古籍索引编制概况[J].中国索引.2006,4(1):28-30.
    [17]黄俊杰,陈恩泉,张普.试谈利用电子计算机自动编制中文著作字索引[J].武汉大学学报(人文科学版).1979(04):36-40.
    [18]刘岩斌,俞士汉,孙钦善.古诗研究的计算机支持环境的实现[J].中文信息学报.1997(1):27-36.
    [19]胡俊峰,俞士汶.唐宋诗之计算机辅助深层研究[J].北京大学学报.2001(5):727-733.
    [20]陈郁夫.中文智能处理技术:古文标点自动化[EB/OL]. [2009-03-01]. http://www.yyxxl. sdu.edu.cn/guojihy.htm.
    [21]李铎,王毅.关于古代文献信息化工程与古典文学研究之间互动关系的对话[J].文学遗产.2005(1):126-137,160.
    [22]郑永晓.古籍数字化与古典文学研究的未来[EB/OL]. [2008-03-18]. http://www. guoxue. com/gjszh/yjwz_006.htm.
    [23]常娥,侯汉清,曹玲.古籍自动校勘的研究和实现[J].中文信息学报.2007(2):6.
    [24]常娥,侯汉清.农业古籍自动编纂的设计和研究[J].南京农业大学学报(社会科学版).2007(01):99-104.
    [25]百度百科.龙语瀚堂典籍数据库[EB/OL]. [2009-03-03]. http://baike. baidu. com/view/ 595190. htm.
    [26]罗晨光,山川,王珊.基于本体的古籍知识库建设初探[J].现代图书情报技术.2007(4):8-11.
    [27]程佳羽,史睿.古籍数字资源的知识库建设解析[J].数字图书馆论坛.2006(12):1-4.
    [28]侯汉清等.文化典籍整理与开发的智能技术研究[R].国家社会科学基金,2008.
    [29]常娥,侯汉清,曹玲.古籍自动校勘的研究和实现[J].中文信息学报,2007(2):83-88.
    [30]青典版本校勘系统试用版[EB/OL]. [2009-03-07]. http://www.ilibo.com/index.asp? p1=09&p2=20.
    [31]比较合并文档[EB/OL]. [2009-03-07]. http://www. lqqqsw. com/question/Word%CE%CA%C C%E2%B9%E9%C0%E0/word%CE%CA%CC%E2%B9%E9%C0%E0-%D0%EC/%BF%CE%CA%B15%B1%EA%CC%E22. ht m.
    [32]曹玲,何琳.农业古籍本体构建及应用[J].广西师范大学学报:自然科学版.2007,25(2):1-4.
    [33]何琳,曹玲.农业古籍本体的构建及其检索机制研究[J].现代图书情报技术.2006(12):37-39.
    [34]何琳.古农学本体的半自动构建及检索研究[D].南京农业大学.2007.
    [35]王雅戈.民国农业文献数字化整理及信息组织研究—兼论民国索引史[D].南京农业大学,2007:34-37.
    [36]王雅戈,杜慧平.机编古籍索引探讨[J].图书馆论坛.2008(10):34-37.
    [37]衡中青.地方志知识组织及内容挖掘研究—以《方志物产·广东》为例[D].南京:南京农业大学,2007.
    [38]黄建年等.农业古籍自动分词及索引编制技术研究[R].教育部,2008.
    [1]罗竹风.汉语大词典[EB/OL].香港:商务印书馆(香港)有限公司,2002.
    [2]古代汉语词典[M].北京:商务印书馆,2003:837.
    [3]林穗芳.汉语“标点”的词源和历史演变[J].编辑学刊.1997(4):21-27.
    [4]黄侃.文心雕龙札记[M].上海:上海古籍出版社,2000.
    [5]吕思勉.章句论[M].上海:商务印书馆,1926.
    [6]程毅中.古籍整理浅谈[M].北京:燕山出板社,2001.
    [7](宋)岳珂.相台书塾刊正九经三传沿革例[M].常熟:鲍氏困学斋,1736.
    [8]汉典.圈点[EB/OL].[2009-01-27].http://www.zdic.net/cd/ci/11/ZdicE5Zdic9 CZdic88113870.htm.
    [9]汉典. 句读[EB/OL]. [2009-01-27].http://www.zdic.net/cd/ci/5/ZdicE5Zdic 8FZdicA5296912.htm.
    [10](元)程端禮.程氏家塾读书分年日程[M].福州:正谊书局左氏增刊本,1866.
    [11]阙勋吾.古文标点例说[M].郑州:河南人民出版社,1985.
    [12]袁晖.汉语标点符号流变史[M].武汉:湖北教育出版社,2002.
    [13]王迈.古文标点例析[M].北京:语文出版社,1992.
    [14]管敏义.怎样标点古书[M].北京:书目文献出版社,1985.
    [15]张仓礼,陈光前.古文断句与标点[M].长春:吉林文史出版社,1986.
    [16]文心雕龙注卷七[EB/OL].[2009-01-24].http://www.zh5000.com/ZHJD/gxjd/2006/ qitabu/jzl/gxjd-0553.htm.
    [17]史通内篇浮词第二十一[EB/OL].[2009-02-08].http://lib.ecit.edu.cn/guoxue/
    %CA%B7%B2%BF/%C6%E4%CB%FB/%CA%B7%CD%A8/21.html.
    [18]汉典. 断句[EB/OL]. [2009-02-07].http://www.zdic.net/cd/ci/11/ZdicE6Zd ic96ZdicAD61917.htm.
    [19]林穗芳.“标点”的词源和概念(上)——兼论建立独立的标点学科的必要性[J].语文建设.1997(4):35-37.
    [20]林穗芳. “标点”的词源和概念(下)——兼论建立独立的标点学科的必要性[J].语文建设.1997(5):31-36.
    [21]国家技术监督局.国家标准标点符号用法[EB/OL].[2009-01-31].http://po.sit.edu. cn/gwzs/bdfh.htm.
    [22]辞海编辑委员会.辞海[M].上海:上海辞书出版社,1999.
    [23]萧世民.也谈中国最早的标点和符号[J].吉安师专学报.1992(3):68-69.
    [24]百度百科.侯马盟书[EB/OL].[2009-01-29].http://baike.baidu.com/view/ 135895.html?fromTaglist.
    [25]裘燮君.古书的标点[J].河池师专学报.1993,13(4):73-83.
    [26]花启清.所有的标点都是外来的吗[J].咬文嚼字.2007(6):32-33.
    [27]王宏佳.无标点的古代汉语书面语及其表意机制[J].广西社会科学.2006(2):167-170.
    [28]刘振铎.古汉语标点与翻译方法[M].武汉:湖北教育出版社,1989.
    [29]我国标点符号引进与推行小史[J].出版视野.2005(5):42-43.
    [30]古敬恒.古文标点技法[M].徐州:中国矿业大学出版社,1994.
    [31]互动百科.标点-中文标点符号[EB/OL].[2009-02-01].http://www.hudong.com/wiki /%E6%A0%87%E7%82%B9.
    [32]栾贵明.永乐大典索引[M].北京:作家出版社,1996.
    [33]吴永贵.汪原放:标点校勘古典小说第一人[EB/OL].[2009-02-02].http://
    www.gmw.cn/content/2007-04/23/content_596446.htm.
    [34]张芳,王思明.中国农业古籍目录[M].北京:北京图书馆出版社,2002:494.
    [35]农业古籍出版情况[EB/OL].[2009-02-04].http://www.yjsy.ecnu.edu.cn/jszj/ %D7%D3%B2%BF-%CE%BA%BD%FA%D2%D4%CF%C2/%C5%A9%D1%A7%B9%C5%BC%AE%B8%C5%C0%C0/1/o1 d-books-publication.htm.
    [36]杨牧之.新中国古籍整理图书总目录[M].长沙:岳麓书社,2007.
    [37]中国古籍整理出版十年规划和“八五”计划(1991—1995—2000)[EB/OL].[2009-02-23]. http://www.guji.cn/left_01.php.
    [38]穆祥桐.发扬传统 继往开来——农业出版社出版农业古籍概述[J].农业考古.1994(01):173-182,126.
    [39]常用典籍目录·农业[EB/OL].[2009-02-23].http://www.aqtvu.com.cn/jwc/yuangX/ page/dianji.htm.
    [40]惠富平.中国传统农书整理综论[J].中国农史.1997(1):98-106.
    [41]肖克之,李兆昆.农业古籍整理出版概况[J].古今农业.1990(1):167-172.
    [42]李根蟠,王小嘉. 中国农史研究的回顾与展望[EB/OL]. [2009-02-26].http://www. lunwen tianxia. com/product. free.5513259.2/.
    [43]Bruthiaux P. Knowing when to stop:investigating the nature of punctuation[J]. Language& Communication.1993,13(1):27-43.
    [44]Bernard J. Towards a syntactic sccount of punctuation[C].1996.
    [45]Bernard J. What's the point? A (computational) theory of punctuations[D]. UK: University of Edinburgh,1997.
    [46]Kim J. A combined punctuation generation and speech recognition system and its performance enhancement using prosody[J]. Speech Communication.2003,41(4): 563-577.
    [47]刘岩斌,俞士汉,孙钦善.古诗研究的计算机支持环境的实现[J].中文信息学报.1997(1):27-36.
    [48]陈郁夫.中文智能处理技术:古文标点自动化[EB/OL]. [2008-07-25]. http:// www. yyxxl. sdu. edu. cn/guojihy. htm.
    [49]李铎,王毅.关于古代文献信息化工程与古典文学研究之间互动关系的对话.[J].文学遗产.2005(1):126-137,160.
    [50]郑永晓.古籍数字化与古典文学研究的未来[J].文学遗产.2005(5):130-137,160.
    [51]张红萍,郑学红.数字化技术在农业古籍中的应用[J].农业图书情报学刊.2004(10):33-35.
    [52]中国农业遗产研究室.中国农业遗产信息平台[EB/OL]. [2008-3-18]. http://rw.njau. edu. cn/INFOBIN lect.D11.
    [53]曹玲.农业古籍数字化整理研究[D].南京:南京农业大学,2006.
    [54]常娥.古籍智能处理技术研究[D].南京:南京农业大学,2007.
    [55](西汉)氾胜之.氾胜之书[EB/OL]. [2008-08-28]. http://bbs4.xilu.com/cgi-bin/bbs/ view? forum=wave99&message=12367.
    [56]Idm. UltraEdit[EB/OL]. [2007-7-22]. http://www.ultraedit.com.cn/product/ ultraedit. htm.
    [57]张亮,陈家骏.基于大规模语料库的句法模式匹配研究[J].中文信息学报.2007(5):31-35.
    [58]齐浩亮等.面向特定领域的汉语句法主干分析[J].2004(1):1-13.
    [59]李文翔等.基于语料库的关联词识别方法[J].计算机工程与应用.2004(7):50-52.
    [60]常娥,侯汉清,曹玲.古籍自动校勘的研究和实现[J].2007(2):83-88.
    [61]衡中青.地方志知识组织及内容挖掘研究—以《方志物产·广东》为例[D].南京:南京农业大学,2007.
    [62]衡中青,刘竟,侯汉清. 《方志物产》引书挖掘及分析研究—以《岭南丛述》(物产)为例[J].中国农史.2007(3):132-139.
    [63]白振田,衡中青,侯汉清.地方志引书挖掘系统的设计与实现[J].图书馆杂志.2008(08):50-54,58.
    [64]陈振宇,陈振宁.怎样计算现代汉语句子的时间信息[J].2005(3):94-104.
    [65]上海人民出版社.清代日记汇抄[M].上海:上海人民出版社,1982.
    [66]张文国.古汉语的“N+N”结构及其发展[J].长安大学学报(社会科学版).2006(2):80-83.
    [67]邱立坤.现代汉语动名语串结构关系判定[J].Journal of Chinese Language and Computing.2005,15(3):173-183.
    [68]维基百科.正则表达式[EB/OL]. [2009-04-10]. http://zh. wikipedia. org/wiki /%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F.
    [69]JScript和VBScript正则表达式[EB/OL]. [2009-04-10]. http://soulogic.com/doc/ RegularExpressions/.
    [1]陈敏.中文信息处理的现状与展望[J].语言文字应用.1995(4):26-32.
    [2]孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学.2001(1):22-32.
    [3]龚汉明,周长胜.汉语分词技术综述[J].北京机械工业学院学报.2004(3):52-55,61.
    [4]高等英语教学网.中国的机器翻译技术[EB/OL]. [2008-3-18]. http://www.heep.cn/ meeting_detial. php?where_class=4&class_where=15&show=deital&where=144.
    [5]Cnki学术趋势.与自动分词相关的历史事件[EB/OL].2008[2008-3-18]. http://trend. cnki. net/more.php?searchword=%E8%87%AA%E5%8A%A8%E5%88%86%E8%AF%8D&type=history.
    [6]曹德和.中文分词连写的问题与对策[J].北华大学学报:社会科学版.2006,7(1):21-26.
    [7]李辉阳,韩忠愿,等.书面汉语分词连写的合理性与紧迫性及其实现[J].中文信息学报.2001,15(5):15-18.
    [8]冯志伟.汉语书面语的分词连写[J].语文建设.2001(3):15-15.
    [9]许林杰.汉语文本分词问题的形成原因和解决途径[J].山东行政学院山东省经济管理干部学院学报.2002(3):113-114.
    [10]杨端志.汉语词汇理论、词典分词与“词”的认知[J].山东大学学报:哲学社会科学版.2003(6):85-89.
    [11]孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学.2001,3(1):22-32.
    [12]张春霞.汉语自动分词的研究现状与困难[J].系统仿真学报.2005(1):138-143,147.
    [13]张培颖,李村合.一种中文分词词典新机制—四字哈希机制[J].微型电脑应用.2006(10):35-36,55,66.
    [14]徐威,董渊.针对中文文本自动分类算法的评估体系[J].计算机科学.2007(8):177-179.
    [15]吴启德,吕强.一个比较不同中文分词法的软件[J].苏州大学学报(工科版).2004(6):29-33.
    [16]刀剑笑.三种中文分词算法优劣比较[EB/OL]. [2009-01-22]. http://blog.csdn.net/jyz3051/ archive/2008/11/01/3202431.aspx.
    [17]翟凤文,赫枫龄.一种字典与统计相结合的中文分词方法[J].小型微型计算机系统.2006(9):1766-1771.
    [18]陈飞.一种混合的中文分词算法[J].南开大学学报(自然科学版).2007(5):27-32.
    [19]宋礼鹏,郑家恒.基于聚类的语料库分词评价方法研究[J].计算机学报.2004,27(2):192-196.
    [20]宋礼鹏,郑家恒.分词评价系统实现[J].电脑开发与应用.2004,17(1):12-13.
    [21]王彩荣,黄玉基.汉语自动分词软件评价方法优化[J].微处理机.2006,27(6):61-63.
    [22]王彩,李晓毅,黄玉基.汉语自动分词系统的评价[J].微处理机.2003(5):28-30.
    [23]何莘,王琬芜.自然语言检索中的中文分词技术研究进展及应用[J].情报科学.2008(5):787-791.
    [24]黄昌宁,赵海.中文分词十年回顾[J].中文信息学报.2007,21(3):8-19.
    [25]张春霞,郝天永.汉语自动分词的研究现状与困难[J].系统仿真学报.2005,17(1):138-143.
    [26]周文帅,冯速.汉语分词技术研究现状与应用展望[J].山西师范大学学报:自然科学版.2006,20(1):25-29.
    [27]刘迁,贾惠波.中文信息处理中自动分词技术的研究与展望[J].计算机工程与应用.2006,42(3):175-177.
    [28]文庭孝,邱均平,侯经川.汉语自动分词研究展望[J].现代图书情报技术.2004(7):6-10.
    [29]熊回香,夏立新.基于词索引的中文全文检索关键技术及其发展方向[J].中国图书馆学报.2007(4):45-49.
    [30]瞿锋,陈纪元.汉语自动分词算法综述[J].福建电脑.2006(4):23-25.
    [31]吴凡.信息检索中的中文分词问题研究[J].情报杂志.2008,27(7):41-43.
    [32]曹桂宏.中文分词对中文信息检索系统性能的影响[J].计算机工程与应用.2003,39(19):78-80.
    [33]董小芸,刘俊熙.自动分词在中文信息检索中的应用[J].情报杂志.2003,22(12):65-66.
    [34]曹倩,等.汉语自动分词研究及其在信息检索中的应用[J].计算机应用研究.2004,21(5):71-74.
    [35]黄昆,符绍宏.自动分词技术及其在信息检索中应用的研究[J].现代图书情报技术.2001(3):26-29.
    [36]孙巍.一种面向中文信息检索的汉语自动分词方法[J].现代图书情报技术.2006(7):33-36.
    [37]曹勇刚,等.面向信息检索的自适应中文分词系统[J].软件学报.2006,17(3):356-363.
    [38]金澎,刘毅,王树梅.汉语分词对中文搜索引擎检索性能的影响[J].情报学报.2006,25(1):21-24.
    [39]吴栋,滕育平.中文信息检索引擎中的分词与检索技术[J].计算机应用.2004,24(7):128-131.
    [40]欧振猛,余顺争.中文分词算法在搜索引擎应用中的研究[J].计算机工程与应用.2000,36(8):80-82.
    [41]王华栋,饶培伦.基于搜索引擎的中文分词评估方法[J].情报科学.2007,25(1):108-112.
    [42]王坚,赵恒永.专业搜索引擎的实现与研究——中文分词算法[J].电子科学技术评论.2005(3):77-79.
    [43]曹羽中,曹勇刚,金茂忠等.支持智能中文分词的互联网搜索引擎的构建[J].计算机工程与设计.2006,27(23):4395-4398.
    [44]湛燕,等.基于中文文本分类的分词方法研究[J].计算机工程与应用.2003,39(23):87-88.
    [45]魏晓宁,朱巧明,梁惺彦.结合中文分词的贝叶斯文本分类[J].苏州市职业大学学报2008(1):104-107
    [46]吴雅娟,柳培林,丁子睿.基于统计分词的中文文本分类系统[J].电脑知识与技术:学术交流.2005(4):71-74.
    [47]唐培丽,等.全文检索搜索引擎中文信息处理技术研究[J].情报科学.2006,24(6):895-899.
    [48]许高建,胡学钢,王庆人.文本挖掘中的中文分词算法研究及实现[J].计算机技术与发展.2007,17(12):122-124.
    [49]谢红薇,王栋.基于Web文本挖掘中的一种中文分词算法研究[J].电脑开发与应用.2007,20(7):6-8.
    [50]马玉春,宋瀚涛.Web中文文本分词技术研究[J].计算机应用.2004,24(4):134-135.
    [51]韩利凯.一种快速Web中文分词算法的研究[J].航空计算技术.2007(6):68-69.
    [52]陈淑珍,卢昌荆,林克明.WEB文本挖掘的中文分词系统的设计与实现[J].三明学院学报.2005,22(2):197-200.
    [53]赵志靖,等.智能人机交互中自动分词技术的实现[J].扬州大学学报:自然科学版.2005,8(3):58-61.
    [54]王力红,杨剑,等.汉语智能接口的自动分词研究[J].计算机工程.2001,27(8):43-44.
    [55]高洁羽,吕强,等.自动分词在输入法测试系统中的应用[J].电化教育研究.2003(2):51-53.
    [56]陈晓柱,曾莹.自动分词在智能答疑系统中的作用[J].电脑知识与技术:学术交流.2007(6):1381-1382.
    [57]郑耿忠.自动分词算法在智能答疑系统中的应用研究[J].计算机工程与设计.2007,28(9):2224-2226.
    [58]张家勇,刘建辉.基于中文分词技术的信息智能过滤系统[J].信息技术.2006,30(12):175-178.
    [59]邓曙光,刘金铸,曾朝晖.基于自学习机制汉语自动分词系统研究[J].平原大学学报.2006,23(1):87-89.
    [60]强永妍,杨庚.中文垃圾邮件的索引分词法的研究与设计[J].计算机应用.2007(9):2334-2336.
    [61]麦范金,叶东海,史慧.基于语义理解的垃圾邮件过滤处理研究[J].中文信息学报.2008,22(5): 80-83.
    [62]申庆永,张建忠,何云,等.中文垃圾邮件过滤系统中的实时分词算法设计[J].计算机工程与应用.2007,43(3):179-181.
    [63]高艳萍,等.基于双数组Trie树的渔业领域分词研究[J].安徽农业科学.2008,36(11):4788-4790.
    [64]吴静,蔡砥,王铮.地理信息系统中自然语言查询的分词处理与应用[J].地球信息科学.2005,7(3):67-71.
    [65]李新福.基于互信息的宋史语料库词表的提取[J].河北大学学报(自然科学版).2006(5).
    [66]苏劲松,周昌乐,李翼鸿.基于统计抽词和格律的全宋词切分语料库建立[J].中文信息学报.2007(2).
    [67]曹艳,侯汉清.古籍文本抽词研究[J].图书情报工作.2008(1):132-135.
    [68]衡中青.地方志知识组织及内容挖掘研究—以《方志物产·广东》为例[D].南京:南京农业大学,2007.
    [69]清华大学科技史数字图书馆资料库[EB/OL]. [3-26]. http://166.111.120.21:4237/home/database/ htm/index.htm.
    [70]曹玲.农业古籍数字化整理研究[D].南京:南京农业大学,2006.
    [71]潘雪莲,钱丹雅,侯汉清.书后主题索引的自动编制初探[EB/OL].2008[3-26].http://www.cnindex.fudan.edu.cn/zgsy/2006n3/panxuelian.htm.
    [72]王雅戈.民国农业文献数字化整理及信息组织研究[J].南京农业大学.2007.
    [73]张琪玉.缺乏抽词词典是自动抽词标引难以普及的主要原因[J].图书与情报.1998(2).
    [74]章成志.自动标引研究的回顾与展望[J].现代图书情报技术.2007(11):33-39.
    [75]孙茂松,左正平.汉语自动分词词典机制的实验研究[J].中文信息学报.2000,14(1):1-6.
    [76]张培颖,李村合.一种中文分词词典新机制——四字哈希机制[J].微型电脑应用.2006,22(10):35-36.
    [77]李庆虎,陈玉健,孙家广.一种中文分词词典新机制——双字哈希机制[J].中文信息学报.2003,17(4):13-18.
    [78]吴昊,等.一种基于变型B-树的中文自动分词词典机制[J].技术与市场.2007(4):37-38.
    [79]中国历代年号索引表[EB/OL]. [2009-03-03]. http://qbar.qq.com/jb2i2z3c/2162.htm.
    [80][后魏]贾思勰撰,缪启愉校释.《齐民要术》附录三本书引用古文献书目[EB/OL].[2009-03-03].http://www.capaw.com/SKQS/ShowArticle.asp?ArticleID=17993&Page=3.
    [81]太平广记引用书目[EB/OL]. [2009-03-03]. http://tieba.baidu.com/f?kz=26776348.
    [82]历史名人大全[EB/OL]. [2009-03-03]. http://pinyin.sogou.com/dict/word_list.php?id=154.
    [83]搜狗.古今中外各界名人词库[EB/OL]. [2009-03-03]. http://pinyin.sogou.com/dict/cell. php?id=669.
    [84]常娥.古籍智能处理技术研究[D].南京:南京农业大学,2007.
    [85]张芳,王思明.中国农业古籍目录[M].北京:北京图书馆出版社,2002:494.
    [86]古今地名对照表[EB/OL]. [2009-03-03]. http://tieba.baidu.com/f?kz=206039406.
    [87]沈起炜,徐光烈编著.中国历代职官辞典[M].上海:上海辞书出版社,1992.
    [88]衡中青,侯汉清.农史物产史料来源发微[J].中国地方志.2008(8).
    [89]邱立坤.现代汉语动名语串结构关系判定[J].Journal of Chinese Language and Computing. 2005,15(3):173-183.
    [90]李华.面向知识服务的传统农具数字博物馆设计与构建[D].南京:南京农业大学,2008.
    [91]王灿辉等.基于相邻词的中文关键词自动抽取[J].西师范人学学报:自然科学版.2007(2):161-164.
    [92]康艳.中文图书内容索引计算机编制的研究与系统实现[D].南京:南京农业大学,2008.
    [93]Donohue, j. C.. Understanding scientific Literature:A Bibliographic Approach [M]. Cambridge:The MIT press,1973.
    [94]孙清兰.高频词与低频词的界分及词频估算法[J].中国图书馆学报.1992,18(2):78-81.
    [95]国学宝典.国学宝典字词频表[EB/OL]. [2008-03-26]. http://bbs.guoxue.com/viewthread.php? tid=443575.
    [96]张琪玉.缺乏抽词词典是自动抽词标引难以普及的主要原因[J].图书与情报.1998(02)27,80.
    [97]王芳,滕桂法,赵洋,等.基于本体的农业问答系统研究[J].农机化研究.2009,31(1):42-45.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700