基于领域词典的动态规划分词算法

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于领域词典的动态规划分词算法

详细信息查看全文 | 推荐本文 |

英文篇名：Dynamic programming word segmentation algorithm based on domain dictionaries
作者：蒋卫丽 ; 陈振华 ; 邵党国 ; 马磊 ; 相艳 ; 郑娜 ; 余正涛
英文作者：Jiang Weili;Chen Zhenhua;Shao Dangguo;Ma Lei;Xiang Yan;Zheng Na;Yu Zhengtao;School of Information Engineering and Automation,Kunming University of Science and Technology;
关键词：动态规划 ; 词典 ; 领域适应性 ; 隐马尔可夫模型 ; 召回率 ; 准确率 ; 中文分词
英文关键词：dynamic programming;;dictionary;;domain adaptability;;hidden Markov model;;recall rate;;accuracy rate;;Chinese word segmentation
中文刊名：NJLG
英文刊名：Journal of Nanjing University of Science and Technology
机构：昆明理工大学信息工程与自动化学院;
出版日期：2019-03-13 13:23
出版单位：南京理工大学学报
年：2019
期：v.43;No.224
基金：博士后基金(2016M592894XB);; 云南省科技厅面上项目(KKS0201703015);; 国家自然科学基金(61741112);; 云南省自然科学基金(2017FB098)
语种：中文;
页：NJLG201901009
页数：9
CN：01
ISSN：32-1397/N
分类号：67-75

摘要

由于中文分词的复杂性,不同专业领域具有不同的词典构造。该文通过隐马尔可夫模型(Hidden Markov model,HMM)中文分词模型对文本信息进行初步分词,并结合相关的搜狗领域词库构建出对应的领域词典,对新词出现进行监控,实时优化更新,从而提出了一种基于领域词典的动态规划分词算法。通过对特定领域的信息进行分词实验,验证了该文提出的分词算法可获得较高的分词准确率与召回率。实验结果表明,基于领域词典的动态规划分词算法与基于领域词典的分词算法相比,准确率和召回率都有提升。基于领域词典的动态规划分词算法与传统的smallseg分词、snailseg分词算法相比,分词召回率和准确率都有提升,分词召回率提升了大约1%,分词准确率提升了大约8%,进一步说明了该文提出的分词算法具有很好的领域适应性。
Due to the Chinese word segmentation complexity,different expertise fields have its lexical structures. This paper combines sougou domain dictionary to construct domain dictionary via Chinese segmentation of the hidden Markov model(HMM) for initial segmentation in text message. It monitors the appearance of new words,optimizes and updates them in time,and proposes a dynamic programming based on domain dictionary. By segmenting the information in a specific field,it is verified that the word segmentation algorithm proposed here can obtain higher accuracy and recall rate of word segmentation. The results show that compared with the dictionary-based word segmentation algorithm,this algorithm has improved the word segment recall rate and accuracy. Compared with the traditional smallseg word segmentation and snailseg word segmentation algorithm,the dynamic dictionary segmentation algorithm based on domain dictionaries has improved word segmentation recall rate and accuracy rate. The word segmentation recall rate is increased by approximately 1%,and the word segmentation accuracy rate is increased by approximately 8%. This demonstrates that this paper algorithm has good field adaptation.

引文

[1] Wang K,Zong C,Su K. A character-based joint model for Chinese word segmentation[C]//International Conference on Computational Linguistics. New York,USA:Association for Computational Linguistics,2010.
    [2] 常建秋,沈炜. 基于字符串匹配的中文分词算法的研究[J]. 工业控制计算机,2016,29(2):115-116.Chang Jianqiu,Shen Wei. Research on Chinese word segmentation algorithm based on string matching[J]. Industrial Control Computer,2016,29(2):115-116.
    [3] 张桂平,刘东生,尹宝生,等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报,2010,24(3):112-116.Zhang Guiping,Liu Dongsheng,Yin Baosheng. Research on Chinese word segmentation for patent documents[J]. Journal of Chinese Information Processing,2010,24(3):112-116.
    [4] 韩冬煦,常宝宝. 中文分词模型的领域适应性方法[J]. 计算机学报,2015,38(2):272-281.Han Donxun,Chang Baobao. Domain adaptation method of Chinese word segmentation model[J]. Chinese Journal of Computers,2015,38(2):272-281.
    [5] 魏莎莎. 一种中文未登录词识别及词典设计新方法[D]. 重庆:西南大学,2011.
    [6] 何爱元. 基于词典和概率统计的中文分词算法研究[D]. 绵阳:辽宁大学,2011.
    [7] 张梅山,邓知龙,车万翔,等. 统计与词典相结合的领域自适应中文分词[J]. 中文信息学报,2012,26(2):8-13.Zhang Meishang,Deng Zhilong,Che Wangxiang,et al. Combining statistical model and dictionary for domain adaption of Chinese word segmentation[J]. Journal of Chinese Information Processing,2012,26(2):8-13.
    [8] 张赢,万仲保. 对专业搜索引擎中未登录词的识别研究[J]. 计算机技术与发展,2009,19(5):134-136.Zhang Ying,Wang Zhongbao. Professional search engine unknown word of recognition[J]. Computer Technology and Development,2009,19(5):134-136.
    [9] 蒋建洪,赵嵩正,罗玫. 词典与统计方法结合的中文分词模型研究及应用[J]. 计算机工程与设计,2012,33(1):387-391.Jiang Jianhong,Zhao Songzheng,Luo Mei. Analysis and application of Chinese word segmentation model which consist of dictionary and statistics method[J]. Computer Engineering & Design,2012,33(1):387-391.
    [10] 曹勇刚,曹羽中,金茂忠,等. 面向信息检索的自适应中文分词系统[J]. 软件学报,2006,17(3):356-363.Cao Yonggang,Cao Yuzhong,Jin Maozhong,et al. Information retrieval oriented adaptive Chinese word segmentation system[J]. Journal of Software,2006,17(3):356-363.
    [11] 钱智勇,周建忠,童国平,等. 基于HMM的楚辞自动分词标注研究[J]. 图书情报工作,2014,58(4):105-110.Qian Zhiyong,Zhou Jianzhong,Tong Guoping,at el. Study on automatic word segmentation of the songs of chu based on HMM[J]. Library and Information Service,2014,58(4):105-110.
    [12] 陈顺强,马嘿玛伙. 基于隐马尔科夫模型的彝文分词系统设计与开发[J]. 西南民族大学学报(自然科学版),2012,38(1):146-149.Chen Shunqiang,Ma Heimahuo. Design and development of word segmentation system based on hidden Markov model[J]. Journal of Southwest University for Nationalities(Natural Science Edition),2012,38(1):146-149.
    [13] 徐钟. 隐马尔科夫模型在中文实体分类中的应用及研究[D]. 南昌:南昌大学,2012.
    [14] 李荣,郑家恒. 一种改进Viterbi算法的应用研究[J]. 计算机工程与设计,2007,28(3):530-531.Li Rong,Zheng Jiaheng. Application research of an improved Viterbi algorithm[J]. Computer Engineering and Design,2007,28(3):530-531.
    [15] 倪维健,孙浩浩,刘彤,等. 面向领域文献的无监督中文分词自动优化方法[J]. 数据分析与知识发现,2018,2(2):96-104.Ni Weijian,Sun Haohao,et al. Unsupervised Chinese word segmentation automatic optimization method for domain literature[J]. Data analysis and knowledge discovery,2018,2(2):96-104.
    [16] 熊志斌,朱剑锋. 基于改进Trie树结构的正向最大匹配算法[J]. 计算机应用与软件,2014,31(5):276-278.Xiong Zhibin,Zhu Jianfeng. Forward maximum matching algorithm based on improved Trie tree structure[J]. Computer Applications and Software,2014,31(5):276-278.
    [17] 崔尚森,冯博琴. 最长前缀匹配查找的索引分离trie树结构及其算法[J]. 计算机工程与应用,2005,41(20):131-134.Cui Shangsen,Feng Boqin. Index separation trie tree structure and algorithm for longest prefix match search[J]. Computer Engineering and Applications,2005,41(20):131-134.
    [18] 朱艳辉,刘璟,徐叶强,等. 基于条件随机场的中文领域分词研究[J]. 计算机工程与应用,2016,52(15):97-100.Zhu Yanhui,Liu Jing,Xu Yeqiang,et al. Chinese word segmentation research based on conditional random field[J]. Computer Engineering and Applications,2016,52(15):97-100.
    [19] 王树梅,戴保存,吴慧中,等. 文本分类的字典生成[J]. 南京理工大学学报,2002,26(5):517-521.Wang Shumei,Dai Baocun,Wu Huizhong,et al. Dictionary generation of text classification[J]. Journal of Nanjing University of Science and Technology,2002,26(5):517-521.
    [20] 王文,王树锋,李洪华. 基于文本语义和表情倾向的微博情感分析方法[J]. 南京理工大学学报,2014,38(6):733-738.Wang Wen,Wang Shufeng,Li Honghua. Microblog sentiment analysis method based on text semantics and expression tendency[J]. Journal of Nanjing University of Science and Technology,2014,38(6):733-738.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700