用户名: 密码: 验证码:
可动态自适应主题爬虫的研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research and Implementation of Dynamic Adaptive Topical Crawler
  • 作者:肖新凤 ; 余伟 ; 李石君 ; 陈亚辉 ; 刘倍雄 ; 刘永明
  • 英文作者:XIAO Xinfeng;YU Wei;LI Shijun;CHEN Yahui;LIU Beixiong;LIU Yongming;Guangdong Polytechnic of Environmental Protection Engineering;Wuhan University;
  • 关键词:主题爬虫 ; 动态自适应 ; URL图结构
  • 英文关键词:topic crawler;;dynamic self-adaption;;URL structure
  • 中文刊名:JSSG
  • 英文刊名:Computer & Digital Engineering
  • 机构:广东环境保护工程职业学院;武汉大学;
  • 出版日期:2019-05-20
  • 出版单位:计算机与数字工程
  • 年:2019
  • 期:v.47;No.355
  • 基金:国家自然科学基金项目(编号:61502350);; 2017广东高校省级重点平台和重大科研项目(编号:2017GKTSCX042)资助
  • 语种:中文;
  • 页:JSSG201905027
  • 页数:9
  • CN:05
  • ISSN:42-1372/TP
  • 分类号:142-150
摘要
针对传统的主题爬虫在面对动态变化的互联网时存在着主题知识涵盖不全、领域知识更新以及主题资源中心转移等问题。论文提出了一种可动态自适应互联网信息的主题爬虫。其中,可动态选择种子URL的TopicHub算法,相比于传统的静态种子URL的主题爬虫,抓取效率提升了7%以上,查全率提升了5%以上。另外,针对于静态本体库所存在的主题信息涵盖不全、领域知识变化更新等问题,提出了一种可动态扩充领域语义信息的结合静态本体库和动态语义的主题算法简称为SDTP算法。相比于传统的基于静态本体库的算法查准率提升了13%,相比于基于向量空间模型VSM的算法提升了4%。
        In the face of a dynamically changing Internet,the traditional topical crawlers have problems such as incomplete topical knowledge,domain knowledge updating,topical resource center transfer and so on. In this paper,a topic crawler that can dynamically adapt to Internet information is proposed. In which the TopicHub algorithm can dynamically select seed URLs. Compared with the traditional topic crawler of static seed URL,the crawling efficiency increases by more than 7%,and the recall rate increases by more than 5%. Additionally,aiming at the problems of the incomplete coverage of the topic information and domain knowledge updating in the static ontology library,an algorithm named SDTP can dynamically expand the domain semantic information is proposed. Compared with the traditional algorithm which is based on the static ontology library,the precision of the algorithm is improved by 13%,and compared with the algorithm which is based on the VSM,the improvement is 4%.
引文
[1]Gupta A,Anand P.Focused web crawlers and its approaches[C]//IEEE,2015.
    [2]Chakrabarti S,Van den Berg M,Dom B.Focused crawling:a new approach to topic-specific Web resource discovery[J].Computer networks,1999,31(11-16):1623-1640.
    [3]Cho J,Garcia-Molina H,Page L.Efficient crawling through URL ordering[J].Computer Networks and ISDNSystems,1998,30(1-7):161-172.
    [4]Rawat S,Patil D R.Efficient focused crawling based on best first search[C]//IEEE,2013.
    [5]Hersovici M,Jacovi M,Maarek Y S,et al.The shark-search algorithm.An application:tailored Web site mapping[J].Computer Networks and ISDN Systems,1998,30(1-7):317-326.
    [6]Wu M,Wang Y,Jing W,et al.The study of Topic Crawler search strategy based on Adaptive Genetic Algorithm[J].Journal of Residuals Science\&Technology,2016,13(7).
    [7]李东晖,廖晓兰,范辅桥,等.一种主题知识自增长的聚焦网络爬虫[J].计算机应用与软件,2014(5):29-33.LI Donghui,LIAO Xiaolan,FAN Fuqiao,et al.A topical knowledge self-growth focused web crawler[J].Computer applications and software,2014(5):29-33.
    [8]傅向华,冯博琴,马兆丰,等.可在线增量自学习的聚焦爬行方法[J].西安交通大学学报,2004,38(6):599-602.FU Xianghua,FENG Boqin,MA Zhaofeng.Focused crawling method capable of online incremental self-learning[J].Journal of Xi'an Jiaotong University,2004,38(6):599-602.
    [9]Su C,Gao Y,Yang J,et al.An efficient adaptive focused crawler based on ontology learning[C]//IEEE,2005.
    [10]吴永辉,王晓龙,丁宇新,等.基于主题的自适应,在线网络热点发现方法及新闻推荐系统[J].电子学报,2010,38(11):2620-2624.WU Yonghui,WANG Xiaolong,DING Yuxin,et al.Topic-based adaptive online hotspot discovery method and news recommendation system[J].Acta Elrctronica Sinica,2010,38(11):2620-2624.
    [11]Priyatam P N,Dubey A,Perumal K,et al.Seed selection for domain-specific search[C]//ACM,2014.
    [12]Zheng Z,Qian D.An improved focused crawler based on text keyword extraction[C]//IEEE,2016.
    [13]Seyfi A,Patel A.A focused crawler combinatory link and content model based on T-Graph principles[J].Computer Standards\&Interfaces,2016,43:1-11.
    [14]Kumar Sharma D,Khan M A.SAFSB:A self-adaptive focused crawler,2015[C]//IEEE,2015.
    [15]Dong H,Hussain F K.Self-adaptive semantic focused crawler for mining services information discovery[J].IEEE Transactions on Industrial Informatics,2014,10(2):1616-1626.
    [16]唐明,朱磊,邹显春.基于Word2Vec的一种文档向量表示[J].计算机科学,2016,43(6):214-217.TANG Ming,ZHU Lei,ZOU Xianchun.A Document Vector Representation Based on Word2Vec[J].Computer Science,2016,43(6):214-217.
    [17]曹姗姗,王冲.基于网页链接与用户反馈的PageRank算法改进研究[J].计算机科学,2014,41(12):179-182.CAO Shanshan,WANG Cong.Research on PageRank algorithm improvement based on webpage link and user feedback[J].Computer Science,2014,41(12):179-182.
    [18]喻金平,朱桂祥,梅宏标.基于Web链接分析的HITS算法研究与改进[J].计算机工程与应用,2013,49(21):42-45.YU Jinping,ZHU Guixiang,MEI Hongbiao.Research and Improvement of HITS Algorithm Based on Web Link Analysis[J].Computer Engineering and Applications,2013,49(21):42-45.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700