用户名: 密码: 验证码:
基于增量式鲁棒非负矩阵分解的短文本在线聚类
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Short Text Online Clustering Based on Incremental Robust Nonnegative Matrix Factorization
  • 作者:贺超波 ; 汤庸 ; 张琼 ; 刘双印 ; 刘海
  • 英文作者:HE Chao-bo;TANG Yong;ZHANG Qiong;LIU Shuang-yin;LIU Hai;School of Information Science and Technology,Zhongkai University of Agriculture and Engineering;School of Computer,South China Normal University;School of Data and Computer Science,Sun Yat-sen University;
  • 关键词:短文本聚类 ; 鲁棒非负矩阵分解 ; 在线聚类 ; l_(2 ; 1)范数 ; 增量式迭代更新规则
  • 英文关键词:short text clustering;;robust nonnegative matrix factorization;;online clustering;;l_(2,1) norm;;incremental iterative update rules
  • 中文刊名:DZXU
  • 英文刊名:Acta Electronica Sinica
  • 机构:仲恺农业工程学院信息科学与技术学院;华南师范大学计算机学院;中山大学数据科学与计算机学院;
  • 出版日期:2019-05-15
  • 出版单位:电子学报
  • 年:2019
  • 期:v.47;No.435
  • 基金:国家自然科学基金(No.61772211);; 广东省科技计划项目(No.2017A040405057,No.2017A030303074,No.2016A030303058);; 广州市科技计划项目(No.201807010043)
  • 语种:中文;
  • 页:DZXU201905016
  • 页数:8
  • CN:05
  • ISSN:11-2087/TN
  • 分类号:112-119
摘要
对社会化媒体产生的大量短文本进行聚类分析具有重要的应用价值,但短文本往往具有噪音数据多、增长迅速且数据量大的特点,导致现有相关算法难于有效处理.提出一种基于增量式鲁棒非负矩阵分解的短文本在线聚类算法STOCIRNMF.STOCIRNMF基于非负矩阵分解构建短文本聚类模型,通过l_(2,1)范数设计模型的优化求解目标函数提高鲁棒性,同时应用增量式迭代更新规则实现短文本的在线聚类.在搜狐新闻标题和微博短文本数据集上进行相关实验,结果表明STOCIRNMF不仅比现有代表性算法具有更好的聚类性能,而且能够有效对微博话题进行在线检测.
        Clustering a large number of short texts in social media has great value in applications.However,short texts often have these characteristics:lots of noises,growing rapidly and massive data.Most existing short text clustering algorithms are not effectively enough to process such short texts.Aiming at this problem,we propose an algorithm of short text online clustering based on incremental robust nonnegative matrix factorization(STOCIRNMF).This algorithm uses NMF to build the short text clustering model and applies l_(2,1) norm to devise its objective function for improving its robustness.Meanwhile,STOCIRNMF can cluster short texts incrementally by using incremental iterative update rules.We conduct extensive experiments on real Sohu news titles and Weibo datasets.The results show that STOCIRNMF not only has better performance of short text clustering than some representative algorithms,but also is very effective to detect micro blog′s topics online.
引文
[1] Hu Y H,Chen Y L,Chou H L.Opinion mining from online hotel reviews-a text summarization approach[J].Information Processing & Management,2017,53(2):436-449.
    [2] Cigarrán J,Castellanos ángel,García-Serrano A.A step forward for topic detection in Twitter:an FCA-based approach[J].Expert Systems with Applications,2016,57:21-36.
    [3] 黄发良,李超雄,元昌安,等.基于TSCM模型的网络短文本情感挖掘[J].电子学报,2016,44(8):1887-1891.HUANG Fa-liang,LI Chao-xiong,YUAN Chang-an,et al.Mining sentiment for web short text based TSCM model[J].Acta Electronica Sinica,2016,44(8):1887-1891.(in Chinese)
    [4] Zhang H,Zhong G Q.Improving short text classification by learning vector representations of both words and hidden topics[J].Knowledge-Based Systems,2016,102(15):76-86.
    [5] Yin J H,Wang J Y.A dirichlet multinomial mixture model-based approach for short text clustering[A].Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining[C].USA:ACM,2014.233-242.
    [6] Yu Z,Wang H X,Lin X M,et al.Understanding short texts through semantic enrichment and hashing[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(2):566-579.
    [7] Lee DD,Seung H S.Algorithms for non-negative matrix factorization[A].Proceedings of 2000 Annual Conference on Neural Information Processing Systems[C].USA:MIT Press,2000.556-562.
    [8] Zhang X C,Zong L L,Liu X Y,et al.Constrained clustering with nonnegative matrix factorization[J].IEEE Transactions on Neural Networks and Learning Systems,2016,27(7):1514-1526.
    [9] Bucak S S,Gunsel B.Incremental subspace learning via non-negative matrix factorization[J].Pattern Recognition,2009,42(5):788-797.
    [10] Chen R G,Li H.Online algorithm for foreground detection based on incremental nonnegative matrix factorization[A].Proceedings of the 2nd International Conference on Control,Automation and Robotics[C].USA:IEEE,2016.312-317.
    [11] Yan X H,Guo J F,Liu S H,et al.Clustering short text using Ncut-weighted non-negative matrix factorization[A].Proceedings of the 21st ACM International Conference on Information and Knowledge Management[C].USA:ACM,2012.2259-2262.
    [12] Ganguly D,Roy D,Mitra M,Jones G J F.Word embedding based generalized language model for information retrieval[A].Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval[C].USA:ACM,2015.795-798.
    [13] Liu C Y,Chen M Y,Tseng C Y.IncreSTS:towards real-time incremental short text summarization on comment streams from social network services[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(11):2986-3000.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700