基于上下文相似度矩阵的Single -Pass短文本聚类

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于上下文相似度矩阵的Single -Pass短文本聚类

详细信息查看全文 | 推荐本文 |

英文篇名：Single -Pass Short Text Clustering Based on Context Similarity Matrix
作者：黄建一 ; 李建江 ; 王铮 ; 方明哲
英文作者：HUANG Jian-yi;LI Jian-jiang;WANG Zheng;FANG Ming-zhe;School of Computer and Communication Engineering,University of Science and Technology Beijing;
关键词：在线社交网络 ; 短文本序列 ; 文本聚类 ; 分布式处理
英文关键词：Online social network;;Short text sequence;;Text clustering;;Distributed processing
中文刊名：JSJA
英文刊名：Computer Science
机构：北京科技大学计算机与通信工程学院;
出版日期：2019-04-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家重点研发计划资助项目(2017YFB0803302);; 中央基本业务费(06116104)资助
语种：中文;
页：JSJA201904008
页数：7
CN：04
ISSN：50-1075/TP
分类号：56-62

摘要

在线社交网络已经成为人们信息交流的重要渠道和载体,形成了与现实世界交互影响的虚拟社会。众多的网络事件通过社交网络进行快速传播,可以在短时间内成为舆论热点,而负面事件会对国家安全和社会稳定造成冲击,从而引发一系列的社会问题。因此,挖掘社交网络中蕴含的热点信息,无论是从舆论监督方面还是舆情预警方面都具有重要的意义。文本聚类是挖掘热点信息的一种重要方法,然而,使用传统长文本聚类算法处理海量短文本时准确率将变低,复杂度急剧增长,从而导致耗时过长;现有的短文本聚类算法的准确率偏低、耗时过长。文中基于文本关键词,提出了结合上下文和相似度矩阵的关联模型,从而判断当前文本与上一文本的关联性。此外,根据该关联模型对文本关键词权重进行调整,以进一步降低噪声。最后,在Hadoop平台上实现了分布式的短文本聚类算法。与K-MEANS,SP-NN,SP-WC算法的比较实验验证了所提算法在话题挖掘速度、准确率和召回率等方面都具有更好的效果。
Online social network has become an important channel and carrier,and it has formed a virtual society interacting with the real world.Numerous network events rapidly spread through social networks,and they can become hot spots in a short period of time.However,the negative events vibrate national security and social stability,and may cause a series of social problems.Therefore,mining hotspot information contained in social networks is of great significance both in public opinion supervision and public opinion early warning.Text clustering is an important method for mining hotspot information.However,when the traditional long text clustering algorithms process massive short texts,their accuracy rate will become lower and the complexity will increase sharply,which will lead to long time-consuming.The exis-ting short text clustering algorithms also have low accuracy and takes too much time.Based on the keywords of text,this paper presented an association model combining context and similarity matrix to determine the relevance between the current text and the previous text.In addition,the text keyword weights were modified according to the association model to further reduce the noise.Finally,a distributed short text clustering algorithm on Hadoop platform was implemented.Through the experiments,it is verified that the proposed algorithm has better results and performance compared with K-MEANS,SP-NN and SP-WC algorithms in terms of the speed of mining topics,the accuracy and the recall rate.

引文

[1] NGUYEN H L,WOON Y K,NG W K.A survey on data stream clustering and classification[J].Knowledge and Information Systems,2015,45(3):535-569.
    [2] HUANG J,PENG M,WANG H,et al.A probabilistic method for emerging topic tracking in microblog stream[J].World Wide Web,2017,20(2):325-350.
    [3] XIE W,ZHU F,JIANG J,et al.TopicSketch:real-time bursty topic detection from Twitter[C]//International Conference on Data Mining.2013:837-846.
    [4] LI X H,HE T N,RAN H Y,et al.A novel graph partitioning criterion based short text clustering method[C]//International Conference on Intelligent Computing.Springer,Cham,2016:338-348.
    [5] BEIL F,ESTER M,XU X.Frequent term-based text clustering[C]//Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2002:436-442.
    [6] SALLOUM S A,AL-EMRAN M,MONEM A A,et al.A survey of text mining in social media:facebook and twitter perspectives[J].Adv.Sci.Technol.Eng.Syst.J,2017,2(1):127-133.
    [7] ALI A,QADIR J,RASOOL R U,et al.Big data for development:applications and techniques[J].arXiv:Computers and Society,2016,1(2):1-24.
    [8] HUANG J,PENG M,WANG H,et al.A probabilistic method for emerging topic tracking in Microblog stream[J].World Wide Web,2017,20(2):325-350.
    [9] ALLAN J,CARBONELL J,DODDINGTON G,et al.Topic detection and tracking pilot study:final report[C]//Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.1998:194-218.
    [10] WAYNE C L.Topic detection &tracking (TDT)[C].Workshop held at the University of Maryland on.1997.
    [11] CAPO M,PEREZ A,LOZANO J A,et al.An efficient approximation to the K-means clustering for massive data[J].Know-ledge Based Systems,2017,117:56-69.
    [12] ARORA P,VARSHNEY S.Analysis of K-Means and K-Medoids algorithm for big data[J].Procedia Computer Science,2016,78:507-512.
    [13] NG R T,HAN J.CLARANS:A method for clustering objects for spatial data mining[J].IEEE Transactions on Knowledge and Data Engineering,2002,14(5):1003-1016.
    [14] ABUALIGAH L M,KHADER A T.Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering[J].The Journal of Supercomputing,2017,73(11):4773-4795.
    [15] PINTO,DAVID,et al.A Self-Enriching Methodology for Clustering Narrow Domain Short Texts[J].The Computer Journal,2011,54(7):1148-1165.
    [16] PINTO D,BENEDí J M,ROSSO P.Clustering narrow-domain short texts by using the Kullback-Leiblerdistance[M].Computational Linguistics and Intelligent Text Processing.Springer Berlin Heidelberg,2007:611-622.
    [17] HU X,SUN N,ZHANG C,et al.Exploiting internal and external semantics for the clustering of short texts using world knowledge[C]//Proceedings of the 18th ACM Conference on Information and Knowledge Management.ACM,2009:919-928.
    [18] THOMAS R E,KHAN S S.Co-Clustering with side information for text mining[C]//International Conference on Data Mining.2016:105-108.
    [19] BHANUSE S S,KAMBLE S D,KAKDE S,et al.text mining using metadata for generation of side information[J].Procedia Computer Science,2016,78:807-814.
    [20] HAHSLER M,BOLAOS M.Clustering data streams based on shared density between micro-clusters[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(6):1449-1461.
    [21] STEINBACH M,KARYPIS G,KUMAR V.A comparison of document clustering techniques[C]//KDD Workshop on Text Mining.2000:525-526.
    [22] KARYPIS G,HAN E H,KUMAR V.Chameleon:Hierarchical clustering using dynamic modeling[J].Computer,1999,32(8):68-75.
    [23] SCHUBERT E,SANDER J,ESTER M,et al.DBSCAN revisited:why and how you should (still) use DBSCAN[J].ACM Transactions on Database Systems (TODS),2017,42(3):19.
    [24] GAO T,LI A,MENG F,et al.Research on data stream clustering based on FCM algorithm[J].Procedia Computer Science,2017,122:595-602.
    [25] REHIOUI H,IDRISSI A,ABOUREZQ M,et al.DENCLUE- IM:a new approach for big data clustering[J].Procedia Compu-ter Science,2016,83:560-567.
    [26] SPARCK J K.A statistical interpretation of term specificity and its application in retrieval[J].Journal of Documentation,1972,28(1):11-21.
    [27] CONG Y,CHAN Y,RAGAN M A.A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF[J].Scientific Reports,2016,6(1):30308.
    [28] GUO L,VARGO C J,PAN Z,et al.Big social data analytics in journalism and mass communication:Comparing dictionary-based text analysis and unsupervised topic modeling[J].Journa-lism & Mass Communication Quarterly,2016,93(2):332-359.
    [29] ALLAHYARI M,POURIYEH S,ASSEFI M,et al.A brief survey of text mining:classification,clustering and extraction techniques[J].arXiv preprint arXiv:1707.02919,2017.
    [30] XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31.
    [31] LI X H,HE T N,RAN H Y,et al.A novel graph partitioning criterion based short text clustering method[C]//International Conference on Intelligent Computing.Springer,Cham,2016:338-348.
    [32] SHEN D,YANG Q,SUN J T,et al.Thread detection in dyna- mic text message streams[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:35-42.
    [33] KENTER T,DE RIJKE M.Short text similarity with word embeddings[C]//Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.ACM,2015:1411-1420.
    [34] AKIMUSHKIN C,AMANCIO D R,OLIVEIRA J O N.Text authorship identified using the dynamics of word co-occurrence networks[J].PloS one,2017,12(1):e0170527.
    [35] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
    [36] BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5(1):135-146.
    [37] LI C,WANG H,ZHANG Z,et al.Topic Modeling for Short Texts with Auxiliary Word Embeddings[C]//International Acm sigir Conference on Research and Development in Information Retrieval.2016:165-174.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700