用户名: 密码: 验证码:
文本情感分类及观点摘要关键问题研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
人类自然语言文本承载了两种信息,客观事实信息和带有人的主观感情色彩的信息,这些带有人的主观性信息的文本反映了人们对于某个特定对象的态度、立场和意见等。文本情感分析以带有主观性信息的文本为研究对象,目的是识别、分类、抽取、标注这些文本里表达的情感、观点、影响。
     随着互联网的迅猛发展,论坛、社区、博客、购物网站等社会媒体上面的主观性评论信息越来越多,甚至呈现爆炸式的增长。越来越多的人或机构开始习惯于在网络上搜索评论信息来帮助做出决定。但是,Web上的海量信息使他们在进行检索后不得不在数量巨大的评论中一条一条地人工翻阅、检查、判断信息,以便做出综合判断。在这种情况下,如果能够对这些海量的评论信息进行综述,得到的观点综述信息势必对消费者和生产商两方都具有很高的参考价值。这方面的工作就是基于观点的多文档摘要。同时,如果能够自动对这些评论进行分析,分析出哪些评论对评论对象持肯定态度,哪些持否定态度,以及肯定或否定的程度,便可极大提高用户获取评论信息时的效率。这方面的工作就是情感分类。
     本文围绕文本情感分析中的多文档观点摘要和情感分类这两个子课题进行了研究,主要工作包括以下三方面:
     (1)提出了一种基于观点的多文档摘要方法。
     现有基于观点的多文档摘要方法多数根据被评论的特征/方面(feature/aspect)进行摘要,称为基于特征/方面的观点摘要。这种摘要很大程度上依赖于对评论特征和评论词的精确识别,而实际中经常存在句子中缺少显式给出的评价特征或评论词的情况,这样的句子在基于特征的观点挖掘中很容易被忽略,从而影响后续生成的摘要的质量。而要精确挖掘句子中的评论特征和评论词又需要领域知识的支持,这又会造成领域依赖性。同时,这种基于特征/方面的观点摘要将关注点放在对每个特征的评价上,不能提供一个覆盖所有评论中主要主题和基本观点的综述信息。
     针对以上问题,本文提出了一种通用的领域无关的多文档观点摘要方法。本方法采用传统摘录式多文档摘要技术,结合概率主题模型LDA(Latent Dirichlet Allocation, LDA)和语义倾向进行多文档观点摘要。本文方法首先利用LDA模型对多文档的句子集合建模,挖掘文本集合中的潜在主题,利用Gibbs抽样得到句子在主题上的概率分布和主题在词上的概率分布,同时对句子进行词性分析并利用WordNet和SentiWordNet计算句子中词的语义倾向值;然后依次计算主题的重要度、词的重要度,在这两者基础上结合词的语义倾向计算句子的重要度;最终根据句子的重要度排序依次抽取句子,根据主题去除句子冗余后得到抽取式文摘。本文方法利用LDA模型挖掘评论文本中的重要主题,并结合语义倾向挖掘在重要主题上的主观性较强的观点。实验证明,本文方法得到的摘要更接近专家摘要。
     (2)提出了一种基于集成学习的不平衡数据集情感分类方法。
     目前二元情感分类的研究重点都放在了如何提高情感分类的性能上,却忽略了现实中经常出现情感分类样本中一个类别的样本数量几倍于另一个类别样本数量的情况,即情感分类样本的类别“不平衡”问题。而当前对情感分类的研究绝大多数都是在平衡的数据集上进行的,由此导致在平衡的数据集上得到的性能较好的情感分类器,在实际中应用时很难保持原有的性能。因此,研究如何对不平衡的情感分类数据进行分类,并提高其分类性能具有非常重要的意义,这也是情感分类技术能真正运用到实践过程中必须解决的一个问题。
     针对上述问题,本文提出了一种综合了不平衡数据集分类和集成学习两方面技术的情感分类方法。作为一种混合的方法,该方法从算法和数据两个层面着手,在集成学习的框架下,综合了欠抽样、Bootstrap重采样和随机特征选择三种方法来处理训练集,以便同时获得这三种方法的优势,得到若干在样本和特征空间都不相同的具有较大差异性的训练子集,由此得到具有较大差异性的基分类器,并最终提高集成得到的总分类器的性能。在“不平衡”情感数据集上的实验证明该方法可显著提高“不平衡”情感数据集的分类性能。
     (3)提出了细粒度的情感分类并研究了文本分类预处理技术对情感分类的影响。
     大量的情感分类研究重点放在二元情感分类上,即将主观性文本分为肯定类别或否定类别,而现实中带主观性信息的文本并不总是分为肯定或否定两类,例如很多网上商城的评价信息都是对应着1星到5星的等级信息,在这种情况下仅仅研究将评论信息分为肯定和否定两个类别不能满足实际的需要。针对这种情况,本文提出了对含有主观性信息的文本进行更细致的分类,称为细粒度的情感分类,该分类不仅考虑评论文本的肯定和否定的极性,还考虑评论的力度等级。本文同时分析了细粒度情感分类与普通多类分类问题本质上的不同。
     考虑到情感分类和传统的基于主题分类的目的不同,为了更好地研究细粒度的情感分类,本文针对有指导的机器学习方法,分析了影响情感分类的各种因素,研究比对了特征词数目、停用词表、文本特征选择、特征权重计算和文本分类方法在情感分类这个特殊问题上的性能表现,发现将文本分类技术应用于情感分类时在停用词、分类方法等方面和应用于主题分类时表现不同。最后,针对细粒度的中文文本情感分类问题,本文利用机器学习的方法在中文科技论文的评论上做了相应实验;实验中使用评论文本对应的等级信息作为类别标签,解决了人工标注的问题;实验发现细粒度的情感分类不仅在本质上和基于主题的多类分类不同,而且分类难度高于传统的多类分类和两类的情感分类。
Human natural language text contains two kinds of information:objective and subjective information. The subjective information represents one's attitude, standpoint and opinion to a specific object. Text sentiment analysis focuses on subjective information to recognize, classify, extract and annotate the expression of sentiment, opinion and effect in the content.
     With the rapid increase usage of internet, there are more and more subjective information appearing at the social medium, such as forum, community, blog and shopping websites. Both individual and organization became strongly relying on the review information obtained from the internet to make their own decisions. However, due to the huge amount of information available on the internet, one has to search, check and judge each review one by one before the person or organization can make the final decision. In this situation, it will be very useful to first summarize the relevant huge amount of information; this summary will be valuable for both the customer and manufacturer. This kind of work is called opinion-based multi-document summarization. Furthermore, it will greatly enhance the customers' efficiency to obtain the information if there is an automatic analysis of the original information, for example, which is positive attitude, which is negative attitude, and to what extent. This is called sentiment classification.
     This thesis focused on the opinion-based multi-document summarization and sentiment classification, two fields in text sentiment analysis. It contains the following three parts:
     1) Developed a new method for the opinion-based multi-document summarization
     Current opinion-based multi-document summarization that mainly based on the feature or aspect of the review is called feature/aspect based opinion summarization. This is largely depended on the accurate recognition of opinion feature and opinion word, however in reality, the opinion feature or opinion word is often not explicitly appeared in the sentence. Therefore, the feature/aspect based opinion mining will miss the opinion that is implied in the sentence due to the failing of recognition of the implicit opinion, and affect the performance of the following summarization. As to accurately recognize the feature/aspect requires the domain knowledge, thus make it domain dependent. Furthermore, this feature/aspect based method mainly focuses on the recognition and evaluation of each feature; therefore, it cannot provide summary information about the main topic and basic idea that covers all the opinions.
     To overcome this problem, this thesis proposed a general, domain-independent multi-document opinion summarization method. This new method utilizes the traditional extractive summarization method, combining Latent Dirichlet Allocation (LDA) and semantic orientation for mullet-document summarization. This method first builds the model of the sentence sets from multi-document with LDA, and explores the latent topics, obtains the sentence-topic distribution and topic-word distribution through Gibbs sampling, performs part of speech analysis and computes semantic orientation of word with WordNet and SentiWordNet. Secondly, it evaluates the importance degree of topic and word sequentially, and then based on these results and semantic orientation of word, it evaluates the importance degree of sentence. Finally, it sorts the sentence by the importance degree of sentence, obtains the extractive abstract after getting rid of the redundancy according to the topics. This identifies the important topic from the opinion text with LDA model and the strong subjective opinion on such topic with semantic orientation method. Experiment results indicate that results with this new method are comparable to expert summarization.
     2) Developed a new ensemble learning based method for sentiment classification of unbalanced data
     Current binary sentiment classification has been focusing on improving the performance of classification, while the unbalanced data, in which the number of samples in one category is several folds of that of another category, is neglected. Majority of the study on sentiment classification has been on the balanced data, so these methods perform well on balanced data, while are unable to maintain the same performance in practical applications. Therefore, it is imperative to study and develop new methods to deal with unbalanced data for sentiment classification and to improve the performance of sentiment classification in practical applications.
     To this end, this thesis proposed a new method of sentiment classification that combines unbalanced data classification method and ensemble learning technique. As a hybrid method, it considers both algorithm and datasets. In the framework of ensemble learning, it integrates three different methods: under-sampling, Bootstrap re-sampling and random feature selection to process the training set. It thus combines the advantage of the three methods to obtain the subset with larger diversity in both sample space and feature space, and leads to a larger diversity base classifier. In the end, it can enhance the ability of the ensemble classifier. Experiment on the unbalanced data for sentiment classification show that such new approach could significantly improve the classification performance on unbalanced data.
     3) Developed a fine-grained sentiment classification and analyzed the effect of pre-process of text on sentiment classification
     Majority of study in sentiment classification focus on binary sentiment classification which categories subjective text as positive or negative. However, in reality, text with subjective information cannot always be simply classified as positive or negative. For example, the review information from many shopping websites contains ranking information from1star to5stars. In this case, classifying them only into positive or negative cannot meet the practical need. To solve this problem, this thesis proposed a method called fine-grained sentiment classification. This method not only considers the positive or negative polarity of the review text, it also addresses the ranking strength of the review text. It further analyzed the essential difference between the fine-grained sentiment classification and the traditional multi-class categorization.
     Considering the difference between the sentiment classification and the traditional topic-based categorization, to better study the fine-grained sentiment classification, this thesis used supervised machine learning method to analyze various components that affect the sentiment classification. Specifically, it compared performance of the combination among the number of feature, stop words list, text feature selection, feature weight computation and text categorization method on sentiment classification. These studies indicated that there were differences between sentiment classification and topic-based classification when applied stop words list and feature selection in text categorization. Finally, to study the fine-grained sentiment classification of Chinese text, this thesis did experiment in analyzing reviews in Chinese scientific literature using machine-learning method. In the experiment, the usage of ranking information correspondent to the review text as category label solved the problem of manual annotation. The experiment shows that fine-grained sentiment classification is not only different from the topic-based multi-class categorization, but also difficult to classification compared to traditional multi-class categorization and binary sentiment classification.
引文
[1]Horrigan J A. Online shopping, Pew Internet & American Life Project Report, 2008.
    [2]comScore/the Kelsey group, Online consumer-generated reviews have significant impact on offline purchase behavior, http://www.comscore.com/press/release.asp?press=1928, November 2007.
    [3]Hearst M A. Direction-Based Text Interpretation as an Information Access Refinement, Jacobs P., Text-Based Intelligent Systems, Lawrence Erlbaum Associates,1992.
    [4]Hatzivassiloglou V, McKeown K R. Predicting the semantic orientation of adjectives:Proceedings of ACL-97,1997, pp.174-181.
    [5]Spertus E. Smokey:Automatic Recognition of Hostile Messages:Innovative Applications of Artificial Intelligence (IAAI)'97, pp.174-181,1997.
    [6]Riloff E, Shepherd J., A Corpus-Based Approach for Building Semantic Lexicons, In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing,1997.
    [7]Wiebet J M, Bruce R F, O'Harat T P. Development and Use of a Gold-Standard Data Set for Subjectivity Classifications. ACL'99,1999.
    [8]Das S, Chen M., Yahoo! for Amazon:Extracting market sentiment from stock message boards, The Asia Pacific Finance Association Annual Conference (APFA'01),2001.
    [9]Tong R M., An operational system for detecting and tracking opinions in on-line discussion:The Workshop on Operational Text Classification (OTC)2001,2001.
    [10]Dave K, Lawrence S, Pennock D M. Mining the peanut gallery:Opinion Extraction and Semantic Classification of Product Reviews:WWW2003, Budapest, Hungary,2003.
    [11]Pang B, Lee L. A sentimental education:sentiment analysis using subjectivity summarization based on minimum cuts:Association for Computational Linguistics, ACL'04, Barcelona, Spain,2004, pp.271-278.
    [12]Hiroshi K, Tetsuya N, Hideo W. Deeper sentiment analysis using machine translation technology:Computational Linguistics, Association for Computational Linguistics. COLING'04, Geneva, Switzerland,2004.
    [13]Hu M, Liu B. Mining and summarizing customer reviews:Knowledge discovery and data mining, Seattle, WA, USA,2004. ACM Press.
    [14]NTCIR. http://research.nii.ac.jp/ntcir/index-en.html.
    [15]TREC. http://trec.nist.gov. http://trec.nist.gov.
    [16]Agrawal R, Rajagopalan S, Srikant R, et al. Mining newsgroups using networks arising from social behavior:WWW 2003.
    [17]Efron M. Cultural Orientation:Classifying Subjective Documents by Cociation Analysis:American Association for Artificial Intelligence 2004,2004.
    [18]Turney P D. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews, Philadelphia,2002. July 2002.
    [19]Mei Q, Ling X, Wondra M. Topic sentiment mixture-modeling facets and opinions in weblogs:WWW2007, Banff, Alberta, Canada,2007.
    [20]Aciar S, Zhang D, Simoff S, et al. Informed Recommender:Basing Recommendations on Consumer Product Reviews. IEEE Intelligent Systems, 2007,22(3):39-47.
    [21]Chaovalit P, Zhou L. Movie Review Mining:a Comparison between Supervised and Unsupervised Classification Approaches:Proceedings of the 38th Hawaii International Conference on System Sciences,2005.
    [22]Morinaga S, Yamanishi K, Tateishi K, et al. Mining Product Reputations on the Web:SIGKDD 02, Edmonton, Alberta, Canada,2002.
    [23]Ghose A, Ipeirotis P G, Sundararajan A. Opinion Mining Using Econometrics:A Case Study on Reputation Systems:Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic,2007.
    [24]Qiang Y, Bin L, Yi-Jun L. Sentiment classification for Chinese reviews:a comparison between SVM and semantic approaches:Proceedings of 2005 International Conference on Machine Learning and Cybernetics,2005.
    [25]Qiang Y, Wen S, Yijun L. Sentiment Classification for Movie Reviews in Chinese by Improved Semantic Oriented Approach:System Sciences, Proceedings of HICSS'06,2006.
    [26]Zhang Q, Wu Y, Wu Y, et al. Opinion Mining with Sentiment Graph:Web Intelligence and Intelligent Agent Technology (WI-IAT),2011 IEEE/WIC/ACM, 2011.
    [27]Wu Q, Tan S, Zhai H, et al. SentiRank:Cross-Domain Graph Ranking for Sentiment Classification:Web Intelligence and Intelligent Agent Technologies, 2009(WI-IAT'09).2009.
    [28]杨源,林鸿飞.基于产品属性的条件句倾向性分析.中文信息学报,2011,25(03):86-92.
    [29]夏火松,彭柳艳,余梦麟.自动情感文本分类研究综述.情报学报,2011,30(05),530-539.
    [30]Pang B, Lee L. Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval,2008,2(1-2):1-135.
    [31]Liu B. Web Data Mining:Exploring Hyperlinks, Contents, and Usage Data. Second Edition. Springer,2011.
    [32]Songbo Tan, Xue qi, Huifeng Tang. A survey on sentiment detection of reviews. Expert Systems with Applications,2009,36 (7):10760-10773.
    [33]Beineke P, Hastie T, Manning C, et al. Exploring sentiment summarization: Proceedings of the AAAI Spring Symposium on Exploring Attitude and Affect in Text, AAAI technical report SS-04-07,2004.
    [34]Seki Y, Eguchi K, Kando N. Analysis of multi-document viewpoint summarization using multi-dimensional genres,2004.
    [35]Pang B, Lee L, Vaithyanathan S. Thumbs up? Sentiment Classification using Machine Learning Techniques:Proceedings of EMNLP,2002.
    [36]Esuli A, Sebastiani F. Determining the Semantic Orientation of Terms through Gloss Classification:CIKM'05, Bremen, Germany,2005.
    [37]Palakvangsa-Na-Ayudhya S, Sriarunrungreung V, Thongprasan P, et al. Nebular: A sentiment classification system for the tourism business:2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE),2011.
    [38]Kechaou Z, Ben Ammar M, Alimi A M. Improving e-learning with sentiment analysis of users' opinions:2011 Global Engineering Education Conference (EDUCON),2011.
    [39]Yu H, Hatzivassiloglou V. Towards Answering Opinion Questions:Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences:The 2003 conference on empirical methods in natural language processing,2003.
    [40]Blitzer J, Dredze M, Pereira F. Biographies, Bollywood, Boom-boxes and Blenders:Domain Adaptation for Sentiment Classification.:ACL'07,2007.
    [41]Whitehead M, Yaeger L. Building a General Purpose Cross-Domain Sentiment Mining Model:Computer Science and Information Engineering,2009 WRI World Congress,2009.
    [42]Lau R Y K, Lai C L, Li Y. Leveraging the web context for context-sensitive opinion mining:ICCSIT 2009, Beijing,2009.
    [43]Abbasi A, France S, Zhu Z, et al. Selecting Attributes for Sentiment Classification Using Feature Relation Networks. IEEE Transaction on Knowledge and Data Engineering,2011,23(3):447-462.
    [44]Binali H, Potdar V, Wu C. A State Of The Art Opinion Mining And Its Application Domains:ICIT'2009,2009.
    [45]Turney P D, Littman M L. Measuring Praise and Criticism:Inference of Semantic Orientation from Association. ACM Transactions on Information Systems,2003,21(4):315-346.
    [46]Mitsdorffer R, Diederich J. Rule extraction from technology IPOs in the US stock market:Proceedings of the 9th International Conference on Neural Infomation Processing (ICONIP'02),2002.
    [47]Zhou L, Chaovalit P. Ontology-supported polarity mining. Journal of the American Society for Information Science and Technology,2008,59(1):98-110.
    [48]http://ww.cs.jhu.edu/-brill/RBT1_14.tar.Z.
    [49]http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html.
    [50]NLProcessor-Text Analysis Toolkit. http://www.infogistics.com/textanalysis.html.
    [51]Stanford Postagger, http://nlp.stanford.edu/software/tagger.shtml#Download. http://nlp. stanfo rd. edu/so ft ware/tagger. shtml#Do wnl oad.
    [52]ICTCLAS.http://ictclas.org/.
    [53]Shang W, Qu Y, Huang H, et al. A Role-based Customer review Mining System. 2006 IEEE International Conference on Systems, Man, and Cybernetics, Taipei, Taiwan,2006.
    [54]Liu B, Hsu W, Ma Y. Integrating Classification and Association Rule Mining: KDD'98,1998.
    [55]Turney P, Littman M L. Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus.NRC/ERB-1094. May 15,2002. NRC 44929.
    [56]Church K W, Hanks P. Word association norms, mutual information and lexicography, New Brunswick, NJ:ACL,1989.
    [57]Turney P D. Mining the Web for Synonyms. PMI-IR Versus LSA on TOEFL, Berlin:Springer-Verlag,2001.
    [58]Landauer T K, Dumais S T. A solution to Plato's problem:The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review,1997,104(2):211-240.
    [59]http://wordnet.princeton.edu.
    [60]Ku L, Liang Y, Chen H. Opinion Extraction, Summarization and Tracking in News and Blog Corpora. AAAI'06,2006.
    [61]梅家驹,竺一鸣,高蕴琦,等.同义词词林.上海辞书出版社,1982.
    [62]http://bow.sinica.edu.tw/.
    [63]谭松波.中文情感挖掘语料-ChnSentiCorp, http://www.searchforum.org.cn/tansongbo/corpus-senti.htm.
    [64]Riloff E, Patwardhan S, Wiebe J. Feature Subsumption for Opinion Analysis: The 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), Sydney,2006.
    [65]Yang Y, Pederson J. O. A comparative study on feature selection in text categorization, ICML'97,1997, pp.412-420.
    [66]Xiao-Bin Wu, Zhi-Hong Deng, Ming Zhang, Dong-Qing Yang. Relative term-Frequency based feature selection for text categorization, Proceedings of 2002 International Conference on Machine Learning and Cybernetics,2002, vol3, pp.1432-1436.
    [67]Han J, Kamber M. Data Mining. Concepts and Techniques(1st Edition). Morgan Kaufmann,2006.
    [68]申红,吕宝粮,内山将夫,等.文本分类的特征提取方法比较与改进.计算机仿真,2006,23(3):222-224.
    [69]Ni X, Xue Q Ling X, et al. Exploring in the Weblog Space by Detecting Informative and Affective Articles:WWW 2007/Track:Industrial Practice and Experience, Banff, Alberta, Canada,2007.
    [70]Yi J, Niblack W. Sentiment mining in WebFountain:Data Engineering, ICDE 2005.2005.
    [71]Wong A, Salton C. G, A Vector Space Model for Automatic Indexing. Communications of the ACM,1975,18(11)613-620.
    [72]Cover T, Hart P. Nearest neighbor pattern classification. IEEE Transactions on Information theory,1967,13(1):21-27.
    [73]Polpinij J, Ghose A K. An Ontology-Based Sentiment Classification Methodology for Online Consumer Reviews:Web Intelligence and Intelligent Agent Technology,2008(WI-IAT'08).2008.
    [74]Shein K P P, Nyunt T T S. Sentiment Classification Based on Ontology and SVM Classifier:Communication Software and Networks,2010 (ICCSN'10).2010.
    [75]Huettner A, Subasic P. Fuzzy Typing for Document Management:ACL'00,2000.
    [76]Esuli A, Sebastiani F. PageRankingWordNet Synsets:An Application to Opinion Mining:The 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic,2007.
    [77]Hart G W. To decode short cryptograms. Communications of the Acm,1994, 37(9):102-108.
    [78]Silva C, Rieiro B. The importance of stop word removal on recall values in text categorization. Neural Networks,2003, vol.3, pp.20-24.
    [79]Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transaction on Knowledge and Data Engineering,2005, 17(4):491-502.
    [80]任双桥,傅耀文,黎湘,等.基于分类间隔的特征选择算法.软件学报,2008,19(4):842-850.
    [81]王素格,魏英杰.停用词表对中文文本情感分类的影响.情报学报,2008,27(2):175-179.
    [82]中国科技论文在线http://www.paper,edu.cn/.
    [83]Barandela R, Sanchez J, Garcia V, et al. Strategies for Learning in Class Imbalance Problems, Pattern Recognition,2003, vol.36, pp.849-851.
    [84]Kubat M, Matwin S. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection:ICML-97,1997.
    [85]Chawla N, Japkowicz N, Kotcz A. Editorial:Special Issue on Learning from Imbalanced Data Sets. SIGKDD Exploration Newsletter,2004,6(1):1-6.
    [86]Drown D J, Khoshgoftaar T M, R N. Using evolutionary sampling to mine imbalanced data:The 6th International Conference on Machine Learning and Applications, Washington DC:IEEE Computer Society,2007.
    [87]Yen S. J, Lee Y. S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications,2009, vol.36, pp.5718-5727.
    [88]Chawla N, Bowyer K, Hall L, et al. SMOTE:Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research,2002, vol.16, pp.321-357.
    [89]Juszczak P, Duin R. Uncertainty Sampling Methods for One-Class Classifiers: ICML'03,2003.
    [90]Zhou Z, Liu X. Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Transaction on Knowledge and Data Engineering,2006, vol.18, pp.63-77.
    [91]Guo H, Viktor H L. Learning from imbalanced data sets with boosting and data generation:the DataBoost-IM approach. ACM SIGKDD Explorations Newsletter-Special issue on learning from imbalanced datasets,2004,6(1):30-39.
    [92]Dietterich T. Machine learning research:Four current directions. AI Magazine, 1997,8(4):97-136.
    [93]Valentini G, Masulli F. Ensembles of learning machines. Neural Nets, LNCS 2486,2002, pp.3-20.
    [94]范明,孙丽娜,任红伟.集成异种分类器分类器稀有类.计算机研究与发展,2007,44(Suppl.):208-217.
    [95]Z-H Z, J W, W T. Ensembling neural networks:Many could be better than all. Artificial Intelligence,2002,137(12):239-263.
    [96]Liu C-L. Classifier Combination Based on Confidence Transformation. Pattern Recognition,2005,1(38):11-28.
    [97]Aksela M, Laaksonne J. Using Diversity of Errors for Selecting Members of a Committee Classifier. Pattern Recognition,2006,4(39):608-623.
    [98]Witten IH, Frank E, Hall MA. Data Mining:Practical Machine Learning Tools and Techniques(Second Edition). San Francisco:Morgan Kaufmann,2005.
    [99]Hansen L K, Salamon P. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence,1990,10(12):993-1001.
    [100]Ueda N. Optimal Linear Combination of Neural Networks for Improving Classification Performance. IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(2):207-215.
    [101]Valiant LG A Theory of the Learnable. Communications of the ACM,1984, 27(11):1134-1142.
    [102]Schapire RE. The Boosting Approach to Machine Learnling:An overview, MSRI Workshop on Nonlinear Estimation and Classification,2002. Berkeley, califonia,2002.
    [103]Kearns MJ. The Computational Complexity of Machine Learning. Cambridge: MIT Press,1990.
    [104]Kearns M, Valianty L. Cryptographic Limitations on Learning Boolean Formulae and Finite Automata. Journal of the ACM,1994,41(1):67-95.
    [105]R O. Duda, P E. Hart, D G Stork.模式分类(第二版).机械工业出版社,2003.
    [106]Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences,1997, 55(1):119-139.
    [107]Breiman L. Bagging Predictors. Machine Learning,1996,24(2):123-140.
    [108]Efron B T R. An Introduction to the Bootstrap. Chapman and Hall,1993.
    [109]Breiman L. Random Forests. Machine Learning,2001(45):5-32.
    [110]http://www.seas,upenn.edu/-mdredze/datasets/sentiment/.
    [111]Tin Kam Ho. The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence,1998, 20(8):832-844.
    [112]吴岩,刘挺,王开铸等.中文自动文摘原理与方法探索.中文信息学报,1998,12(2):8-16.
    [113]秦兵,刘挺,陈尚林,等.多文档文摘中句子优化选择方法研究.计算机研究与发展,2006,43(06):1129-1134.
    [114]刘廷,吴岩,王开铸.自动文摘综述.情报学报,1998,16(1):63-69.
    [115]Hirao T, Isozaki H, Maeda E. Extracting Important Sentences with Support Vector Machines,19th COLING, 2002.
    [116]Nenkova A, Vanderwende L. The Impact of Frequency on Summarization, MSR-TR-2005-101.2005.
    [117]Harabagiu S, Hickl A, Lacatusu F. Satisfying Information Needs with Multidocument Summaries. Information Processing and Management,2007, 43(6):1619-1642.
    [118]Antiqueira L, Osvaldo N, Oliveira J. A Complex Network Approach to Text Summarization. Information Science,2009, vol.179, pp.584-599.
    [119]McKeown K R, Barzilay R, Evans D. Tracking and Summarizing News on a Daily Basis with Columbia's Newsblaster, HLT'02,2002, pp.280-285.
    [120]Radev D R, Jing H, Stys M, et al. Centroid-based summarization of multiple documents. Information Processing and Management,2004(40):919-938.
    [121]Harabagiu S M, Lacatusu F. Generating Single and Multi-document Summaries with Gistexter, DUC2002,2002, pp.30-38.
    [122]秦兵,刘挺,李生.基于局部主题判定与抽取的多文档文摘技术.自动化学报,2004,30(06):905-910.
    [123]王志琪,王永成,刘传汉.基于互增强关系的自动文摘句子加权方法.上海交通大学学报,2007,41(08):1297-1300.
    [124]王志琪,王永成,刘传汉.论自动文摘及其分类.情报学报,2005,24(02):214-221.
    [125]沈洲,王永成,许一震,等.自动文摘系统评价方法的研究与实践.情报学报,2001,20(01):66-72.
    [126]史磊,王永成.英文文献自动摘要系统研究.情报学报,1999,18(06):504-508.
    [127]张奇,黄萱菁,吴立德.一种新的句子相似度度量及其在文本自动摘要中的应用.中文信息学报,2005,19(02):93-99.
    [128]葛加银,黄萱菁,吴立德.基于实体名的文本自动综述研究.计算机科学,2004,31(09):161-164.
    [129]郑义,黄萱菁,吴立德.文本自动综述系统的研究与实现.计算机研究与发展,2003,40(11):1606-1611.
    [130]Arora R, Ravindran B. Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-document Summarization:ICDM'08.2008.
    [131]Bing L, Mingqing H, Junsheng C. Opinion observer:analyzing and comparing opinions on the Web:WWW'05.
    [132]Lu Y, Zhai C, Sundaresan N. Rated Aspect Summarization of Short Comments: WWW'09,2009.
    [133]Blair-Goldensohn S, Hannan K, Mcdonald R. Building a Sentiment Summarizer for Local Service Reviews:NLPIX'08,2008.
    [134]Peng L, Yinglin W. Automatically extracting summaries with a novel unsupervised framework:Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2011),2011.
    [135]Raymond Ng, Pauls A, Carenini G Multi-document Summarization of Evaluative Text, In Proceedings of the 11st Conference of the European, Chapter of the Association for Computational Linguistics,2006.
    [136]Zhan J, Loh H T, Liu Y. Gather customer concerns from online product reviews-A text summarization approach. Expert Systems with Applications:An International Journal,2009,36(2):2107-2115.
    [137]徐戈,王厚峰.自然语言处理中主题模型的发展.计算机学报,2011,34(8):1423-1436.
    [138]石晶,戴国忠.基于PLSA模型的文本分割.计算机研究与发展,2007,44(2):242-248.
    [139]Keller M, Bengio S. Theme Topic Mixture Model:A Graphical Model for Document Representation, In:PASCAL Workshop on Learning Methods for Text Understanding and Mining,2004.
    [140]Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation. Journal of Machine Learning Research,2003, vol.3, pp.993-1022.
    [141]Steyvers M, Steyvers T. Probabilistic Topic Models, Handbook of Latent Semantic Analysis, Laurence Erlbaum,2007.
    [142]Teh Y W, Jordan M I, Beal M J. Hierarchical Dirichlet Processes. Journal of the American Statistical Association,2006,101 (476):1566-1581.
    [143]石晶,胡明,石鑫等.基于LDA模型的文本分隔.计算机学报,2008,31(10):1865-1873.
    [144]Griffiths T. Gibbs sampling in the generative model of Latent Dirichlet Allocation:Tech. rep.,2002. Stanford University, (2002).
    [145]Griffiths T L, Steyvers M. Finding scientific topics. PNAS,2004, 101(1):5228-5235.
    [146]Chesley P, Vincent B, Xu L, et al. Using verbs and adjectives to automatically classify blog sentiment:AAAI-CAAW'06,2006.
    [147]Nasukawa T, Yi J. Sentiment analysis:Capturing favorability using natural language processing, K-CAP'03,2003, pp.70-77.
    [148]Esuli A, Sebastiani F. SentiWordNet:A Publicly Available Lexical Resource for Opinion Mining, LREC'06,2006.
    [149]Goldstein J, Mittal V, Kantrowitz M, et al. Multi-Document Summarization by Sentence Extraction, NAACL-ANLP-AutoSum,2000,vol 4, pp.40-48.
    [150]http://jgibblda.sourceforge.net/.
    [151]ROUGE-1.5.5, http://www.rouge.com. au/.
    [152]Lin C. ROUGE:A Package for Automatic Evaluation of Summaries, Proceedings of the ACL-04 Workshop, Barcelona, Spain,2004, pp.74-81.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700