摘要
【目的】针对时态意图识别问题,探讨可抽取查询表达式特征的有效性及采用不同类别分类算法的识别准确度,为后续相关研究提供一定的借鉴。【方法】按查询表达式特征与时间的关联性,将其归类为时间无关特征、潜在时间特征、显式时间特征。在此基础上,分别采用有监督分类算法及半监督分类算法,探讨采用不同特征组合的有效性及不同分类算法的识别准确度。【结果】在抽取的三类查询表达式特征中,仅使用显式时间特征的平均分类准确率最高,且"查询是否包含年份"这一特征为强特征;使用不同分类算法的识别准确度相差不大;时态意图识别结果优于已有参与时态意图分类子任务(TQIC)测评的成果,平均分类准确率为81.14%。【局限】限于数据集的获取途径,仅对300条查询的时态意图识别效果进行验证;仅考虑已有的查询表达式特征,未提出用于时态意图识别的新特征。【结论】查询表达式特征中与时间关联性高的特征能提高时态意图识别准确度,而基于统计的特征(如查询词长度)对时态意图识别分类准确度的提升效果不明显。
[Objective] This paper investigates the effectiveness of query-based features and compares the performance of two types of classifiers in a query temporal intent classification task. [Methods] This paper first reviews all query-based features and then classifies those features into three types, according to their temporal relevance, namely, atemporal, implicit temporal and explicit temporal. Then, it tests accuracy of a temporal query intent classification task, using a supervised classifier and a semi-supervised classifier individually, with various combinations of query-based features of different types. [Results] Among all tested query-based features, using explicit temporal features achieves best accuracy, especially for the feature on whether a query contains a year; The performance hardly varies across classifiers; Our best macro average accuracy of 81.14% is higher than that in previous studies with the same experimental setups. [Limitations] Due to accessibility of dataset, our experiments are done on a limited size dataset. Only existing query-based features are studied and no new feature is proposed or tested. [Conclusions] Using highly temporal relevant features can improve accuracy in temporal query intent classification task, whereas using slightly temporal relevant features could hardly improve accuracy.
引文
[1]Broder A.A Taxonomy of Web Search[J].SIGIR Forum,2002,36(2):3-10.
[2]Sushmita S,Piwowarski B,Lalmas M.Dynamics of Genre and Domain Intents[C]//Proceedings of the 6th Asia Information Retrieval Societies Conference on Information Retrieval Technology.Springer,2010:399-409.
[3]Calderón-Benavides L,González-Caro C,Baeza-Yates R A.Towards a Deeper Understanding of the User’s Query Intent[C]//Proceedings of the 2010 Workshop on Query Representation and Understanding.2010:21-24.
[4]Nguyen B V,Kan M.Functional Faceted Web Query Analysis[C]//Proceedings of the 16th International World Wide Web Conference.2007.
[5]González-Caro C,Baeza-Yates R.A Multi-faceted Approach to Query Intent Classification[C]//Proceedings of the 18th International Conference on String Processing and Information Retrieval.2011:368-379.
[6]Campos R,Dias G,Jorge A M.What is the Temporal Value of Web Snippets?[C]//Proceedings of the 1st International Temporal Web Analytics Workshop.2011:9-16.
[7]张晓娟,韩毅.时态信息检索研究综述[J].数据分析与知识发现,2017,1(1):3-15.(Zhang Xiaojuan,Han Yi.Reviews on Temporal Information Retrieval[J].Data Analysis and Knowledge Discovery,2017,1(1):3-15.)
[8]Jones R,Diaz F.Temporal Profiles of Queries[J].ACMTransactions on Information Systems,2007,25(3):Article No.14.
[9]Joho H,Jatowt A,Blanco R,et al.Overview of NTCIR-11Temporal Information Access(Temporalia)Task[C]//Proceedings of the 11th NTCIR Conference on Evaluation of Information Access Technologies.2014:217-224.
[10]Mizzaro S.How Many Relevances in Information Retrieval?[J].Interacting with Computers,1998,10(3):303-320.
[11]Yu H,Kang X,Ren F.TUTA1 at the NTCIR-11 Temporalia Task[C]//Proceedings of the 11th NTCIR Conference on Evaluation of Information Access Technologies.2014:461-467.
[12]Shah A,Shah D,Majumder P.Andd7@NTCIR-11 Temporal Information Access Task[C]//Proceedings of the 11th NTCIRConference on Evaluation of Information Access Technologies.2014:456-460.
[13]Filannino M,Nenadic G.Using Machine Learning to Predict Temporal Orientation of Search Engines’Queries in the Temporalia Challenge[C]//Proceedings of the 11th NTCIRConference on Evaluation of Information Access Technologies.2014:438-442.
[14]Burghartz R,Berberich K.MPI-INF at the NTCIR-11Temporal Query Classification Task[C]//Proceedings of the11th NTCIR Conference on Evaluation of Information Access Technologies.2014:443-450.
[15]Hasanuzzaman M,Dias G,Ferrari S.HULTECH at the NTCIR-11 Temporalia Task:Ensemble Learning for Temporal Query Intent Classification[C]//Proceedings of the11th NTCIR Conference on Evaluation of Information Access Technologies.2014:478-482.
[16]Campos R,Dias G,Jorge A,et al.GTE:A Distributional Second-order Co-occurrence Approach to Improve the Identification of Top Relevant Dates in Web Snippets[C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management.2012:2035-2039.
[17]Hasanuzzaman M,Saha S,Dias G,et al.Understanding Temporal Query Intent[C]//Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2015:823-826.
[18]Hou Y,Tan C,Xu J,et al.HITSZ-ICRC at NTCIR-11Temporalia Task[C]//Proceedings of the 11th NTCIRConference on Evaluation of Information Access Technologies.2014:468-473.
[19]Miller G A.WordNet:A Lexical Database for English[J].Communications of the ACM,1995,38(11):39-41.
[20]Sokolova M,Lapalme G.A Systematic Analysis of Performance Measures for Classification Tasks[J].Information Processing and Management,2009,45(4):427-437.
(1)http://ntcirtemporalia.github.io/.
(1)http://research.nii.ac.jp/ntcir/permission/ntcir-11/perm-en-Temporalia.html.
(2)http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection/.
(1)http://nlp.stanford.edu/software/index.shtml.
(2)https://tempowordnet.greyc.fr/download_TWn.html.
(3)http://vikas.sindhwani.org/svmlin.html.
(4)https://www.csie.ntu.edu.tw/~cjlin/libsvm/.