用户名: 密码: 验证码:
基于概率推断的质量控制智能体
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Quality Control Agent Based on Probability Inference
  • 作者:徐耀丽 ; 李战怀
  • 英文作者:XU Yao-li;LI Zhan-huai;School of Computer Science and Engineering,Northwestern Polytechnical University;Key Laboratory of Big Data Storage and Management,Northwestern Polytechnical University,Ministry of Industry and Information Technology;
  • 关键词:质量控制 ; 实体解析 ; 不一致性消歧 ; 智能体 ; 查准率
  • 英文关键词:Quality control;;Entity resolution;;Inconsistency reconciliation;;Agent;;Precision
  • 中文刊名:JSJA
  • 英文刊名:Computer Science
  • 机构:西北工业大学计算机学院;西北工业大学大数据存储与管理工业和信息化部重点实验室;
  • 出版日期:2019-04-15
  • 出版单位:计算机科学
  • 年:2019
  • 期:v.46
  • 基金:中国科技部国家重点研发计划(2016YFB1000703);; 国家自然科学基金重点项目(61732014,61332006);国家自然科学基金面上项目(61472321,61672432);国家自然科学基金青年项目(61502390);; 陕西省自然科学基础研究计划(2018JM6086);; 西北工业大学中央高校基本科研业务费项目(3102017jg02002)资助
  • 语种:中文;
  • 页:JSJA201904002
  • 页数:6
  • CN:04
  • ISSN:50-1075/TP
  • 分类号:14-19
摘要
实体解析(Entity Resolution,ER)是数据集成和清洗领域的基础问题,而不一致性消歧(Inconsistency Reconciliation,IR)通过对现存的不同ER算法产生的不一致记录对进行消歧,进一步提升解析效果。但是现有的IR方法有一个局限,即消歧结果没有质量保障。对此,首次提出了一个基于概率推断的质量控制智能体,记为QCAgent。该智能体不需要训练数据集,能够在满足给定查准率的约束条件下输出查全率最大的消歧结果。它的核心思想是:首先,使用异常点检测模型来估算不一致记录对匹配的概率,并依据这些概率估算查准率和查全率,再将计算出的查准率和查全率作为环境端的反馈;其次,使用二分搜索算法,选择满足查准率要求且查全率最大的翻转方案,作为QCAgent的下一次行动;然后,用更新后的一致结果训练异常点模型,并估算查准率和查全率。按此循环,当新估计的查准率满足约束条件时,该迭代过程停止。在真实的数据集上,实验结果表明:QCAgent能够有效解决消歧结果的质量控制问题。
        Entity resolution(ER) is the fundamental problem of data integration and cleaning,while inconsistency reconciliation(IR) further improves the resolution performance through reconciling inconsistent pairs resolved by existing diverse ER approaches.However,previous IR approaches have a limitation that the reconciliation solution has no quality guarantee.To solve this problem,this paper firstly proposed a quality control agent based on probability inference,denoted as QCAgent.QCAgent does not require any manually labeled pair,and can automatically output reconciliation result with the highest recall on the premise of satisfying the given precision threshold.Its core idea is as follows.Firstly,the outlier detection model is utilized to estimate the matching probability for each inconsistent pair,and then the estimated precision and recall are regarded as the environmental feedback according to these probabilities.Next,the binary search algorithm is used to select a flipping solution as the next action of QCAgent,which can make flipped reconciliation result satisfy the precision requirement with the highest recall.Then the outlier detection model is retrained by using the new consistent pairs,and the recall and precision of flipped reconciliation result are estimated.The iterative process terminates until the newest estimated precision meets the constraints.On the real data set,the experimental results show that QCAgent can effectively solve the quality control problem of reconciliation result.
引文
[1]XU Y,LI Z,CHEN Q,et al.GL-RF:A Reconciliation Framework for Label-free Entity Resolution[J].Frontiers of Computer Science,2018,12(5):1035-1037.
    [2]LI G.Human-in-the-loop data integration[J].Proceedings of the VLDB Endowment,2017,10(12):2006-2017.
    [3]FAN F F,LI Z H,CHEN Q,et al.An outlier-detection based approach for automatic entity matching[J].Chinese Journal of Computers,2017,40(10):2197-2211.(in Chinese)樊峰峰,李战怀,陈群,等.一种基于离群点检测的自动实体匹配方法[J].计算机学报,2017,40(10):2197-2211.
    [4]EFTHYMIOU V,STEFANIDIS K,CHRISTOPHIDES V.Minoan ER:Progressive Entity Resolution in the Web of Data[C]∥Proceedings of the 19th International Conference on Extending Database Technology.2016:670-671.
    [5]LI L,LI J,GAO H.Rule-Based Method for Entity Resolution[J].IEEE Transactions on Knowledge&Data Engineering,2015,27(1):250-263.
    [6]WHANG S E,MARMAROS D,GARCIA-MOLINA H.Pay-asyou-go entity resolution[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(5):1111-1124.
    [7]BELLARE K,IYENGAR S,PARAMESWARAN A,et al.Active Sampling for Entity Matching with Guarantees[J].ACMTransactions on Knowledge Discovery from Data,2013,7(3):1-24.
    [8]BELLARE K,IYENGAR S,PARAMESWARAN A G,et al.Active sampling for entity matching[C]∥Proceedings of the18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM:New York,2012:1131-1139.
    [9]WANG J,LI G,YU J X,et al.Entity matching:how similar is similar[J].Proceedings of the VLDB Endowment,2011,4(10):622-633.
    [10]MONGE A E,ELKAN C.The Field Matching Problem:Algorithms and Applications[C]∥Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.AAAI Press:California,1996:267-270.
    [11]ZHANG D,GUO L,HE X,et al.A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution[C]∥Proceedings of the 34th IEEE International Conference on Data Engineering.IEEE Computer Society,2018:713-724.
    [12]ARASU A,GTZ M,KAUSHIK R.On active learning of record matching packages[C]∥Proceedings of the 2010 ACMSIGMOD International Conference on Management of Data.ACM:New York,2010:783-794.
    [13]MUDGAL S,LI H,REKATSINAS T,et al.Deep Learning for Entity Matching:A Design Space Exploration[C]∥Proceedings of the 2018International Conference on Management of Data.ACM:New York,2018:19-34.
    [14]COHEN W,RAVIKUMAR P,FIENBERG S.A comparison of string metrics for matching names and records[C]∥Proceedings of the KDD Workshop on Data Cleaning and Object Consolidation.2003:73-78.
    [15]EBRAHEEM M,THIRUMURUGANATHAN S,JOTY S,et al.Distributed representations of tuples for entity resolution[J].Proceedings of the VLDB Endowment,2018,11(11):1454-1467.
    [16]COHEN W W.Data integration using similarity joins and a word-based information representation language[J].ACMTransactions on Information Systems,2000,18(3):288-321.
    [17]DAS A,KOTTUR S,MOURA J M F,et al.Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning[C]∥Proceedings of the IEEE International Conference on Computer Vision.2017:2970-2979.
    [18]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
    [19]LIU Q,ZHAI J W,ZHANG Z Z,et al.A Survey on Deep Reinforcement Learning[J].Chinese Journal of Computers,2018,41(1):1-27.(in Chinese)刘全,翟建伟,章宗长,等.深度强化学习综述[J].计算机学报,2018,41(1):1-27.
    [20]ZHAO X Y,DING S F.Research on Deep Reinforcement Learning[J].Computer Science,2018,45(7):1-6.(in Chinese)赵星宇,丁世飞.深度强化学习研究综述[J].计算机科学,2018,45(7):1-6.
    [21]CHEN Z,CHEN Q,FAN F,et al.Enabling quality control for entity resolution:A human and machine cooperation framework[C]∥Proceedings of the 2018IEEE 34th International Conference on Data Engineering.IEEE:New Jersey,2018:1156-1167.
    [22]EFTHYMIOU V,PAPADAKIS G,PAPASTEFANATOS G,et al.Parallel meta-blocking for scaling entity resolution over big heterogeneous data[J].Information Systems,2017,65:137-157.
    [23]WANG Q,CUI M,LIANG H.Semantic-aware blocking for entity resolution[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(1):166-180.
    [24]SIMONINI G,BERGAMASCHI S,JAGADISH H.BLAST:a loosely schema-aware meta-blocking approach for entity resolution[J].Proceedings of the VLDB Endowment,2016,9(12):1173-1184.
    [25]PAPADAKIS G,KOUTRIKA G,PALPANAS T,et al.MetaBlocking:Taking Entity Resolution to the Next Level[J].IEEETransactions on Knowledge&Data Engineering,2014,26(8):1946-1960.
    [26]SCHLKOPF B,PLATT J C,SHAWE-TAYLOR J,et al.Estimating the support of a high-dimensional distribution[J].Neural computation,2001,13(7):1443-1471.
    [27]PEDREGOSA F,VAROQUAUX G,GRAMFORT A,et al.Scikit-learn:Machine learning in Python[J].Journal of Machine Learning Research,2011,12:2825-2830.
    [28]CORMEN T H,LEISERSON C E,RIVEST R L,et al.算法导论[M].殷建平,徐云,王刚,等译.北京:机械工业出版社,2013.
    [29]KPCKE H,THOR A,RAHM E.Evaluation of entity resolution approaches on real-world match problems[J].Proceedings of the VLDB Endowment,2010,3(1-2):484-493.
    1)http://www.cs.utexas.edu/users/ml/riddle/data/cora.tar.gz

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700