A probabilistic ranking framework for web-based relational data imputation

详细信息查看全文

作者：Zhaoqiang Chen^a ; ^{chenzhaoqiang@mail.nwpu.edu.cn" class="auth_mail" title="E-mail the corresponding author} ; Qun Chen ; ^a ; ^{chenbenben@nwpu.edu.cn" class="auth_mail" title="E-mail the corresponding author} ; Jiajun Li^a ; Zhanhuai Li^a ; Lei Chen^b
关键词：Web-based relational data imputation ; Missing attribute values ; Probabilistic ranking
刊名：Information Sciences
出版年：2016
出版时间：10 August 2016
年：2016
卷：355-356
期：Complete
页码：152-168
全文大小：1575 K

文摘

Due to richness of information on web, there is an increasing interest to search for missing attribute values in relational data on web. Web-based relational data imputation has to first extract multiple candidate values from web and then rank them by their matching probabilities. However, effective candidate ranking remains challenging because web documents are unstructured and popular search engines can only provide with relevant but not necessarily semantically matching information.

In this paper, we propose a novel probabilistic approach for ranking the web-retrieved candidate values. It can integrate various influence factors, e.g. snippet rank order, occurrence frequency, occurrence pattern, and keyword proximity, in a single framework by semantic reasoning. The proposed framework consists of snippet influence model and semantic matching model. The snippet influence model measures the influence of a snippet, and the semantic matching model measures the semantic similarity between a candidate value in a snippet and a missing relational value in a tuple. We also present effective probabilistic estimation solutions for both models. Finally, we empirically evaluate the performance of the proposed framework on real datasets. Our extensive experiments demonstrate that it outperforms the state-of-the-art techniques by considerable margins on imputation accuracy.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700