用户名: 密码: 验证码:
一种带稀疏间隙约束的并行模式匹配算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Parallel Pattern Matching Algorithm with Sparse Gap Constraint
  • 作者:周开来 ; 陈红 ; 熊子绎 ; 李翠平 ; 孙辉
  • 英文作者:ZHOU Kai-Lai;CHEN Hong;XIONG Zi-Yi;LI Cui-Ping;SUN Hui;Key Laboratory of Data Engineering and Knowledge Engineering of the Ministry of Education (Renmin University of China);School of Information, Renmin University of China;School of Software, Zhengzhou University of Light Industry;
  • 关键词:模式匹配 ; 稀疏间隙约束 ; 后缀自动机 ; 并行算法 ; 通配符
  • 英文关键词:pattern matching;;sparse gaps constraint;;suffix automaton;;parallel algorithm;;wildcards
  • 中文刊名:RJXB
  • 英文刊名:Journal of Software
  • 机构:数据工程与知识工程国家教育部重点实验室(中国人民大学);中国人民大学信息学院;郑州轻工业大学软件学院;
  • 出版日期:2018-12-15
  • 出版单位:软件学报
  • 年:2018
  • 期:v.29
  • 基金:国家重点研发计划(2016YFB1000702);; 国家重点基础研究发展计划(973)(2014CB340402);; 国家高技术研究发展计划(863)(2014AA015204);; 国家自然科学基金(61272137,61202114,61532021);; 国家社会科学基金(12&ZD220);; 中国人民大学科学研究基金(中央高校基本科研业务费专项资金资助)项目成果(15XNLQ06)~~
  • 语种:中文;
  • 页:RJXB201812013
  • 页数:21
  • CN:12
  • ISSN:11-2560/TP
  • 分类号:209-229
摘要
带通配符的模式匹配是一个经典的研究问题,带有可变间隙约束的模式匹配是近年来比较热门的研究方向.为适应某些查询精度要求较高的应用领域,提出一种在稀疏间隙约束条件下求解模式匹配完备解的算法SGPM-SAI(pattern matching with sparse gaps constraint based on suffix automaton index).SGPM-SAI通过对文本串预处理,建立一种称为W-SAM的图索引结构,然后对模式串分段查找End Pos集合,最后以集合归并求交的方法得到模式匹配的完备解.实验结果表明:在不考虑预处理时间的情况下,相比几种最典型的模式匹配算法(KMP,BM,AC,suffixarray),SGPM-SAI算法性能优势显著,至少高出3~5倍.通过与SAIL算法的最新优化版本(SAIL-Gen)进行比较,在稀疏间隙约束条件下,SGPM-SAI的性能要显著优于SAIL-Gen算法.此外,为有效利用现代处理器的大规模并行处理单元,提出了并行优化后的算法Parallel SGPM-SAI.实验结果表明:Parallel SGPM-SAI算法的加速效果显著,且具有良好的并行可扩展性,能够充分利用现代众核处理器的高并行计算优势.
        Pattern matching with wildcards is a classic problem, and matching with variable gap constraints is a popular direction in this field in recent years. In order to meet the requirement of high accuracy in some query applications, this paper proposes an algorithm(referred to as SGPM-SAI) to obtain a complete solution of pattern matching under the condition of sparse gaps constraint. SGPM-SAI firstly creates an index structure called W-SAM(wildcard suffix automation) for the preprocessed text, and then get EndPos collection for each pattern segmentation by searching string from W-SAM, and finally get the complete solution of pattern matching by means of EndPos sets intersection. Experimental results show that, regardless of pretreatment time, the performance of SGPM-SAI algorithm is at least 3~5 times higher than other competitive algorithms, such as KMP, BM, AC, suffix array. Compared with the latest version(SAIL-Gen) of SAIL algorithm, the performance of SGPM-SAI is significantly better under the condition of sparse gaps constraint. In addition, this paper introduces parallel process methods for SGPM-SAI algorithm so as to effectively utilize the massive parallel processing units of modern processors. Experimental results show that the acceleration of Parallel SGPM-SAI algorithm has significant effect, as well as good parallel scalability. This indicates that the presented method can take full advantage of the high parallel computation capability of modern many-core processors.
引文
[1]Fischer MJ,Paterson MS.String-Matching and other products.In:Proc.of the 7th SIAM AMS Complexity of Computation.Cambridge,1974.113-125.
    [2]Qiang JP,Xie F,Gao J,et al.Pattern matching with arbitrary-length wildcards.Acta Automatica Sinica,2014,40(11):2499-2511(in Chinese with English abstract).[doi:10.3724/SP.J.1004.2014.02499]
    [3]Clifford P,Clifford R.Simple deterministic wildcard matching.Information Processing Letters,2007,101(2):53-54.[doi:10.1016/j.ipl.2006.08.002]
    [4]Cole R,Hariharan R.Verifying candidate matches in sparse and wildcard matching.In:Proc.of the Annual ACM Symp.on Theory of Computing.2002.592-601.[doi:10.1145/509907.509992]
    [5]Kalai A.Efficient pattern-matching with don’t cares.In:Proc.of the 13th Annual ACM-SIAM Symp.on Discrete Algorithms.Philadelphia,2002.655-656.
    [6]Wu XD,Zhu XQ,He Y,Arslan AN.PMBC:Pattern mining from biological sequences with wildcard constraints.Computers in Biology and Medicine,2013,43(5):481-492.[doi:10.1016/j.compbiomed.2013.02.006]
    [7]Sitaridi EA,Ross KA.GPU-Accelerated string matching for database applications.VLDB Journal,2015.1-22.[doi:10.1007/s00778-015-0409-y]
    [8]Xin C,Jia XF,Wu YX,et al.Strict pattern matching with general gaps and one-off condition.Ruan Jian Xue Bao/Journal of Software,2015,26(5):1096-1112(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/4707.htm[doi:10.13328/j.cnki.jos.004707]
    [9]Chen G,Wu XD,Zhu XQ,Arslan AN,He Y.Efficient string matching with wildcards and length constraints.Knowledge and Information Systems,2006,10(4):399-419.[doi:10.1007/s10115-006-0016-8]
    [10]Wu YX,Liu YW,Guo L,Wu XD.Subnettrees for strict pattern matching with general gaps and length constraints.Ruan Jian Xue Bao/Journal of Software,2013,24(5):915-932(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/4381.htm[doi:10.3724/SP.J.1001.2013.04381]
    [11]Wu XD,Xie F,Qiang JP.Pattern Matching with Wildcards and Length Constraints.Beijing:Science Press,2016.
    [12]Bille P,G¨ortz IL,Vildh′oj HW,Wind DK.String matching with variable length gaps.Theoretical Computer Science,2012,443(1):25-34.[doi:10.1016/j.tcs.2012.03.029]
    [13]Wu YX,Shen C,Jiang H.Strict pattern matching under non-overlapping condition.Science China Information Sciences,2017,60(1):1-16.[doi:10.1007/s11432-015-0935-3]
    [14]Ding BL,Lo D,Han JW,Khoo SC.Efficient mining of closed repetitive gapped subsequences from a sequence database.In:Proc.of the 25th Int’l Conf.on Data Engineering.Shanghai:IEEE,2009.1024-1035.[doi:10.1109/ICDE.2009.104]
    [15]Wang H,Wang HP,Wu XD.Models for pattern matching with wildcards and length constraints.Computer Science,2016,43(4):279-283(in Chinese with English abstract).
    [16]Manber U,Baeza-Yates R.An algorithm for string matching with a sequence of dont cares.Information Processing Letters,1991,37(3):133-136.[doi:10.1016/0020-0190(91)90032-D]
    [17]Navarro G,Raffinot M.Fast and simple character classes and bounded gaps pattern matching,with applications to protein searching.Journal of Computational Biology,2003,10(6):903-923.[doi:10.1089/106652703322756140]
    [18]Knuth DE,Jr Morris JH,Pratt VR.Fast pattern matching in strings.SIAM Journal on Computing,1977,6(1):323-350.[doi:10.1137/0206024]
    [19]Boyer RS,Moore JS.A fast string searching algorithm.Communications of the ACM,1977,20(10):762-772.[doi:10.1145/359842.359859]
    [20]Karp RM,Rabin MO.Efficient randomized pattern-matching algorithms.IBM Journal of Research and Development,1987,31(2):249-260.[doi:10.1147/rd.312.0249]
    [21]Aho AV,Corasick MJ.Efficient string matching:An aid to bibliographic search.Communications of the ACM,1975,18(6):333-340.[doi:10.1145/360825.360855]
    [22]Wu S.A fast algorithm for multi-pattern searching.Technical Report,Report TR-94-17,Tucson:Department of Computer Science,University of Arizona,1994.
    [23]Bellekens X,Atkinson R,Andonovic I,et al.Investigation of GPU-based pattern matching.In:Proc.of the Post Graduate Symp.on the Convergence of Telecommunications,Networking and Broadcasting.2013.[doi:10.6084/m9.figshare.3821922.v1]
    [24]Lin CH,Tsai SY,Liu CH,et al.Accelerating string matching using multi-threaded algorithm on GPU.In:Proc.of the IEEE Global Telecommunications Conf.IEEE,2010.1-5.[doi:10.1109/GLOCOM.2010.5683320]
    [25]Xu D,Zhang H,Fan Y.The GPU based high-performance pattern-matching algorithm for intrusion detection.Journal of Computational Information Systems,2013.3791-3800.[doi:10.12733/jcis5781]
    [26]Cole R,Gottlieb LA,Lewenstein M.Dictionary matching and indexing with errors and dont cares.In:Proc.of the 36th Annual ACM Symp.on Theory of Computing.New York:ACM Press,2004.91-100.[doi:10.1145/1007352.1007374]
    [27]Karkkainen J,Sanders P.Simple linear work suffix array construction.In:Proc.of the Int’l Colloquium on Automata Languages and Programming.2003.943-955.[doi:10.1007/3-540-45061-0_73]
    [28]Ko P,Aluru S.Space efficient linear time construction of suffix arrays.In:Proc.of the Combinatorial Pattern Matching.2003.200-210.[doi:10.1007/3-540-44888-8_15]
    [29]Khancome C,Boonjing V.New Hashing based multiple string pattern matching algorithms.In:Proc.of the Int’l Conf.on Information Technology:New Generations.2012.195-200.[doi:10.1109/ITNG.2012.34]
    [30]Blumer A,Blumer J,Haussler D,et al.The smallest automaton recognizing the subwords of a text.Theoretical Computer Science,1985,40(1):31-55.[doi:10.1016/0304-3975(85)90157-4]
    [31]Crochemore M.Transducers and repetitions.Theoretical Computer Science,1986,45(1):63-86.[doi:10.1016/0304-3975(86)90041-1]
    [32]Inenaga S,Takeda M,Shinohara A,et al.The minimum DAWG for all suffixes of a string and its applications.LNCS,2002.153-167.[doi:10.1007/3-540-45452-7_14]
    [33]Inenaga S,Bannai H,Shinohara A,et al.Discovering best variable-length-don’t-care patterns.In:Proc.of the Discovery Science.2002.86-97.[doi:10.1007/3-540-36182-0_10]
    [34]Lothaire M.Algebraic Combinatorics on Words.Cambridge University Press,2002.
    [2]强继朋,谢飞,高隽,等.带任意长度通配符的模式匹配.自动化学报,2014,40(11):2499-2511.[doi:10.3724/SP.J.1004.2014.02499]
    [8]柴欣,贾晓菲,武优西,等.一般间隙及一次性条件的严格模式匹配.软件学报,2015,26(5):1096-1112.http://www.jos.org.cn/1000-9825/4707.htm[doi:10.13328/j.cnki.jos.004707]
    [10]武优西,刘亚伟,郭磊,吴信东.子网树求解一般间隙和长度约束严格模式匹配.软件学报,2013,24(5):915-932.http://www.jos.org.cn/1000-9825/4381.htm[doi:10.3724/SP.J.1001.2013.04381]
    [15]汪浩,王海平,吴信东.带有通配符和长度约束的模式匹配问题求解模型.计算机科学,2016,43(4):279-283.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700