摘要
带通配符的模式匹配是一个经典的研究问题,带有可变间隙约束的模式匹配是近年来比较热门的研究方向.为适应某些查询精度要求较高的应用领域,提出一种在稀疏间隙约束条件下求解模式匹配完备解的算法SGPM-SAI(pattern matching with sparse gaps constraint based on suffix automaton index).SGPM-SAI通过对文本串预处理,建立一种称为W-SAM的图索引结构,然后对模式串分段查找End Pos集合,最后以集合归并求交的方法得到模式匹配的完备解.实验结果表明:在不考虑预处理时间的情况下,相比几种最典型的模式匹配算法(KMP,BM,AC,suffixarray),SGPM-SAI算法性能优势显著,至少高出3~5倍.通过与SAIL算法的最新优化版本(SAIL-Gen)进行比较,在稀疏间隙约束条件下,SGPM-SAI的性能要显著优于SAIL-Gen算法.此外,为有效利用现代处理器的大规模并行处理单元,提出了并行优化后的算法Parallel SGPM-SAI.实验结果表明:Parallel SGPM-SAI算法的加速效果显著,且具有良好的并行可扩展性,能够充分利用现代众核处理器的高并行计算优势.
Pattern matching with wildcards is a classic problem, and matching with variable gap constraints is a popular direction in this field in recent years. In order to meet the requirement of high accuracy in some query applications, this paper proposes an algorithm(referred to as SGPM-SAI) to obtain a complete solution of pattern matching under the condition of sparse gaps constraint. SGPM-SAI firstly creates an index structure called W-SAM(wildcard suffix automation) for the preprocessed text, and then get EndPos collection for each pattern segmentation by searching string from W-SAM, and finally get the complete solution of pattern matching by means of EndPos sets intersection. Experimental results show that, regardless of pretreatment time, the performance of SGPM-SAI algorithm is at least 3~5 times higher than other competitive algorithms, such as KMP, BM, AC, suffix array. Compared with the latest version(SAIL-Gen) of SAIL algorithm, the performance of SGPM-SAI is significantly better under the condition of sparse gaps constraint. In addition, this paper introduces parallel process methods for SGPM-SAI algorithm so as to effectively utilize the massive parallel processing units of modern processors. Experimental results show that the acceleration of Parallel SGPM-SAI algorithm has significant effect, as well as good parallel scalability. This indicates that the presented method can take full advantage of the high parallel computation capability of modern many-core processors.
引文
[1]Fischer MJ,Paterson MS.String-Matching and other products.In:Proc.of the 7th SIAM AMS Complexity of Computation.Cambridge,1974.113-125.
[2]Qiang JP,Xie F,Gao J,et al.Pattern matching with arbitrary-length wildcards.Acta Automatica Sinica,2014,40(11):2499-2511(in Chinese with English abstract).[doi:10.3724/SP.J.1004.2014.02499]
[3]Clifford P,Clifford R.Simple deterministic wildcard matching.Information Processing Letters,2007,101(2):53-54.[doi:10.1016/j.ipl.2006.08.002]
[4]Cole R,Hariharan R.Verifying candidate matches in sparse and wildcard matching.In:Proc.of the Annual ACM Symp.on Theory of Computing.2002.592-601.[doi:10.1145/509907.509992]
[5]Kalai A.Efficient pattern-matching with don’t cares.In:Proc.of the 13th Annual ACM-SIAM Symp.on Discrete Algorithms.Philadelphia,2002.655-656.
[6]Wu XD,Zhu XQ,He Y,Arslan AN.PMBC:Pattern mining from biological sequences with wildcard constraints.Computers in Biology and Medicine,2013,43(5):481-492.[doi:10.1016/j.compbiomed.2013.02.006]
[7]Sitaridi EA,Ross KA.GPU-Accelerated string matching for database applications.VLDB Journal,2015.1-22.[doi:10.1007/s00778-015-0409-y]
[8]Xin C,Jia XF,Wu YX,et al.Strict pattern matching with general gaps and one-off condition.Ruan Jian Xue Bao/Journal of Software,2015,26(5):1096-1112(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/4707.htm[doi:10.13328/j.cnki.jos.004707]
[9]Chen G,Wu XD,Zhu XQ,Arslan AN,He Y.Efficient string matching with wildcards and length constraints.Knowledge and Information Systems,2006,10(4):399-419.[doi:10.1007/s10115-006-0016-8]
[10]Wu YX,Liu YW,Guo L,Wu XD.Subnettrees for strict pattern matching with general gaps and length constraints.Ruan Jian Xue Bao/Journal of Software,2013,24(5):915-932(in Chinese with English abstract).http://www.jos.org.cn/1000-9825/4381.htm[doi:10.3724/SP.J.1001.2013.04381]
[11]Wu XD,Xie F,Qiang JP.Pattern Matching with Wildcards and Length Constraints.Beijing:Science Press,2016.
[12]Bille P,G¨ortz IL,Vildh′oj HW,Wind DK.String matching with variable length gaps.Theoretical Computer Science,2012,443(1):25-34.[doi:10.1016/j.tcs.2012.03.029]
[13]Wu YX,Shen C,Jiang H.Strict pattern matching under non-overlapping condition.Science China Information Sciences,2017,60(1):1-16.[doi:10.1007/s11432-015-0935-3]
[14]Ding BL,Lo D,Han JW,Khoo SC.Efficient mining of closed repetitive gapped subsequences from a sequence database.In:Proc.of the 25th Int’l Conf.on Data Engineering.Shanghai:IEEE,2009.1024-1035.[doi:10.1109/ICDE.2009.104]
[15]Wang H,Wang HP,Wu XD.Models for pattern matching with wildcards and length constraints.Computer Science,2016,43(4):279-283(in Chinese with English abstract).
[16]Manber U,Baeza-Yates R.An algorithm for string matching with a sequence of dont cares.Information Processing Letters,1991,37(3):133-136.[doi:10.1016/0020-0190(91)90032-D]
[17]Navarro G,Raffinot M.Fast and simple character classes and bounded gaps pattern matching,with applications to protein searching.Journal of Computational Biology,2003,10(6):903-923.[doi:10.1089/106652703322756140]
[18]Knuth DE,Jr Morris JH,Pratt VR.Fast pattern matching in strings.SIAM Journal on Computing,1977,6(1):323-350.[doi:10.1137/0206024]
[19]Boyer RS,Moore JS.A fast string searching algorithm.Communications of the ACM,1977,20(10):762-772.[doi:10.1145/359842.359859]
[20]Karp RM,Rabin MO.Efficient randomized pattern-matching algorithms.IBM Journal of Research and Development,1987,31(2):249-260.[doi:10.1147/rd.312.0249]
[21]Aho AV,Corasick MJ.Efficient string matching:An aid to bibliographic search.Communications of the ACM,1975,18(6):333-340.[doi:10.1145/360825.360855]
[22]Wu S.A fast algorithm for multi-pattern searching.Technical Report,Report TR-94-17,Tucson:Department of Computer Science,University of Arizona,1994.
[23]Bellekens X,Atkinson R,Andonovic I,et al.Investigation of GPU-based pattern matching.In:Proc.of the Post Graduate Symp.on the Convergence of Telecommunications,Networking and Broadcasting.2013.[doi:10.6084/m9.figshare.3821922.v1]
[24]Lin CH,Tsai SY,Liu CH,et al.Accelerating string matching using multi-threaded algorithm on GPU.In:Proc.of the IEEE Global Telecommunications Conf.IEEE,2010.1-5.[doi:10.1109/GLOCOM.2010.5683320]
[25]Xu D,Zhang H,Fan Y.The GPU based high-performance pattern-matching algorithm for intrusion detection.Journal of Computational Information Systems,2013.3791-3800.[doi:10.12733/jcis5781]
[26]Cole R,Gottlieb LA,Lewenstein M.Dictionary matching and indexing with errors and dont cares.In:Proc.of the 36th Annual ACM Symp.on Theory of Computing.New York:ACM Press,2004.91-100.[doi:10.1145/1007352.1007374]
[27]Karkkainen J,Sanders P.Simple linear work suffix array construction.In:Proc.of the Int’l Colloquium on Automata Languages and Programming.2003.943-955.[doi:10.1007/3-540-45061-0_73]
[28]Ko P,Aluru S.Space efficient linear time construction of suffix arrays.In:Proc.of the Combinatorial Pattern Matching.2003.200-210.[doi:10.1007/3-540-44888-8_15]
[29]Khancome C,Boonjing V.New Hashing based multiple string pattern matching algorithms.In:Proc.of the Int’l Conf.on Information Technology:New Generations.2012.195-200.[doi:10.1109/ITNG.2012.34]
[30]Blumer A,Blumer J,Haussler D,et al.The smallest automaton recognizing the subwords of a text.Theoretical Computer Science,1985,40(1):31-55.[doi:10.1016/0304-3975(85)90157-4]
[31]Crochemore M.Transducers and repetitions.Theoretical Computer Science,1986,45(1):63-86.[doi:10.1016/0304-3975(86)90041-1]
[32]Inenaga S,Takeda M,Shinohara A,et al.The minimum DAWG for all suffixes of a string and its applications.LNCS,2002.153-167.[doi:10.1007/3-540-45452-7_14]
[33]Inenaga S,Bannai H,Shinohara A,et al.Discovering best variable-length-don’t-care patterns.In:Proc.of the Discovery Science.2002.86-97.[doi:10.1007/3-540-36182-0_10]
[34]Lothaire M.Algebraic Combinatorics on Words.Cambridge University Press,2002.
[2]强继朋,谢飞,高隽,等.带任意长度通配符的模式匹配.自动化学报,2014,40(11):2499-2511.[doi:10.3724/SP.J.1004.2014.02499]
[8]柴欣,贾晓菲,武优西,等.一般间隙及一次性条件的严格模式匹配.软件学报,2015,26(5):1096-1112.http://www.jos.org.cn/1000-9825/4707.htm[doi:10.13328/j.cnki.jos.004707]
[10]武优西,刘亚伟,郭磊,吴信东.子网树求解一般间隙和长度约束严格模式匹配.软件学报,2013,24(5):915-932.http://www.jos.org.cn/1000-9825/4381.htm[doi:10.3724/SP.J.1001.2013.04381]
[15]汪浩,王海平,吴信东.带有通配符和长度约束的模式匹配问题求解模型.计算机科学,2016,43(4):279-283.