高通量测序中序列拼接算法的研究进展

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

高通量测序中序列拼接算法的研究进展

详细信息查看全文 | 推荐本文 |

英文篇名：Survey on Sequence Assembly Algorithms in High-throughput Sequencing
作者：周卫星 ; 石海鹤
英文作者：ZHOU Wei-xing;SHI Hai-he;College of Computer Information and Engineering,Jiangxi Normal University;
关键词：高通量测序技术 ; 序列拼接算法 ; greedy ; Overlap-Layout-Consensus ; De ; Bruijn ; Graph
英文关键词：High-throughput sequencing;;Sequence assembly algorithms;;Greedy;;Overlap-layout-consensus;;De bruijn graph
中文刊名：JSJA
英文刊名：Computer Science
机构：江西师范大学计算机信息工程学院;
出版日期：2019-05-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家自然科学基金项目(61662035,61762049,61862033);; 江西省自然科学基金项目(20171BAB202013)资助
语种：中文;
页：JSJA201905007
页数：8
CN：05
ISSN：50-1075/TP
分类号：43-50

摘要

高通量测序(High-throughput Sequencing,HTS)技术是继第一代测序技术之后发展起来的一种新型测序方式,又被称为下一代测序技术。与第一代测序技术中采用基于Sanger方法的自动、半自动毛细管测序方法不同,高通量测序技术采用了基于焦磷酸测序的并行测序技术,是对传统测序技术的一项重要技术突破,它不仅克服了第一代测序技术高成本、低通量、低速度的缺点,而且能满足现代分子生物学和基因组学快速发展的需求,达到低成本、高通量以及快速的目的。相较于第一代测序数据,高通量测序数据具有典型的长度短、覆盖度不均匀以及准确率低的特点,同时第三代测序技术虽保持了高通量测序技术边测序边合成的思想,但采用了更为高效的单分子实时测序技术和纳米孔测序技术,具有高通量、低成本和测序数据长的优势。因此,要获得完整的全基因组基因序列,生物学家就需要使用一种技术将短测序reads拼装成一条完整的基因单链序列。在这种情况下,序列拼接算法应运而生。首先,介绍了序列拼接算法的发展背景以及高通量测序技术的相关概念,分析了高通量测序技术在序列拼接算法中所具有的优势;其次,通过总结序列拼接算法的发展成果,按基于greedy策略、基于Overlap-Layout-Consensus (OLC)策略和基于De Bruijn Graph (DBG)策略的分类对序列拼接算法进行阐述;最后,探讨了序列拼接算法的相关研究方向和发展趋势。
High-throughput sequencing technology is a new sequencing method developed after the first generation sequencing technology,also known as next-generation sequencing technology.Different from the automatic and semi-automatic capillary sequencing method based on Sanger,the high-throughput sequencing technology adopts the parallel sequencing technology based on pyrosequencing.It not only conquers the shortcomings of high cost,low throughput and low speed of the first generation sequencing technology,but also meets the demands of the rapid development of modern molecular biology and genomics with low cost,high throughput and fast speed.Compared with the first generation sequencing data,high-throughput sequencing data are characterized by short lengths,uneven coverage and low accuracy,and the third-generation sequencing technology adopts more efficient single molecular real-time sequencing and Nanopore sequencing technology as well as the principle of sequencing and synthesis,which has the advantages of high throughput,low cost and long sequencing data.Therefore,in order to obtain complete genome sequence,a technique is needed to assemble short sequencing reads into a complte single-stranded sequence of genes.In this case, the sequence assembly algorithm was proposed.Firstly,the development background of sequence assembly algorithms and the related concepts of high-throughput sequencing technology were introduced,and the advantages of high-throughput sequencing technology on sequence assembly were analyzed.Secondly,by summarizing the development of sequence assembly algorithms.The sequence assembly algorithms were illustrated,according to the algorithm classifications,respectively,by greedy strategy,Overlap-Layout-Consensus(OLC) strategy and De Bruijn Graph(DBG) strategy.Finally,the research direction and development trend of sequence assembly algorithms were discussed.

引文

[1] WARD R M,SCHMIEDER R,HIGHNAM G,et al.Big data challenges and opportunities in high-throughput sequencing[J].Systems Biomedicine,2013,1(1):29-34.
    [2] PEARSON W R Y,LIPMAN D J.Improved tools for biological sequence analysis[J].Proceedings of the National Academy of Sciences of the United Sates of America,1988,85(46):16138-16143.
    [3] ALTSCHUL S F,MADDEN T L,SCH?FFER A A,et al.Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J].Nucleic Acids Research,1997,25(17):3389-3402.
    [4] LARKIN M A,BLACKSHIELDS G,BROWN N P,et al.Clus- tal W and Clustal X version 2.0[J].Bioinformatics,2007,23(21):2947-2948.
    [5] DARLING A E,MAU B,PERNA N T.progressive Mauve:Mul- tiple Genome Alignment with Gene Gain,Loss and Rearrangement[J].PLOS One,2010,5(6):e11147.
    [6] BLANCHETTE M,KENT W J,RIEMER C,et al.Aligning multiple genomic sequences with the threaded blockset aligner[J].Genome Research,2004,14(4):708-715.
    [7] CANTOR C R.Orchestrating the Human Genome Project[J].Science,1990,248(4951):49-51.
    [8] CONSORTIUM T G P.A global reference for human genetic variation,the 1000 Genomes Project Consortium[J].Nature,2015,526:68-74.
    [9] CONSORTIUM T U.The UK10K project identifies rare va- riants in health and disease[J].Nature,2015,526(7571):82-90.
    [10] WATSON M.Illuminating the future of DNA sequencing[J].Genome Biology,2014,15(2):108.
    [11] Sanger F,Nicklen S,Coulson A R.DNA sequencing with chain-terminating inhibitors[J].Proceedings of the National Academy of Sciences of the United States of America,1977,74(12):5463-5467.
    [12] MOROZOVA O.Applications of next-generation sequencing technologies in functional genomics[J].Genomics,2008,92(5):255-264.
    [13] WOOLEY J C,GODZIK A,FRIEDBERG I.A Primer on Meta- genomics[J].PLOS Computational Biology,2010,6(2):e1000667.
    [14] WU X,ZHU X,WU G Q,et al.Data Mining with Big Data[J].IEEE Transactions on Knowledge & Data Engineering,2014,26(1):97-107.
    [15] FONSECA N A,RUNG J,BRAZMA A,et al.Tools for mapping high-throughput sequencing data[J].Bioinformatics,2012,28(24):3169-3177.
    [16] NIEDRINGHAUS T P,MILANOVA D,KERBY M B,et al.Landscape of Next-Generation Sequencing Technologies[J].Analytical Chemistry,2011,83(12):4327-4341.
    [17] HARRIS T D,BUZBY P R,BABCOCK H,et al.Single-molecule DNA sequencing of a viral genome[J].Science,2008,320(5872):106-109.
    [18] FLUSBERG B A,WEBSTER D,LEE J,et al.Direct detection of DNA methylation during single-molecule,real-time sequencing[J].Nature Methods,2010,7(6):461-465.
    [19] RUSK N.Cheap third-generation sequencing[J].Nature Me- thods,2009,6(4):244-244.
    [20] CHIN C S,PELUSO P,SEDLAZECK F J,et al.Phased Diploid Genome Assembly with Single Molecule Real-Time Sequencing[J].Nature Methods,2016,13(12):1050-1054.
    [21] HEEREMA S J,DEKKER C.Graphene nanodevices for DNA sequencing[J].Nature Nanotechnology,2016,11(2):127-136.
    [22] HEATHER J M,CHAIN B.The sequence of sequencers:The history of sequencing DNA[J].Genomics,2016,107(1):1-8.
    [23] SHENDURE J,JI H.Next-generation DNA sequencing[J].Nature Biotechnology,2008,26(10):1135-1145.
    [24] JACKMAN S D,VANDERVALK B P,MOHAMADI H,et al.ABySS 2.0:resource-efficient assembly of large genomes using a Bloom filter[J].Genome Research,2017,27(5):768-777.
    [25] ZIMIN A V,STEVENS K A,CREPEAU M W,et al.An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing[J].GIGA Science,2017,6(1):1-4.
    [26] MOSTOVOY Y,LEVYSAKIN M,LAM J,et al.A hybrid approach for de novo human genome sequence assembly and phasing[J].Nature Methods,2016,13(7):587-590.
    [27] LIU Y,BERTIL S,MASKELL D L.Parallelized short read assembly of large genomes using de Bruijn graphs[J].BMC Bioinformatics,2011,12(1):354.
    [28] WARREN R L,SUTTON G G,JONES S J M,et al.Assembling millions of short DNA sequences using SSAKE[J].Bioinforma-tics,2007,23(4):500-501.
    [29] JECK W R,REINHARDT J A,BALTRUS D A,et al.Extending assembly of short DNA sequences to handle error[J].Bioinformatics,2007,23(21):2942-2944.
    [30] DOHM J C,LOTTAZ C,BORODINA T,et al.SHARCGS,a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing[J].Genome Research,2007,17(11):1697-1706.
    [31] FARRANT G K,HOEBEKE M,PARTENSKY F,et al.WiseScaffolder:an algorithm for the semi-automatic scaffolding of Next Generation Sequencing data[J].BMC Bioinformatics,2015,16(1):281.
    [32] CAO M D,NGUYEN S H,GANESAMOORTHY D,et al.Scaffolding and completing genome assemblies in real-time with na-nopore sequencing[J].Nature Communications,2017,8:14515.
    [33] HIEU T N,ZIAUR R M,HE L,et al.Complete De NovoAssembly of Monoclonal Antibody Sequences[J].Scientific Reports,2016,6:31730.
    [34] MIN L,LIAO Z,HE Y,et al.ISEA:Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2017,14(4):916-925.
    [35] MYERS E W,SUTTON G G,DELCHER A L,et al.A Whole-Genome Assembly of Drosophila[J].Science,2000,287(5461):2196-2204.
    [36] DE L B M,MCCOMBIE W R.Assembling genomic DNA se- quences with PHRAP[J].Current Protocols in Bioinformatics,2007,17(1):11.4.1-11.4.15.
    [37] MARGULIES M,EGHOLM M,ALTMAN W E,et al.Genome sequencing in microfabricated high-density picolitre reactors[J].Nature,2005,437(7057):376-380.
    [38] KAMATH G M,SHOMORONY I,XIA F,et al.HINGE:long-read assembly achieves optimal repeat resolution[J].Genome Research,2017,27(5):747-756.
    [39] MYERS G.Efficient Local Alignment Discovery amongst Noisy Long Reads[M]//Algorithms in Bioinformatics.Berlin:Sprin-ger,2014:52-67.
    [40] KOREN S,WALENZ B P,BERLIN K,et al.Canu:scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation[J].Genome Research,2017,27(5):722-736.
    [41] YE C,HILL C M,WU S,et al.DBG2OLC:Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies[J].Scientific Reports,2016,6(X):31900.
    [42] YE C,MA Z S,CANNON C H,et al.Exploiting sparseness in de novo genome assembly[J].BMC Bioinformatics,2012,13(S6):S1.
    [43] LI H.Minimap and miniasm:fast mapping and de novo assembly for noisy long sequences[J].Bioinformatics,2015,32(14):2103-2110.
    [44] VASER R,SOVI,et al.Fast and accurate de novo genome assembly from long uncorrected reads[J].Genome Research,2017,27(5):737-746.
    [45] JANSEN H J,LIEM M,JONGRAADSEN S A,et al.Rapid de novo assembly of the European eel genome from nanopore sequencing reads[J].Scientific Reports,2017,7(1):7213.
    [46] BAAIJENS J A,AZE A,RIVALS E,et al.De novo assembly of viral quasispecies using overlap graphs[J].Genome Research,2017,27(5):835-848.
    [47] PENG G,JI P,ZHAO F.A novel codon-based de Bruijn graph algorithm for gene construction from unassembled transcriptomes[J].Genome Biology,2016,17(1):232.
    [48] CAMERON D L,SCHROEDER J,PENINGTON J S,et al.GRIDSS:sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly[J].Genome Research,2017,27(12):1-11.
    [49] WEISENFELD N I,KUMAR V,SHAH P,et al.Direct determination of diploid genome sequences[J].Genome Research,2017,27(5):757-767.
    [50] BUTLER J,MACCALLUM I,KLEBER M,et al.ALLPATHS:de novo assembly of whole-genome shotgun microreads[J].Genome Research,2008,18(5):810-820.
    [51] PEVZNER P A,TANG H,WATERMAN M S.An Eulerian path approach to DNA fragment assembly[J].Proceedings of the National Academy of Sciences of the United Sates of America,2001,98(17):9748-9753.
    [52] GNERRE S,JAFFE D B.High-quality draft assemblies of mammalian genomes from massively parallel sequence data[J].Proceedings of the National Academy of Sciences of the United Sates of America,2011,108(4):1513-1518.
    [53] LI R,ZHU H,RUAN J,et al.De novo assembly of human genomes with massively parallel short read sequencing[J].Genome Research,2010,20(2):265-272.
    [54] LUO R,LIU B,XIE Y,et al.SOAPdenovo2:an empirically improved memory-efficient short-read de novo assembler[J].GIGA Science,2012,1(1):18.
    [55] XIE Y,WU G,TANG J,et al.SOAPdenovo-Trans:de novo transcriptome assembly with short RNA-Seq reads[J].Bioinformatics,2014,30(12):1660-1666.
    [56] PENG Y,LEUNG H C M,YIU S M,et al.IDBA-UD:a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth[J].Bioinformatics,2012,28(11):1420-1428.
    [57] CHITSAZ H,YEE-GREENBAUM J L,TESLER G,et al.Efficient de novo assembly of single-cell bacterial genomes from short-read data sets[J].Nature Biotechnology,2011,29(10):915-921.
    [58] PENG Y,LEUNG H C M,YIU S M,et al.IDBA-tran:a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels[J].Bioinformatics,2013,29(13):326-334.
    [59] NURK S,MELESHKO D,KOROBEYNIKOV A,et al.metaSPAdes:a new versatile metagenomics assembler[J].Genome Research,2017,27(5):824-834
    [60] BANKEVICH A,NURK S,ANTIPOV D,et al.SPAdes:A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing[J].Journal of Computational Biology,2012,19(5):455-477.
    [61] PRJIBELSKI A D,VASILINETC I,BANKEVICH A,et al.ExSPAnder:a universal repeat resolver for DNA fragment assembly[J].Bioinformatics,2014,30(12):293-301.
    [62] SAFONOVA Y,BANKEVICH A,PEVZNER P A.dipSPAdes:Assembler for Highly Polymorphic Diploid Genomes[C]//International Conference on Research in Computational Molecular Biology.New York:Springer-Verlag,2014:265-279.
    [63] ANTIPOV D,KOROBEYNIKOV A,MCLEAN J S,et al.hybridSPAdes:an algorithm for hybrid assembly of short and long reads[J].Bioinformatics,2015,32(7):1009-1015.
    [64] VASILINETC I,PRJIBELSKI A D,GUREVICH A,et al.Assembling short reads from jumping libraries with large insert sizes[J].Bioinformatics,2015,31(20):3262-3268.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700