用户名: 密码: 验证码:
ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
详细信息    查看全文
  • 作者:Bie Verbist (1)
    Lieven Clement (2)
    Joke Reumers (3)
    Kim Thys (3)
    Alexander Vapirev (3) (4)
    Willem Talloen (3)
    Yves Wetzels (3)
    Joris Meys (1)
    Jeroen Aerssens (3)
    Luc Bijnens (3)
    Olivier Thas (1) (5)

    1. Department of Mathematical Modeling
    ; Statistics and Bioinformatics ; Ghent University ; Coupure Links 653 ; Gent ; 9000 ; Belgium
    2. Department of Applied Mathematics
    ; Informatics and Statistics ; Ghent University ; Krijgslaan 281 S9 ; Gent ; 9000 ; Belgium
    3. Janssen R&D
    ; Janssen Pharmaceutical Companies of J&J ; Turnhoutseweg 30 ; Beerse ; 2340 ; Belgium
    4. ExaScience Life Lab
    ; Kapeldreef 75 ; Leuven ; 3001 ; Belgium
    5. University of Wollongong
    ; National Institute for Applied Statistics Research Australia (NIASRA) ; School of Mathematics and Applied Statistics ; NSW ; 2522 ; Australia
  • 关键词:Illumina sequencing ; Codon ; Second best base call ; Model ; based clustering ; Viral quasispecies
  • 刊名:BMC Bioinformatics
  • 出版年:2015
  • 出版时间:December 2015
  • 年:2015
  • 卷:16
  • 期:1
  • 全文大小:888 KB
  • 参考文献:1. Dohm, JC, Lottaz, C, Borodina, T, Himmelbauer, H (2008) Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic Acids Res. 36: pp. 105 CrossRef
    2. Beerenwinkel, N, G眉nthard, HF, Roth, V, Metzner, KJ (2012) Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol. 3: pp. 329 CrossRef
    3. Eriksson, N, Pachter, L, Mitsuya, Y, Rhee, S-Y, Wang, C, Gharizadeh, B (2008) Viral population estimation using pyrosequencing. PLoS Comput Biol. 4: pp. 1000074 CrossRef
    4. Zagordi, O, Geyrhofer, L, Roth, V, Beerenwinkel, N (2010) Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. J Comput Biol. 17: pp. 417-28 CrossRef
    5. Zagordi, O, Bhattacharya, A, Eriksson, N, Beerenwinkel, N (2011) Shorah: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinf. 12: pp. 119 CrossRef
    6. Prosperi, MC, Salemi, M (2012) Qure: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics. 28: pp. 132-3 CrossRef
    7. Flaherty, P, Natsoulis, G, Muralidharan, O, Winters, M, Buenrostro, J, Bell, J (2012) Ultrasensitive detection of rare mutations using next-generation targeted resequencing. Nucleic Acids Res. 40: pp. e2 CrossRef
    8. Brockman, W, Alvarez, P, Young, S, Garber, M, Giannoukos, G, Lee, WL (2008) Quality scores and snp detection in sequencing-by-synthesis systems. Genome Res. 18: pp. 763-70 CrossRef
    9. Dohm, JC, Lottaz, C, Borodina, T, Himmelbauer, H (2008) Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic Acids Res. 36: pp. 105 CrossRef
    10. Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, et al. Lofreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012:918.
    11. Macalalad, AR, Zody, MC, Charlebois, P, Lennon, NJ, Newman, RM, Malboeuf, CM (2012) Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Comput Biol. 8: pp. 1002417 CrossRef
    12. Yang, X, Charlebois, P, Macalalad, A, Henn, MR, Zody, MC (2013) V-phaser 2: variant inference for viral populations. BMC Genomics. 14: pp. 674 CrossRef
    13. Quince, C, Lanzen, A, Davenport, RJ, Turnbaugh, PJ (2011) Removing noise from pyrosequenced amplicons. BMC Bioinf. 12: pp. 38 CrossRef
    14. Roche 454. http://www.genomeweb.com/sequencing/roche-shutting-down-454-sequencing-business .
    15. Ewing, B, Green, P (1998) Base-calling of automated sequencer traces using phred. ii. error probabilities. Genome Res. 8: pp. 186-94 CrossRef
    16. De Beuf, K, Schrijver, JD, Thas, O, Criekinge, WV, Irizarry, RA, Clement, L (2012) Improved base-calling and quality scores for 454 sequencing based on a hurdle poisson model. BMC Bioinf. 13: pp. 303 CrossRef
    17. Bentley, DR, Balasubramanian, S, Swerdlow, HP, Smith, GP, Milton, J, Brown, CG (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 456: pp. 53-9 CrossRef
    18. Bravo, HC, Irizarry, RA (2010) Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics. 66: pp. 665-74 CrossRef
    19. Abnizova, I, Leonard, S, Skelly, T, Brown, A, Jackson, D, Gourtovaia, M (2012) Analysis of context-dependent errors for illumina sequencing. J Bioinf Comput Biol. 10: pp. 1241005 CrossRef
    20. Manual Illumina. http://supportres.illumina.com/documents/myillumina/ec3129a6-b41f-4d98-963f-668391997f1a/olb_194_userguide_15009920d.pdf.
    21. Li, H, Durbin, R (2009) Fast and accurate short read alignment with burrows鈥搘heeler transform. Bioinformatics. 25: pp. 1754-60 CrossRef
    22. Schirmer, M, Sloan, WT, Quince, C (2014) Benchmarking of viral haplotype reconstruction programmes: an overview of the capacities and limitations of currently available programmes. Brief Bioinf. 15: pp. 431-42 CrossRef
    23. McLachlan, G, Krishnan, T (2007) The EM Algorithm and Extensions. vol. 382. John Wiley & Sons, Inc., Hoboken, New Jersey
    24. Fraley, C, Raftery, AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 97: pp. 611-31 CrossRef
    25. Asselah, T, Marcellin, P (2011) New direct-acting antivirals鈥?combination for the treatment of chronic hepatitis c. Liver International. 31: pp. 68-77 CrossRef
    26. Zagordi, O, Klein, R, D盲umer, M, Beerenwinkel, N (2010) Error correction of next-generation sequencing data and reliable estimation of hiv quasispecies. Nucleic Acids Res. 38: pp. 7400-9 CrossRef
    27. Henn, MR, Boutwell, CL, Charlebois, P, Lennon, NJ, Power, KA, Macalalad, AR (2012) Whole genome deep sequencing of hiv-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathogens. 8: pp. 1002529 CrossRef
    28. Local variants. https://github.com/ozagordi/localvariants.
    29. Nielsen, R, Paul, JS, Albrechtsen, A, Song, YS (2011) Genotype and snp calling from next-generation sequencing data. Nat Rev Genet. 12: pp. 443-51 CrossRef
    30. Vandenhende, M-A, Bellecave, P, Recordon-Pinson, P, Reigadas, S, Bidet, Y, Bruyand, M (2014) Prevalence and evolution of low frequency hiv drug resistance mutations detected by ultra deep sequencing in patients experiencing first line antiretroviral therapy failure. PloS One. 9: pp. 86771 CrossRef
    31. Halfon, P, Locarnini, S (2011) Hepatitis c virus resistance to protease inhibitors. J Hepatol. 55: pp. 192-206 CrossRef
  • 刊物主题:Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms;
  • 出版者:BioMed Central
  • ISSN:1471-2105
文摘
Background Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. Results Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. Conclusions ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700