用户名: 密码: 验证码:
高通量基因组数据的处理、分析与建模
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着高通量测序技术的不断发展,生物学相关数据也越来越多,如何从高通量实验数据中挖掘出有价值的知识和规律是生物信息学及计算生物学研究的热点之一。本文围绕高通量基因组数据处理、分析方法等展开了一系列研究,并取得了以下研究结果。
     1、随着第二代DNA测序技术的发展,人们揭示了越来越多不同物种的参考基因组序列和不同生物个体基因组序列。然而,如何存储和管理数量巨大的不同生物个体的基因组数据,已成为生物学家面临的一个重要挑战。本文提出了一种新颖的压缩工具GRS (Genome ReSequencing),用来储存并分析有参考基因组序列的基因组重测序数据。和以前的方法相比,GRS能够处理没有单核苷酸多态性参考序列和其他变异信息图谱的基因组序列数据,并根据参考基因组序列自动重建个体基因组序列。通过对第一个韩国人个体基因组序列数据的测试,GRS能够实现159倍左右的压缩效率,从原始2986.8 MB大小压缩至18.8 MB。通过对水稻和拟南芥测序数据的测试,水稻基因组数据从原来的361.0 MB大小压缩至4.4 MB,拟南芥基因组数据从115.1 MB压缩至6.5 KB。该压缩工具可以通过http://gmdd.shgmo.org/Computational-Biology/GRS访问。
     2、染色质免疫沉淀后对其进行大规模高通量并行测序(ChIP-Seq)是用于研究蛋白质和基因组DNA相互作用的的重要手段。本文设计了一种可以用来分析来自Illumina双端测序ChIP-Seq数据的新算法,并开发出其对应的分析工具SIPeS(从双端测序数据中鉴定结合位点)。我们获得了拟南芥AMS转录因子(一个参与拟南芥花粉发育过程的基因)ChIP-Seq实验;SIPeS分析结果与现有的分析方法CisGenome和MACS相比,有更高的结合位点识别分辨率。根据双端测序数据,SIPeS可以准确的计算出有效基因组长度(mappable genome length/effective genome length),并且通过使用动态基线(dynamic baseline)的方法有效地分辨出紧密相邻的结合位点,特别是对于拟南芥等基因密度较大的基因组时非常有效。该分析工具可以通过http://gmdd.shgmo.org/Computational-Biology/ChIP-Seq/download/SIPeS访问,目前版本为2.0。
     3、蛋白质的相互作用参与生物体生命活动的各个方面,虽然目前有超过10个以上的公用拟南芥蛋白质相互作用数据库。但是,这些数据库存在某些缺陷,包括使用没有统一标准类型的相互作用证据,缺乏统一的蛋白质或基因标识符以及使用没有标准定义的其它信息等。为了有效地整合来自不同相互作用数据库的数据,并最大限度地利用这些数据,本文提出了一个交互式的生物信息学网络工具,ANAP(拟南芥网络分析流水线)。ANAP是根据拟南芥蛋白质相互作用数据整合及其相互作用网络研究而开发的,它可以方便地进行蛋白质相互作用网络分析。ANAP集成了11个拟南芥蛋白质相互作用数据库,其中共包括201699对唯一的蛋白质相互作用对,15208个标识符(包括11931个TAIR的AGI号),89种相互作用检测方法,73种参与拟南芥蛋白质相互作用的物种,6161篇参考文献。ANAP可以用来作为构建蛋白质相互作用网络的知识库,根据用户的输入,支持蛋白质直接和间接相互作用分析。它有一个直观的图形界面,便于网络的可视化,并为每对相互作用提供详细的证据。此外,通过连接相应TAIR数据库,ANAP可以很方便在生成的相互作用网络中浏览相关基因或蛋白质的功能注释,并且可以比较方便的连接至相关基因或蛋白质对应的AtGenExpress可视化工具(AVT),拟南芥1001基因组GBrowse(1001基因组),蛋白质知识库(UniProtKB),京都基因与基因组百科全书(KEGG)以及Ensembl基因组浏览器(EnsemblGenomes)去更好的进行相互作用网络分析。该工具可以通过http://gmdd.shgmo.org/Computational-Biology/ANAP/ANAP_V1.0访问。
     4、转基因作物的安全性评价是转基因作物研究到其商业化过程中的关键步骤,其中分子特征是安全评价中最基本和最重要的部分,包括评价外源插入位点,旁侧序列及插入拷贝数等。相对于常规使用的检测方法,如Southern杂交,聚合酶链式反应,原位杂交,基因组步移等,建立和发展新的高通量转基因作物分子特征分析方法是有益和必要的。这里,我们在双端测序技术基础上开发了一个准确的高通量方法用以评估转基因水稻全基因组水平的分子特征。对于转基因水稻T1C-19,利用我们建立的方法,可以清楚的发现位于4号和11号染色体上的外源插入位点,该结果同时较好的得到了常规PCR和Sanger测序方法的验证。
With the rapid development of biological sciences, a large amount of data has been generated. How to explore the valuable knowledge has become a major topic in bioinformatics and computational biology research. This thesis study focuses on high-throughput genomic data with regard to their processing, analysis, and modeling.
     The following important findings have been made.
     1. With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently, available bioinformatics tools used to compress genome sequencing data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool named GRS for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequencing data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequencing data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequencing data set, GRS was able to achieve 159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.
     2. ChIP-Seq, which combines chromatin immunoprecipitation (ChIP) with high-throughput massively parallel sequencing, is increasingly being used for identification of protein-DNA interactions in vivo in the genome. However, to maximize the effectiveness of data analysis of such sequences, new algorithms that are able to accurately predict DNA-protein binding sites need to be developed. Here, we present SIPeS (Site Identification from Paired-end Sequencing), a novel algorithm for precise identification of binding sites from short reads generated by paired-end solexa ChIP-Seq technology. We used this method on the ChIP-Seq data from the Arabidopsis basic helix-loop-helix transcription factor ABORTED MICROSPORES (AMS), which is expressed in anther during pollen development. Our results show that SIPeS has better resolution for binding site identification compared to two existing ChIP-Seq peak detection algorithms, Cisgenome and MACS. Moreover, SIPeS is designed to accurately calculate the mappable genome length with fragment length based on the paired-end reads. Dynamic baselines are also employed to effectively discriminate closely adjacent binding sites for effective binding site discovery, which is of particular value when working on genomes with high gene density. This de novo tool is available at http://gmdd.shgmo.org/Computational-Biology/ChIP-Seq/download/SIPeS, and current version is 2.0.
     3. Protein interactions are essential in the molecular processes occurring within an organism and are utilised in network biology to help organise and understand biological complexity. Currently, there are more than 10 publically available Arabidopsis protein interaction databases. However, there are limitations with these databases, including different types of interaction evidence, a lack of defined standards for protein identifiers, and the use of other non-standard information. To effectively integrate the different datasets and maximise access to available data, this paper presents an interactive bioinformatics web tool, ANAP (Arabidopsis Network Analysis Pipeline). ANAP has been developed for Arabidopsis protein interaction integration and network-based study, to facilitate functional protein network analysis. ANAP integrates 11 Arabidopsis protein interaction databases, comprising a total of 201,699 unique protein interaction pairs, 15,208 identifiers (include 11,931 TAIR AGI code), 89 interaction detection methods, 73 species interacting with Arabidopsis and 6161 references. ANAP can be used as a knowledge base for constructing protein interaction networks based on a user input and supports both direct and indirect interaction analysis. It has an intuitive graphical interface allowing easy network visualisation and provides extensive detailed evidence for each interaction. In addition, ANAP displays the gene and protein annotation in the generated interactive network with links to the TAIR, AtGenExpress Visualization Tool (AVT), Arabidopsis 1001 Genomes GBrowse (1001 Genomes), Protein Knowledgebase (UniProtKB), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Ensembl Genome Browser (EnsemblGenomes) to significantly aid functional network analysis. The tool is available open access at http://gmdd.shgmo.org/Computational-Biology/ANAP/ANAP_V1.0.
     4. Safety assessment of genetically modified (GM) crops is a key step from research of transgenic crops to commercialization. Molecular characterization, including analysis of the integrated site, flanking sequence, and copy numbers of insertion, provides the most basic and important data to safety assessment. Development of high-throughput analyzing methods for molecular characterization of GM crops proves to be advantageous over conventional methods, such as southern blotting, polymerase chain reaction (PCR), fluorescence in situ hybridization (FISH), and genomic walking. In this work, we developed a high throughput and accurate method based on the paired-end sequencing technique to reveal the molecular features of GM rice at the genome-wide level. One transgenic rice event T1C-19 was selected to test the applicability of the developed method. The integrated sites in Chr04 and Chr11 were clearly revealed for two transgenes, and the sequences surrounding the integration sites were easily identified using conventional PCR and Sanger sequencing.
引文
[1] Li H, Handsaker B, Wysoker A et al. The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 2009, 25, 2078-2079.
    [2] Sanger F, Nicklen S, Coulson AR., DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A, 1977, 74(12), 5463-5467.
    [3] http://www.illumina.com
    [4] http://www.my454.com
    [5] http://www.appliedbiosystems.com.cn
    [6] http://www.helicosbio.com
    [7] http://www.pacificbiosciences.com
    [8] http://www.nanoporetech.com
    [9] http://en.wikipedia.org/wiki/RNA-Seq
    [10] http://seqanswers.com
    [11] http://www.gopubmed.com
    [12] http://www.genomics.cn
    [13] http://en.wikipedia.org/wiki/FASTQ_format
    [14] http://en.wikipedia.org/wiki/Phred_quality_score
    [15] http://gmdd.shgmo.org/Computational-Biology/ChIP-Seq/download/guess_fastq_version.sh
    [16] Ning Z, Cox AJ, Mullikin JC. SSAHA: a fast search method for large DNA databases. Genome research, 2001, 11(10), 1725-1729.
    [17] Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res, 2008, 18(11): 1851-1858.
    [18] Jiang H, Wong WH. SeqMap: mapping massive amount of oligonucleotides tothe genome. Bioinformatics, 2008, 24(20), 2395-2396.
    [19] Li H. and Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 2009, 25, 1754-1760.
    [20] Langmead B, Trapnell C, Pop M et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 2009, 10(3): R25.
    [21] Li R, Li Y, Kristiansen K et al. SOAP: short oligonucleotide alignment program. Bioinformatics, 2008, 24(5), 713-714.
    [22] Li R, Yu C, Li Y et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 2009, 1966-1967.
    [23] http://www.sanger.ac.uk/resources/software/smalt/
    [24] Kaiser J. DNA sequencing. A plan to capture human diversity in 1000 genomes. Science, 2008,319(5862), 395.
    [25] http://www.sciencemag.org/site/special/data/
    [26] Frazer KA, Ballinger DG, Cox DR et al. A second generation human haplotype map of over 3.1 million SNPs. Nature, 2007, 449, 851-862.
    [27] Birney E, Stamatoyannopoulos JA, Dutta A et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 2007, 447:799-816.
    [28] Genome of 1001 Arabidopsis thaliana strains project. http://www.1001genomes.org/
    [29] Kahn SD. On the future of genomic data. Science, 2011, 331(6018): 728-729.
    [30] Brandon MC, Wallace DC, Baldi P. Data structures and compression algorithms for genomic sequence data. Bioinformatics, 2009, 25, 1731-1738.
    [31] Christley S, Lu Y, Li C et al. Human genomes as email attachments. Bioinformatics, 2009, 25, 274-275.
    [32] Tembe W, Lowey J, Suh E. G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics, 26, 2010, 2192-2194.
    [33] Soliman TH, Gharib TF, Abo-Alian A. et al. A Lossless Compression Algorithm for DNA sequences. Int J Bioinform Res Appl, 2009, 5, 593-602.
    [34] Hsi-Yang Fritz M, Leinonen R, Cochrane G et al. Efficient storage of high throughput sequencing data using reference-based compression. Genome Res, 2011, 21: 734-740.
    [35] Wang C, Zhang D: A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res, 2011, 39(7): e45.
    [36] Fujita PA, Rhead B, Zweig AS et al. The UCSC Genome Browser database: update 2011. Nucleic Acids Res, 2011, 39: D876-D882.
    [37] Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 2008, 18: 821-829.
    [38] Gnerre S, Maccallum I, Przybylski D et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A, 2011, 108: 1513-1518.
    [39] Pushkarev D, Neff NF, Quake SR. Single-molecule sequencing of an individual human genome. Nat Biotechnol, 2009, 27, 847-850.
    [40] Snyder M, Du J, Gerstein M. Personal genome sequencing: current approaches and challenges. Genes Dev, 2010, 24, 423-431.
    [41] Langmead B, Schatz MC, Lin J et al. Searching for SNPs with cloud computing. Genome Biol, 2009, 10: R134.
    [42] Kaufmann K, Mui?o JM, Jauregui R et al. Target Genes of the MADS Transcription Factor SEPALLATA3: Integration of Developmental and Hormonal Pathways in the Arabidopsis Flower. PLoS Biol. 2009, 7(4):e1000090.
    [43] Iyer VR, Horak CE, Scafe CS et al. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature, 2001, 409: 533-538.
    [44] Ren B, Robert F, Wyrick JJ et al. Genome-wide location and function of DNA binding proteins. Science, 2000, 290: 2306-2309.
    [45] Robertson G, Hirst M, Bainbridge M et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods, 2007, 4: 651-657.
    [46] Barski A, Cuddapah S, Cui K et al. High-resolution profiling of histone methylations in the human genome. Cell, 2007, 129: 823-837.
    [47] Mikkelsen TS, Ku M, Jaffe DB et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature, 2007, 448: 553-560.
    [48] Johnson DS, Mortazavi A, Myers RM et al. Genome-wide mapping of in vivo protein-DNA interactions. Science, 2007, 316: 1497-1502.
    [49] Zhang Y, Liu T, Meyer C et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol, 2008, 9: R137.
    [50] Ji H, Jiang H, Ma W et al. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotech, 2008, 26: 1293-1300.
    [51] Valouev A, Johnson DS, Sundquist A et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods, 2008, 5: 829-834.
    [52] Jothi R, Cuddapah S, Barski A et al. Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res, 2008, 36: 5221–5231.
    [53] Kharchenko PV, Tolstorukov MY, Park PJ. Design and analysis of ChIP-seqexperiments for DNA-binding proteins. Nat Biotech, 2008, 26: 1351-1359.
    [54] Boyle AP, Guinney J, Crawford GE et al. F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics, 2008, 24: 2537-2538.
    [55] Fejes AP, Robertson G, Bilenky M et al. FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics, 2008, 24: 1729-1730.
    [56] Xu H, Wei CL, Lin F et al. An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data. Bioinformatics, 2008, 24: 2344-2349.
    [57] Joel Rozowsky, Ghia Euskirchen, Raymond K Auerbach et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotech, 2009, 27: 66-75.
    [58] Hoffman BG, Jones SJM. Genome-wide identification of DNA-protein interactions using chromatin immunoprecipitation coupled with flow cell sequencing. J Endocrinol, 2009, 201: 1-13.
    [59] http://www.illumina.com/pages.ilmn?ID=203
    [60] Sorensen AM, Krober S, Unte US et al. The Arabidopsis ABORTED MICROSPORES (AMS) gene encodes a MYC class transcription factor. Plant J, 2003, 33: 413-423.
    [61] Jie Xu, Caiyun Yang, Zheng Yuan et al. Regulatory Network of ABORTED MICROSPORES (AMS) Required for Postmeiotic Male Reproductive Development in Arabidopsis thaliana. Plant Cell, 2010, 22: 91-107.
    [62] Jansen R, Yu H, Greenbaum D et al. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 2003, 302: 449-453.
    [63] Bork P, Jensen LJ, von Mering C et al. Protein interaction networks from yeast to human. Curr Opin Struct Biol, 2004, 14: 292-299.
    [64] Uetz P, Giot L, Cagney G et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 2000, 403: 623-627.
    [65] Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in yeast. Nature Biotechnology, 2000, 18: 1257-1261.
    [66] Giot L, Bader JS, Brouwer C et al. A protein interaction map of Drosophila melanogaster. Science, 2003, 302: 1727-1736.
    [67] Li S, Armstrong CM, Bertin N et al. A map of the interactome network of the metazoan C. elegans. Science, 2004, 303: 540-543.
    [68] Rual JF, Venkatesan K, Hao T et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature, 2005, 437: 1173-1178.
    [69] Prieto C, De Las Rivas J. APID: Agile Protein Interaction DataAnalyzer. Nucleic Acids Res, 2006, 34: W298-W302.
    [70] Bader GD, Betel D, Hogue CW. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res, 2003, 31: 248-250.
    [71] Stark C, Breitkreutz BJ, Reguly T et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res, 2006, 34: D535-D539.
    [72] Stark C, Breitkreutz BJ, Chatr-Aryamontri A et al. The BioGRID Interaction Database: 2011 update. Nucleic Acids Res, 2011, 39: D698-D704.
    [73] Overington J. ChEMBL. An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory (EMBL-EBI). Interview by Wendy A. Warr. J Comput Aided Mol Des, 2009, 23: 195-198.
    [74] Xenarios I, Rice DW, Salwinski L et al. DIP: the database of interacting proteins.Nucleic Acids Res, 2000, 28: 289-291.
    [75] Xenarios I, Fernandez E, Salwinski L et al. DIP: The Database of Interacting Proteins: 2001 update. Nucleic Acids Res, 2001, 29: 239-241.
    [76] Xenarios I, Salwinski L, Duan XJ et al. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res, 2002, 30: 303-305.
    [77] Aranda B, Achuthan P, Alam-Faruque Y et al. The IntAct molecular interaction database in 2010. Nucleic Acids Res, 2010, 38: D525-D531.
    [78] Michaut M, Kerrien S, Montecchi-Palazzi L et al. InteroPORC: automated inference of highly conserved protein interaction networks. Bioinformatics, 2008, 24: 1625-1631.
    [79] Razick S, Magklaras G, Donaldson IM. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics, 2008, 9: 405.
    [80] Ceol A, Chatr Aryamontri A, Licata L et al. MINT, the molecular interaction database: 2009 update. Nucleic Acids Res, 2010, 38: D532-D539.
    [81] http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml
    [82] Jensen LJ, Kuhn M, Stark M et al. STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res, 2009, 37: D412-D416.
    [83] Szklarczyk D, Franceschini A, Kuhn M et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res, 2011, 39: D561-D568.
    [84] Obayashi T, Kinoshita K, Nakai K et al. ATTED-II: a database of co-expressed genes and cis elements for identifying co-regulated gene groups in Arabidopsis. Nucleic Acids Research, 2007, 35: D863-D869.
    [85] Kohl M, Wiese S, Warscheid B. Cytoscape: software for visualization and analysis of biological networks. Methods Mol Biol, 2011, 696: 291-303.
    [86] Shannon P, Markiel A, Ozier O et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 2003, 13: 2498-2504.
    [87] Smoot ME, Ono K, Ruscheinski J et al. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics, 2011, 27: 431-432.
    [88] NWB. Network Workbench Tool. In. Indiana University, Northeastern University, and University of Michigan, 2006.
    [89] James, C. Global status of commercialize biotech/GM crops: 2010. ISAAA Briefs No. 42, 2010. http://www.isaaa.org/resources/publications/briefs/42/
    [90] International Plant Protection Convention. https://www.ippc.int/
    [91] Codex Alimentarius Commission. http://www.codexalimentarius.net/web/index_en.jsp
    [92] CropLife International.
    [93] International Organization for Standardization. www.CropLife.org. http://www.iso.org/iso/home.html
    [94] Organisation for Economic Co-operation and Development. http://www.oecd.org/topic/0,3699,en_2649_34385_1_1_1_1_37401,00.htm.
    [95] Gilles-Eric Séralini, Robin Mesnage, Emilie Clair, Steeve Gress, Jo?l S de Vend?mois and Dominique Cellier. Genetically modified crops safety assessments: present limits and possible improvements. Environmental Sciences Europe 2011, 23:10
    [96] CAC/GL 45-2003 Guideline for the Conduct of Food Safety Assessment of Foods Derived from Recombinant.
    [97] Kohli A, Twyman RM, Abranches R, Wegel E, Stoger E, Christou P. Transgene integration, organization and interaction in plants. Plant Mol Biol. 2003 May;52(2):247-58.
    [98] Pedersen C, Zimny J, Becker D, J?hne-G?rtner A and L?rz H. Localization of introduced genes on the chromosomes of transgenic barley, wheat and triticale by fluorescence in situ hybridization. TAG Theoretical and Applied Genetics. 1997 ,94(6-7):749-757.
    [99] Agbios GM Crop Database. http://www.cera-gmc.org/?action=gm_crop_database
    [100] Ahn SM, Kim TH, Lee S et al. The first Korean genome sequence and analysis: Full genome sequencing for asocio-ethnic group. Genome Res, 2009, 19, 1622-1629.
    [101] Huala E, Dickerman A, Garcia-Hernandez M et al. The Arabidopsis Information Resource (TAIR): A comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res, 2001, 29, 102-105.
    [102] Ouyang S, Zhu W, Hamilton J et al. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res, 2007, 35, D883-D887.
    [103] Rhee SY, Beavis W, Berardini TZ et al. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res, 2003, 31, 224-228.
    [104] Myers EW. An O(ND) Difference Algorithm and Its Variations. Algorithmica, 1986, 1, 251-266.
    [105] Miller W, Myers EW. A File Comparison Program. Software-Practice andExperience, 1985, 15, 1025-1040.
    [106] Huffman D. A method for the construction of minimum redundancy codes. Proc IRE, 1952, 40, 1098-1101.
    [107] Frazer KA, Ballinger DG, Cox DR et al. A second generation human haplotype map of over 3.1 million SNPs. Nature, 2007, 449, 851-862.
    [108] Service RF. The race for the $1000 genome. Science, 2006, 311, 1544-1546.
    [109] Deorowicz S, Grabowski S. Robust relative compression of genomes with random access. Bioinformatics, 2011, 27(21): 2979-2986.
    [110] http://www.gnu.org/s/parallel/
    [111] Saleh A, Alvarez-Venegas R, Avramova Z. An efficient chromatin immunoprecipitation (ChIP) protocol for studying histone modifications in Arabidopsis plants. Nat Protoc, 2008, 3: 1018-1025.
    [112] http://www.mathworks.cn/products/bioinfo/demos.html?file=/products/demos/shipping/bioinfo/chipseqpedemo.html
    [113] ftp://ftp.arabidopsis.org
    [114] Aranda B, Blankenburg H, Kerrien S et al. PSICQUIC and PSISCORE: accessing and scoring molecular interactions. Nature methods, 2011, 8: 528-529.
    [115] http://code.google.com/p/psicquic/wiki/WhoUsesPsicquic
    [116] Consortium AIM. Evidence for network evolution in an Arabidopsis interactome map. Science, 2011, 333: 601-607.
    [117] Cote RG, Jones P, Apweiler R et al. The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries. BMC Bioinformatics, 2006, 7: 97.
    [118] http://gmdd.shgmo.org/Computational-Biology/ANAP/ANAP_V1.0/help/Supplemental_Data_Set_3.xls
    [119] Kaiser J. Proteomics. Public-private group maps out initiatives. Science, 2002, 296: 827.
    [120] Dennis G Jr, Sherman BT, Hosack DA et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome biology, 2003, 4: P3.
    [121] Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protocols, 2009, 4: 44-57.
    [122] Jain E, Bairoch A, Duvaud S et al. Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinformatics, 2009, 10: 136.
    [123] UniProt_Consortium. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res, 2011, 39: D214-D219.
    [124] Mudunuri U, Che A, Yi M et al. bioDBnet: the biological database network. Bioinformatics, 2009, 25: 555-556.
    [125] Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Mol Syst Biol, 2007, 3: 88.
    [126] Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics, 2005, 21: 3448-3449.
    [127] Demir E, Cary MP, Paley S et al. The BioPAX community standard for pathway data sharing. Nature biotechnology, 2010, 28: 935-942.
    [128] Hucka M, Finney A, Sauro HM et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 2003, 19: 524-531.
    [129] RAP-DB Database. http://rapdb.dna.affrc.go.jp/genome/statistics.html

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700