用户名: 密码: 验证码:
基于判别式模型的生物医学文本挖掘相关问题研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算技术和生物技术的进步,当前生物医学文献正在以前所未有的速度增长。这些文献中蕴含着最新的研究进展和丰富的生物医学知识,对于生物医学研究者具有重要意义。然而数以千万计的文献使得研究者追踪和整理自己需要的知识和信息变得越来越困难。文本挖掘技术可以解决这一问题,帮助生物医学研究者提高从文献中获取知识和信息的效率。因此针对生物医学文献的文本挖掘研究具有重要的应用价值。判别式模型是一类直接利用特征来预测目标变量的发生概率的机器学习模型,本文中主要用到的判别式模型有最大熵模型和条件随机域模型。相对于产生式模型,判别式模型降低了特征之间的独立性假设的要求,并且与很多文本挖掘任务的需求相一致,因而更有可能取得好的效果。
     本文主要研究如何利用判别式模型来解决生物医学文献挖掘中的问题。具体地,我们研究了生物医学文本挖掘中的三个任务:生物医学名实体识别、生物医学实体规范化以及生物医学语义关系抽取。在这3任务中,第二个任务是第一个任务在语义处理上的延伸,前两个任务是第三个任务的基础。本文的主要内容包含以下4个方面。
     生物医学名实体识别的目标是确定一个给定的文本集合内的某一类型的实体的名字的所有实例,它是进行深层次文本挖掘的必要步骤之一。本文在考察了生物医学领域实体识别的特点和难点,分析了目前已有的生物医学实体识别方法的优缺点的基础上,提出了利用条件随机域模型结合丰富特征集来进行生物医学实体识别的方法。这些特征包括:构词法特征、上下文特征和句法特征。其中,浅层句法特征是首次被引入到条件随机域模型中,同时用来进行实体的边界检测和类别判断。实验表明,这一特征可以有效地提高名实体识别的效果。
     有监督的机器学习方法需要大规模的标注语料。大量的电子文献使得在生物医学领域获取未标记的语料已相当容易,但是对语料进行标注仍然是一件昂贵的工作。针对在生物医学名实体识别中有监督学习所需的大规模训练语料比较难以获取的问题,本文提出了基于最大熵模型的协同训练的半监督学习方法。该方法可以利用大量的未标注语料来提高在较少的标注语料的基础上学习到的分类器的名实体识别性能。为了进一步提高半监督学习的效果,本文将主动学习引入到半监督学习的过程中。实验表明,基于最大熵模型的协同训练方法可以有效地提高初始分类器的识别性能。
     灵活的生物医学实体命名方式使得生物医学实体具有严重的歧义。这已成为对生物医学文献进行深层自动文本挖掘的主要障碍之一。生物医学实体规范化的提出就是为了解决这一问题。生物医学实体规范化就是把生物医学文献中表达同一概念的不同变体映射到统一的概念标识符。本文提出了一种用于生物医学实体规范化的多层歧义消解框架。实体规范化过程中不同阶段有不同的歧义情形,在本文提出的框架中,针对这些情形采用了有针对性的解决策略,包括:基于词典的实体名字检测,基于机器学习方法的候选选择以及基于知识的歧义消解。在BioCreAtIvE2006基因名字规范化任务的测试集上的实验表明本文提出的框架可以有效地解决规范化过程中的各种歧义。
     生物医学语义关系抽取是生物医学文本挖掘的主要研究内容之一,是从无结构的生物医学文献中抽取出生物医学知识的重要手段。在实际应用中,生物医学语义关系的定义有宽泛和具体之分。本文将宽泛定义和具体定义的生物医学语义关系抽取分别看作二分类和多分类问题,提出基于最大熵模型的生物医学语义关系抽取的方法。针对不区分类别的蛋白质相互作用这种宽泛定义的关系抽取,提出了一种基于最大熵的二阶段蛋白质相互作用关系抽取方法。针对多类别的蛋白质相互作用这种具体定义的关系抽取,提出使用最大熵模型结合词特征的抽取方法,该方法在一个具有10种蛋白质相互作用类别的数据集上取得了73.4%的总体精确率。同样的方法应用到疾病与治疗方式关系抽取任务中,也取得了很好的实验结果。此外,本文还通过理论分析和实验对比,从理论和实践两个方面说明了判别式模型比产生式模型更适合生物医学语义关系抽取问题。
With the advancement of computing technology and biotechnology, the amountof biomedical literature is increasing in an unprecedented speed. The literature con-tains the latest research progress and rich biomedical knowledge, which are vital forbiomedical researchers. However, tens of millions of literature makes tracking andcollating the necessary knowledge and information become more and more difficult.Text mining technology can solve this problem and enhance the efficiency of utilizingbiomedical literature. So it is valuable in practice to research the text mining tech-nology for biomedical literature. Discriminative models are a class of models usedin machine learning, which can directly use the features to predicate the probabilityof target variables. In this thesis, conditional random fields model and maximum en-tropy model are used. Compare to generative models, discriminative models needn’tthe assumption that features have to be independent and are consistent with the re-quirements of many text mining tasks. So discriminative models are more likely toachieve good results.
     This thesis is on how to make use of discriminative models to solve the biomed-ical text mining issues. Concretely, we study on three tasks in biomedical text min-ing: biomedical named entity recognition, biomedical named entity normalization andbiomedical semantic relation extraction. In the three tasks, the second is the extensionof the first in semantic processing; the first and the second are the basis for the third.The major contents of this thesis include the following four parts.
     The target of biomedical named entity recognition is to identify the named en-tity instances of the specified categories in the documents. It is a necessary step fordeep text mining. On the basis of investigating the characteristics and difficulties ofbiomedical named entity recognition and analyzing the advantages and disadvantagesof current methods for biomedical named entity recognition, we propose to use condi-tional random fields model with rich feature sets to identify biomedical named entity.The feature sets include literal feature, context feature and syntactic feature. In thesefeatures, shallow syntactic features are first introduced into conditional random fieldsmodel when doing boundary detection and semantic labeling at the same time, which effectively improve the model’s performance.
     Supervised machine learning methods need large annotated corpora. Currently,it is easy to obtain un-annotated data in biomedical domain due to the existence ofhuge amount of electronic literature, but corpus annotation is still an expensive work.In order to deal with the lack of large scale annotated biomedical named entity corpus,this thesis proposes maximum entropy based co-training method. This method cantake advantage of the un-annotated data to improve the performance of the classifierstrained on a small scale annotated corpus. Active learning strategy is also integrated tofurther improve the results of co-training. Experiments show the effect of the proposedmethod.
     The ?exible nomenclature of biomedical named entities results in severe seman-tic ambiguity, which is an obstacle for deep biomedical text mining. Biomedicalnamed entities normalization is an effect way to resolve this problem. The goal ofbiomedical named entities normalization is to correctly associate the named entitiesin documents with standard identifiers. In this thesis, a multi-level disambiguationframework is proposed to accomplish biomedical named entities normalization task.Aiming at different ambiguity situations during the procedure of biomedical namedentities normalization, three different strategies are included in the framework. Theyare dictionary based named entities detection, machine learning based candidate se-lection and knowledge based disambiguation. Experiment results on the test data ofBioCreAtIvE2006 gene name normalization task show that the proposed frameworkcan resolve all kinds of ambiguities during normalization processing effectively.
     Biomedical semantic relation extraction is one of the main research topics inbiomedical text mining. It is an important mean to extract biomedical knowledge frombiomedical literature. In practice, there are two kinds of relation definition: generaland concrete. The general and concrete definitions are considered as binary classifica-tion and multi-way classification problems respectively and maximum entropy modelis proposed to solve the problems. For a general relation definition, Protein-ProteinInteraction (PPI) relation, we propose a two-phrase PPI Relation extraction methodbased on maximum entropy model. For a concrete relation definition, multi-class PPIrelation, we propose a method which uses maximum entropy model with word fea-tures. In a 10-class PPI relation test data, the method achieved 73.4% accuracy. Thesame method is also applied to a disease-treatment relation extraction and get good re- sults. Besides, we show that discriminative models are more suitable than generativemodels for biomedical semantic relation extraction in both theory and practice.
引文
1黄利辉.文本挖掘在生物学中的应用.医学信息学杂志. 2006, 27(3):161–163
    2 A. M. Cohen, W. R. Hersh. A Survey of Current Work in Biomedical TextMining. Briefings Bioinformatics. 2005, 6(1):57–71
    3 T. H. Tsai, W. C. Chou, S. H. Wu. Integrating Linguistic Knowledge Into aConditional Random Field Framework to Identify Biomedical Named Entities.Expert Systems Appl. 2006, 30(1):117–128
    4 I. Xenarios, D. W. Rice, L. Salwinski, et al. Dip: The Database of InteractingProteins. Nucleic Acids Research. 2000, 28(1):289–291
    5 M. A. Hearst. Untangling Text Data Mining. Proceedings of the 37th An-nual Meeting of the Association for Computational Linguistics. University ofMarylnd, 1999:3–10
    6王浩畅.基于统计学习的生物医学文本信息抽取方法研究.哈尔滨工业大学博士学位论文. 2008:1–21
    7 M. Huang, X. Zhu, Y. Hao. Discovering Patterns to Extract Protein-proteinInteractions from Full Biomedical Texts. Bioinformatics. 2004, 20(18):3604–3612
    8 Y. Hao, X. Zhu, M. Huang. Discovering Patterns to Extract Protein-proteinInteractions from the Literature: Part Ii. Bioinformatics. 2005, 21(15):3294–3300
    9 M. Huang, X. Zhu, M. Li. A Hybrid Method for Relation Extraction fromBiomedical Literature. International Journal of Medical Informatics. 2006,75(6):443–455
    10唐焕文.生物医学文献中的蛋白质名识别.大连理工大学硕士学位论文.2006:34–40
    11 N. Chinchor. Muc-7 Named Entity Task Definition (version 3.5). Proceedingsof the Seventh Message Understanding Conference. 1998
    12 B. De Bruijn, J. Martin. Getting to the (c)ore of Knowledge: Mining BiomedicalLiterature. International Journal of Medical Informatics. 2002, 67(1-3):7–18
    13 D. Hanisch, J. Fluck, H. T. Mevissen, et al. Playing Biology’s Name Game:Identifying Protein Names in Scientific Text. Proceedings of the 8th PacificSymposium on Biocomputing. University of Marylnd, 2003:403–414
    14 J. Allen. Natural Language Understanding. Benjamin-Cummings PublishingCo., Inc. Redwood City, CA, USA, 1995:25–28
    15 H. Shatkay, R. Feldman. Mining the Biomedical Literature in the Genomic Era:An Overview. Journal of Computational Biology. 2003, 10(6):821–855
    16 K. Fukuda, A. Tamura, T. Tsunoda, et al. Toward Information Extraction: Iden-tifying Protein Names from Biological Papers. Proceedings of Pacific Sympo-sium on Biocomputing. 1998
    17 L. Tanabe, W. J. Wilbur. Tagging Gene and Protein Names in Biomedical Text.Bioinformatics. 2002, 18(8):1124–1132
    18 H. Yu, V. Hatzivassiloglou, C. Friedman. Automatic Extraction of Gene andProtein Synonyms from Medline and Journal Articles. Proceedings of theAMIA Symposium. 2002:919–923
    19 E. Brill. Transformation-based Errordriven Learning and Natural LanguageProcessing: A Case Study in Part-of-speech Tagging. Computational Linguis-tics. 1995, 21(4):543–565
    20 E. Brill. Processing Natural Language without Natural Language Processing.Computational Linguistics and Intelligent Text Processing, Lecture Notes inComputer Science, Vol. 2588. 2003:360–369
    21 Y. Tsuruoka, J. Tsujii. Improving the Performance of Dictionary-based Ap-proaches in Protein Name Recognition. Journal of Biomedical Informatics.2004, 37(6):461–470
    22 Z. Kou, W. W. Cohen, R. F. Murphy. High-recall Protein Entity RecognitionUsing a Dictionary. Bioinformatics. 2005, 21(Suppl. 1):i266–i273
    23 G. Zhou, J. Su. Exploring Deep Knowledge Resources in Biomedical NameRecognition. Proceedings of Joint Workshop on Natural Language Processingin Biomedicine and its Applications. 2004:96–99
    24 G. Zhou, J. Zhang, J. Su. Recognizing Names in Biomedical Texts: A MachineLearning Approach. Bioinformatics. 2004, 20(7):1178–1190
    25 J. Kazama, T. Makino, Y. Ohta, et al. Tuning Support Vector Machines forBiomedical Named Entity Recognition. Proceedings of the ACL Workshop onNatural Language Processing in the Biomedical Domain. 2002:1–8
    26 J. Finkel, S. Dingare, H. Nguyen, et al. Exploiting Context for BiomedicalEntity Recognition: From Syntax to the Web. Proceedings of Joint Workshopon Natural Language Processing in Biomedicine and its Applications. 2004:88–91
    27 S. Burr. Biomedical Named Entity Recognition Using Conditional RandomFields and Novel Feature Sets. Proceedings of Joint Workshop on Natural Lan-guage Processing in Biomedicine and its Applications. 2004:104–107
    28 H. Wang, T. Zhao, S. Li, et al. A Conditional Random Fields Approach toBiomedical Named Entity Recognition. Journal of Electronics(China). 2007,24(6):838–844
    29 L. Hirschman, A. Yeh, C. Blaschke, et al. Overview of Biocreative: CriticalAssessment of Information Extraction for Biology. BMC Bioinformatics. 2005,6(Suppl. 1):S1
    30 J. Gonzalo, F. Verdejo, I. Chugur, et al. Indexing with Wordnet Synsets CanImprove Text Retrieval. Proceedings of the COLING/ACL’98 Workshop onUsage of WordNet for NLP Montreal. 1998:38–44
    31 J. Wren, J. Chang, J. Pustejovsky, et al. Biomedical Term Mapping Databases.Nucleic Acids Research. 2005, 33(Database Issue):D289–D293
    32 H. Yu, E. Agichtein. Extracting Synonymous Gene and Protein Terms fromBiological Literature. Bioinformatics. 2003, 19(Suppl. 1):i340–349
    33 E. Morin, C. Jacquemin. Automatic Acquisition and Expansion of HypernymLinks. Computers and the Humanities. 2004, 38:363–396
    34 J. McCrae, N. Collier. Synonym Set Extraction from the Biomedical Literatureby Lexical Pattern Discovery. BMC Bioinformatics. 2008, 9:159
    35 A. Cohen, W. Hersh, C. Dubay, et al. Using Co-occurrence Network Structure toExtract Synonymous Gene and Protein Names from Medline Abstracts. BMCBioinformatics. 2005, 6:103
    36 J. T. Chang, H. Schutze, R. B. Altman. Creating an Online Dictionary of Abbre-viations from Medline. Journal of American Medical Information Association.2002, 9(6):612–620
    37王浩畅,赵铁军.生物医学文本挖掘技术的研究与进展.中文信息学报.2008, 22(3):89–98
    38 H. Liu, C. Friedman. Mining Terminological Knowledge in Large Biomed-ical Corpora. Proceedings of the 8th Pacific Symposium on Biocomputing.2003:415–426
    39 H. Yu, G. Hripcsak, C. Friedman. Mapping Abbreviations to Full Forms inBiomedical Articles. Journal of American Medical Information Association.2002, 9(3):262–272
    40 A. Schwartz, M. Hearst. A Simple Algorithm for Identifying Abbreviation Def-initions in Biomedical Text. Proceedings of the 8th Pacific Symposium on Bio-computing. 2003:451–462
    41于中华,陈蓉,胡俊锋,等.基于加权投票k―近邻法的生物医学缩略语消歧.中文信息学报. 2008, 22(2):18–23
    42 C. Blaschke, L. Hirschman, A. Valencia. Information Extraction in MolecularBiology. Briefings in Bioinformatics. 2002, 3(2):154–165
    43 L. Yeganova, L. Smith, W. J. Wilbur. Identification of Related Gene/proteinNames Based Onan Hmm of Name Variations. Computational Biology andChemistry. 2004, 28(2):97–107
    44 L. Hirschman, M. Colosimo, A. Morgan, et al. Overview of Biocreative Task
    1b: Normalized Gene Lists. BMC Bioinformatics. 2005, 6(Suppl 1):S11
    45 W. Cohen, E. Minkov. A Graph-search Framework for Associating Gene Iden-tifiers with Documents. BMC Bioinformatics. 2006, 7:440
    46 M. Colosimo, A. Morgan, A. Yeh, et al. Data Preparation and InterannotatorAgreement: Biocreative Task 1b. BMC Bioinformatics. 2005, 6(Suppl 1)
    47 A. A. Morgan, L. Hirschman. Overview of Biocreative Ii Gene Normaliza-tion. Proceedings of the Second BioCreative Challenge Evaluation Workshop(BioCreative II). Madrid, Spain, 2007:17–28
    48 T. Yoshimasa, M. John, A. Sophia. Normalizing Biomedical Terms by Mini-mizing Ambiguity and Variability. BMC Bioinformatics. 2008, 9(Suppl. 3):S2
    49 G. Joshi-Tope, M. Gillespie, I. Vastrik. Reactome: A Knowledgebase of Bio-logical Pathways. Nucleic Acids Research. 2005, 33 Database Issue:D428–432
    50 G. D. Bader, D. Betel, C. W. Hogue. Bind: The Biomolecular Interaction Net-work Database. Nucleic Acids Research. 2003, 31(1):248–250
    51 S. Peri, J. D. Navarro, T. Z. Kristiansen. Human Protein Reference Data-baseas a Discovery Resource for Proteomics. Nucleic Acids Research. 2004, 32Database Issue:497–501
    52 R. Bunescu, R. Mooney, A. Ramani. Integrating Co-occurrence Statistics withInformation Extraction for Robust Retrieval of Protein Interactions from Med-line. Proceedings of BioNLP-2006. 2006:49–56
    53 A. K. Ramani, R. C. Bunescu, R. J. Mooney, et al. Consolidating the Set ofKnown Human Protein-protein Interactions in Preparation for Large-scale Map-ping of the Human Interactome. Genome Biology. 2005, 6(5):r40
    54 C. Friedman, P. Kra, H. Yu. Genies: A Natural-language Processing System forthe Extraction of Molecular Pathways from Journal Articles. Bioinformatics.2001, 17(Suppl. 1):S74–8
    55 J. M. Temkin, M. R. Gilder. Extraction of Protein Interaction Informationfrom Unstructured Text Using a Context-free Grammar. Bioinformatics. 2003,19(16):2046–2053
    56 H. Jang, J. Lim, J. H. Lim. Finding the Evidence for Protein-protein Interactionsfrom Pubmed Abstracts. Bioinformatics. 2006, 22(14):e220–e226
    57 T. Mitsumori, M. Murata, Y. Fukuda. Extracting Protein-protein InteractionInformation from Biomedical Text with Svm. IEICE Trans Inf & Syst. 2006,E89-D:2464–2466
    58 J. Xiao, J. Su, G. Zhou. Protein-protein Interaction Extraction: A Super-vised Learning Approach. Proceedings of Symp. on Semantic Mining inBiomedicine. 2005:51–59
    59 S. Pyysalo, A. Airola, J. Heimonen, et al. Comparative Analysis of Five Protein-protein Interaction Corpora. BMC Bioinformatics. 2008, 9(Suppl. 3):S6
    60 A. S. Yeh, L. Hirschman, A. A. Morgan. Evaluation of Text Data Mining forDatabase Curation: Lessons Learned from the Kdd Challenge Cup. Bioinfor-matics. 2003, 19(Suppl. 1):i331–339
    61 Y. Regev, M. Finkelstein-Landau, R. Feldman. Rule-based Extraction of Ex-perimental Evidence in the Biomedical Domain: The Kdd Cup 2002 (task 1).ACM SIGKDD Explorations Newsletter. 2002, 4(2):90–92
    62 D. R. Swanson. Complementary Structures in Disjoint Science Literatures. Pro-ceedings of the 14th Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval. 1991:280–289
    63 D. Swanson, N. Smalheiser, A. Bookstein. Information Discovery from Com-plementary Literatures: Categorizing Viruses as Potential Weapons. Jour-nal of the American Society for Information Science and Technology. 2001,52(10):797–812
    64 C. D. Manning, H. Schutze原著,苑春法,李庆中,王昀,李伟,曹德芳译.Foundations of Statistical Natural Language Processing,统计自然语言处理基础.电子工业出版社, 2005:72–91
    65 J. Kim, T. Ohta, Y. Tateisi, et al. Genia Corpus-a Semantically Annotated Cor-pus for Bio-textmining. Bioinformatics. 2003, 19(Suppl. 1):i180–i182
    66 L. Tanabe, N. Xie, L. Thom, et al. Genetag: A Tagged Corpus for Gene/proteinNamed Entity Recognition. BMC Bioinformatics. 2005, 6(Suppl. 1):S3
    67 K. Franze′n, G. Eriksson, F. Olsson, et al. Protein Names and How to FindThem. International Journal of Medical Informatics. 2002, 67(1-3):49–61
    68 J. Kim, T. Ohta, Y. tsuruoka, et al. Introduction to the Bio-entity RecognitionTask at Jnlpba, In: Joint Workshop on Natural Language. Proceedings of JointWorkshop on Natural Language Processing in Biomedicine and its Applica-tions. 2004:70–75
    69 R. Bunescu, R. Ge, R. Kate, et al. Comparative Experiments on Learning Infor-mation Extractors for Proteins and Their Interactions. Artificial Intelligence InMedicine. 2005, 33(2):139–155
    70 C. Ne′dellec. Learning Language in Logic-genic Interaction Extraction Chal-lenge. Proceedings of the ICML-2005 Workshop on Learning Language inLogic (LLL05). 2005:31–37
    71 S. Pyysalo, F. Ginter, J. Heimonen, et al. BioInfer: a Corpus for InformationExtraction in the Biomedical Domain. BMC Bioinformatics. 2007, 8:50
    72 K. Fundel, R. Kuffner, R. Zimmer. RelEx–Relation Extraction Using Depen-dency Parse Trees. Bioinformatics. 2007, 23(3):365–371
    73 J. Ding, D. Berleant, D. Nettleton, et al. Mining Medline: Abstracts, Sen-tences, Or Phrases. Proceedings of the Pacific Symposium on Biocomputing.2002:326–337
    74 O. Bodenreider. The Unified Medical Language System (umls): IntegratingBiomedical Terminology. Nucleic Acids Research. 2004, 32:D267–D270
    75 M. Ashburner, C. Ball, J. Blake, et al. Gene Ontology: Tool for the Unificationof Biology. Nature Genetics. 2000, 25:25–29
    76 W. Hersh, R. Bhupatiraju. Trec Genomics Track Overview. Notebook of theTREC-2003. 2004:148–157
    77 W. Hersh, R. Bhuptiraju, L. Ross, et al. Trec 2004 Genomics Track Overview.Proceedings of the Thirteenth Text Retrieval Conference. 2004
    78 W. Hersh, A. Cohen, J. Yang, et al. Trec 2005 Genomics Track Overview.Proceedings of the Fourteenth Text Retrieval Conference. 2005
    79 W. Hersh, A. Cohen, P. Roberts, et al. Trec 2006 Genomics Track Overview.Proceedings of the Fifteenth Text Retrieval Conference. 2006
    80 J. W. amd L. Smith, L. Tanabe. Biocreative 2 Gene Mention Task. Proceedingsof the Second BioCreative Challenge Evaluation Workshop (BioCreative II).Madrid, Spain, 2007:7–16
    81 M. Krallinger, F. Leitner, A. Valencia. Assessment of the Second BiocreativePpi Task: Automatic Extraction of Protein-protein Interactions. Proceedingsof the Second BioCreative Challenge Evaluation Workshop (BioCreative II).Madrid, Spain, 2007:41–54
    82 A. S. Yeh, A. Morgan, M. Colosimo, et al. Biocreative Task 1a: Gene MentionFinding Evaluation. BMC Bioinformatics. 2005, 6(Suppl. 1)
    83王浩畅,赵铁军,郑德权,等.基于元学习策略的分类器融合方法及应用.通信学报. 2007, 28(10):7–13
    84 J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: ProbabilisticModels for Segmenting and Labeling Sequence Data. Proceedings of the 18thInternational Conference on Machine Learning. 2001:282–289
    85赵健,王晓龙,关毅,等.中文名实体识别:基于词触发对的条件随机域方法.高技术通讯. 2006, 16(8):795–801
    86 C. E. Shannon, W. Weaver. The Mathematical Theory of Communication. 1949
    87 E. T. Jaynes. Information Theory and Statistical Mechanics. Physics Reviews.1957, 106:620–630
    88 A. L. Berger, S. A. D. Pietra, V. J. D. Pietra. A Maximum Entropy Approach toNatural Language Processing. Computational Linguistics. 1996, 22(1):39–72
    89 S. D. Pietra, V. D. Pietra, R. Mercer, et al. Adaptive Language Modeling UsingMinimum Discriminant Estimation. Proceedings of the International Confer-ence on Acoustics, Speech and Signal Processing. 1992:633–636
    90 K. Nigam, J. Lafferty, A. McCallum. Using Maximum Entropy for Text Clas-sification. Proceedings of the IJCAI-99 Workshop on Information Filtering.1999:421–426
    91李荣陆,王建会,陈晓云,等.使用最大熵模型进行中文文本分类.计算机研究与发展. 2005, 42(1):94–101
    92 A. Ratnaparkhi. A Maximum Entropy Model for Part-of-speech Tagging. Pro-ceedings of Conference on Empirical Method in Natural Language Processing.1996:145–152
    93 J. N. Darroch, D. Ratcliff. Generalized Iterative Scaling for Log-linear Models.Annals of Mathematical Statistics. 1972, 43(5):1470–1480
    94 H. M. Wallach. Efficient Training of Conditional Random Fields. Master’sthesis, University of Edinburgh. 2002
    95 A. McCallum, D. Freitag, F. Pereira. Maximum Entropy Markov Models forInformation Extraction and Segmentation. Proceedings of the 17th InternationalConference on Machine Learning. 2000:591–598
    96 A. Viterbi. Error Bounds for Convolutional Codes and an Asymptotically Op-timum Decoding Algorithm. IEEE Transactions on Information Theory. 1967,13(2):260–269
    97 C. Sutton, A. McCallum. An Introduction to Conditional Random Fields for Re-lational Learning. In Introduction to Statistical Relational Learning. 2006:84–98
    98 A. K. McCallum. Mallet: A Machine Learning for Language Toolkit, 2002.Http://mallet.cs.umass.edu
    99 K. R. Charles Sutton, Andrew McCallum. Dynamic Conditional RandomFields: Factorized Probabilistic Models for Labeling and Segmenting SequenceData. Journal of Machine Learning Research. 2007, 8(Mar):693–723
    100 X. Zhu. Semi-supervised Learning with Graphs. Doctor’s thesis, CarnegieMellon University. 2005:69–74
    101 Y. Liu. Conditional Graphical Models for Protein Structure Prediction. Doctor’sthesis, Carnegie Mellon University. 2006
    102孙广路.基于统计学习的中文组块分析技术研究.哈尔滨工业大学博士学位论文. 2008:18–34
    103 Y. Tsuruoka, J. D. Kim, T. Ohta, et al. Developing a Robust Part-of-speechTagger for Biomedical Text. Advances in Informatics - 10th Panhellenic Con-ference on Informatics, LNCS 3746. 2005:382–392
    104 T.-H. Tsai, C.-W. Wu, W.-L. Hsu. Using Maximum Entropy to Extract Biomed-ical Named Entities without Dictionaries. Proceedings of Second InternationalJoint Conference on Natural Language Processing. 2005:268–273
    105 B. Settles. ABNER: an Open Source Tool for Automatically Tagging Genes,Proteins and Other Entity Names in Text. Bioinformatics. 2005, 21(14):3191–3192
    106 O. Chapelle, B. Scho¨lkopf, A. Zien. Semi-supervised Learning. MIT Press,2006:1–12
    107 B. Shahshahani, D. Landgrebe. The Effect of Unlabeled Samples in Reducingthe Small Sample Sizeproblem and Mitigating the Hughes Phenomenon. IEEETransactions on Geoscience and Remote Sensing. 1994, 32(5):1087–1095
    108 R. Lippmann. Pattern Classification Using Neural Networks. IEEE Communi-cations. 1989, 27(11):47–50
    109 D. Miller, H. Uyar. A Mixture of Experts Classifier with Learning Based onBoth Labelled and Unlabelled Data. Advances in Neural Information Process-ing Systems. 1996, 9:571–577
    110 T. Zhang. The Value of Unlabeled Data for Classification Problems. Pro-ceedings of the Seventeenth International Conference on Machine Learning.2000:1191–1198
    111 A. Blum, T. Mitchell. Combining Labeled and Unlabeled Data with Co-training.Proceedings of the eleventh annual conference on Computational learning the-ory. 1998:92–100
    112 K. Nigam, A. Mccallum, S. Thrun, et al. Text Classification from Labeled andUnlabeled Documents Using Em. Machine Learning. 2000, 39(2):103–134
    113 F. Jiao, S. Wang, C.-H. Lee, et al. Semi-supervised Conditional Random Fieldsfor Improved Sequence Segmentation and Labeling. Proceedings of the 21st In-ternational Conference on Computational Linguistics and 44th Annual Meetingof the Association for Computational Linguistics. Sydney, Australia, 2006:209–216
    114 J. Chen, D. Ji, C. Tan, et al. Relation Extraction Using Label Propagation BasedSemi-supervised Learning. Proceedings of the 21st International Conference onComputational Linguistics and the 44th annual meeting of the ACL. 2006:129–136
    115 D. McClosky, E. Charniak, M. Johnson. Effective Self-training for Parsing. Pro-ceedings of the main conference on Human Language Technology Conferenceof the North American Chapter of the Association of Computational Linguis-tics. 2006:152–159
    116 K. Nigam, R. Ghani. Analyzing the Effectiveness and Applicability of Co-training. Proceedings of the ninth international conference on Information andknowledge management. 2000:86–93
    117 Y. Zhou, S. Goldman. Democratic Co-learning. Proceedings of the 16th IEEEInternational Conference on Tools with Artificial Intelligence. 2004:594–602
    118 Z. Zhou, M. Li. Tri-Training: Exploiting Unlabeled Data Using Three Classi-fiers. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEER-ING. 2005:1529–1541
    119 T. Joachims. Transductive Inference for Text Classification Using Support Vec-tor Machines. Proceedings of the Sixteenth International Conference on Ma-chine Learning. 1999:200–209
    120 N. Abe, H. Mamitsuka. Query Learning Strategies Using Boosting and Bag-ging. Proceedings of the Fifteenth International Conference on Machine Learn-ing. 1998:1–9
    121龙军,殷建平,祝恩,等.主动学习研究综述.计算机研究与发展. 2008,45(z1):300–304
    122 Y. Freund, H. Seung, E. Shamir, et al. Selective Sampling Using the Query byCommittee Algorithm. Machine Learning. 1997, 28(2):133–168
    123 A. A. Morgan, L. Hirschman, M. Colosimo, et al. Gene Name Identificationand Normalization Using a Model Organism Database. Journal of BiomedicalInformatics. 2004, 37(6):396–410
    124 M. Yetisgen-Yildiz, W. Pratt. Using Statistical and Knowledge-based Ap-proaches for Literature-based Discovery. Journal of Biomedical Informatics.2006, 39(6):600–611
    125 H. Fang, K. Murphy, Y. Jin, et al. Human Gene Name Normalization Using TextMatching with Automatically Extracted Synonym Dictionaries. Proceedings ofthe BioNLP Workshop on Linking Natural Language Processing and Biologyat HLT-NAACL 06. New York City, USA, 2006:41–48
    126 J. Tamames, A. Valencia. The Success (or Not) of Hugo Nomenclature. GenomeBiology. 2006, 7(5):402
    127 D. Hanisch, K. Fundel, H. Mevissen, et al. Prominer: Rule-based Protein andGene Entity Recognition. BMC Bioinformatics. 2005, 6(Suppl 1):S14
    128 J. Crim, R. McDonald, F. Pereira. Automatically Annotating Documents withNormalized Gene Lists. BMC Bioinformatics. 2005, 6:S13(Suppl. 1)
    129 K. Fundel, D. Guttler, R. Zimmer, et al. A Simple Approach for Protein NameIdentification: Prospects and Limits. BMC Bioinformatics. 2005, 6(Suppl1):S15
    130 Y. Tsuruoka, J. McNaught, J. Tsujii, et al. Learning String Similarity Mea-sures for Gene/protein Name Dictionary Look-up Using Logistic Regression.Bioinformatics. 2007, 23(20):2768–2774
    131 H. Xu, J. W. Fan, G. Hripcsak, et al. Gene Symbol Disambiguation UsingKnowledge-based Profiles. Bioinformatics. 2007, 23(8):1015–1022
    132 G. Salton, C. Buckley. Term-weighting Approaches in Automatic Text Re-trieval. Information Processing and Management. 1988, 24(5):513–523
    133 D. Yarowsky. Unsupervised Word Sense Disambiguation Rivaling SupervisedMethods. Proceedings of the 33th Annual Meeting of the Association for Com-putational Linguistics. Cambridge, Massachusetts, USA, 1995:189–196
    134 C. Fellbaum. Wordnet: An Electronic Lexical Database. The MIT Press, 1998
    135 M. Lesk. Automatic Sense Disambiguation Using Machine Readable Dictio-naries: How to Tell a Pine Cone from an Ice Cream Cone. Proceedings of the5th annual international conference on Systems documentation. 1986:24–26
    136 S. Banerjee, T. Pedersen. Extended Gloss Overlaps as a Measure of SemanticRelatedness. Proceedings of the Eighteenth International Joint Conference onArtificial Intelligence. 2003:805–810
    137 J. Hakenberg, L. Royer, C. Plake. Me and My Friends: Gene Mention Normal-ization with Background Knowledge. Proceedings of the Second BioCreativeChallenge Evaluation Workshop (BioCreative II). Madrid, Spain, 2007:141–144
    138 D. Zhou, Y. He. Extracting Interactions between Proteins from the Literature.Journal of Biomedical Informatics. 2008, 41(2):393–407
    139 H. Chun, Y. Tsuruoka, J. Kim, et al. Extraction of Gene-disease Relations fromMedline Using Domain Dictionaries and Machine Learning. Proceedings of the11th Pacific Symposium on Biocomputing. 2006:4–15
    140 E. Agichtein, L. Gravano. Snowball: Extracting Relations from Large Plain-textCollections. Proceedings of the Fifth ACM International Conference on DigitalLibraries. 2000
    141 D. Zelenko, C. Aone, A. Richardella. Kernel Methods for Relation Extraction.Journal of Machine Learning Research. 2003, 3:1083–1106
    142 C. Ramani, E. Marcotte, R. Bunescu, et al. Using Biomedical Literature Miningto Consolidate the Set of Known Human Protein-protein Interactions. Proceed-ings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologiesand Databases: Mining Biological Semantics. 2005:46–53
    143 B. Rosario, M. A. Hearst. Multi-way Relation Classification: Application toProtein-protein Interactions. Proceedings of the conference on Human Lan-guage Technology and Empirical Methods in Natural Language Processing.Morristown, NJ, USA, 2005:732–739
    144 P. I. Nakov, A. S. Schwartz, M. A. Hearst. Citances: Citation Sentences forSemantic Analysis of Bioscience Text. Proceedings of the SIGIR’04 workshopon Search and Discovery in Bioinformatics. 2004
    145 A. Ng, M. Jordan. On Discriminative Vs. Generative Classifiers: A Comparisonof Logistic. Proceedings of the Advances in Neural Information ProcessingSystems. 2002
    146 B. Rosario, M. A. Hearst. Classifying Semantic Relations in Bioscience Text.Proceedings of the 42nd Annual Meeting of the Association for ComputationalLinguistics. Barcelona, Spain, 2004:430–437

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700