Machine Learning for Information Extraction from XML marked-up text - PowerPoint PPT Presentation

Machine Learning for Information Extraction from XML marked-up text on the Semantic Web Nigel Collier National Institute of Informatics Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo 101-8430, Japan May 1 st 2001 Semantic Web Workshop 2001 at WWW10

Talk summary •Introduction •Motivation •System model •Tagged texts as the key to learning •Test collections •Method •Results and Conclusion

Introduction and motivation •Final goal: •Smart documents and smart applications based on standardised content annotation schemes XML, RDF etc.. •Why is this a good thing? •Information access, building natural interfaces etc. •The bottleneck: •Entering expert knowledge into (textual) documents •Proposed solution: � •Learning to annotate domain-based texts using examples.

System model: PIA project at NII <x> Y </x> Answer-Document.xml Smart (IE) engine Annotation Question searcher Tagger Local search Annotation Annotation engine Learner Tagger XML editor Smart <x> Y XML editor Tagged Indexed </x> searcher/ document document collection submitter Document.xml collection

System model •Initial goals: •a pilot study to test machine learning technology in a technical domain as well as news. •explore the problems of tagging from a linguistic perspective. •Concentrate on terminology, i.e. identification & classification of terms •using examples to learn •Next step goals: •Make use of higher level information contained in the DTD schema, attribute information etc. Define and use ontologies etc..

Tagged texts as the key to learning •Example marked-up sentence for molecular-biology: No <PROTEIN>STAT</PROTEIN> activity was detected in <SOURCE subtype= ct>TCR-stimulated lymphocytes</SOURCE>, indicating that the <PROTEIN>JAK</PROTEIN>/<PROTEIN>STAT</PROTEIN> pathway defined in this study constitutes an <PROTEIN>IL-2R</PROTEIN>- mediated signaling event which is not shared by the <PROTEIN>TCR</PROTEIN>.

Challenges of name-finding in a technical domain •Inconsistent naming conventions e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2 •Wide-spread synonymy Many synonyms in wide usage, e.g. PKB and Akt •Open, growing vocabulary for many classes •Cross-over of names between classes depending on context

HMM models •Advantages - can consider language modeling within a well-known and understood mathematical framework - although the n-1 assumption is naïve, it works well in practice •Disadvantages - the model ignores long distance and structural dependencies - the model suffers from fragmentation of probability distribution (i.e. data sparseness)

Model specification •Formal generative model Pr( , ) W NC Pr( NC | W ) � Pr( W ) NC a sequence of name classes W, a given sequence of words Since Pr(W) can be considered to be constant we aim to maximize Pr(W,NC).

Model’s intuition Class states protein DNA Start of sentence End of sentence Source.ct UNK Example: Activation of JAK kinases and STAT proteins in human T lymphocytes . UNK UNK PROTEIN PROTEIN UNK PROTEIN PROTEIN UNK SOURCE.ct SOURCE.ct SOURCE.ct UNK Underlying process:

Interpolating HMM model specification •We need two probability distributions (1) for the first word and name class in a sequence (2) for all other words and name classes • Let (1) be, Pr( NC | W , F ) � � � � first first first 0 � Pr( NC | _, F ) � � � � first first 1 Pr( NC ) � first 2 for 1 . 0 , � � � i � � � � � 0 1 2 empirically determined constant � x NC first first name class (state) in the sequence W first first word in observed emission F first first feature belonging to first word

Interpolating HMM model specification • Let (2) be, � Pr( NC | W , F , W , F , NC ) � � � � � 0 t t t t 1 t 1 t 1 � � � � � Pr( NC | _, F , W , F , NC ) � � � � � 1 1 1 1 t t t t t � � � � Pr( NC | W , F , _, F , NC ) � � � � � 2 1 1 t t t t t � � � Pr( NC | _, F , _, F , NC ) � � � � � 3 t t t 1 t 1 � � Pr( NC | NC ) � � 4 t t 1 � � Pr( NC ) 5 t for 1 . 0 , � � � i ... � � � � � � 0 1 5 � x empirically determined constant NC t next name class (state) in the sequence W t next word in observed emission F t next feature belonging to first word •The optimal path is recovered using the Viterbi algorithm

Interpolating HMM model specification Character features: Code Feat ur e Exam pl e di g Di gi t Number 15 si n Si ngl eCapi t al M gr k Gr eekLet t er al pha cad CapsAndDi gi t s I 2 cap At Least TwoCaps Ral GDS l ad Let t er sAndDi gi t s i l 2 f st Fi r st Wor d ( f i r st wor d i n sent ence) i ni I ni t Cap I nt er l euki n l cp Lower Caps kappaB l ow Lower Case ki nases hyp Hyhon -' opp OpenPar ent hese( cl p Cl osePar ent hese ) f sp Ful l St op . cma Comma , pct Per cent % osq OpenSquar eBr ac [ csq Cl oseSquar eBr a] cl n Col on : scn Semi Col on ; det Det er mi ner t he con Conj unct i on and ot h Ot her *, +, #, @

Experiments (molecular biology) •Interpolating HMM (NEHMM) •Domain of biochemistry: human+blood cell+transcription factor •Corpus: 100 MEDLINE abstracts - 80 for training, 20 for testing with 5-fold cross-validation Tagged by domain expert Developed at the Tsujii laboratory (U. Tokyo) •Ontology: A simple taxonomy that forbid term class overlapping based on substance characteristics (rather than e.g. role)

Tag set for molecular biology Class # Example Description PROTEIN 2125 JAK kinase proteins, protein groups, families, complexes and substructures. DNA 358 IL-2 promoter DNAs, DNA groups, regions and genes RNA 30 TAR RNAs, RNA groups, regions and genes SOURCE.cl 93 leukemic T cell cell line line Kit225 SOURCE.ct 417 human T cell type lymphocytes SOURCE.mo 21 Schizosacchar- mono-organism omyces pombe SOURCE.mu 64 mice multi-organism SOURCE.vi 90 HIV-1 viruses SOURCE.sl 77 membrane sub-location SOURCE.ti 37 central nervous tissue system UNK - tyrosine background words phosphorylation

Experiments (news) •Interpolating HMM (NEHMM) •Domain of news: MUC-6 dry run and formal run test set •Corpus: 60 news texts - 50 for training, 10 for testing with 6-fold cross-validation •Ontology: No explicit ontology. MUC-6 tagging guidelines.

Tag set for news Class # Example Description ORGANISATION 1783 Harvard Law names of organisations School PERSON 838 Washington names of people LOCATION 390 Houston names of places, countries etc. DATE 542 1970s date expressions TIME 3 midnight time expressions MONEY 423 $ 10 million money expressions PERCENT 108 2.5% percentage expressions UNK - start-up costs background words

Results for news tests - comparison with molecular biology tests System News Biology HMM (w/Unity) 78.4 75.0 HMM (w/o Unity) 74.2 73.1 Table 2: F-score all class averages for news and molecular biology test sets F-score = (2 x Precision x Recall) / (Precision + Recall)

Analysis • Classification was far easier than identification due to linguistic structures such as: • Coordination, e.g. c-rel and v-rel (proto) oncogenes • Apposition, e.g. The transcription factor NF-Kappa B.. • Abbreviation, e.g. the Interleukin-2 (IL-2) promoter..

Analysis • Ways forward: 1. Use some other identification method than HMM? 2. We estimate that the training texts are no more than 95% consistent between human-taggers - improve the consistency of tagging with better guidelines? 3. Incorporate nested tagging to model term-internal dependencies? Or a domain independent dependency analyser.

Conclusion 1. The HMM performed quite well overall considering training data size. 2. Local context and small feature set limitations of the HMM need to be overcome in future models for complex local linguistic structures. 3. The model needs to make use of element type name relations such as combination relations and element attributes held inside the DTD as well as integrating ontological knowledge held e.g. in RDF(S).

Machine Learning for Information Extraction from XML marked-up text - PowerPoint PPT Presentation

Machine Learning for Information Extraction from XML marked-up text on the Semantic Web Nigel Collier National Institute of Informatics Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo 101-8430, Japan May 1 st 2001 Semantic Web Workshop 2001 at WWW10

Module 2 Module 2 XML Basics XML Basics (XML, Namespaces, (XML, Namespaces, Usage scenarios,

XML and Web Services Lecture 8 1 Outline XML (Section 17) XML syntax, semistructured

Binary XML and its Characterization Robin Berjon, XML Prague, 25/06/2005 What is Binary XML?

Java 2 Micro Edition XML F. Ricci 2010/2011 J2Me XML overview XML, REST Parsing XML :

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.

Querying XML Documents Querying XML Documents How XML may be supported in databases with

XML in Programming Patryk Czarnik XML and Applications 2015/2016 Lecture 5 4.04.2016 XML in

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Transforming XML Documents Transforming XML Documents How the XSLT language transforms XML

Session 23 XML XML Reading and Reference Reading https://en.wikipedia.org/wiki/XML

XML and Content Management Lecture 3: Modelling XML Documents: XML Schema Maciej Ogrodniczuk,

Modelling XML Applications Patryk Czarnik XML and Applications 2015/2016 Lecture 2

XML Walking the Tree Modifying the Tree Generating XML Documents Creating Documents Volker

Modelling XML Applications Patryk Czarnik XML and Applications 2013/2014 Lecture 2

How does does it it look? look? How <?xml version= <?xml version= 1.0 1.0

Modelling XML Applications Patryk Czarnik XML and Applications 2013/2014 Lecture 2

Overview of the Recognizing Inference in TExt (RITE-2) at Recognizing Inference in

The Tortuous Path New Creation Doing ng again ain Preparation Collecting and sorting

Attachment and Bioethics: An Anabaptist Trans-Disciplinary Perspective American Scientific

Evidentials and questions Natasha Korotkova University of California, Los Angeles

Management Decision Flexible working: changing the manager's role Janice Johnson Article

Syntax, Grammars & Parsing CMSC 470 Marine Carpuat Fig credits: Joakim Nivre, Dan Jurafsky

A Data-Mining Approach To Time-Series Microarray Alignment for Crossing Large-Scale Biomolecular

Dependency Parsing Dr. Besnik Fetahu Parsing so far Use context free grammars to

Machine Learning for Information Extraction from XML marked-up text - PowerPoint PPT Presentation

Machine Learning for Information Extraction from XML marked-up text on the Semantic Web Nigel Collier National Institute of Informatics Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo 101-8430, Japan May 1 st 2001 Semantic Web Workshop 2001 at WWW10

Module 2 Module 2 XML Basics XML Basics (XML, Namespaces, (XML, Namespaces, Usage scenarios,

XML and Web Services Lecture 8 1 Outline XML (Section 17) XML syntax, semistructured

Binary XML and its Characterization Robin Berjon, XML Prague, 25/06/2005 What is Binary XML?

Java 2 Micro Edition XML F. Ricci 2010/2011 J2Me XML overview XML, REST Parsing XML :

XML Documents XML Documents The XML Namespace mechanism Anders Mller &amp; Michael I.

Querying XML Documents Querying XML Documents How XML may be supported in databases with

XML in Programming Patryk Czarnik XML and Applications 2015/2016 Lecture 5 4.04.2016 XML in

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Transforming XML Documents Transforming XML Documents How the XSLT language transforms XML

Session 23 XML XML Reading and Reference Reading https://en.wikipedia.org/wiki/XML

XML and Content Management Lecture 3: Modelling XML Documents: XML Schema Maciej Ogrodniczuk,

Modelling XML Applications Patryk Czarnik XML and Applications 2015/2016 Lecture 2

XML Walking the Tree Modifying the Tree Generating XML Documents Creating Documents Volker

Modelling XML Applications Patryk Czarnik XML and Applications 2013/2014 Lecture 2

How does does it it look? look? How &lt;?xml version= &lt;?xml version= 1.0 1.0

Modelling XML Applications Patryk Czarnik XML and Applications 2013/2014 Lecture 2

Overview of the Recognizing Inference in TExt (RITE-2) at Recognizing Inference in

The Tortuous Path New Creation Doing ng again ain Preparation Collecting and sorting

Attachment and Bioethics: An Anabaptist Trans-Disciplinary Perspective American Scientific

Evidentials and questions Natasha Korotkova University of California, Los Angeles

Management Decision Flexible working: changing the manager's role Janice Johnson Article

Syntax, Grammars &amp; Parsing CMSC 470 Marine Carpuat Fig credits: Joakim Nivre, Dan Jurafsky

A Data-Mining Approach To Time-Series Microarray Alignment for Crossing Large-Scale Biomolecular

Dependency Parsing Dr. Besnik Fetahu Parsing so far Use context free grammars to

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.

How does does it it look? look? How <?xml version= <?xml version= 1.0 1.0

Syntax, Grammars & Parsing CMSC 470 Marine Carpuat Fig credits: Joakim Nivre, Dan Jurafsky