Machine Learning for Information Extraction from XML marked-up text on the Semantic Web
Nigel Collier
National Institute of Informatics Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo 101-8430, Japan May 1st 2001
Semantic Web Workshop 2001 at WWW10
Machine Learning for Information Extraction from XML marked-up text - - PowerPoint PPT Presentation
Machine Learning for Information Extraction from XML marked-up text on the Semantic Web Nigel Collier National Institute of Informatics Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo 101-8430, Japan May 1 st 2001 Semantic Web Workshop 2001 at WWW10
Semantic Web Workshop 2001 at WWW10
Indexed document collection
Local search engine
Smart (IE) engine
Tagged document collection
<x> Y </x>
Document.xml Question
<x> Y </x>
Answer-Document.xml
Annotation Learner Annotation Tagger XML editor Smart XML editor Annotation Tagger
No <PROTEIN>STAT</PROTEIN> activity was detected in <SOURCE subtype= ct>TCR-stimulated lymphocytes</SOURCE>, indicating that the <PROTEIN>JAK</PROTEIN>/<PROTEIN>STAT</PROTEIN> pathway defined in this study constitutes an <PROTEIN>IL-2R</PROTEIN>- mediated signaling event which is not shared by the <PROTEIN>TCR</PROTEIN>.
UNK UNK PROTEIN PROTEIN UNK PROTEIN PROTEIN UNK SOURCE.ct SOURCE.ct SOURCE.ct UNK
2 1 2 1
first first first first first first
5 1 5 1 4 1 1 3 1 1 2 1 1 1 1 1 1 1
t t t t t t t t t t t t t t t t t t t t t t t
Code Feat ur e Exam pl e di g Di gi t Number 15 si n Si ngl eCapi t al M gr k Gr eekLet t er al pha cad CapsAndDi gi t s I 2 cap At Least TwoCaps Ral GDS l ad Let t er sAndDi gi t s i l 2 f st Fi r st Wor d ( f i r st wor d i n sent ence) i ni I ni t Cap I nt er l euki n l cp Lower Caps kappaB l
Lower Case ki nases hyp Hyhon
OpenPar ent hese( cl p Cl
ent hese ) f sp Ful l St
. cma Comma , pct Per cent %
OpenSquar eBr ac [ csq Cl
eBr a] cl n Col
: scn Semi Col
; det Det er mi ner t he con Conj unct i
and
h Ot her *, +, #, @
Class # Example Description PROTEIN 2125 JAK kinase proteins, protein groups, families, complexes and substructures. DNA 358 IL-2 promoter DNAs, DNA groups, regions and genes RNA 30 TAR RNAs, RNA groups, regions and genes SOURCE.cl 93 leukemic T cell cell line line Kit225 SOURCE.ct 417 human T cell type lymphocytes SOURCE.mo 21 Schizosacchar- mono-organism
SOURCE.mu 64 mice multi-organism SOURCE.vi 90 HIV-1 viruses SOURCE.sl 77 membrane sub-location SOURCE.ti 37 central nervous tissue system UNK
background words phosphorylation
Class # Example Description ORGANISATION 1783 Harvard Law names of organisations School PERSON 838 Washington names of people LOCATION 390 Houston names of places, countries etc. DATE 542 1970s date expressions TIME 3 midnight time expressions MONEY 423 $ 10 million money expressions PERCENT 108 2.5% percentage expressions UNK
background words