1
BioNLP for NLPeople
CS5832/HLT-NAACL/RANLP
The weirdest job in the world
BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the - - PDF document
BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in the world The weirdest job in the world 2 The weirdest job in the world The weirdest job in the world 3 How I got here How I got here 4 How I
1
BioNLP for NLPeople
CS5832/HLT-NAACL/RANLP
The weirdest job in the world
2
The weirdest job in the world The weirdest job in the world
3
The weirdest job in the world The weirdest job in the world
4
How I got here How I got here
5
How I got here
How I got here
Interactive Voice Response (yuck)
software engineer
6
What is BioNLP?
to biomedical language
– Publications – Medical records – Ontologies
Part 0
7
Why a field called BioNLP?
There is little reason for the data on which a linguist works to have the right to name that work.
Shuy 2002:8
(One lab’s) funding for NLP in computational biology
Alcoholism) $5M, 5 years
years)
3 years)
years)
8
Why biologists care
But, I’m a NLPerson (computer scientist, mathematician,
engineer…)
than in newswire text
domain than in newswire text
9
Resources
The big drawing point for NLPeople
– Lexical resources – 500 * 16M words of text – Labelled training data
– NER, POS taggers, parsers, semantic normalizers....
10
Job market
– US, Europe
specific right now
Surely Shuy jests...
There is little reason for the data on which a linguist works to have the right to name that work.
11
It really is different on every level
NLP actually could make the world a better place....
12
An embarrassing truth about BioNLP...
www.chilibot.net
1
13
Part 1: Just enough biology Cells and proteins
<illustration: cell, structures, proteins>
14
How biologists see the world
Wattarujeekrit et al. (2004)
The Central Dogma: from genes to proteins
http://www.swbic.org/products/clipart/images/dogmag.jpg
15
The Central Dogma: from genes to proteins
http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioRevi ew/images/central_dogma.gif
Higher-level structures
16
Biological structures are complex
SNAP Receptor Vesicle SNARE V-SNARE N-Ethylmaleimide-Sensitive Fusion Protein Soluble NSF Attachment Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (Alex Morgan, MITRE)
Part 2: Why bioscientists fund and publish research in BioNLP
17
Two basic markets, multiple user types
– Clinicians – Consumers – “Informationists” – Administrators (billing, quality assurance, ...)
– High-throughput experimentalists – “Bench scientists” – Model organism database curators
18
19
Structured vocabulary Free text (phenotypes)
20
122 references... Medical
21
<scanned picture of business card>
22
<happy-face photo>
One year later…
23
A sad story: physicians don’t buy a lot of NLP software
Another sad story: trying to sell “gisting” to physicians
24
Sold for $400K: 14.5 or 2.9¢ on the dollar… Salesperson’s thought process
25
Physician’s thought process Genomics
26
Why biologists care
Why biologists care
10 years ago...
27
Why biologists care
Today....
Double exponential growth in the literature
New entries in Medline with publication date in Jan-Aug 2005: 431,478 (avg. 1775/ day)
1
28
Biological Nomenclature: “V-SNARE”
SNAP Receptor Vesicle SNARE V-SNARE N-Ethylmaleimide-Sensitive Fusion Protein Soluble NSF Attachment Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (Alex Morgan, MITRE)
Part 3
Some things that make BioNLP different
29
Named Entity Recognition Genes have names??
30
Suzanna Lewis
Suzanna Lewis
It is the middle of the night (2:38 to be precise), I am away from friends and family, It has been this way for over 2 years, I can't sleep because of all the work there is yet to do, and there is no end in sight. So when do the magic little elves appear out of nowhere and get everything done? p.s. I am serious.
31
Suzanna Lewis
pray for elves
abbreviated as PFE, is reported here. It has also been known in FlyBase as CG15151. Similar sequences have been identified in Caenorhabditis elegans, Homos sapiens, Mus musculus, Rattus norvegicus and Saccharomyces cerevisiae. (FlyBase report FBal0138651)
32
abbreviated as PFE, is reported here. It has also been known in FlyBase as CG15151. Similar sequences have been identified in Caenorhabditis elegans, Homos sapiens, Mus musculus, Rattus norvegicus and Saccharomyces cerevisiae. (FlyBase report FBal0138651)
Named entity recognition
problem:
– large list of classes – some of them much harder
– as posed, not useful
33
white white
"wild-type" (not mutated)
34
white
"mutant"
white
white
35
Case is meaningful
white White
Case is meaningful
white Symbol: w White Symbol: W
36
Yes, there are genes with the symbols I, a, R, p.... Case is meaningful
Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock.
37
Case is meaningful
Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. (Ruan et al. 2002)
…even sentence-initially.
sunday driver (syd) was identified in a screen for novel axonal transport mutants in Drosophila. Syd is a ~137kDa protein that is broadly conserved in evolution with homogous proteins identified in C. elegans, mouse and human. (Bowman 2000)
38
Case is meaningful
Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here, we show that Bifocal (Bif), a putative cytoskeletal regulator, is a component of the Msn pathway for regulating R cell growth targeting. bif displays strong genetic interaction with msn.
Surely you could determine on a document-by-document basis…
Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here, we show that Bifocal (Bif), a putative cytoskeletal regulator, is a component of the Msn pathway for regulating R cell growth targeting. bif displays strong genetic interaction with msn.
39
Surely you could determine on a document-by-document basis…
Axonal traffic jams with a sunday driver: Identification of a broadly conserved transmembrane protein required for axonal transport in Drosophila. (Bowman 2000)
Evolution
40
Looks like…
Acts like…
41
Metaphor/metonymy
whimsy
moonshine (16 zebrafish genes)
(32 Drosophila genes)
42
But, that’s not the only way of naming genes....
kinase kinase 5
43
Worst gene names
repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A
Worst gene names
repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A
44
Worst gene names
repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A
Worst gene names
repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A
epidermal growth factor homology domains
45
“Gene mention” (NER)
Yeh et al. (2005)
46
Gene mention (NER)
Yeh et al. (2005)
Good systems?
protein 60) (base NP chunking, abbreviation definitions, post-processing)
(MaxEnt, HMM, CRF, SVM) (or a clever hack)
47
The Jim Martin technique really works
Kinoshita et al. (2005)
...which isn’t to say that external knowledge is bad
Poesio’s use of Google
48
Most feature sets include...
– Patterns like \w+-?\d+ – Contains Greek letters
– Next word is “protein” – Followed by “protein” somewhere else in document
Why not better?
– Inconsistency – Multiple correct answers – Inter-corpus differences in definition Yeh et al. (2005)
49
Length effect (and why the Jim Martin technique works so well for this)
Kinoshita et al. (2005)
A great research project
– Species – Laboratory techniques – Cell types – Cell lines – Tissues – ....
50
...and, NER isn’t what you need anyways
Tokenization
phrase chunker
– Start from right, move left
comma, period, or right parenthesis, add it
51
Tokenization
– 2,6-diaminohexanoic acid – tricyclo(3.3.1.13,7)decanone
Four kinds of hyphens
– Calcium-dependent – Hsp-60
52
B-cell-CD4(+)-T-cell interactions
Special challenges in biomedical corpus construction
53
rat epithelial growth factor receptor 2 ?
named entities
tricyclo(3.3.1.13,7)decanone
named entities
54
linguistics graduate student to tag rat epithelial growth factor receptor 2?
tagging when you don’t have syntactically tagged text?
55
Some specific cases of word sense disambiguation Abbreviation disambiguation
(Jeff Chang’s paper)
– Chang
– Schwartz and Hearst
56
Part 4: getting up to speed
(about) 10 papers and resources that will let you read most other papers in BioNLP
Named entity recognition 1: rule-based
– Find something that looks like a symbol for a yeast gene (ABC1) – Extend name to the left (yeast ABC1) – Extend name to the right (ABC1 protein)
– Never replicated – Yeast is easy
57
Named entity recognition 2: machine learning
NER 3: state of the art
58
Information extraction 1: rule-based
Information extraction 2: machine learning
– Protein/protein – Protein/disease – Protein/?
as +/- positing a relation
– Features: bag-of-words
59
Information extraction 3: rules, linguistics, knowledge
Corpora: 1
– MEDLINE: database of 16M+ abstracts – PubMed: interface for searching MEDLINE – ASCII and free
NOT a corpus—not really even a “text collection”
60
Corpora: 2
– Fully annotated corpus – 2,000 abstracts – X00,000 words – Now: POS, named entities, 25% treebanked – Coming: anaphora; events?; PAS?; dependency parses?
Lexical resources: 1
– Biological functions – Molecular processes – Cell components
– Terms + definitions – Is-a, part-of
61
Lexical resources: 2
– Names – Symbols – Synonyms – Protein products – “Summary” – Gene References Into Function
Lexical resources: 3
System)
– MetaThesaurus – Semantic Network
62
Tools overview
specific task
Tools: 1
– GENIA – MEDPOST – LingPipe?
63
Tools: 2
– ABNER (Settles 200x) – KeX – AbGene
the world will beat a path to your door
Part 6: Current hot topics
64
What’s the right model for semantic representation?
enough
– Rzhetsky/GeneWays paper – Penn folks/IE paper – Native speaker intuitions (Juliane, etc.)
What’s the right model for semantic representation?
– Differentiating binary relations
– PAS
Karin: how do these representational choices affect what a biologist would get out
65
The ontology wars
– Hunter; PASBio; Barry Smith; L&C.... – GOA; MGI; EBI; ...
– Tsujii/Ananiadou; Pedersen/Pakhomov; Markert/Nissim...
True integration of NLP into laboratory data interpretation
book>
66
The embarrassing truth about BioNLP (take 2)...
References
in trademark disputes. Palgrave.
Marc Colosimo; and Lynette Hirschman (2004) BioCreative Task 1A: gene mention finding evaluation. BMC Bioinformatics 6(Suppl. 1):S2.