bionlp for nlpeople
play

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the - PDF document

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in the world The weirdest job in the world 2 The weirdest job in the world The weirdest job in the world 3 How I got here How I got here 4 How I


  1. BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1

  2. The weirdest job in the world The weirdest job in the world 2

  3. The weirdest job in the world The weirdest job in the world 3

  4. How I got here How I got here 4

  5. How I got here • Voice Input Technologies • Linguistix • Nationwide Insurance • MapQuest • Berdy Medical Systems • OneRealm [sic] How I got here • Perl hacker, SLM data preprocessing • Linguist, Corpus construction • Senior Programmer/Analyst, Interactive Voice Response (yuck) • Software test dept. manager; senior software engineer • Consultant/Perl hacker • Senior software engineer 5

  6. What is BioNLP? • Natural language processing applied to biomedical language – Publications – Medical records – Ontologies Part 0 6

  7. Why a field called BioNLP? There is little reason for the data on which a linguist works to have the right to name that work. Shuy 2002:8 (One lab’s) funding for NLP in computational biology • INIA (Neuroinformatics of Alcoholism) $5M, 5 years • Wyeth Genomics Institute ($200K, 2 years) • National Library of Medicine ($4.2M, 3 years) • National Library of Medicine ($XM, 3 years) 7

  8. Why biologists care • High-throughput data interpretation • Literature search • Annotation • Database construction But, I’m a NLPerson ( computer scientist, mathematician, engineer…) • Hard, but might be possible • Might be harder in biomedical domain than in newswire text • Might be more possible in biomedical domain than in newswire text 8

  9. Resources The big drawing point for NLPeople • Data – Lexical resources – 500 * 16M words of text – Labelled training data • Tools – NER, POS taggers, parsers, semantic normalizers.... $$$ 9

  10. Job market • Academia: great – US, Europe • Industry: not bad, but genomics- specific right now Surely Shuy jests... There is little reason for the data on which a linguist works to have the right to name that work. 10

  11. It really is different on every level •Tokenization •Named entity recognition •Corpus construction •Semantic representation NLP actually could make the world a better place.... 11

  12. An embarrassing truth about BioNLP... www.chilibot.net 1 12

  13. Part 1: Just enough biology Cells and proteins <illustration: cell, structures, proteins> 13

  14. How biologists see the world Wattarujeekrit et al. (2004) The Central Dogma: from genes to proteins http://www.swbic.org/products/clipart/images/dogmag.jpg 14

  15. The Central Dogma: from genes to proteins http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioRevi ew/images/central_dogma.gif Higher-level structures • Genotype, phenotype • Tissue, organ, organism 15

  16. Biological structures are complex V-SNARE Vesicle SNARE SNAP Receptor Soluble NSF Attachment Protein N-Ethylmaleimide-Sensitive Fusion Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (Alex Morgan, MITRE) Part 2: Why bioscientists fund and publish research in BioNLP 16

  17. Two basic markets, multiple user types • Medical • “MolBio” (genomic) – Clinicians – High-throughput experimentalists – Consumers – “Bench scientists” – “Informationists” – Model organism – Administrators database curators (billing, quality assurance, ...) 17

  18. 18

  19. Structured vocabulary Free text (phenotypes) 19

  20. 122 references... Medical 20

  21. 1997 <scanned picture of business card> 21

  22. <happy-face photo> One year later… 22

  23. A sad story: physicians don’t buy a lot of NLP software Another sad story: trying to sell “gisting” to physicians 23

  24. Sold for $400K: 14.5 or 2.9¢ on the dollar… Salesperson’s thought process 24

  25. Physician’s thought process Genomics 25

  26. Why biologists care • High-throughput data interpretation • Literature search • Annotation • Database construction Why biologists care 10 years ago... 26

  27. Why biologists care Today.... Double exponential growth in the literature New entries in Medline with publication date in Jan-Aug 2005: 431,478 (avg. 1775/ day) 1 27

  28. Biological Nomenclature: “V-SNARE” V-SNARE Vesicle SNARE SNAP Receptor Soluble NSF Attachment Protein N-Ethylmaleimide-Sensitive Fusion Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (Alex Morgan, MITRE) Part 3 Some things that make BioNLP different 28

  29. Named Entity Recognition Genes have names?? 29

  30. Suzanna Lewis •Fruitfly geneticist •5 kids •Latte + 3 shots Suzanna Lewis It is the middle of the night (2:38 to be precise), I am away from friends and family, It has been this way for over 2 years, I can't sleep because of all the work there is yet to do, and there is no end in sight. So when do the magic little elves appear out of nowhere and get everything done? p.s. I am serious. 30

  31. Suzanna Lewis pray for elves D. melanogaster gene Pray For Elves, abbreviated as PFE, is reported here. It has also been known in FlyBase as CG15151. Similar sequences have been identified in Caenorhabditis elegans, Homos sapiens, Mus musculus, Rattus norvegicus and Saccharomyces cerevisiae. (FlyBase report FBal0138651) 31

  32. D. melanogaster gene Pray For Elves , abbreviated as PFE , is reported here. It has also been known in FlyBase as CG15151 . Similar sequences have been identified in Caenorhabditis elegans, Homos sapiens, Mus musculus, Rattus norvegicus and Saccharomyces cerevisiae. (FlyBase report FBal0138651) Named entity recognition • Molecular biology entity identification problem: – large list of classes – some of them much harder • Usual case-related cues don't help • More variability of content • Huge lexical ambiguity problem • Common English – as posed, not useful 32

  33. white white "wild-type" (not mutated) 33

  34. white "mutant" white white 34

  35. Case is meaningful White white Case is meaningful White white Symbol: W Symbol: w 35

  36. Yes, there are genes with the symbols I, a, R, p.... Case is meaningful Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. 36

  37. Case is meaningful Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock . (Ruan et al. 2002) …even sentence-initially. sunday driver (syd) was identified in a screen for novel axonal transport mutants in Drosophila. Syd is a ~137kDa protein that is broadly conserved in evolution with homogous proteins identified in C. elegans, mouse and human. (Bowman 2000) 37

  38. Case is meaningful Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock . Here, we show that Bifocal (Bif), a putative cytoskeletal regulator, is a component of the Msn pathway for regulating R cell growth targeting. bif displays strong genetic interaction with msn. Surely you could determine on a document-by-document basis… Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock . Here, we show that Bifocal (Bif), a putative cytoskeletal regulator, is a component of the Msn pathway for regulating R cell growth targeting. bif displays strong genetic interaction with msn . 38

  39. Surely you could determine on a document-by-document basis… Axonal traffic jams with a sunday driver : Identification of a broadly conserved transmembrane protein required for axonal transport in Drosophila. (Bowman 2000) Evolution • What it looks like • What it acts like • Metaphor • … 39

  40. Looks like… • white • swiss cheese • clown • daschund • dreadlocks Acts like… • ether a go-go • lush • agnostic • amontillado 40

  41. Metaphor/metonymy • lot • maggie • scott of the antarctic • always early -> british rail • asp -> cleopatra • tudor -> vasa -> gustavus • nanos -> smaug whimsy • chablis, merlot, zinfandel, retsina, moonshine (16 zebrafish genes) • milkah, murashka, zolotistyuy, zloday (32 Drosophila genes) 41

  42. But, that’s not the only way of naming genes.... • Breast cancer 1 (BRCA1) • p53 • Ribosomal protein S27 • Heat shock protein 110 • Mitogen activated protein kinase 15 • Mitogen activated protein kinase kinase kinase 5 • fuculokinase • cheap date • GABA • lush • Heat shock protein 60 • ken and barbie • calmodulin • ring • dHAND • to • suppressor of p53 • the • there • a 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend