BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the - - PDF document

bionlp for nlpeople
SMART_READER_LITE
LIVE PREVIEW

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the - - PDF document

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in the world The weirdest job in the world 2 The weirdest job in the world The weirdest job in the world 3 How I got here How I got here 4 How I


slide-1
SLIDE 1

1

BioNLP for NLPeople

CS5832/HLT-NAACL/RANLP

The weirdest job in the world

slide-2
SLIDE 2

2

The weirdest job in the world The weirdest job in the world

slide-3
SLIDE 3

3

The weirdest job in the world The weirdest job in the world

slide-4
SLIDE 4

4

How I got here How I got here

slide-5
SLIDE 5

5

How I got here

  • Voice Input Technologies
  • Linguistix
  • Nationwide Insurance
  • MapQuest
  • Berdy Medical Systems
  • OneRealm [sic]

How I got here

  • Perl hacker, SLM data preprocessing
  • Linguist, Corpus construction
  • Senior Programmer/Analyst,

Interactive Voice Response (yuck)

  • Software test dept. manager; senior

software engineer

  • Consultant/Perl hacker
  • Senior software engineer
slide-6
SLIDE 6

6

What is BioNLP?

  • Natural language processing applied

to biomedical language

– Publications – Medical records – Ontologies

Part 0

slide-7
SLIDE 7

7

Why a field called BioNLP?

There is little reason for the data on which a linguist works to have the right to name that work.

Shuy 2002:8

(One lab’s) funding for NLP in computational biology

  • INIA (Neuroinformatics of

Alcoholism) $5M, 5 years

  • Wyeth Genomics Institute ($200K, 2

years)

  • National Library of Medicine ($4.2M,

3 years)

  • National Library of Medicine ($XM, 3

years)

slide-8
SLIDE 8

8

Why biologists care

  • High-throughput data interpretation
  • Literature search
  • Annotation
  • Database construction

But, I’m a NLPerson (computer scientist, mathematician,

engineer…)

  • Hard, but might be possible
  • Might be harder in biomedical domain

than in newswire text

  • Might be more possible in biomedical

domain than in newswire text

slide-9
SLIDE 9

9

Resources

The big drawing point for NLPeople

  • Data

– Lexical resources – 500 * 16M words of text – Labelled training data

  • Tools

– NER, POS taggers, parsers, semantic normalizers....

$$$

slide-10
SLIDE 10

10

Job market

  • Academia: great

– US, Europe

  • Industry: not bad, but genomics-

specific right now

Surely Shuy jests...

There is little reason for the data on which a linguist works to have the right to name that work.

slide-11
SLIDE 11

11

It really is different on every level

  • Tokenization
  • Named entity recognition
  • Corpus construction
  • Semantic representation

NLP actually could make the world a better place....

slide-12
SLIDE 12

12

An embarrassing truth about BioNLP...

www.chilibot.net

1

slide-13
SLIDE 13

13

Part 1: Just enough biology Cells and proteins

<illustration: cell, structures, proteins>

slide-14
SLIDE 14

14

How biologists see the world

Wattarujeekrit et al. (2004)

The Central Dogma: from genes to proteins

http://www.swbic.org/products/clipart/images/dogmag.jpg

slide-15
SLIDE 15

15

The Central Dogma: from genes to proteins

http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioRevi ew/images/central_dogma.gif

Higher-level structures

  • Genotype, phenotype
  • Tissue, organ, organism
slide-16
SLIDE 16

16

Biological structures are complex

SNAP Receptor Vesicle SNARE V-SNARE N-Ethylmaleimide-Sensitive Fusion Protein Soluble NSF Attachment Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (Alex Morgan, MITRE)

Part 2: Why bioscientists fund and publish research in BioNLP

slide-17
SLIDE 17

17

Two basic markets, multiple user types

  • Medical

– Clinicians – Consumers – “Informationists” – Administrators (billing, quality assurance, ...)

  • “MolBio” (genomic)

– High-throughput experimentalists – “Bench scientists” – Model organism database curators

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

Structured vocabulary Free text (phenotypes)

slide-20
SLIDE 20

20

122 references... Medical

slide-21
SLIDE 21

21

1997

<scanned picture of business card>

slide-22
SLIDE 22

22

<happy-face photo>

One year later…

slide-23
SLIDE 23

23

A sad story: physicians don’t buy a lot of NLP software

Another sad story: trying to sell “gisting” to physicians

slide-24
SLIDE 24

24

Sold for $400K: 14.5 or 2.9¢ on the dollar… Salesperson’s thought process

slide-25
SLIDE 25

25

Physician’s thought process Genomics

slide-26
SLIDE 26

26

Why biologists care

  • High-throughput data interpretation
  • Literature search
  • Annotation
  • Database construction

Why biologists care

10 years ago...

slide-27
SLIDE 27

27

Why biologists care

Today....

Double exponential growth in the literature

New entries in Medline with publication date in Jan-Aug 2005: 431,478 (avg. 1775/ day)

1

slide-28
SLIDE 28

28

Biological Nomenclature: “V-SNARE”

SNAP Receptor Vesicle SNARE V-SNARE N-Ethylmaleimide-Sensitive Fusion Protein Soluble NSF Attachment Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (Alex Morgan, MITRE)

Part 3

Some things that make BioNLP different

slide-29
SLIDE 29

29

Named Entity Recognition Genes have names??

slide-30
SLIDE 30

30

Suzanna Lewis

  • Fruitfly geneticist
  • 5 kids
  • Latte + 3 shots

Suzanna Lewis

It is the middle of the night (2:38 to be precise), I am away from friends and family, It has been this way for over 2 years, I can't sleep because of all the work there is yet to do, and there is no end in sight. So when do the magic little elves appear out of nowhere and get everything done? p.s. I am serious.

slide-31
SLIDE 31

31

Suzanna Lewis

pray for elves

  • D. melanogaster gene Pray For Elves,

abbreviated as PFE, is reported here. It has also been known in FlyBase as CG15151. Similar sequences have been identified in Caenorhabditis elegans, Homos sapiens, Mus musculus, Rattus norvegicus and Saccharomyces cerevisiae. (FlyBase report FBal0138651)

slide-32
SLIDE 32

32

  • D. melanogaster gene Pray For Elves,

abbreviated as PFE, is reported here. It has also been known in FlyBase as CG15151. Similar sequences have been identified in Caenorhabditis elegans, Homos sapiens, Mus musculus, Rattus norvegicus and Saccharomyces cerevisiae. (FlyBase report FBal0138651)

Named entity recognition

  • Molecular biology entity identification

problem:

– large list of classes – some of them much harder

  • Usual case-related cues don't help
  • More variability of content
  • Huge lexical ambiguity problem
  • Common English

– as posed, not useful

slide-33
SLIDE 33

33

white white

"wild-type" (not mutated)

slide-34
SLIDE 34

34

white

"mutant"

white

white

slide-35
SLIDE 35

35

Case is meaningful

white White

Case is meaningful

white Symbol: w White Symbol: W

slide-36
SLIDE 36

36

Yes, there are genes with the symbols I, a, R, p.... Case is meaningful

Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock.

slide-37
SLIDE 37

37

Case is meaningful

Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. (Ruan et al. 2002)

…even sentence-initially.

sunday driver (syd) was identified in a screen for novel axonal transport mutants in Drosophila. Syd is a ~137kDa protein that is broadly conserved in evolution with homogous proteins identified in C. elegans, mouse and human. (Bowman 2000)

slide-38
SLIDE 38

38

Case is meaningful

Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here, we show that Bifocal (Bif), a putative cytoskeletal regulator, is a component of the Msn pathway for regulating R cell growth targeting. bif displays strong genetic interaction with msn.

Surely you could determine on a document-by-document basis…

Misshapen (Msn) has been proposed to shut down Drosophila photoreceptor (R cell) growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here, we show that Bifocal (Bif), a putative cytoskeletal regulator, is a component of the Msn pathway for regulating R cell growth targeting. bif displays strong genetic interaction with msn.

slide-39
SLIDE 39

39

Surely you could determine on a document-by-document basis…

Axonal traffic jams with a sunday driver: Identification of a broadly conserved transmembrane protein required for axonal transport in Drosophila. (Bowman 2000)

Evolution

  • What it looks like
  • What it acts like
  • Metaphor
slide-40
SLIDE 40

40

Looks like…

  • white
  • swiss cheese
  • clown
  • daschund
  • dreadlocks

Acts like…

  • ether a go-go
  • lush
  • agnostic
  • amontillado
slide-41
SLIDE 41

41

Metaphor/metonymy

  • lot
  • maggie
  • scott of the antarctic
  • always early -> british rail
  • asp -> cleopatra
  • tudor -> vasa -> gustavus
  • nanos -> smaug

whimsy

  • chablis, merlot, zinfandel, retsina,

moonshine (16 zebrafish genes)

  • milkah, murashka, zolotistyuy, zloday

(32 Drosophila genes)

slide-42
SLIDE 42

42

But, that’s not the only way of naming genes....

  • Breast cancer 1 (BRCA1)
  • p53
  • Ribosomal protein S27
  • Heat shock protein 110
  • Mitogen activated protein kinase 15
  • Mitogen activated protein kinase

kinase kinase 5

  • fuculokinase
  • GABA
  • Heat shock protein 60
  • calmodulin
  • dHAND
  • suppressor of p53
  • cheap date
  • lush
  • ken and barbie
  • ring
  • to
  • the
  • there
  • a
slide-43
SLIDE 43

43

Worst gene names

  • sema domain, seven thrombospondin

repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A

Worst gene names

  • sema domain, seven thrombospondin

repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A

slide-44
SLIDE 44

44

Worst gene names

  • sema domain, seven thrombospondin

repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A

  • SEMA5A

Worst gene names

  • sema domain, seven thrombospondin

repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A

  • SEMA5A
  • Tyrosine kinase with immunoglobulin and

epidermal growth factor homology domains

  • tie
slide-45
SLIDE 45

45

  • What doesn’t work
  • What does (as of 2004)

“Gene mention” (NER)

Yeh et al. (2005)

slide-46
SLIDE 46

46

Gene mention (NER)

Yeh et al. (2005)

Good systems?

  • Handle multi-word names (heat shock

protein 60) (base NP chunking, abbreviation definitions, post-processing)

  • Use some form of machine learning

(MaxEnt, HMM, CRF, SVM) (or a clever hack)

  • Do some rule-based post-processing
  • Don’t rely on dictionaries
slide-47
SLIDE 47

47

The Jim Martin technique really works

Kinoshita et al. (2005)

...which isn’t to say that external knowledge is bad

  • Markert/Nissim’s extensions of

Poesio’s use of Google

slide-48
SLIDE 48

48

Most feature sets include...

  • Typo/orthographic features

– Patterns like \w+-?\d+ – Contains Greek letters

  • Local/distant context

– Next word is “protein” – Followed by “protein” somewhere else in document

Why not better?

  • Length
  • Case
  • Tokenization
  • Annotation issues

– Inconsistency – Multiple correct answers – Inter-corpus differences in definition Yeh et al. (2005)

slide-49
SLIDE 49

49

Length effect (and why the Jim Martin technique works so well for this)

Kinoshita et al. (2005)

A great research project

  • Build an NER system for...

– Species – Laboratory techniques – Cell types – Cell lines – Tissues – ....

slide-50
SLIDE 50

50

...and, NER isn’t what you need anyways

  • GN task and results

Tokenization

  • How to build a cheap base noun

phrase chunker

– Start from right, move left

  • If next token is not conjunction, preposition,

comma, period, or right parenthesis, add it

  • Else start a new chunk
slide-51
SLIDE 51

51

Tokenization

  • Commas

– 2,6-diaminohexanoic acid – tricyclo(3.3.1.13,7)decanone

Four kinds of hyphens

  • “Syntactic:”

– Calcium-dependent – Hsp-60

  • Knocked-out gene: lush-- flies
  • Negation: -fever
  • Electric charge: Cl-
slide-52
SLIDE 52

52

B-cell-CD4(+)-T-cell interactions

  • PMID: 10516078

Special challenges in biomedical corpus construction

slide-53
SLIDE 53

53

  • How do you parse

rat epithelial growth factor receptor 2 ?

  • Don’t—pretag all

named entities

  • How do you tokenize

tricyclo(3.3.1.13,7)decanone

  • Don’t—pretag all

named entities

slide-54
SLIDE 54

54

  • How do you hire a

linguistics graduate student to tag rat epithelial growth factor receptor 2?

  • You can’t...
  • How do you do PAS

tagging when you don’t have syntactically tagged text?

  • Sigh...
slide-55
SLIDE 55

55

Some specific cases of word sense disambiguation Abbreviation disambiguation

  • Incidence of ambiguous abbreviations

(Jeff Chang’s paper)

  • Statistical approaches

– Chang

  • Rule-based

– Schwartz and Hearst

slide-56
SLIDE 56

56

Part 4: getting up to speed

(about) 10 papers and resources that will let you read most other papers in BioNLP

Named entity recognition 1: rule-based

  • Fukuda et al. (1998): first NER paper

– Find something that looks like a symbol for a yeast gene (ABC1) – Extend name to the left (yeast ABC1) – Extend name to the right (ABC1 protein)

  • Results in 90s

– Never replicated – Yeast is easy

slide-57
SLIDE 57

57

Named entity recognition 2: machine learning

  • Collier et al. (XXX)

NER 3: state of the art

slide-58
SLIDE 58

58

Information extraction 1: rule-based

  • Blaschke 1998

Information extraction 2: machine learning

  • Craven and Kumlein 199X
  • Identify entity pairs

– Protein/protein – Protein/disease – Protein/?

  • Use naïve Bayes to classify sentences

as +/- positing a relation

– Features: bag-of-words

slide-59
SLIDE 59

59

Information extraction 3: rules, linguistics, knowledge

  • Friedman: MedLEE, BioMedLEE
  • NER
  • Syntax

Corpora: 1

  • PubMed/MEDLINE

– MEDLINE: database of 16M+ abstracts – PubMed: interface for searching MEDLINE – ASCII and free

NOT a corpus—not really even a “text collection”

slide-60
SLIDE 60

60

Corpora: 2

  • GENIA

– Fully annotated corpus – 2,000 abstracts – X00,000 words – Now: POS, named entities, 25% treebanked – Coming: anaphora; events?; PAS?; dependency parses?

Lexical resources: 1

  • Gene Ontology

– Biological functions – Molecular processes – Cell components

  • Building blocks

– Terms + definitions – Is-a, part-of

slide-61
SLIDE 61

61

Lexical resources: 2

  • Entrez Gene (formerly LocusLink)

– Names – Symbols – Synonyms – Protein products – “Summary” – Gene References Into Function

Lexical resources: 3

  • UMLS (Unified Medical Language

System)

– MetaThesaurus – Semantic Network

slide-62
SLIDE 62

62

Tools overview

  • Probably something available
  • Might work decently
  • Definitely improvable for your

specific task

Tools: 1

  • POS tagging:

– GENIA – MEDPOST – LingPipe?

slide-63
SLIDE 63

63

Tools: 2

  • Named entity recognition

– ABNER (Settles 200x) – KeX – AbGene

  • LESSON: distribute a .jar file and

the world will beat a path to your door

Part 6: Current hot topics

slide-64
SLIDE 64

64

What’s the right model for semantic representation?

  • So far: binary relations
  • Arguments that that’s not good

enough

– Rzhetsky/GeneWays paper – Penn folks/IE paper – Native speaker intuitions (Juliane, etc.)

What’s the right model for semantic representation?

  • Two ways forward

– Differentiating binary relations

  • Marti HLT/EMNLP; Tsujii

– PAS

  • PASBio/Wattarujeekrit et al.
  • Kogan et al.

Karin: how do these representational choices affect what a biologist would get out

  • f the text?
slide-65
SLIDE 65

65

The ontology wars

  • Point:

– Hunter; PASBio; Barry Smith; L&C.... – GOA; MGI; EBI; ...

  • Counterpoint:

– Tsujii/Ananiadou; Pedersen/Pakhomov; Markert/Nissim...

True integration of NLP into laboratory data interpretation

  • <Last chapter of Sophia and John’s

book>

slide-66
SLIDE 66

66

The embarrassing truth about BioNLP (take 2)...

References

  • Shuy, Roger (2002) Linguistic battles

in trademark disputes. Palgrave.

  • Yeh, Alexander; Alexander Morgan;

Marc Colosimo; and Lynette Hirschman (2004) BioCreative Task 1A: gene mention finding evaluation. BMC Bioinformatics 6(Suppl. 1):S2.