Assisted Curation: Does Text Mining Really Help? (Alex et al. 2008) - - PowerPoint PPT Presentation

assisted curation does text mining really help
SMART_READER_LITE
LIVE PREVIEW

Assisted Curation: Does Text Mining Really Help? (Alex et al. 2008) - - PowerPoint PPT Presentation

Assisted Curation: Does Text Mining Really Help? (Alex et al. 2008) by Benedict Fehringer Seminar: Unlocking the Secrets of the Past: Text Mining for Historical Documents Supervisor: Dr. Caroline Sporleder (and Martin Schreiber)


slide-1
SLIDE 1

23.02.2012

Assisted Curation: Does Text Mining Really Help?

(Alex et al. 2008) by Benedict Fehringer Seminar: „Unlocking the Secrets of the Past: Text Mining for Historical Documents“ Supervisor: Dr. Caroline Sporleder (and Martin Schreiber)

Donnerstag, 23. Februar 2012

slide-2
SLIDE 2

Outline

! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References

Donnerstag, 23. Februar 2012

slide-3
SLIDE 3

Outline

! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References

Donnerstag, 23. Februar 2012

slide-4
SLIDE 4

Basic study elements

  • Content -

! Curation of biomedical literature ! For example, protein-protein interaction recognition:

  • 1. Which protein are there?
  • 2. If two proteins are named, are they in interaction?

Donnerstag, 23. Februar 2012

slide-5
SLIDE 5

Example for protein-protein interaction recognition

Source: Schwikowski, Uetz, & Fields (pp. 1259, 2000)

[...] An example is YHR105W, which interacts with one protein involved in vesicular transport, Akr2, and with YGL161C, an uncharacterized protein that interacts with two transport proteins, Yip1 and Pep12. YHR105W also interacts with YPL246C, another uncharacterized protein that interacts with Ypt1 and Vam7, proteins implicated in vesicular transport and membrane fusion, respectively. [...]

  • 1. Which proteins are there?
  • 2. If two proteins are named, are

they in interaction?

Donnerstag, 23. Februar 2012

slide-6
SLIDE 6

Basic study elements

  • Research Question -

! Curation of biomedical literature ! For example, protein-protein interaction recognition:

  • 1. Which protein are there?
  • 2. If two proteins are named, are they in interaction?

! Task should be supported by text mining

Donnerstag, 23. Februar 2012

slide-7
SLIDE 7

Related Work

! Increasing development of information extraction systems (spurred

  • n by BioCreAtIvE II competition; Krallinger, Leitner, & Valencia,

2007)

! studies suggest reduction of curation time ! But: lack of user studies for extrinsically evaluation ! no validation by curator feedback about affecting their work and

usefulness

Donnerstag, 23. Februar 2012

slide-8
SLIDE 8

Basic study elements

  • Evaluation -

! Curation of biomedical literature ! For example, protein-protein interaction recognition:

  • 1. Which protein are there?
  • 2. If two proteins are named, are they in interaction?

! Task should be supported by text mining ! Evaluation by: ! objective performance metrics (e.g. speed improvement, number of

records)

! focusing on user feedback, too

Donnerstag, 23. Februar 2012

slide-9
SLIDE 9

Outline

! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References

Donnerstag, 23. Februar 2012

slide-10
SLIDE 10

Curation Scenario

  • General -

! Goal: Curators should identify protein-protein interactions (PPIs) ! Initial step: Providing set of matching papers ! Middle step: Filtering papers into candidates

Donnerstag, 23. Februar 2012

slide-11
SLIDE 11

Curation Scenario

  • General -

! Goal: Curators should identify protein-protein interactions (PPIs) ! Initial step: Providing set of matching papers ! Middle step: Filtering papers into candidates

How can NLP help the curator work?

Donnerstag, 23. Februar 2012

slide-12
SLIDE 12

Curation Scenario

  • General -

! Goal: Curators should identify protein-protein interactions (PPIs) ! Initial step: Providing set of matching papers ! Middle step: Filtering papers into candidates ! Basic Assumption: Information Extraction (IE) techniques are likely

effective in identifying entities and relations

" More specific: NLP can propose candidate PPIs

Donnerstag, 23. Februar 2012

slide-13
SLIDE 13

Curation Scenario

  • General -

! Goal: Curators should identify protein-protein interactions (PPIs) ! Initial step: Providing set of matching papers ! Middle step: Filtering papers into candidates ! Basic Assumption: Information Extraction (IE) techniques are likely

effective in identifying entities and relations

" More specific: NLP can propose candidate PPIs

Donnerstag, 23. Februar 2012

slide-14
SLIDE 14

Curation Scenario

  • Concrete -

Information Flow in the Curation Process

Source: Alex et al. (p. 558, 2008)

Donnerstag, 23. Februar 2012

slide-15
SLIDE 15

Curation Scenario

  • Concrete -

Information Flow in the Curation Process

Source: Alex et al. (p. 558, 2008)

Donnerstag, 23. Februar 2012

slide-16
SLIDE 16

Curation Scenario

  • Concrete -

Information Flow in the Curation Process

Source: Alex et al. (p. 558, 2008)

Donnerstag, 23. Februar 2012

slide-17
SLIDE 17

Curation Scenario

  • Concrete -

Information Flow in the Curation Process

Source: Alex et al. (p. 558, 2008)

Donnerstag, 23. Februar 2012

slide-18
SLIDE 18

Curation Scenario

  • Concrete -

Information Flow in the Curation Process

Source: Alex et al. (p. 558, 2008)

Donnerstag, 23. Februar 2012

slide-19
SLIDE 19

Curation Scenario

  • Concrete -

Information Flow in the Curation Process

Source: Alex et al. (p. 558, 2008)

Donnerstag, 23. Februar 2012

slide-20
SLIDE 20

NLP Engine

  • Main Components -

Concrete Subtasks

  • 1. Exists protein‘s name in

sentence?

  • 2. Which protein do they name?
  • 3. If two proteins are named, are

they in interaction?

NLP-Components

  • 1. Named Entity

Recognition

  • 2. Term Identification
  • 3. Relation Extraction

Donnerstag, 23. Februar 2012

slide-21
SLIDE 21

! How should the interface design look like?

NLP Engine

  • Creation details -

Donnerstag, 23. Februar 2012

slide-22
SLIDE 22

! How should the interface design look like? ! How should the labour be divided between human and the software?

NLP Engine

  • Creation details -

For example: To decide which species is associated with which protein should be quite simple for an expert but not necessarily for the software.

Donnerstag, 23. Februar 2012

slide-23
SLIDE 23

! How should the interface design look like? ! How should the labour be divided between human and the software? ! Which functional characteristics of the NLP engine would be optimal?

NLP Engine

  • Creation details -

For example: Should recall or precision be improved?

Donnerstag, 23. Februar 2012

slide-24
SLIDE 24

NLP Engine

  • Creation details -

The focus will be on the third question.

! How should the interface design look like? ! How should the labour be divided between human and the software? ! Which functional characteristics of the NLP engine would be optimal?

Donnerstag, 23. Februar 2012

slide-25
SLIDE 25

Outline

! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References

Donnerstag, 23. Februar 2012

slide-26
SLIDE 26

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification

Donnerstag, 23. Februar 2012

slide-27
SLIDE 27

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification 217 Papers 9 Entities PPI relations FRAG* relations Attributes Normalized were Properties enriched with 84.9 88.4 64.8 59.6 87.1 inter-annotator agreement *linked fragments and mutants to their parents

Donnerstag, 23. Februar 2012

slide-28
SLIDE 28

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification 217 Papers 9 Entities PPI relations FRAG* relations Attributes Normalized were Properties enriched with 84.9 88.4 64.8 59.6 87.1 inter-annotator agreement Corpus consists of 2 million tokens:

  • TRAIN (66%)
  • DEVTEST (17%)
  • TEST (17%)

*linked fragments and mutants to their parents

Donnerstag, 23. Februar 2012

slide-29
SLIDE 29

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification

Donnerstag, 23. Februar 2012

slide-30
SLIDE 30

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification Sentence boundary detection Tokenization Adding useful linguistic markup Attaches NCBI* taxonomy identifiers *National Center for Biotechnology Information

Donnerstag, 23. Februar 2012

slide-31
SLIDE 31

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification

Donnerstag, 23. Februar 2012

slide-32
SLIDE 32

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification no entity entity

Donnerstag, 23. Februar 2012

slide-33
SLIDE 33

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification no entity entity

entity pred no entity pred Sum entity real 9 3 12 no entity real 1 11 12 Sum 10 14 24

Donnerstag, 23. Februar 2012

slide-34
SLIDE 34

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification no entity entity Recall: 9/12 = 0.75

entity pred no entity pred Sum entity real 9 3 12 no entity real 1 11 12 Sum 10 14 24

Donnerstag, 23. Februar 2012

slide-35
SLIDE 35

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification no entity entity Recall: 9/12 = 0.75 Precision: 9/10 = 0.9

entity pred no entity pred Sum entity real 9 3 12 no entity real 1 11 12 Sum 10 14 24

Donnerstag, 23. Februar 2012

slide-36
SLIDE 36

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification no entity entity Recall: 9/12 = 0.75 Precision: 9/10 = 0.9

entity pred no entity pred Sum entity real 9 3 12 no entity real 1 11 12 Sum 10 14 24

Donnerstag, 23. Februar 2012

slide-37
SLIDE 37

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification no entity entity Recall: 9/12 = 0.75 Precision: 9/10 = 0.9

entity pred no entity pred Sum entity real 12 12 no entity real 5 7 12 Sum 17 7 24

Donnerstag, 23. Februar 2012

slide-38
SLIDE 38

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification no entity entity Recall: 9/12 = 0.75 Precision: 9/10 = 0.9 Recall: 12/12 = 1

entity pred no entity pred Sum entity real 12 12 no entity real 5 7 12 Sum 17 7 24

Donnerstag, 23. Februar 2012

slide-39
SLIDE 39

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification no entity entity Recall: 9/12 = 0.75 Precision: 9/10 = 0.9 Recall: 12/12 = 1 Precision: 12/17 = 0.71

entity pred no entity pred Sum entity real 12 12 no entity real 5 7 12 Sum 17 7 24

Donnerstag, 23. Februar 2012

slide-40
SLIDE 40

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification

Donnerstag, 23. Februar 2012

slide-41
SLIDE 41

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification Producing a Set of candidate identifiers for each protein Assigned species Heuristics Bag accuracy as evaluation metric

Donnerstag, 23. Februar 2012

slide-42
SLIDE 42

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification

Donnerstag, 23. Februar 2012

slide-43
SLIDE 43

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification Intra-sentential PPI and FRAG relations Inter-sentential FRAG relations Attributes and Properties enriched with

Donnerstag, 23. Februar 2012

slide-44
SLIDE 44

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification

Donnerstag, 23. Februar 2012

slide-45
SLIDE 45

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification DEVTEST and trained on TRAIN F1 = 2 * (precision * recall) / (precision + recall)

Donnerstag, 23. Februar 2012

slide-46
SLIDE 46

Pipeline-Components

Corpus Pre- processing Named Entity Recognition Relation Extraction Component Performance Term Identification DEVTEST and trained on TRAIN

inter-annotator agreement: 84.9/88.4 64.8 87.1 59.6

F1 = 2 * (precision * recall) / (precision + recall)

Donnerstag, 23. Februar 2012

slide-47
SLIDE 47

Outline

! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References

Donnerstag, 23. Februar 2012

slide-48
SLIDE 48

Experiment 1: Manual vs. Assisted Curation

! 4 curators ! 4 papers ! 3 conditions: ! Manual: without assistance ! GSA-assisted: with integrated gold standard annotation ! NLP-assisted: with integrated NLP pipeline output

Donnerstag, 23. Februar 2012

slide-49
SLIDE 49

Experiment 1: Results

Total number of records and average curation speed per record Scores range from (1) for „strongly agree“ to (5) for „strongly disagree“

Donnerstag, 23. Februar 2012

slide-50
SLIDE 50

Experiment 1: Results

Total number of records and average curation speed per record Scores range from (1) for „strongly agree“ to (5) for „strongly disagree“

< = < = >

Donnerstag, 23. Februar 2012

slide-51
SLIDE 51

Experiment 2: NLP Consistency

! 1 curator ! 10 papers ! 2 conditions: ! Consistency 1: all recognized named entities (NEs) were

propagated (5 papers)

! Consistency 2: only the most frequent recognized NEs were

propagated (5 papers)

Donnerstag, 23. Februar 2012

slide-52
SLIDE 52

Experiment 2: Results I

Total number of records and average curation speed per record

Donnerstag, 23. Februar 2012

slide-53
SLIDE 53

Experiment 2: Results II

Scores range from (1) for „strongly agree“ to (5) for „strongly disagree“ A: consistent NLP output (Consistency 1/2) B: baseline NLP

Donnerstag, 23. Februar 2012

slide-54
SLIDE 54

Experiment 2: Results II

Scores range from (1) for „strongly agree“ to (5) for „strongly disagree“ A: consistent NLP output (Consistency 1/2) B: baseline NLP

Donnerstag, 23. Februar 2012

slide-55
SLIDE 55

Experiment 3: Optimizing for Precision or Recall

! 1 curator ! 10 papers ! 3 conditions: ! High R: NLP output with high recall (5 papers) ! High P: NLP output with high precision (5 papers) ! High F1: NLP output with high F1-score (subsequent all papers;

  • nly viewing)

F1 = 2 * (precision * recall) / (precision + recall)

Donnerstag, 23. Februar 2012

slide-56
SLIDE 56

Experiment 3 Results I

Comparison between High F1, High P and High R TP: true positive FP: false positive FN: false negative

Donnerstag, 23. Februar 2012

slide-57
SLIDE 57

Experiment 3 Results II

Scores range from (1) for „strongly agree“ to (5) for „strongly disagree“ A: High P/High R B: High F1

Donnerstag, 23. Februar 2012

slide-58
SLIDE 58

Experiment 3 Results II

Scores range from (1) for „strongly agree“ to (5) for „strongly disagree“ A: High P/High R B: High F1

Donnerstag, 23. Februar 2012

slide-59
SLIDE 59

Outline

! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References

Donnerstag, 23. Februar 2012

slide-60
SLIDE 60

Discussion I

! Experiment 1: ! Maximum time reduction of 1/3 if NLP output is perfectly accurate ! NLP assistance leads to more records (but the validity has to be

proven)

! In the questionnaire all condition are quite equal

Donnerstag, 23. Februar 2012

slide-61
SLIDE 61

Discussion II

! Experiment 2: ! Curator prefers consistency with all NEs ! But: objective metrics suggest that other condition is prefered ! Experiment 3: ! Curator prefers high recall " Must be repeated with other curators (different curation styles)

Donnerstag, 23. Februar 2012

slide-62
SLIDE 62

Conclusion

! Curation time not sufficient measurement for NLP‘s usefulness ! Closely work with user is necessary " Identifying helpful and hindering aspects ! Future work: ! Further research regarding the merit of high recall and high

precision

! Implementing confidence values of extracted information ! ... with more curators

Donnerstag, 23. Februar 2012

slide-63
SLIDE 63

Outline

! Introduction ! Related Work ! Assisted Curation ! Text Mining Pipeline ! Curation Experiments ! Discussion and Conclusion ! References

Donnerstag, 23. Februar 2012

slide-64
SLIDE 64

References

! Alex, B., Grover, C., Haddow, B., Kabadjov, M., Klein, E., Matthews,

M., Roebuck, S., Tobin, R., Wang, X. (2008). Assisted curation: does text mining really help? In Pacific Symposium on

  • Biocomputing. Pacific Symposium on Biocomputing, pp. 556-567.

! Krallinger, M., Leitner, F., & Valencia, A. (2007). Assessment of the ! second BioCreative PPI task: Automatic extraction of protein-

protein interactions. In Proceedings of the Second BioCreative Challenge Evaluation Workshop, pp. 41–54, Madrid, Spain.

! Schwikowski, B., Uetz, P., & Fields, S. (2000). A network of protein-

protein interactions in yeast. Nature Biotechnology, 18, pp. 1257-1261.

Donnerstag, 23. Februar 2012