Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C - - PowerPoint PPT Presentation

cancer panomics
SMART_READER_LITE
LIVE PREVIEW

Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C - - PowerPoint PPT Presentation

Semantic Parsing for Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C ATTCGGGTATTTAAGCC Disease Genes Drug Targets High-Throughput Data KB 2 Overview ATTCGG A TATTTAAG G C


slide-1
SLIDE 1

Semantic Parsing for Cancer Panomics

Hoifung Poon

1

slide-2
SLIDE 2

Overview

2

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

…… …… Disease Genes Drug Targets ……

KB

High-Throughput Data

slide-3
SLIDE 3

Overview

3

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

…… …… Disease Genes Drug Targets ……

KB Infer cancer driver mutations

High-Throughput Data

slide-4
SLIDE 4

4

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

…… …… Disease Genes Drug Targets …

KB Extract Pathways from Pubmed

Overview

High-Throughput Data Grounded Unsupervised Semantic Parsing

slide-5
SLIDE 5

Collaborators

5

David Heckerman Tony Gitter Lucy Vanderwende Kristina Toutanova Chris Quirk Ankur Parikh

slide-6
SLIDE 6

Precision Medicine

slide-7
SLIDE 7

7

Before Treatment 15 Weeks

Vemurafenib on BRAF-V600 Melanoma

slide-8
SLIDE 8

Vemurafenib on BRAF-V600 Melanoma

8

Before Treatment 15 Weeks 23 Weeks

slide-9
SLIDE 9

9

slide-10
SLIDE 10

Traditional Biology

10

Targeted Experiments Discovery

One hypothesis

slide-11
SLIDE 11

Genomics

11

High-Throughput Experiments Discovery

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

Many hypotheses

?

slide-12
SLIDE 12

… ATTCGGATATTTAAGGC … … ATTCGGGTATTTAAGCC …

Healthy Disease

(e.g., Alzheimer, Cancer)

Genome-Wide Association Studies (GWAS)

2000 2010 “Genetic diagnosis of diseases would be accomplished in 10 years and that treatments would start to roll out perhaps five years after that.”

“A Decade Later, Genetic Maps Yield Few New Cures” New York Times, June 2010.

12

slide-13
SLIDE 13

Key Challenges

 Human genome: 3 billion base pairs  Potential variations: > 10 million mutations  Combination: > 101000000 (1 million zeros)  Machine learning problem

 Atomic features: > 10 million  Feature combination: Too many to enumerate

13

slide-14
SLIDE 14

Genomics

14

Discovery

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

How to Scale Discovery?

High-Throughput Experiments

slide-15
SLIDE 15

Cancer

 Hundreds of mutations  Most are “passenger”, not driver  Can we identify likely drivers?

15

… ATTCGGATATTTAAGGC … … ATTCGGGTATTTAAGCC …

Normal cells Tumor cells

slide-16
SLIDE 16

Panomics

16

… ATTCGGATATTTAAGGC …

Genome Transcriptome Epigenome ……

slide-17
SLIDE 17

Pathway Knowledge

Genes work synergistically in pathways

17

slide-18
SLIDE 18

Why Hard to Identify Drivers?

 Complex diseases  Synergistic perturbation

  • f multiple pathways

 Cancer: 6  8 “hallmarks”

 Promote growth  Avoid suicide  Evade immune attack  Induce blood vessels  Invade neighboring tissues  …

18

slide-19
SLIDE 19

19

Hanahan & Weinberg [Cell 2011]

slide-20
SLIDE 20

Why Cancer Comes Back?

 Subtypes with alternative pathway profile  Compensatory pathways can be activated

20

EphA2 EphB2 Ovarian Cancer

slide-21
SLIDE 21

Why Cancer Comes Back?

 Subtypes with alternative pathway profile  Compensatory pathways can be activated

21

EphA2 EphB2 Ovarian Cancer

X

slide-22
SLIDE 22

A Grammar of Cancer?

Cancer  Anti-Apoptosis & ProGrowth & … Anti-Apoptosis  Deactivate TP53 Anti-Apoptosis  Activate BCL-2 …

22

slide-23
SLIDE 23

Infer Cancer Driver Mutations

23

Gene A DNA mRNA Protein Protein Active Transcription Translation Activation

… ATTCGGATATTTAAGGC …

What’s the level of activity? Is change caused by mutation?

slide-24
SLIDE 24

24

Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase

Pathway Knowledge

slide-25
SLIDE 25

25

Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase

Pathway Knowledge

?

slide-26
SLIDE 26

26

Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase

Pathway Knowledge

?

slide-27
SLIDE 27

27

Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase

Pathway Knowledge

!

slide-28
SLIDE 28

Approach: Graph HMM

28

Gene A DNA mRNA Protein Protein Active Transcription Factor Protein Kinase Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active

slide-29
SLIDE 29

Extract Pathways from Pubmed

29

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

…… …… Disease Genes Drug Targets ……

KB

High-Throughput Data

slide-30
SLIDE 30

PubMed

 22 millions abstracts  Two new abstracts every minute  Adds 2000-4000 every day

30

slide-31
SLIDE 31

… VDR+ binds to SMAD3 to form … … JUN expression is induced by SMAD3/4 … PMID: 123 PMID: 456 ……

31

Extract Pathways from Pubmed

slide-32
SLIDE 32

32

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41 envelope protein of human immunodeficiency virus type 1 ...

Involvement up-regulation IL-10 human monocyte gp41 p70(S6)-kinase activation

Extract Complex Knowledge

slide-33
SLIDE 33

33

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41 envelope protein of human immunodeficiency virus type 1 ...

Involvement up-regulation IL-10 human monocyte gp41 p70(S6)-kinase activation

Extract Complex Knowledge

REGULATION REGULATION REGULATION PROTEIN PROTEIN PROTEIN CELL

slide-34
SLIDE 34

34

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41 envelope protein of human immunodeficiency virus type 1 ...

Involvement up-regulation IL-10 human monocyte

Site Theme Cause

gp41 p70(S6)-kinase activation

Theme Cause Theme

Extract Complex Knowledge

REGULATION REGULATION REGULATION PROTEIN PROTEIN PROTEIN CELL

slide-35
SLIDE 35

35

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41 envelope protein of human immunodeficiency virus type 1 ...

Involvement up-regulation IL-10 human monocyte

Site Theme Cause

gp41 p70(S6)-kinase activation

Theme Cause Theme

Extract Complex Knowledge

REGULATION REGULATION REGULATION PROTEIN PROTEIN PROTEIN CELL

Semantic Parsing

slide-36
SLIDE 36

Bottleneck: Annotated Examples

 GENIA (BioNLP Shared Task 2009-2013)

 1999 abstracts  MeSH: human, blood cell, transcription factor

 Can we breach the annotation bottleneck?

36

slide-37
SLIDE 37

Free Lunch #1: Distributional Similarity

 Similar context  Probably similar meaning  Annotation as latent variables

Textual expression  Recursive clusters

 Unsupervised semantic parsing

37

Poon & Domingos, “Unsupervised Semantic Parsing”. EMNLP-2009 (Best Paper Award).

slide-38
SLIDE 38

Problem Formulation

Dependency tree Semantic parse Probability Parsing Learning

38

Prior: Favor fewer parameters

slide-39
SLIDE 39

Free Lunch #2: Existing KBs

 Many KBs available

 Gene/Protein: GeneBank, UniProt, …  Pathways: NCI, Reactome, KEGG, BioCarta, …

 Annotation as latent variables

Textual expression  Table, column, join, …

 Grounded unsupervised semantic parsing

39

Poon, “Grounded Unsupervised Semantic Parsing”. ACL-13.

slide-40
SLIDE 40

Natural-Language Interface to Database

Get flight from Toronto to San Diego stopping at DTW

SELECT flight.flight_id FROM flight, city, city c2, flight_stop, airport_service, airport_service as2 WHERE flight.from_airport = airport_service.airport_code AND flight.to_airport = as2.airport_code AND airport_service.city_code = city.city_code AND as2.city_code = city2.city_code AND city.city_name = ‘toronto’ AND city2.city_name = ‘san diego’ AND flight_stop.flight_id = flight.flight_id AND flight_stop.stop_airport = ‘dtw’

Answers

40

slide-41
SLIDE 41

Clusters  KB Elements

 Entity: Table, Column, Cell  Relation: Relational join  Priors:

 Favor lexical similarity  Favor short relational joins

41

slide-42
SLIDE 42

GUSP: Key Ideas

 Leverage target database

42

Job ID Company System 001 IBM Unix 002 Roche IBM 003 Microsoft Windows …… Prior: Favor Unix → System

Bootstrap learning with lexical prior

JOB

slide-43
SLIDE 43

GUSP: Key Ideas

 Leverage target database

43

Flight ID From Airport …… Flight Airport Code Airport Name …… Airport Foreign Key

slide-44
SLIDE 44

GUSP: Key Ideas

 Leverage target database

44

Flight Airport

slide-45
SLIDE 45

GUSP: Key Ideas

 Leverage target database

45

Flight Days Fare Airline Airport

slide-46
SLIDE 46

GUSP: Key Ideas

 Leverage target database

46

Flight Airport flight BWI Days Fare Airline

?

Flight Days Fare Airline Airport

slide-47
SLIDE 47

GUSP: Key Ideas

 Leverage target database

47

Prior: Favor shorter join

Leverage schema to guide learning

Flight Days Fare Airline Airport flight BWI

slide-48
SLIDE 48

Free Lunch #3: Dependency Parses

 Start from syntactic parse  Rich resources and available parsers  Intractable structure learning  Tree HMM  Exact inference is linear-time  Need to handle syntax-semantics mismatch

48

slide-49
SLIDE 49

Syntax-Semantics Mismatch

49

get toronto flight from to diego at san stopping dtw

slide-50
SLIDE 50

50

get toronto flight from to diego at san stopping dtw

Syntax-Semantics Mismatch

slide-51
SLIDE 51

51

get toronto flight from to diego at san stopping dtw

Syntax-Semantics Mismatch

slide-52
SLIDE 52

52

get toronto flight from to diego at san stopping dtw

Syntax-Semantics Mismatch

slide-53
SLIDE 53

Introduce Complex States

 Raising  Sinking  Implicit

53

slide-54
SLIDE 54

Raising

54

get toronto flight from to diego at san stopping dtw

E:flight E:flight:R

slide-55
SLIDE 55

Sinking

get toronto flight from to diego at san stopping dtw

55

E:flight:R V:city.name + E:flight

slide-56
SLIDE 56

Implicit

56

Give me the fare (of the flight) from Seattle to Boston

fare

E:fare

fare

E:fare + E:flight

slide-57
SLIDE 57

Experiment: Dataset

 ATIS

 Questions and ATIS database  Dev. / Test: Follow ZC07 [Zettlemoyer & Collins 2007]  Gold SQLs: Use at evaluation only  Gold logical forms in ZC07: Not used

 Evaluate on question-answering accuracy

57

slide-58
SLIDE 58

Experiment: Systems

 LEXICAL: Lexical-trigger prior only  Supervised learning

 ZC07: Zettlemoyer & Collins [2007]  FUBL: Kwiatkowski et al. [2011]

 GUSPSIMPLE: Simple states only  GUSP++: All states

58

slide-59
SLIDE 59

Results

59

System Accuracy ZC07 84.6 FUBL 82.8 GUSP++ 83.5

slide-60
SLIDE 60

Ablation

60

System Variant Accuracy LEXICAL 33.9 GUSPSIMPLE 66.5 GUSP++ 83.5  Raising 75.7  Sinking 77.5  Implicit 76.2

slide-61
SLIDE 61

Pathway Extraction

 More to leverage from KB:

Semantic relations in KB likely occur in semantic parse of some sentence

 Priors:

 Favor a parse w. relations in KB  Penalize a parse w. relations not in KB

61

slide-62
SLIDE 62

Distant-Supervision

 Existing work: Binary relation, classification

Mintz et al. [2009]

Riedel et al. [2010]

Hoffmann et al. [2011]

Krishnamurphy & Mitchell [2012]

Etc.

 Our approach: Generalize distant supervision

to semantic parsing

62

Parikh, Poon, Toutanova. In progress.

slide-63
SLIDE 63

http://literome.azurewebsites.net

63

Literome

Poon et al., “Literome: PubMed-Scale Genomic Knowledge Base in the Cloud”, Bioinformatics 2014.

slide-64
SLIDE 64

PubMed-Scale Extraction

 Preliminary pass:

 2 million instances  13,000 genes, 870,000 unique interactions

 Applications:

 UCSC Genome Browser, MSR Interactions Track  Cancer expression profile modeling  Validate de novo pathway prediction  Etc.

64

slide-65
SLIDE 65

Big Mechanism

 42-million program for 12 teams

 Reading, Assembly, Explanation  Domain: Cancer signaling pathways

 We are funded

 PI: Andrey Rzhetsky  Co-PI w. James Evans, Ross King

65

slide-66
SLIDE 66

We Have Digitized Life

66

slide-67
SLIDE 67

Next: Digitize Medicine

67

Knock down genes A, B, C → Cure

slide-68
SLIDE 68

Summary

 Precision medicine is the future  Infer cancer driver mutations

Graphical model: Pathways + Panomics data

 Extract pathways from Pubmed

Semantic parsing grounded in KBs

 Literome: KB for genomic medicine

68

slide-69
SLIDE 69

Summary

69

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

…… …… Disease Genes Drug Targets ……

KB

High-Throughput Data