Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C - - PowerPoint PPT Presentation

cancer panomics
SMART_READER_LITE
LIVE PREVIEW

Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C - - PowerPoint PPT Presentation

Machine Reading for Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C ATTCGGGTATTTAAGCC Disease Genes Drug Targets High-Throughput Data KB Cancer Systems Modeling 2 Overview ATTCGG


slide-1
SLIDE 1

Machine Reading for Cancer Panomics

Hoifung Poon

1

slide-2
SLIDE 2

Overview

2

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

…… …… Disease Genes Drug Targets ……

KB Cancer Systems Modeling

High-Throughput Data

slide-3
SLIDE 3

3

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

…… …… Disease Genes Drug Targets …

KB Extract Pathways from PubMed

Overview

High-Throughput Data Grounded Semantic Parsing

slide-4
SLIDE 4

Precision Medicine

slide-5
SLIDE 5

5

Before Treatment 15 Weeks

Vemurafenib on BRAF-V600 Melanoma

slide-6
SLIDE 6

Vemurafenib on BRAF-V600 Melanoma

6

Before Treatment 15 Weeks 23 Weeks

slide-7
SLIDE 7

7

slide-8
SLIDE 8

Traditional Biology

8

Targeted Experiments Discovery

One hypothesis

slide-9
SLIDE 9

Genomics

9

High-Throughput Experiments Discovery

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

Many hypotheses

?

slide-10
SLIDE 10

… ATTCGGATATTTAAGGC … … ATTCGGGTATTTAAGCC …

Healthy Disease

(e.g., Alzheimer, Cancer)

Genome-Wide Association Studies (GWAS)

2000 2010 “Genetic diagnosis of diseases would be accomplished in 10 years and that treatments would start to roll out perhaps five years after that.”

“A Decade Later, Genetic Maps Yield Few New Cures” New York Times, June 2010.

10

slide-11
SLIDE 11

Key Challenges

 Human genome: 3 billion base pairs  Potential variations: > 10 million variants  Combination: > 101000000 (1 million zeros)  Machine learning problem

 Atomic features: > 10 million  Feature combination: Too many to enumerate

11

slide-12
SLIDE 12

Genomics

12

Discovery

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

How to Scale Discovery?

High-Throughput Experiments

slide-13
SLIDE 13

Cancer

 Hundreds of mutations  Most are “passenger”, not driver  Can we identify likely drivers?

13

… ATTCGGATATTTAAGGC … … ATTCGGGTATTTAAGCC …

Normal cells Tumor cells

slide-14
SLIDE 14

Panomics

14

… ATTCGGATATTTAAGGC …

Genome Transcriptome Epigenome ……

slide-15
SLIDE 15

Pathway Knowledge

Genes work synergistically in pathways

15

slide-16
SLIDE 16

Why Hard to Identify Drivers?

Complex diseases  Perturb multiple pathways

16

Hanahan & Weinberg [Cell 2011]

slide-17
SLIDE 17

Why Cancer Comes Back?

 Subtypes with alternative pathway profile  Compensatory pathways can be activated

17

EphA2 EphB2 Ovarian Cancer

slide-18
SLIDE 18

Why Cancer Comes Back?

 Subtypes with alternative pathway profile  Compensatory pathways can be activated

18

EphA2 EphB2 Ovarian Cancer

X

slide-19
SLIDE 19

Cancer Systems Modeling

19

Gene A DNA mRNA Protein Protein Active Transcription Translation Activation

… ATTCGGATATTTAAGGC …

Functional activity Mutation effect Drug Target ……

slide-20
SLIDE 20

20

Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase

Knowledge  Model

slide-21
SLIDE 21

21

Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase

?

Knowledge  Model

slide-22
SLIDE 22

22

Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase

?

Knowledge  Model

slide-23
SLIDE 23

23

Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase

!

Knowledge  Model

slide-24
SLIDE 24

Approach: Graph HMM

24

Gene A DNA mRNA Protein Protein Active Transcription Factor Protein Kinase Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active

slide-25
SLIDE 25

Extract Pathways from PubMed

25

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

…… …… Disease Genes Drug Targets ……

KB

High-Throughput Data

slide-26
SLIDE 26

PubMed

 24 millions abstracts  Two new abstracts every minute  Adds over one million every year

26

slide-27
SLIDE 27

… VDR+ binds to SMAD3 to form … … JUN expression is induced by SMAD3/4 … PMID: 123 PMID: 456 ……

27

Machine Reading

slide-28
SLIDE 28

28

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41 envelope protein of human immunodeficiency virus type 1 ...

Machine Reading

slide-29
SLIDE 29

29

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41 envelope protein of human immunodeficiency virus type 1 ...

IL-10 human monocyte gp41 p70(S6)-kinase

Machine Reading

PROTEIN PROTEIN PROTEIN CELL

slide-30
SLIDE 30

30

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41 envelope protein of human immunodeficiency virus type 1 ...

Involvement up-regulation IL-10 human monocyte

Site Theme Cause

gp41 p70(S6)-kinase activation

Theme Cause Theme

Machine Reading

REGULATION REGULATION REGULATION PROTEIN PROTEIN PROTEIN CELL

slide-31
SLIDE 31

31

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41 envelope protein of human immunodeficiency virus type 1 ...

Involvement up-regulation IL-10 human monocyte

Site Theme Cause

gp41 p70(S6)-kinase activation

Theme Cause Theme

Machine Reading

REGULATION REGULATION REGULATION PROTEIN PROTEIN PROTEIN CELL

Semantic Parsing

slide-32
SLIDE 32

Long Tail of Variations

32

TP53 inhibits BCL2. Tumor suppressor P53 down-regulates the activity of BCL-2 proteins. BCL2 transcription is suppressed by P53 expression. The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 … ……

slide-33
SLIDE 33

Bottleneck: Annotated Examples

 GENIA (BioNLP Shared Task 2009-2013)

 1999 abstracts  MeSH: human, blood cell, transcription factor

 Challenge for “supervised” machine learning  Can we breach this bottleneck?

33

slide-34
SLIDE 34

Free Lunch #1: Distributional Similarity

 Similar context  Probably similar meaning  Annotation as latent variables

Textual expression  Recursive clusters

 Unsupervised semantic parsing

34

Poon & Domingos, “Unsupervised Semantic Parsing”. EMNLP 2009. Best Paper Award.

slide-35
SLIDE 35

Recursive Clustering

35

TP53 inhibits BCL2. Tumor suppressor P53 down-regulates the activity of BCL-2 proteins. BCL2 transcription is suppressed by P53 expression. The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 … ……

slide-36
SLIDE 36

Recursive Clustering

36

TP53 inhibits BCL2. Tumor suppressor P53 down-regulates the activity of BCL-2 proteins. BCL2 transcription is suppressed by P53 expression. The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 … ……

slide-37
SLIDE 37

Recursive Clustering

37

TP53 inhibits BCL2. Tumor suppressor P53 down-regulates the activity of BCL-2 proteins. BCL2 transcription is suppressed by P53 expression. The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 … ……

slide-38
SLIDE 38

Recursive Clustering

38

TP53 inhibits BCL2. Tumor suppressor P53 down-regulates the activity of BCL-2 proteins. BCL2 transcription is suppressed by P53 expression. The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 … ……

BCL2, BCL-2 proteins, B-cell CLL/Lymphoma 2 …… TP53,Tumor suppressor P53 …… inhibits, down-regulates, suppresses, inhibition, … Theme Cause

slide-39
SLIDE 39

Free Lunch #2: Existing KBs

 Many KBs available

 Gene/Protein: GeneBank, UniProt, …  Pathways: NCI, Reactome, KEGG, BioCarta, …

 Annotation as latent variables

Textual expression  Table, column, join, …

 Grounded semantic parsing

39

slide-40
SLIDE 40

Entity Extraction

40

ID Symbol Alias 990 BCL2 B-cell CLL/Lymphoma 2, … 11998 TP53 Tumor suppressor P53, … … … … HGNC

slide-41
SLIDE 41

Entity Extraction

41

ID Symbol Alias 990 BCL2 B-cell CLL/Lymphoma 2, … 11998 TP53 Tumor suppressor P53, … … … … HGNC

TP53 inhibits BCL2. Tumor suppressor P53 down-regulates the activity of BCL-2 proteins. BCL2 transcription is suppressed by P53 expression. The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 … ……

slide-42
SLIDE 42

Relation Extraction

42

Regulation Theme Cause Positive A2M FOXO1 Positive ABCB1 TP53 Negative BCL2 TP53 … … … NCI-PID Pathway KB

TP53 inhibits BCL2. Tumor suppressor P53 down-regulates the activity of BCL-2 proteins. BCL2 transcription is suppressed by P53 expression. The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 … ……

slide-43
SLIDE 43

Relation Extraction

43

Regulation Theme Cause Positive A2M FOXO1 Positive ABCB1 TP53 Negative BCL2 TP53 … … … NCI-PID Pathway KB

TP53 inhibits BCL2. Tumor suppressor P53 down-regulates the activity of BCL-2 proteins. BCL2 transcription is suppressed by P53 expression. The inhibition of B-cell CLL/Lymphoma 2 expression by TP53 … ……

Grounded Learning

slide-44
SLIDE 44

Question Answering w.r.t. KB

44

Poon, “Grounded Unsupervised Semantic Parsing”. ACL 2013.

System Accuracy ZC07 84.6 FUBL 82.8 GUSP 83.5

Supervised Unsupervised

slide-45
SLIDE 45

Pathway Extraction

 Generalize distant supervision:

Nested events in KB likely occur in semantic parse of some sentence

 Prior: Favor semantic parse grounded in KB  Outperformed the majority of participants in

  • riginal GENIA Event Shared Task

45

Parikh, Poon, Toutanova. In Progress.

slide-46
SLIDE 46

http://literome.azurewebsites.net

46

Literome

Poon et al., “Literome: PubMed-Scale Genomic Knowledge Base in the Cloud”, Bioinformatics 2014.

slide-47
SLIDE 47

PubMed-Scale Extraction

 Preliminary pass:

 2 million instances  13,000 genes, 870,000 unique regulations

 Applications:

 UCSC Genome Browser, MSR Interactions Track  Expression profile modeling  Validate de novo pathway prediction  Etc.

47

Poon, Toutanova, Quirk, “Distant Supervision for Cancer Pathway Extraction from Text”. PSB 2015. To appear.

slide-48
SLIDE 48

Machine Science

48

Evans & Rzhetsky, “Machine Science”. Science, Vol. 329, 2010.

slide-49
SLIDE 49

Machine Science

49

Big Data

slide-50
SLIDE 50

Machine Science

50

Big Data Rich Knowledge

KB

slide-51
SLIDE 51

Machine Science

51

Deep Model Big Data Rich Knowledge

KB

slide-52
SLIDE 52

Machine Science

52

Deep Model Big Data Rich Knowledge Hypotheses

KB

slide-53
SLIDE 53

Machine Science

53

Deep Model Big Data Rich Knowledge Hypotheses Experiments

KB

slide-54
SLIDE 54

Machine Science

54

Deep Model Big Data Rich Knowledge Hypotheses Experiments

KB

slide-55
SLIDE 55

Roadmap

 Extract richer knowledge:

 Cell type, experimental condition, …  Hedging, negation, …

 Formulate coherent models:

 Supporting evidence, contradiction, …  Intellectual gaps, hypotheses, …

 Integrate w. data & experiments:

 Cancer panomics  Driver genes / pathways  Single-drug response  Drug combo prioritization

55

slide-56
SLIDE 56

Big Mechanism

 42-million program

 Reading, Assembly, Explanation  Domain: Cancer signaling pathways

 We are in

 PI: Andrey Rzhetsky  Co-PI w. James Evans, Ross King

56

slide-57
SLIDE 57

57

Berkeley AMP Lab OHSU Microsoft Research

slide-58
SLIDE 58

We Have Digitized Life

58

slide-59
SLIDE 59

Next: Digitize Medicine

59

Knock down genes A, B, C → Cure

slide-60
SLIDE 60

Summary

 Precision medicine is the future  Cancer systems modeling

Graphical model: Pathways + Panomics data

 Extract pathways from PubMed

Machine reading by grounded semantic parsing

 Literome: KB for genomic medicine

60

slide-61
SLIDE 61

Acknowledgments

61

 U. Chicago: Andrey Rzhetsky, Kevin White  OHSU: Brian Drucker, Jeff Tyner  Berkeley AMP Lab: David Patterson  U. Wisconsin: Anthony Gitter  Microsoft Research: Chris Quirk, Kristina

Toutanova, David Heckerman, Ankur Parikh, Lucy Vanderwende, Bill Bolosky, Ravi Pandya

slide-62
SLIDE 62

Summary

62

… ATTCGGATATTTAAGGC …

… ATTCGGGTATTTAAGCC …

…… …… Disease Genes Drug Targets ……

KB

High-Throughput Data