Semantic Parsing for Cancer Panomics
Hoifung Poon
1
Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C - - PowerPoint PPT Presentation
Semantic Parsing for Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C ATTCGGGTATTTAAGCC Disease Genes Drug Targets High-Throughput Data KB 2 Overview ATTCGG A TATTTAAG G C
1
2
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
…… …… Disease Genes Drug Targets ……
High-Throughput Data
3
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
…… …… Disease Genes Drug Targets ……
High-Throughput Data
4
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
…… …… Disease Genes Drug Targets …
High-Throughput Data Grounded Unsupervised Semantic Parsing
5
David Heckerman Tony Gitter Lucy Vanderwende Kristina Toutanova Chris Quirk Ankur Parikh
7
Before Treatment 15 Weeks
8
Before Treatment 15 Weeks 23 Weeks
9
10
Targeted Experiments Discovery
11
High-Throughput Experiments Discovery
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
… ATTCGGATATTTAAGGC … … ATTCGGGTATTTAAGCC …
Healthy Disease
(e.g., Alzheimer, Cancer)
“A Decade Later, Genetic Maps Yield Few New Cures” New York Times, June 2010.
12
Human genome: 3 billion base pairs Potential variations: > 10 million mutations Combination: > 101000000 (1 million zeros) Machine learning problem
Atomic features: > 10 million Feature combination: Too many to enumerate
13
14
Discovery
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
High-Throughput Experiments
Hundreds of mutations Most are “passenger”, not driver Can we identify likely drivers?
15
… ATTCGGATATTTAAGGC … … ATTCGGGTATTTAAGCC …
Normal cells Tumor cells
16
… ATTCGGATATTTAAGGC …
17
Complex diseases Synergistic perturbation
Cancer: 6 8 “hallmarks”
Promote growth Avoid suicide Evade immune attack Induce blood vessels Invade neighboring tissues …
18
19
Subtypes with alternative pathway profile Compensatory pathways can be activated
20
Subtypes with alternative pathway profile Compensatory pathways can be activated
21
22
23
Gene A DNA mRNA Protein Protein Active Transcription Translation Activation
… ATTCGGATATTTAAGGC …
24
Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase
25
Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase
26
Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase
27
Gene A DNA mRNA Protein Protein Active Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active Transcription Factor Protein Kinase
28
Gene A DNA mRNA Protein Protein Active Transcription Factor Protein Kinase Gene B DNA mRNA Protein Protein Active Gene C DNA mRNA Protein Protein Active
29
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
…… …… Disease Genes Drug Targets ……
High-Throughput Data
22 millions abstracts Two new abstracts every minute Adds 2000-4000 every day
30
… VDR+ binds to SMAD3 to form … … JUN expression is induced by SMAD3/4 … PMID: 123 PMID: 456 ……
31
32
33
REGULATION REGULATION REGULATION PROTEIN PROTEIN PROTEIN CELL
34
Site Theme Cause
Theme Cause Theme
REGULATION REGULATION REGULATION PROTEIN PROTEIN PROTEIN CELL
35
Site Theme Cause
Theme Cause Theme
REGULATION REGULATION REGULATION PROTEIN PROTEIN PROTEIN CELL
GENIA (BioNLP Shared Task 2009-2013)
1999 abstracts MeSH: human, blood cell, transcription factor
Can we breach the annotation bottleneck?
36
Similar context Probably similar meaning Annotation as latent variables
Unsupervised semantic parsing
37
Poon & Domingos, “Unsupervised Semantic Parsing”. EMNLP-2009 (Best Paper Award).
38
Many KBs available
Gene/Protein: GeneBank, UniProt, … Pathways: NCI, Reactome, KEGG, BioCarta, …
Annotation as latent variables
Grounded unsupervised semantic parsing
39
Poon, “Grounded Unsupervised Semantic Parsing”. ACL-13.
SELECT flight.flight_id FROM flight, city, city c2, flight_stop, airport_service, airport_service as2 WHERE flight.from_airport = airport_service.airport_code AND flight.to_airport = as2.airport_code AND airport_service.city_code = city.city_code AND as2.city_code = city2.city_code AND city.city_name = ‘toronto’ AND city2.city_name = ‘san diego’ AND flight_stop.flight_id = flight.flight_id AND flight_stop.stop_airport = ‘dtw’
40
Entity: Table, Column, Cell Relation: Relational join Priors:
Favor lexical similarity Favor short relational joins
41
Leverage target database
42
Job ID Company System 001 IBM Unix 002 Roche IBM 003 Microsoft Windows …… Prior: Favor Unix → System
JOB
Leverage target database
43
Flight ID From Airport …… Flight Airport Code Airport Name …… Airport Foreign Key
Leverage target database
44
Flight Airport
Leverage target database
45
Flight Days Fare Airline Airport
Leverage target database
46
Flight Airport flight BWI Days Fare Airline
Flight Days Fare Airline Airport
Leverage target database
47
Prior: Favor shorter join
Flight Days Fare Airline Airport flight BWI
Start from syntactic parse Rich resources and available parsers Intractable structure learning Tree HMM Exact inference is linear-time Need to handle syntax-semantics mismatch
48
49
50
51
52
Raising Sinking Implicit
53
54
E:flight E:flight:R
55
E:flight:R V:city.name + E:flight
56
ATIS
Questions and ATIS database Dev. / Test: Follow ZC07 [Zettlemoyer & Collins 2007] Gold SQLs: Use at evaluation only Gold logical forms in ZC07: Not used
Evaluate on question-answering accuracy
57
LEXICAL: Lexical-trigger prior only Supervised learning
ZC07: Zettlemoyer & Collins [2007] FUBL: Kwiatkowski et al. [2011]
GUSPSIMPLE: Simple states only GUSP++: All states
58
59
60
More to leverage from KB:
Priors:
Favor a parse w. relations in KB Penalize a parse w. relations not in KB
61
Existing work: Binary relation, classification
Mintz et al. [2009]
Riedel et al. [2010]
Hoffmann et al. [2011]
Krishnamurphy & Mitchell [2012]
Etc.
Our approach: Generalize distant supervision
62
Parikh, Poon, Toutanova. In progress.
63
Poon et al., “Literome: PubMed-Scale Genomic Knowledge Base in the Cloud”, Bioinformatics 2014.
Preliminary pass:
2 million instances 13,000 genes, 870,000 unique interactions
Applications:
UCSC Genome Browser, MSR Interactions Track Cancer expression profile modeling Validate de novo pathway prediction Etc.
64
42-million program for 12 teams
Reading, Assembly, Explanation Domain: Cancer signaling pathways
We are funded
PI: Andrey Rzhetsky Co-PI w. James Evans, Ross King
65
66
67
Precision medicine is the future Infer cancer driver mutations
Extract pathways from Pubmed
Literome: KB for genomic medicine
68
69
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
…… …… Disease Genes Drug Targets ……
High-Throughput Data