Genomic and epigenomic signatures for interpreting complex disease
Manolis Kellis
MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard
Genomic and epigenomic signatures for interpreting complex disease - - PowerPoint PPT Presentation
Genomic and epigenomic signatures for interpreting complex disease Manolis Kellis Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory
MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT
Encode proteins
Control gene expression
Goal: A systems-level understanding of genomes and gene regulation:
The parts list = Building blocks of gene regulatory networks Our tools: Comparative genomics & large-scale experimental datasets.
Integrative models = Define roles in development, health, disease
CATGACTG CATGCCTG Disease-associated variant (SNP/CNV/…) Gene annotation (Coding, 5’/3’UTR, RNAs) Evolutionary signatures Non-coding annotation Chromatin signatures Roles in gene/chromatin regulation Activator/repressor signatures Other evidence of function Signatures of selection (sp/pop)
Family Inheritance Me vs. my brother
My dad Dad’s mom Mom’s dad
Human ancestry Disease risk Genomics: Regions mechanisms drugs Systems: genes combinations pathways
2 9 m am m als 1 7 fungi 1 2 flies
8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P N N
Kellis Nature 2003 Nature 2004; Stark Nature 2007; Clark Nature 2007; Butler Nature 2009; Lindblad-Toh Nature 2011
– For example: exons are deeply conserved to mouse, chicken, fish – Many other elements are also strongly conserved: exons / regulatory?
– Patterns of change distinguish different types of functional elements – Specific function Selective pressures Patterns of mutation/inse/del
Kellis Nature 2003 Nature 2004; Stark Nature 2007; Clark Nature 2007; Butler Nature 2009; Lindblad-Toh Nature 2011
Protein-coding genes
RNA structures
microRNAs
Regulatory motifs
Stark et al, Nature 2007
Novel protein-coding genes Revised gene annotations Unusual gene structures Novel structural families Targeting, editing, stability Riboswitches in mammals Novel/expanded miR families miR/miR* arm cooperation Sense/anti-sense miR switches Novel regulatory motifs Regulatory motif instances TF/miRNA regulatory networks Single binding site resolution
Stark et al, Nature 2007
Translational read-through in human & fly
Protein-coding conservation Continued protein- coding conservation No more conserv
Stop codon read through 2nd stop codon
Jungreis, Genome Research 2011 Overlapping selection in human exons Reveal splicing signals, RNA structures, enhancer motifs, dual-coding genes
Synonym. Substitut. Rate
Lin, Genome Research 2011 RNA structure families: ortholog/paralog cons Ex:MAT2A S-adeosyl-methionic level detection Structure/loop sequence deep conservation Parker Gen. Res. 2011 Regions of codon-level positive selection Distributed vs. localized positive selection Immunity/taste vs. retinal/bone/secretion distributed localized Lindblad-Toh Nature 2011
NRSF motif
Transcription initiation from Pol2 promoter Transcription coactivator activity Transcription factor binding Chromatin binding Negative regulation of transcription, DNA-dependent Transcription factor complex Protein complex Protein kinase activity Nerve growth factor receptor signaling pathway Signal transducer activity Protein serine/threonine kinase activity Negative regulation of transcription from Pol2 prom Protein tyrosine kinase activity In utero embryonic development
Ernst et al Nature Biotech 2010 See also: Amos Tanay, Bill Noble.
modifications
Epigenomic maps
9 human cell types 9 marks
H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac H3K27me3 H4K20me1 H3K36me3 CTCF +WCE +RNA
HUVEC Umbilical vein endothelial NHEK Keratinocytes GM12878 Lymphoblastoid K562 Myelogenous leukemia HepG2 Liver carcinoma NHLF Normal human lung fibroblast HMEC Mammary epithelial cell HSMM Skeletal muscle myoblasts H1 Embryonic
81 Chromatin Mark Tracks (281 combinations)
Ernst et al, Nature 2011
across cell types (virtual concatenation)
are common
are dynamic
Brad Bernstein ENCODE Chromatin Group
Predicted linking
Correlated activity
HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 Gene expression Chromatin States Active TF motif enrichment ON OFF Active enhancer Repressed Motif enrichment Motif depletion TF regulator expression TF On TF Off Dip-aligned motif biases Motif aligned Flat profile
25
3.2 4.4
1.1 3.1
Sequence variant at distal position
A A A C A A A C C … Example: Lymphoblastoid (GM) cells study
(Montgomery et al, Nature 2010)
linking based on our datasets
4-fold enrichment (10% exp. by chance)
Individuals
… …
Expression level of gene 15kb
Validation rationale:
provide independent SNP-to-gene links
eQTL study
26
HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 Gene expression Chromatin States Active TF motif enrichment ON OFF Active enhancer Repressed Motif enrichment Motif depletion TF regulator expression TF On TF Off Dip-aligned motif biases Motif aligned Flat profile
Activity signatures for each TF Enhancer activity
29
Tarjei Mikkelsen
Predicted causal HNF motifs (that also showed dips) in HepG2 enhancers
GWAS
CATGACTG CATGCCTG
xx
Disrupt activator Ets-1 motif Loss of GM-specific activation Loss of enhancer function Loss of HLA-DRB1 expression
Erythrocyte phenotypes in K562 leukemia cells Lupus erythromatosus in GM lymphoblastoid `
Creation of repressor Gfi1 motif Gain K562-specific repression Loss of enhancer function Loss of CCDC162 expression
GM12878 Lymphoblastoid K562 Myelogenous leukemia
GM12878 enhancer enrichment now seen Cell type specific: GM and K562 enhancers Chromatin state specific: Enhancers/promoters
Could bias in array design contribute to these enrichments? Evaluate all 1000 genomes SNPs by imputing those in LD
Enhancers: 2049 (excess 392) 1940 distinct loci (R^2<.8) Promoters: 462 (excess 81) Transcribed: 4740 (excess 522) Repressed: 1351 (excess 76) Insulator: 240 (excess 23) Other: 21k (deplete 1093)
Melnikov, Nature Biotech 2012
54000+ measurements (x2 cells, 2x repl)
Repressor disruption aberrant expression in opposite cell types
GWAS
CATGACTG CATGCCTG
mQTLs MWAS
500,000 methylation probes 750 individuals
Phil de Jager, Roadmap disease epigenomics Brad Bernstein REMC mapping
Genome Epigenome meQTL Phenotype Epigenome Classification MWAS
1 2
50
Chromosome and genomic position P-value exponent (-log10P)
Distance from CpG (MB)
1
Probe intensity distribution Inter-individual variability
138,731 184 2,647 Multimodal probes (~3Κ) SNP-associated probes (29% of all)
1 Active promoter 2 Promoter flanking 3 Active enhancer 4 Weak enhancer 5 Gene bodies 6 Active gene bodies 7 Repetitive 8 Heterochromatin 9 Low signal
% of CpG probes
(driven epigenetically>genetically, open chrom)
SNP-associated All probes
are SNP-associated
contribution of genotype to disease associations
500,000 methylation probes 750 individuals
Phil de Jager, Roadmap disease epigenomics Brad Bernstein REMC mapping
Genome Epigenome meQTL Phenotype Epigenome Classification MWAS
1 2
Alzheimer’s Normal Alzheimer’s Normal
Hypomethylated probes (active) Hypermethylated probes (repressed) Alzheimer’s-associated probes are hypermethylated 480,000 probes, ranked by Alzheimer’s association P-value Methylation
Top 7000 probes
– Rank all probes by Alzheimer’s association – Observe functional changes down ranklist – 7000 probes show shift in methylation
* => fisher exact test, p-value <= 0.001
1 Active promoter 2 Promoter flanking 3 Active enhancer 4 Weak enhancer 5 Gene bodies 6 Active gene bodies 7 Repetitive 8 Heterochromatin 9 Low signal
Red: More methylated in Alhzeimer’s Blue: Less methylated in Alzheimer’s Significant probes are in enhancers Not promoters
Predictive power: 6k probes + APOE
Regulatory motifs associated with Alzheimer-associated probes suggest potential pathways
CTCF NRSF ELK1
All probes, ranked by AD assoc. P-value All probes, ranked by AD assoc. P-value
Goal: A systems-level understanding of genomes and gene regulation:
The parts list = Building blocks of gene regulatory networks
CATGACTG CATGCCTG Disease-associated variant (SNP/CNV/…) Gene annotation (Coding, 5’/3’UTR, RNAs) Evolutionary signatures Non-coding annotation Chromatin signatures Roles in gene/chromatin regulation Activator/repressor signatures Other evidence of function Signatures of selection (sp/pop)
– Brad Bernstein, Tarjei Mikkelsen, Noam Shoresh, David Epstein
– Tarjei Mikkelsen, Broad Institute
– Bing Ren, Brad Bernstein, John Stam, Joe Costello
– Kerstin Lindblad-Toh, Eric Lander, Manuel Garber, Or Zuk
– NHGRI, NIH, NSF Sloan Foundation
Daniel Marbach Mike Lin Jason Ernst Jessica Wu Rachel Sealfon Pouya Kheradpour (#187) Manolis Kellis Chris Bristow Loyal Goff Irwin Jungreis
Sushmita Roy #331: Luke Ward
Stata4 Stata3
Louisa DiStefano Dave Hendrix Angela Yen Ben Holmes Soheil Feizi Mukul Bansal #19:Bob Altshuler Stefan Washietl Matt Eaton