Genomic and epigenomic signatures for interpreting complex disease - - PowerPoint PPT Presentation

genomic and epigenomic signatures for interpreting
SMART_READER_LITE
LIVE PREVIEW

Genomic and epigenomic signatures for interpreting complex disease - - PowerPoint PPT Presentation

Genomic and epigenomic signatures for interpreting complex disease Manolis Kellis Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory


slide-1
SLIDE 1

Genomic and epigenomic signatures for interpreting complex disease

Manolis Kellis

MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

slide-2
SLIDE 2

ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT

slide-3
SLIDE 3

ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT

Genes

Encode proteins

Regulatory motifs

Control gene expression

slide-4
SLIDE 4

Building systems-level views of genomes and disease

Goal: A systems-level understanding of genomes and gene regulation:

  • The regulators: Transcription factors, microRNAs, sequence specificities
  • The regions: enhancers, promoters, and their tissue-specificity
  • The targets: TFstargets, regulatorsenhancers, enhancersgenes
  • The grammars: Interplay of multiple TFs  prediction of gene expression

 The parts list = Building blocks of gene regulatory networks Our tools: Comparative genomics & large-scale experimental datasets.

  • Evolutionary signatures for coding/non-coding genes, microRNAs, motifs
  • Chromatin signatures for regulatory regions and their tissue specificity
  • Activity signatures for linking regulators  enhancers  target genes
  • Predictive models for gene function, gene expression, chromatin state

 Integrative models = Define roles in development, health, disease

slide-5
SLIDE 5

Challenge: interpreting disease-associated variants

  • GWAS, case-control,… reveal disease-associated variants

 Molecular mechanism, cell-type specificity, drug targets

  • Challenges towards interpreting disease variants

– Find ‘true’ causative SNP among many candidates in LD – Use ‘causal’ variant: predict function, pathway, drug targets – Non-coding variant: type of function, cell type of activity – Regulatory variant: upstream regulators, downstream targets

  • This talk: genomics tools for addressing these challenges

CATGACTG CATGCCTG Disease-associated variant (SNP/CNV/…) Gene annotation (Coding, 5’/3’UTR, RNAs)  Evolutionary signatures Non-coding annotation  Chromatin signatures Roles in gene/chromatin regulation  Activator/repressor signatures Other evidence of function  Signatures of selection (sp/pop)

slide-6
SLIDE 6

Recombination breakpoints

Family Inheritance Me vs. my brother

My dad Dad’s mom Mom’s dad

Human ancestry Disease risk Genomics: Regions  mechanisms  drugs Systems: genes  combinations  pathways

Goal: Towards personal systems genomics

slide-7
SLIDE 7

Systems-level views of disease epigenomics

  • Evolutionary signatures  gene/genome annotation

– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population

  • Chromatin states for interpreting disease association

– Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators  enhancers  targets

  • Interpreting disease-associated sequence variants

– Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs

  • Systematic manipulation of 2000+ human enhancers

– Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs

  • Personal genomes/epigenomes in health and disease

– Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF

slide-8
SLIDE 8

Large-scale comparative genomics datasets

2 9 m am m als 1 7 fungi 1 2 flies

8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P N N

Kellis Nature 2003 Nature 2004; Stark Nature 2007; Clark Nature 2007; Butler Nature 2009; Lindblad-Toh Nature 2011

slide-9
SLIDE 9

Comparative genomics and evolutionary signatures

  • Comparative genomics can reveal functional elements

– For example: exons are deeply conserved to mouse, chicken, fish – Many other elements are also strongly conserved: exons / regulatory?

  • Can we also pinpoint specific functions of each region? Yes!

– Patterns of change distinguish different types of functional elements – Specific function  Selective pressures  Patterns of mutation/inse/del

  • Develop evolutionary signatures characteristic of each function

Kellis Nature 2003 Nature 2004; Stark Nature 2007; Clark Nature 2007; Butler Nature 2009; Lindblad-Toh Nature 2011

slide-10
SLIDE 10

Evolutionary signatures for diverse functions

Protein-coding genes

  • Codon Substitution Frequencies
  • Reading Frame Conservation

RNA structures

  • Compensatory changes
  • Silent G-U substitutions

microRNAs

  • Shape of conservation profile
  • Structural features: loops, pairs
  • Relationship with 3’UTR motifs

Regulatory motifs

  • Mutations preserve consensus
  • Increased Branch Length Score
  • Genome-wide conservation

Stark et al, Nature 2007

slide-11
SLIDE 11

Implications for genome annotation / regulation

Novel protein-coding genes Revised gene annotations Unusual gene structures Novel structural families Targeting, editing, stability Riboswitches in mammals Novel/expanded miR families miR/miR* arm cooperation Sense/anti-sense miR switches Novel regulatory motifs Regulatory motif instances TF/miRNA regulatory networks Single binding site resolution

Stark et al, Nature 2007

slide-12
SLIDE 12

Translational read-through in human & fly

Protein-coding conservation Continued protein- coding conservation No more conserv

Stop codon read through 2nd stop codon

Jungreis, Genome Research 2011 Overlapping selection in human exons Reveal splicing signals, RNA structures, enhancer motifs, dual-coding genes

Synonym. Substitut. Rate

Lin, Genome Research 2011 RNA structure families: ortholog/paralog cons Ex:MAT2A S-adeosyl-methionic level detection Structure/loop sequence deep conservation Parker Gen. Res. 2011 Regions of codon-level positive selection Distributed vs. localized positive selection Immunity/taste vs. retinal/bone/secretion distributed localized Lindblad-Toh Nature 2011

slide-13
SLIDE 13

Measuring constraint at individual nucleotides

  • Reveal individual transcription factor binding sites
  • Within motif instances reveal position-specific bias
  • More species: motif consensus directly revealed

NRSF motif

slide-14
SLIDE 14

Detect SNPs that disrupt conserved regulatory motifs

  • Functionally-associated SNPs enriched in states, constraint
  • Prioritize candidates, increase resolution, disrupted motifs
slide-15
SLIDE 15

Measuring selection within the human lineage

slide-16
SLIDE 16

Human constraint outside conserved regions

  • Non-conserved regions:

– ENCODE-active regions show reduced diversity  Lineage-specific constraint in biochemically-active regions

  • Conserved regions:

– Non-ENCODE regions show increased diversity  Loss of constraint in human when biochemically-inactive Average diversity (heterozygosity) Aggregate over the genome Active regions

slide-17
SLIDE 17

Strongest: motifs, short RNA, Dnase, ChIP, lncRNA

  • Significant derived allele depletion in active features
slide-18
SLIDE 18

Bound motifs show increased human constraint Position-specific reduction in bound motif heterozygosity Aggregate across thousands of CTCF motif instances

slide-19
SLIDE 19

Most constrained human-specific enhancer functions

Regulatory genes: Transcription, Chromatin, Signaling. Developmental enhancers: embryo, nerve growth

Transcription initiation from Pol2 promoter Transcription coactivator activity Transcription factor binding Chromatin binding Negative regulation of transcription, DNA-dependent Transcription factor complex Protein complex Protein kinase activity Nerve growth factor receptor signaling pathway Signal transducer activity Protein serine/threonine kinase activity Negative regulation of transcription from Pol2 prom Protein tyrosine kinase activity In utero embryonic development

slide-20
SLIDE 20

Systems-level views of disease epigenomics

  • Evolutionary signatures  gene/genome annotation

– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population

  • Chromatin states for interpreting disease association

– Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators  enhancers  targets

  • Interpreting disease-associated sequence variants

– Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs

  • Systematic manipulation of 2000+ human enhancers

– Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs

  • Personal genomes/epigenomes in health and disease

– Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF

slide-21
SLIDE 21

Chromatin signatures for genome annotation

Ernst et al Nature Biotech 2010 See also: Amos Tanay, Bill Noble.

  • 2. Histone

modifications

  • 3. DNA accessibility
  • 1. DNA methylation

Epigenomic maps

slide-22
SLIDE 22

ENCODE: Study nine marks in nine human cell lines

9 human cell types 9 marks

H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac H3K27me3 H4K20me1 H3K36me3 CTCF +WCE +RNA

HUVEC Umbilical vein endothelial NHEK Keratinocytes GM12878 Lymphoblastoid K562 Myelogenous leukemia HepG2 Liver carcinoma NHLF Normal human lung fibroblast HMEC Mammary epithelial cell HSMM Skeletal muscle myoblasts H1 Embryonic

x

81 Chromatin Mark Tracks (281 combinations)

Ernst et al, Nature 2011

  • Learned jointly

across cell types (virtual concatenation)

  • State definitions

are common

  • State locations

are dynamic

Brad Bernstein ENCODE Chromatin Group

slide-23
SLIDE 23

Chromatin states dynamics across nine cell types

  • Single annotation track for each cell type
  • Summarize cell-type activity at a glance
  • Can study 9-cell activity pattern across

Predicted linking

Correlated activity

slide-24
SLIDE 24

Link enhancers to target genes

Introducing multi-cell activity profiles

HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 Gene expression Chromatin States Active TF motif enrichment ON OFF Active enhancer Repressed Motif enrichment Motif depletion TF regulator expression TF On TF Off Dip-aligned motif biases Motif aligned Flat profile

slide-25
SLIDE 25

Enhancer-gene links supported by eQTL-gene links

25

  • 1.4

3.2 4.4

  • 1.8

1.1 3.1

  • 1.8
  • 1.5
  • 0.5
  • Indiv. 1
  • Indiv. 2
  • Indiv. 3
  • Indiv. 4
  • Indiv. 5
  • Indiv. 6
  • Indiv. 7
  • Indiv. 8
  • Indiv. 9

Sequence variant at distal position

A A A C A A A C C … Example: Lymphoblastoid (GM) cells study

  • Expression/genotype across 60 individuals

(Montgomery et al, Nature 2010)

  • 120 eQTLs are eligible for enhancer-gene

linking based on our datasets

  • 51 actually linked (43%) using predictions

 4-fold enrichment (10% exp. by chance)

Individuals

… …

Expression level of gene 15kb

  • Independent validation of links.
  • Relevance to disease datasets.

Validation rationale:

  • Expression Quantitative Trait Loci (eQTLs)

provide independent SNP-to-gene links

  • Do they agree with activity-based links?

eQTL study

slide-26
SLIDE 26

Visualizing 10,000s predicted enhancer-gene links

  • Overlapping regulatory units, both few and many
  • Both upstream and downstream elements linked
  • Enhancers correlate with sequence constraint

26

slide-27
SLIDE 27

Link TFs to target enhancers Predict activators vs. repressors

Introducing multi-cell activity profiles

HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 Gene expression Chromatin States Active TF motif enrichment ON OFF Active enhancer Repressed Motif enrichment Motif depletion TF regulator expression TF On TF Off Dip-aligned motif biases Motif aligned Flat profile

slide-28
SLIDE 28

Ex2: Gfi1 repressor of K562/GM cells Ex1: Oct4 predicted activator

  • f embryonic stem (ES) cells

Coordinated activity reveals activators/repressors

  • Enhancer networks: Regulator  enhancer  target gene

Activity signatures for each TF Enhancer activity

slide-29
SLIDE 29

Causal motifs supported by dips & enhancer assays

29

Dip evidence of TF binding (nucleosome displacement) Enhancer activity halved by single-motif disruption

 Motifs bound by TF, contribute to enhancers

Tarjei Mikkelsen

Predicted causal HNF motifs (that also showed dips) in HepG2 enhancers

slide-30
SLIDE 30

Systems-level views of disease epigenomics

  • Evolutionary signatures  gene/genome annotation

– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population

  • Chromatin states for interpreting disease association

– Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators  enhancers  targets

  • Interpreting disease-associated sequence variants

– Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs

  • Systematic manipulation of 2000+ human enhancers

– Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs

  • Personal genomes/epigenomes in health and disease

– Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF

slide-31
SLIDE 31

Genotype Disease

GWAS

Interpret variants using Epigenomics

  • Chromatin states: Enhancers, promoters, motifs
  • Enrichment in individual loci, across 1000s of SNPs in T1D

Interpreting disease-association signals

CATGACTG CATGCCTG

Epigenome changes in disease

slide-32
SLIDE 32

xx

  • Disease-associated SNPs enriched for enhancers in relevant cell types
  • E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator

Revisiting disease- associated variants

slide-33
SLIDE 33

Mechanistic predictions for top disease-associated SNPs

Disrupt activator Ets-1 motif  Loss of GM-specific activation  Loss of enhancer function  Loss of HLA-DRB1 expression

Erythrocyte phenotypes in K562 leukemia cells Lupus erythromatosus in GM lymphoblastoid `

Creation of repressor Gfi1 motif  Gain K562-specific repression  Loss of enhancer function  Loss of CCDC162 expression

slide-34
SLIDE 34

Allele-specific chromatin marks: cis-vs-trans effects

  • Maternal and paternal GM12878 genomes sequenced
  • Map reads to phased genome, handle SNPs indels
  • Correlate activity changes with sequence differences
slide-35
SLIDE 35

HaploReg: systematic ENCODE mining of variants (compbio.mit.edu/HaploReg)

  • Start with any list of SNPs or select a GWA study

– Mine publically available ENCODE data for significant hits – Hundreds of assays, dozens of cells, conservation, motifs – Report significant overlaps and link to info/browser

slide-36
SLIDE 36

Functional enrichment for 1000s of SNPs

slide-37
SLIDE 37

Full T1D association spectrum  1000s of causal SNPs

GM12878 Lymphoblastoid K562 Myelogenous leukemia

  • Rank all SNPs by P-value
  • Find chromatin states with

enrichment in high ranks

  • Signal spans 1000s of SNPs

GM12878 enhancer enrichment now seen Cell type specific: GM and K562 enhancers Chromatin state specific: Enhancers/promoters

Could bias in array design contribute to these enrichments?  Evaluate all 1000 genomes SNPs by imputing those in LD

slide-38
SLIDE 38

Imputing SNPs in LDstronger cell/state separation

  • Excess of 30,000 SNPs2049 enhancers (excess 392)
  • Mostly found in independent loci (1730 with R2<0.2)

 Systematically measure their regulatory contributions

Enhancers across cell types Chromatin states in GM12878

Enhancers: 2049 (excess 392) 1940 distinct loci (R^2<.8) Promoters: 462 (excess 81) Transcribed: 4740 (excess 522) Repressed: 1351 (excess 76) Insulator: 240 (excess 23) Other: 21k (deplete 1093)

slide-39
SLIDE 39

Systems-level views of disease epigenomics

  • Evolutionary signatures  gene/genome annotation

– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population

  • Chromatin states for interpreting disease association

– Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators  enhancers  targets

  • Interpreting disease-associated sequence variants

– Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs

  • Systematic manipulation of 2000+ human enhancers

– Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs

  • Personal genomes/epigenomes in health and disease

– Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF

slide-40
SLIDE 40

High-throughput experiments: 10,000s enhancers

  • Experiment features:

– Multiplexed enhancer assays – 10,000s of elements – Each w/ unique barcode – Multiple human cell types – Repeat experiments on same array / diff barcodes

  • Applied to:

– Test enhancer offsets – Test causal motifs

  • With: Tarjei Mikkelse

– Broad Institute, ARRA funds – See also: Barak Cohen, Jay Shendure, Eran Segal

Melnikov, Nature Biotech 2012

slide-41
SLIDE 41

Systematic motif disruption for 5 activators and 2 repressors in 2 human cell lines

54000+ measurements (x2 cells, 2x repl)

slide-42
SLIDE 42

Example activator: conserved HNF4 motif match

WT expression specific to HepG2 Non-disruptive changes maintain expression Motif match disruptions reduce expression to background Random changes depend on effect to motif match

slide-43
SLIDE 43

Results hold across 2000+ enhancers

  • Scramble abolishes

reporter expression

  • Neutral mutations

show no change

  • Increasing mutations

show more expression

  • However, only 40%

show wild-type expression: context?

slide-44
SLIDE 44

Features of functional wildtype enhancers

  • Nucleosome exclusion, motif conservation, other TFs
  • Each of these features is encoded in primary sequence
slide-45
SLIDE 45

Repressors of HepG2 enhancer act in K562

Repressor disruption  aberrant expression in opposite cell types

slide-46
SLIDE 46

Testing effect of SNP change in enhancer constructs

  • SNPs in enhancer regions can lead to expression

changes in downstream reporter genes

  • Currently testing all T1D-associated enhancer SNPs
slide-47
SLIDE 47

Systems-level views of disease epigenomics

  • Evolutionary signatures  gene/genome annotation

– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population

  • Chromatin states for interpreting disease association

– Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators  enhancers  targets

  • Interpreting disease-associated sequence variants

– Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs

  • Systematic manipulation of 2000+ human enhancers

– Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs

  • Personal genomes/epigenomes in health and disease

– Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF

slide-48
SLIDE 48

Genotype Disease

GWAS

(1) Interpret variants using Epigenomics

  • Chromatin states: Enhancers, promoters, motifs
  • Enrichment in individual loci, across 1000s of SNPs in T1D

Interpreting disease-association signals

CATGACTG CATGCCTG

(2) Epigenome changes in disease

  • Intermediate molecular phenotypes associated with disease
  • Variation in brain methylomes of Alzheimer’s patients

mQTLs MWAS

Epigenome

slide-49
SLIDE 49

Phil de Jager: Methylation in 750 Alzheimer patients

500,000 methylation probes 750 individuals

  • Patients followed for 10+ years with cognitive evaluations
  • Brain samples donated post-mortem methylation/genotype
  • Seek predictive features: SNPs, QTLs, mQTLs, regulation

Phil de Jager, Roadmap disease epigenomics Brad Bernstein REMC mapping

Genome Epigenome meQTL Phenotype Epigenome Classification MWAS

1 2

slide-50
SLIDE 50

2,500 mQTLs for neighboring SNPs at 10-14

  • Overlay Manhattan plots of 450,000 methylation probes
  • Cutoff of 10-14 (10-8 after Bonferroni correction)
  • Use to pinpoint disrupted motifs, predict epigenome

50

Chromosome and genomic position P-value exponent (-log10P)

Distance from CpG (MB)

  • 1

1

slide-51
SLIDE 51

Focusing on 2831 most variable probes

Probe intensity distribution Inter-individual variability

  • Hemi-methylated probes are

also the most variable

  • Tiny fraction (0.6%) of all probes
  • Promoters: Stable low (active)
  • Gene bodies: Stable high (active)
  • Enhancers/poised: Most variable
slide-52
SLIDE 52

138,731 184 2,647 Multimodal probes (~3Κ) SNP-associated probes (29% of all)

1 Active promoter 2 Promoter flanking 3 Active enhancer 4 Weak enhancer 5 Gene bodies 6 Active gene bodies 7 Repetitive 8 Heterochromatin 9 Low signal

% of CpG probes

MultimodalSNP-associatedPromoter-depleted

  • SNP-associated probes depleted in promoters

(driven epigenetically>genetically, open chrom)

SNP-associated All probes

  • 93.5% of multimodal probes

are SNP-associated

  • Importance of distinguishing

contribution of genotype to disease associations

slide-53
SLIDE 53

Phil de Jager: Methylation in 750 Alzheimer patients

500,000 methylation probes 750 individuals

  • Patients followed for 10+ years with cognitive evaluations
  • Brain samples donated post-mortem methylation/genotype
  • Seek predictive features: SNPs, QTLs, mQTLs, regulation

Phil de Jager, Roadmap disease epigenomics Brad Bernstein REMC mapping

Genome Epigenome meQTL Phenotype Epigenome Classification MWAS

1 2

slide-54
SLIDE 54

Global hyper-methylation trend in AD-associated probes

Alzheimer’s Normal Alzheimer’s Normal

Hypomethylated probes (active) Hypermethylated probes (repressed) Alzheimer’s-associated probes are hypermethylated 480,000 probes, ranked by Alzheimer’s association P-value Methylation

Top 7000 probes

  • Global effect across 1000s of probes

– Rank all probes by Alzheimer’s association – Observe functional changes down ranklist – 7000 probes show shift in methylation

 Complex disease: genome-wide effects, 1000s of loci

slide-55
SLIDE 55

Chromatin state breakdown reveals ↓ activity

* => fisher exact test, p-value <= 0.001

% probes

1 Active promoter 2 Promoter flanking 3 Active enhancer 4 Weak enhancer 5 Gene bodies 6 Active gene bodies 7 Repetitive 8 Heterochromatin 9 Low signal

Red: More methylated in Alhzeimer’s Blue: Less methylated in Alzheimer’s Significant probes are in enhancers Not promoters

slide-56
SLIDE 56

Alzheimer’s prediction vs. likely biological pathways

Predictive power: 6k probes + APOE

Regulatory motifs associated with Alzheimer-associated probes suggest potential pathways

CTCF NRSF ELK1

We have not solved Alzheimer’s, but new insights gained

All probes, ranked by AD assoc. P-value All probes, ranked by AD assoc. P-value

slide-57
SLIDE 57

Systems-level views of disease epigenomics

  • Evolutionary signatures  gene/genome annotation

– High-resolution annotation: genes, RNAs, motif instances – Measuring selection within the human population

  • Chromatin states for interpreting disease association

– Annotate dynamic regulatory elements in multiple cell types – Activity-based linking of regulators  enhancers  targets

  • Interpreting disease-associated sequence variants

– Mechanistic predictions for individual top-scoring SNPs – Functional roles of 1000s of disease-associated SNPs

  • Systematic manipulation of 2000+ human enhancers

– Test effect of single-motif and single-nucleotide disruptions – Role of activator/repressor motifs, disease-associated SNPs

  • Personal genomes/epigenomes in health and disease

– Allele-specific activity.Alzheimer’sbrain methylationSNP – Global repression of distal enhancers. NRSF, ELK1, CTCF

slide-58
SLIDE 58

Goal: A systems-level understanding of genomes and gene regulation:

  • The regulators: Transcription factors, microRNAs, sequence specificities
  • The regions: enhancers, promoters, and their tissue-specificity
  • The targets: TFstargets, regulatorsenhancers, enhancersgenes
  • The grammars: Interplay of multiple TFs  prediction of gene expression

 The parts list = Building blocks of gene regulatory networks

CATGACTG CATGCCTG Disease-associated variant (SNP/CNV/…) Gene annotation (Coding, 5’/3’UTR, RNAs)  Evolutionary signatures Non-coding annotation  Chromatin signatures Roles in gene/chromatin regulation  Activator/repressor signatures Other evidence of function  Signatures of selection (sp/pop)

Understanding human variation and human disease

  • Challenge: from loci to mechanism, pathways, drug targets
slide-59
SLIDE 59

Collaborators and Acknowledgements

  • ENCODE

– Brad Bernstein, Tarjei Mikkelsen, Noam Shoresh, David Epstein

  • Massively parallel enhancer reporter assays

– Tarjei Mikkelsen, Broad Institute

  • Epigenome Roadmap

– Bing Ren, Brad Bernstein, John Stam, Joe Costello

  • 2X mammals

– Kerstin Lindblad-Toh, Eric Lander, Manuel Garber, Or Zuk

  • Funding

– NHGRI, NIH, NSF Sloan Foundation

slide-60
SLIDE 60

Daniel Marbach Mike Lin Jason Ernst Jessica Wu Rachel Sealfon Pouya Kheradpour (#187) Manolis Kellis Chris Bristow Loyal Goff Irwin Jungreis

MIT Computational Biology group Compbio.mit.edu

Sushmita Roy #331: Luke Ward

Stata4 Stata3

Louisa DiStefano Dave Hendrix Angela Yen Ben Holmes Soheil Feizi Mukul Bansal #19:Bob Altshuler Stefan Washietl Matt Eaton