Lecture 6: Regulatory genomics Gene regulation, chromatin - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 6: Regulatory genomics Gene regulation, chromatin accessibility, DNA regulatory code Prof. Manolis Kellis Slides credit: 6.047, Anshul Kundaje, David Gifford http://mit6874.github.io

Deep Learning for Regulatory Genomics 1. Biological foundations: Building blocks of Gene Regulation – Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq 2. Classical methods for Regulatory Genomics and Motif Discovery – Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation. 3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations – Key idea: pixels  DNA letters. Patches/filters  Motifs. Higher  combinations – Learning convolutional filters  Motif discovery. Applying them  Motif matches 4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures – DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis

1a. Basics of gene regulation

One Genome – Many Cell Types ACCAGTTACGACGGTCA GGGTACTGATACCCCAA ACCGTTGACCGCATTTA CAGACGGGGTTTGGGTT TTGCCCCACACAGGTAC GTTAGCTACTGGTTTAG CAATTTACCGTTACAAC GTTTACAGGGTTACGGT TGGGATTTGAAAAAAAG TTTGAGTTGGTTTTTTC ACGGTAGAACGTACCGT TACCAGTA 4 Image Source wikipedia

DNA packaging • Why packaging – DNA is very long – Cell is very small • Compression – Chromosome is 50,000 times shorter than extended DNA • Using the DNA – Before a piece of DNA is used for anything, this compact structure must open locally • Now emerging: – Role of accessibility – State in chromatin itself – Role of 3D interactions

Combinations of marks encode epigenomic state Enhancers Promoters Transcribed Repressed • H3K4me1 • H3K4me3 • H3K36me3 • H3K9me3 • H3K27ac • H3K9ac • H3K79me2 • H3K27me3 • DNase • DNase • H4K20me1 • DNAmethyl • H3K4me3 • H3K4me1 • H3K27ac • H3K36me3 • H4K20me1 • H3K79me3 • H3K27me3 • H3K9me3 • H3K9ac • H3K18ac • 100s of known modifications, many new still emerging • Systematic mapping using ChIP-, Bisulfite-, DNase-Seq

Summarize multiple marks into chromatin states Chromatin state track summary 30+ epigenomics marks WashU Epigenome Browser ChromHMM: multi-variate hidden Markov model

T ra nsc ription fa c tors c ontrol a c tiva tion of c e ll- type - spe c ific promote rs a nd e nha nc e rs Enhancer region Promoter region Protein-coding sequence

T F s use DNA-b inding do ma ins to re c o g nize spe c ific DNA se q ue nc e s in the g e no me “ Logo ” or “ motif ” TAATTA CACGTG AGATAAGA DNA-binding domain of Engrailed TCATTA

Re g ula to r struc ture  re c o g nize d mo tifs • Pro te ins ‘ fe e l’ DNA - Re a d c he mic a l pro pe rtie s o f b a se s - Do NOT o pe n DNA (no b a se c o mple me nta rity) • 3D T o po lo g y dic ta te s spe c ific ity - F ully c o nstra ine d po sitio ns:  e ve ry a to m ma tte rs - “Amb ig uo us / de g e ne ra te ” po sitio ns  lo o se ly c o nta c te d • Othe r type s o f re c o g nitio n - Mic ro RNAs: c o mple me nta rity - Nuc le o so me s: GC c o nte nt - RNAs: struc ture / se q n c o mb ina tio n

Mo tifs summa rize T F se q ue nc e spe c ific ity • Summa rize info rma tio n • I nte g ra te ma ny po sitio ns • Me a sure o f info rma tio n • Disting uish mo tif vs. mo tif insta nc e • Assumptio ns: - I nde pe nde nc e - F ixe d spa c ing

Re gulator y motifs at all le ve ls of pr e / post- tx r e gulation Enhancer regions Promoter motifs Splicing signals Motifs at RNA level Where in the body? When in time? Which variants? Which subsets? • T he pa rts list: ~20-30k g e ne s - Pro te in-c o ding g e ne s, RNA g e ne s (tRNA, mic ro RNA, snRNA) • T he c irc uitry: c o nstruc ts c o ntro lling g e ne usa g e - E nha nc e rs, pro mo te rs, splic ing , po st-tra nsc riptio na l mo tifs • T he re g ula to ry c o de , c o mplic a tio ns: - Co mb ina to ria l c o ding o f ‘ uniq ue ta g s’ - Da ta -c e ntric e nc o ding o f a ddre sse s - Ove rla id with ‘ me mo ry’ ma rks - L a rg e -sc a le o n/ o ff sta te s - Mo dula tio n o f the la rg e -sc a le c o ding - Po st-tra nsc riptio na l a nd po st-tra nsla tio na l info rma tio n • T o da y: disc o ve ring mo tifs in c o -re g ula te d pro mo te rs a nd de no vo mo tif disc o ve ry & ta rg e t ide ntific a tio n

Disrupte d mo tif a t the he a rt o f F T O o b e sity lo c us Strongest association C-to-T disruption of AT-rich with obesity regulatory motif Lean Obese Restoring motif restores thermogenesis

1b. Technologies for probing gene regulation

Mapping regulator binding: ChIP-seq (Chromatin immunoprecipitation followed by sequencing) TF=transcription factor antibody Bar-coded multiplexed sequencing

ChIP-chip and ChIP-Seq technology overview or modification Image adapted from Wikipedia Modification-specific antibodies  Chromatin Immuno-Precipitation followed by: ChIP-chip: array hybridization ChIP-Seq: Massively Parallel Next-gen Sequencing

ChIP-Seq Histone Modifications: What the raw data looks like • Each sequence tag is 30 base pairs long • Tags are mapped to unique positions in the ~3 billion base reference genome • Number of reads depends on sequencing depth. Typically on the order of 10 million mapped reads. 17

Chro ma tin a c c e ssib ility c a n re ve a l T F b inding She rwo o d, RI , e t a l. “ Disc ove r y of dir e c tiona l a nd nondir e c tiona l pione e r tr a nsc r iption fa c tor s by mode ling DNa se pr ofile ma g nitude a nd sha pe ” Nat. Bio te c h 2014.

DNa se - se q r e ve a ls g e nome pr ote c tion pr ofile s

AT AC-se q

AT AC- se q and DNase - se q ar e not ide ntic al GM12878, Chr. 14, E a c h po int is a c c e ssib ility in a 2 kb windo w Ha shimo to T B, e t a l. “ A Syne r gistic DNA L ogic Pr e dic ts Ge nome - wide Chr omatin Ac c e ssibility” Ge no me Re se arc h 2016

Dnase - se q is le ss de fine d e vide nc e than ChIP- se q A ChIP-seq reports TF-binding locations regions (specifically) seq DNase-seq reports proximal TF- non-binding locations ( noisily ) seq

Bound fa c tor s le a ve distinc t DNa se - se q pr ofile s Esrrb Zfx CTCF Oct4 Brg motif Individua l binding site pr e dic tion is diffic ult Individual CTCF: Aggregate CTCF:

Motifs c a n pr e dic t T F binding Binding site s c ha ng e a c r oss time ~50,000 binding sites for a typical TF ~650,000 TF Motifs

Chr omatin ac c e ssibly influe nc e s tr ansc r iption fac tor binding • Mo de ling a c c e ssib ility pro file s yie lds b inding pre dic tio ns a nd pio ne e r fa c to r disc o ve ry • Asymme tric a c c e ssib ility is induc e d b y dire c tio nal pio ne e rs • T he b inding o f se ttle r fac to rs c a n b e e na b le d b y pro xima l pio ne e r fa c to r b inding She rwo o d, RI , e t a l. “ Disc ove r y of dir e c tiona l a nd nondir e c tiona l pione e r tr a nsc r iption fa c tor s by mode ling DNa se pr ofile ma g nitude a nd sha pe ” Nat. Bio te c h 2014.

Deep Learning for Regulatory Genomics 1. Biological foundations: Building blocks of Gene Regulation – Gene regulation: Cell diversity, Epigenomics, Regulators (TFs), Motifs, Disease role – Probing gene regulation: TFs/histones: ChIP-seq, Accessibility: DNase/ATAC-seq 2. Classical methods for Regulatory Genomics and Motif Discovery – Enrichment-based motif discovery: Expectation Maximization, Gibbs Sampling – Experimental: PBMs, SELEX. Comparative genomics: Evolutionary conservation. 3. Regulatory Genomics CNNs (Convolutional Neural Networks): Foundations – Key idea: pixels  DNA letters. Patches/filters  Motifs. Higher  combinations – Learning convolutional filters  Motif discovery. Applying them  Motif matches 4. Regulatory Genomics CNNs/RNNs in Practice: Diverse Architectures – DeepBind: Learn motifs, use in (shallow) fully-connected layer, mutation impact – DeepSea: Train model directly on mutational impact prediction – Basset: Multi-task DNase prediction in 164 cell types, reuse/learn motifs – ChromPuter: Multi-task prediction of different TFs, reuse partner motifs – DeepLIFT: Model interpretation based on neuron activation properties – DanQ: Recurrent Neural Network for sequential data analysis

2. Classical regulatory genomics (before Deep Learning)

Enrichment-based discovery methods Given a set of co-regulated/functionally related genes, find common motifs in their promoter regions • Align the promoters to each other using local alignment • Use expert knowledge for what motifs should look like • Find ‘median’ string by enumeration (motif/sample driven) • Start with conserved blocks in the upstream regions

Lecture 6: Regulatory genomics Gene regulation, chromatin - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 6: Regulatory genomics Gene regulation, chromatin accessibility, DNA regulatory code Prof. Manolis Kellis Slides credit: 6.047,

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &

clinical genomics Melbourne Genomics Health Alliance Melbourne Genomics Health Alliance Medical

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

High throughput methods approches in genomics D. Puthier Genomics The science for the 21st

Genomics Virtual Laboratory Mike Pheasant (UQ) Andrew Lonie (VLSCI) What is the Genomics

Comparative Genomics of Environmental Stress Responses in North American Hardwoods The

Comparative Genomics Comparative Genomics Common Themes Gene and functional pathway

Computational Challenges in Computational Challenges in Genomics and Molecular Biology Genomics

Regulatory Binder By: Sam Payn Regulatory Binder Goals To learn about regulatory binders.

What is Genomics? The study of all of an organisms genes (the genome), including

Risk Assessment and Genomics Risk Assessment and Genomics Science and Policy: EPAs

Consideration of Recommendations from the Grants Working Group on Stem Cell Genomics Center of

Skin is both strong and flexible Keratinization results in an epidermis that is a

Th The I Integumentary Sy System The Skin and the Hypodermis Skin our largest organ

Sleep and Autism: Helping Families Get the Rest they Need Beth A. Malow, MD, MS Professor of

Finding fun 3 www.safeoffleashdogplay.com 1 3/1/15 Rate your fun factor?

ARRHYTHMIA PHARMACOLOGY NURS 203 General Pharmacology Danita Narciso Pharm D 1 LEARNING

Tutorial on Electrophysiology of the Heart Sam Dudley, MD, PhD Chief of Cardiology, The Miriam

Acute Arrhythmias in the Hospitalized Patient Gregory M Marcus, MD, MAS Associate Professor of

Simulation of Cardiac Ischemia and Arrhythmias !"#$%&'()"*

Lecture 6: Regulatory genomics Gene regulation, chromatin - PowerPoint PPT Presentation

6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 6: Regulatory genomics Gene regulation, chromatin accessibility, DNA regulatory code Prof. Manolis Kellis Slides credit: 6.047,

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &amp;

clinical genomics Melbourne Genomics Health Alliance Melbourne Genomics Health Alliance Medical

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

High throughput methods approches in genomics D. Puthier Genomics The science for the 21st

Genomics Virtual Laboratory Mike Pheasant (UQ) Andrew Lonie (VLSCI) What is the Genomics

Comparative Genomics of Environmental Stress Responses in North American Hardwoods The

Comparative Genomics Comparative Genomics Common Themes Gene and functional pathway

Computational Challenges in Computational Challenges in Genomics and Molecular Biology Genomics

Regulatory Binder By: Sam Payn Regulatory Binder Goals To learn about regulatory binders.

What is Genomics? The study of all of an organisms genes (the genome), including

Risk Assessment and Genomics Risk Assessment and Genomics Science and Policy: EPAs

Consideration of Recommendations from the Grants Working Group on Stem Cell Genomics Center of

Skin is both strong and flexible Keratinization results in an epidermis that is a

Th The I Integumentary Sy System The Skin and the Hypodermis Skin our largest organ

Sleep and Autism: Helping Families Get the Rest they Need Beth A. Malow, MD, MS Professor of

Finding fun 3 www.safeoffleashdogplay.com 1 3/1/15 Rate your fun factor?

ARRHYTHMIA PHARMACOLOGY NURS 203 General Pharmacology Danita Narciso Pharm D 1 LEARNING

Tutorial on Electrophysiology of the Heart Sam Dudley, MD, PhD Chief of Cardiology, The Miriam

Acute Arrhythmias in the Hospitalized Patient Gregory M Marcus, MD, MAS Associate Professor of

Simulation of Cardiac Ischemia and Arrhythmias !&quot;#$%&amp;'()&quot;*

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &

Simulation of Cardiac Ischemia and Arrhythmias !"#$%&'()"*