Toward a More Accurate Genome: Algorithms for the Analysis of High- Throughput Sequencing Data
Dissertation Defense
- W. Jacob B. Biesinger
Tuesday, May 27th
Toward a More Accurate Genome: Algorithms for the Analysis of High- - - PowerPoint PPT Presentation
Toward a More Accurate Genome: Algorithms for the Analysis of High- Throughput Sequencing Data Dissertation Defense W. Jacob B. Biesinger Tuesday, May 27th AREM TreeHMM Applied statistics Genomix Applied computer science ChIP-seq
Dissertation Defense
Tuesday, May 27th
Applied statistics Applied computer science
and 10 years of labor and planning
sequence of a large genome New biological insight Seeded technological revolution: high- throughput sequencing
http://www.genome.gov/sequencingcosts/
XPRIZE: $10 million for 100 genomes @ $1,000 each XPRIZE cancelled: “Outpaced by Innovation”
Kercher et al., Bioessays (2010).
low-throughput high cost Now 250nt, $.05/MB
Targeted resequencing Whole-genome sequencing Exome sequencing ChIP-seq RNA-seq MeDIP-seq CLIP-seq ChIRP-seq Hi-C ChIA-PET
New experimental methods New (or rebranded) fields
Genomics Transcriptomics Metabolomics Microbiomics Toxicogenomics Epigenomics Interactonomics Circadiomics
○ Discovery of inheritable genetic disorders ○ Cancer biology (identify cancer subtypes) ○ Evolution and spread of infectious diseases ○ Prenatal diagnostics ○ Now transitioning into clinical laboratory ○ Lead to personalized therapies
○ Gene expression levels ○ Identify regulatory network structure ○ Elucidate fundamental biological processes ■ find promoter TATA binding, splicing mechanisms, the drivers of cellular state/stem cell “stemness”
You can’t trust 1/100 bases We all wish the error rate were uniform All kinds of hidden biases based on the sequence composition (GC-content, strand-bias, positional bias, But we have much more
it all?
Detect and correct errors and biases See the biology beyond the letters
*Koning et al. PLOS Genetics 7 (12): e1002384
**Langmead et. al. Genome Biology 10 (2009) R25 ACGTGATATAAACTGCGTCGGATATAAACTACTCTAGG GATATAAACT
wikipedia.org/wiki/ChIP-sequencing
Qing Zhou, PNAS 16438–16443 Qing Zhou, PNAS 16438–16443
Treat as Nx1 dataset (N is in 10’s of milions) Smooth via kernel density ChIP reads Control reads Non-uniform control… “Strand” bias MACS: Zhang et al, 2008
AAAGTCTATCCCAGGCTC
Consider all possible peak sources and all possible alignments Overall problem: find best peak configuration Expectation (With regions fixed, update alignments) Maximization (With alignments fixed, find best regions)
r1 r2 r3 Expectation Maximization
ChIP reads Control reads
Srebp-1 binding motif
* Zeng et. al. PLoS Genetics, 5(7) 2009
CTCF binding motif
Method Alignments Peaks New FDR Motif Repeat MACS
18,556
81.67% 56.55% SICER
17,092
82.55% 70.42% AREM 1 2,368,229 19,012
81.32% 55.30% AREM 10 7,616,647 19,881 1,404 3.80% 81.04% 58.88% AREM 20 12,312,878 19,935 1,517 3.70% 80.88% 59.66% AREM 40 20,527,010 19,863 1,546 3.20% 80.93% 60.34% AREM 80 34,537,311 19,820 1,538 2.90% 80.73% 60.91%
2 1. Allow for sequences with one alignment. 2. Allow for sequences with up to 10-80 possible alignments. 1
8% more peaks, similar FDR, many peaks in repeats!
Method Alignments Peaks New FDR Motif Repeat MACS
721
46.60% 53.95% SICER
622
59.00% 77.33% AREM 1 10,482,005 1,438
39.08% 53.47% AREM 10 28,347,869 1,815 262 10.5% 39.22% 56.04% AREM 20 44,493,532 1,748 227 8.0% 39.95% 55.97% AREM 40 72,453,642 1,685 248 8.2% 40.34% 56.46% AREM 80 118,744,757 1,695 272 7.3% 40.66% 56.73%
2 1. Allow for sequences with one alignment. 2. Allow for sequences with up to 10-80 possible alignments. 1
5% more peaks called at lower FDR
Align reads Identify K regions enriched with alignments E Step: Update alignment probabilities from enrichment M Step: Update peak enrichment from alignment probabilities Check convergence Call treatment peaks Call control peaks Calculate FDR
Determine binding site dynamics by performing the same ChIP experiment at different timepoints/cell stages Integrate multiple datasets for biological insight
Histone modifications (not transcription factors)
wikipedia.org/wiki/Epigenetics
Nine ChIP-seq experiments
Nine ChIP-seq experiments
Nine human cell types
Zhou et al, Nature Rev. Gen., 2011
“Active Promoter” “Active transcription” “Active Enhancer”
“Repressed Gene”
Active Promoter Repressed Promoter
Polm: DNA polymerase (gene needed in all cell types)
Mikkelsen et al., Nature 2007
Olig1: Neural transcription factor Active Promoter Neural genes repressed in muscle cells
Nine ChIP-seq experiments
Nine human cell types
Histone mark co-occurrence and spatial transitions Hidden Markov Model can capture spatial transitions We build on a previous “ChromHMM” model
j: observed histone marks (position i, species j)
j: hidden chromatin state (to be inferred)
ES cells NPCs
Connect multiple cell types in a tree Connect adjacent regions
Mean field: Optimize single nodes separately Structured mean field: Optimize complete HMM chains separately Loopy belief propagation
Structured mean field approximates the exact model very well K=5, very small dataset
Strong Enhancer (state 5) Active Promoter (state 3) Insulator (CTCF) (state 14)
Strong Enhancer (state 5) Active Promoter (state 3) Insulator (CTCF) (state 14)
Strong vertical information for some states Enhancer state is rarely inherited from ES cells
We predict fewer sites with better accuracy Taf1 is part of all cells’ promoter machinery
We predict more sites at better or similar accuracy
A C T G C A T T C A A C A ACTGCA CCAAAC A C T G C A T T C A A C A ACTGCA CCAAAC A C T G C A T T C A A C A ACTGCA CCAAAC A C T G C A T T C A A C A ACTGCA CCAAAC A C T G C A T T C A A C A ACTGCA CCAAAC
Extraction, sonication High-Throughput Sequencing Assembly hundreds of millions
...ACTGAATCTAG GTCTGCGT GTCTGCGTACCCTACGTCTGACTGC CGCGGCAA CGCGGCAAACGGCTAGCTGTGTTTTTACT TCTTTGA TCTTTGACCA... TCTTTGATTC...
...ACTGAATCTAGGC
...CCTCTAGGGTGC
Repeated kmers are collapsed into a single node
* except for errors...
Algorithm Genome Size Small Medium Large Velvet ✓✓ ✓ ✗ ABySS ✓ ✓ ✓ Ray ✓ ✓ ✓ Contrail ✗ ✓ ✓ Genomix ✓ ✓✓ ✓✓ If you have enough memory
In 2013, 50% of Fortune 50
http://www.cs.washington.edu/mssi/2010/MikeCarey.pdf
K-means clustering in the cloud
Hyracks Filesystem abstraction: Shared-nothing, HDFS, NFS, etc.
Build graph Graph algorithms
De Bruijn graph of the small genome of E. coli after error correction
Chaisson et al., Genome Research 2009
H
T T H H T T
single error read connects the walk incorrectly “high-confidence” region Check all candidate paths out to a certain distance D candidate path reads have mutual
One set of edges must dominate the other for the walk to proceed
Hadoop-based Contrail lags behind Velvet is super fast Ray is another parallel framework for de Bruijn graph assembly
same figure… with a larger genome Ray requires ~450GB RAM across machines Velvet is a no-show… Some groups reported assembling a human genome with 2TB RAM... Genomix has no memory constraints, though having it will speed up computation
TreeHMM AREM Genomix
Apply tree-HMM to the Broad dataset (9 chromatin modification in 9 cell types):
Model selection: AIC (Akaike information criterion), BIC (Bayesian information criterion)
The optimal number of states is 18 by BIC.
erythrocytic leukaemia skeletal muscle myoblasts mammary epithelial cells normal epidermal keratinocytes umbilical vein endothelial cells