Building the human pangenome
Benedict Paten - UC Santa Cruz Genomics Institute bpaten@ucsc.edu
Building the human pangenome Benedict Paten - UC Santa Cruz Genomics - - PowerPoint PPT Presentation
Building the human pangenome Benedict Paten - UC Santa Cruz Genomics Institute bpaten@ucsc.edu Now the $1,000 individual genome is here but $1B $300M $100M $10M $10M $1M $50K $100K $5K $3K $1K $10K 2002 2004 2006 2008
Benedict Paten - UC Santa Cruz Genomics Institute bpaten@ucsc.edu
Sources: NIH: www.genome.gov/sequencingcosts; UC San Diego, 1/14/14: Illumina breaks genome cost barrier
$1B $100M $10M $1M $100K $10K $1K 2002 2004 2006 2008 2010 2012 2014 2015 $300M $10M $50K $5K $3K $1K
A typical person has
DNA base variations different from the reference (out of 3 billion)
large segments of DNA that are not present in the same form in the reference genome
currently assayed accurately: reference allele bias
Instead - imagine mapping to a reference structure that contains all common variation: a pangenome graph
4
routine, so that we can create the genomes for the human pangenome
5
6
7
8
10
Short Read Eliminator Kit (https://www.circulomics.com)
11
Read N50 (kb) N50s: 42kb
https://github.com/human-pangenomics/hpgp-data Individual genomes
12
Total throughput (Gb) Individual genomes
13
Alignment identity = matches / (matches + mismatches + insertions + deletions)
Mode: 93% Median: 90%
1.0 0.9 0.8 0.7
Alignment Identity (GRCh38)
0.6 0.5 0.4 24143 24149 24385 00733 01109 01243 02055 02080 02723 03098
Individual genomes Guppy 2.3.5 flip flop basecaller
https://upload.wikimedia.org/wikipedia/commons/2/22/MtShasta_aerial.JPG
15
16
https://github.com/chanzuckerberg/shasta
developed by Paolo Carnevali at CZI
○ Use run-length encoding (RLE) throughout to compress homopolymer confusion - the dominant source of error in ONT reads ○ Uses novel high-cardinality marker space representation for super efficient overlap alignment ○ Does everything in memory (requires 1.5TB of memory for 60x human) ○ Outputs GFA, intent for whole pipeline to use GFA to represent ambiguities
17
Run Length Encoding (RLE)
18
Marker Representation
19
Marker Representation
20
21
Shasta GPU Acceleration
22
shasta flye canu + 10X wtdbg2 Number of misassemblies 1160 5580 6093 4164
23
Median contig NG50 = 23 Mb
24
https://github.com/UCSC-nanopore-cgl/marginPolish
https://github.com/kishwarshafin/helen
A graph-based alignment polisher A DNN-based consensus sequence polisher
25
26
27
Guppy basecaller Shasta Shasta + MarginPolish Shasta + MarginPolish + HELEN
28
Without HiC With HiC
https://upload.wikimedia.org/wikipedia/commons/2/22/MtShasta_aerial.JPG
30
31
32
Adam Novak Glenn Hickey Jordan Eizenga Erik Garrison Jean Monlong Xian Chang Daniel Garalde Rosemary Dokos Simon Mayes Chris Seymour Chris Wright David Stoddart Dan Turner Kelvin Liu Duncan Kilburn Adam Phillippy (NHGRI) Fritz Sedlazeck (Baylor) Sidney Bell Charlotte Weaver Michael Barrientos Ryan King Bruce Martin Phil Smoot Cori Bargmann David Haussler Ed Green Sofie Salama Mark Akeson Kristof Tigyi Nicholas Maurer Yatish Turakhia Kishwar Shafin Marina Haukness Trevor Pesout Colleen Bosworth Karen Miga Ryan Lorig-Roach Miten Jain Hugh Olsen
Sequencing and de novo assembly of 150 genomes from Denmark as a population reference
Lasse Maretty et al. 2017
Danish reference genome project Korean reference genome project
De novo assembly and phasing of a Korean human genome
Jeong-Sun Seo et al. 2016
...
variation
hard to map
inaccurately represented
biomedicine
Goals:
variation from all human ethnic populations
genetic reference
CREDIT: Kiran Garimella and Benedict Paten
The major histocompatibility complex: Kiran Garimella and Benedict Paten
Joins can connect either side of a sequence (bidirected edges) Walks encode DNA strings, with side of entry determining strand
variation graph
another variation graph
genome graphs
https://github.com/vgteam/vg
doi.org/10.1101/234856
View of genomes (gray to black) in an actual genome map, and DNA sequencing reads (colored worms) from a newly sequenced individual mapped to it
Substitution
Insertion or deletion
Duplication (top path traverses same nodes multiple times)
Inversion (red path traverses reverse complement)
Garrison et al. bioxriv: doi.org/10.1101/234856
Garrison et al. bioxriv: doi.org/10.1101/234856
Sample Genome Pan genome Reference genome
Garrison et al. bioxriv: doi.org/10.1101/234856
UC Santa Cruz Adam Novak Wolfgang Beyer Glenn Hickey Karen Miga Yohei Rosen Jouni Siren Jordan Eizenga Charles Markello David Haussler Xian Chang Yatish Turakhia The Rest of Team VG Erik Garrison Richard Durbin Eric Dawson Mike Lin (& many more) GA4GH collaborators Andres Kahles Heng Li Ben Murray Stephen Keenan Goran Rakocevic Gil McVean Alex Dilthey (& many more)
Simons Foundation Join us: https://cgl.genomics.ucsc.edu/opportunities/
50
important data structure in genomics
and practically map to a population cohort instead, alleviating bias
Figure borrowed from “Richard Durbin, Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT), Bioinformatics, 2014”
PBWT[:k]
From “Novak et al, A Graph Extension of the Positional Burrows-Wheeler Transform and its Applications (PBWT), WABI 2016”
gPBWTk[]
○ Whole 1000 Genomes Graph construction on one
○ Half space of gPBWT (14gb for entire index for 1000G)
Siren et al. https://arxiv.org/pdf/1805.03834.pdf
True Graph AF > 0.03 Filtered 1KG Graph 1KG Graph - True Variants GRCh38
65
Workflow
Shasta MarginPolish HELEN Product Version
Device PromethION Alpha-Beta Flongle Flow Cells FLO-PRO002 FLO-FLG106 Kits Ligation Sequencing Kit Circulomics SRE Puregene Data analysis Shasta MarginPolish HELEN minimap2