SLIDE 1 USING GPU AND POWER8 TO EXPLORE HOW GENOMES FOLD
Ido Machol Aiden Lab Baylor College of Medicine Rice University GTC 2015
SLIDE 2
SLIDE 3 THE HUMAN GENOME IS LONG!
…CGTTTACGAAAATCGCAAAACTTTCGATACCCATAGGCTACTGATCATACGACCGTTTACGAAAATCGAAACCTTTCCGATCTAGGCTAC…
3 BILLION Letters 2 METERS
Nucleus
Cell
6 μm
SLIDE 4 10 bp 100 bp 1 Kb 10 Kb 100 Kb 1 Mb 10 Mb 100 Mb
SLIDE 5
SAME GENOME, DIFFERENT FUNCTIONS
SLIDE 6
PART I: TECHNOLOGY
SLIDE 7
MICROSCOPY & FLUORESCENT IN SITU HYBRIDIZATION
FISH
SLIDE 8
CONTACT MAPPING
Exploring structure via proximity
SLIDE 9
SLIDE 10
SLIDE 11 4-11
(lives nearby)
0-3
(lives far away)
Always
(same person) Times in the Same Photo
FACEBOOK
CONTACT MAP
Homer
SLIDE 12 Simpsons' Contact Map
# of Pictures Together
2 0 1 2 1 0 1 0 0 0 3 2 1 0 0 0 0 0 1 2 16 6 5 4 11 1 1 2 1 6 8 6 3 4 0 0 1 0 5 6 8 4 5 1 0 0 0 4 3 4 5 5 0 0 1 0 11 4 5 5 11 1 1 0 0 1 0 1 0 1 2 1 0 0 1 0 0 0 1 1 1
16
SLIDE 13
Hi-C
3D Genome Sequencing
SLIDE 14 Hi-C: genome-wide Chromosome Conformation Capture
Erez Lieberman-Aiden, Nynke van Berkum et al. Science 2009
SLIDE 15 Computational Challenge I Alignment, calculate contacts
…CTGCCTCCTCGCGG CCGCGTGGTGGCAG…
DNA Reference Sequence Align to reference genome
… …
SLIDE 16 Alignment is not trivial
…CTGCC_TCCTCGCGG… …CTGC__TCCTCGCGG… …CTGAA_TCCTCGCGG… …CTGCCCTCCTCGCGG…
Substitution Deletion Insertion
SLIDE 17
Computational HW and SW setup
SLIDE 18 8 x Power8 Servers 2 Sockets x 12 cores x 8 threads = 192 virtual cores each Total of 1,536 virtual cores in cluster.
- 4 X 256GB RAM
- 2 X 1024GB RAM
- 2 X 256GB RAM
with NVIDIA K40 Tesla
Model 8247-22L and 8247-42L Byte order: BI-Endian
Rice RSCG PowerOmics hardware
SLIDE 19 Tesla K40
Stream Processors 2880 Core Clock 745MHz Boost Clock(s) 810MHz, 875MHz Memory Clock 6GHz GDDR5 VRAM 12GB Single Precision 4.29 TFLOPS Double Precision 1.43 TFLOPS (1/3)
GPUs
SLIDE 20 Storage
- IBM GPFS Storage Server (Model 24)
- 4 X JBOD
- Total of 361 TB fast scratch disk space
- (Up to 1.4 Peta bytes)
- FlashSystem 840 20TB Flash
SLIDE 21 Interconnect:
- 56 Gigabit 36-port FDR IB switch
- Mellanox Next gen Connect-IB FDR Host Channel Adapters
- 10-Gigabit Ethernet
- Internet 2
Interconnect
SLIDE 22 Rice RSCG PowerOmics software
Cluster management
- IBM Platform LSF, PPM, PAC, PowerKVM 2.1.0
Operating system
- Ubuntu 14.4 (little-endian) + Red Hat Enterprise Linux 7.0
Storage
- Mellanox OFED 2.4-1
- GPFS 4.1
Scientific
SLIDE 23 Challenge -
Alignment of billions of contacts High Resolution Map
13 billion reads forming 5 billion contacts in the map
IBM Power8 Cluster
675 read alignments / second / CPU core 192 cores
About 27 hours
…CTGCCTCCTCGCGG…
SLIDE 24 Chromosome
Hi-C
GENERATES GENOME- WIDE CONTACT MAPS
Genome
SLIDE 25 Genome
Hi-C
GENERATES GENOME- WIDE CONTACT MAPS
SLIDE 26 Genome Chromosome 8
Hi-C
GENERATES GENOME- WIDE CONTACT MAPS
700 Reads/250 kb2
SLIDE 27 A A
Hi-C
GENERATES GENOME- WIDE CONTACT MAPS
700 Reads/250 kb2
SLIDE 28 A B A B
Hi-C
GENERATES GENOME- WIDE CONTACT MAPS
700 Reads/250 kb2
SLIDE 29 PART II: BIOLOGY
Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome
Erez Lieberman-Aiden, Nynke van Berkum et al. Science 2009 Science, 2009
SLIDE 30 Genomic analysis of compartments
Genes
Chromosome 14 Mb
2 Pixels
1
The two compartments correlate strongly with
chromatin
kb
2 Pixels
100
SLIDE 31 The whole genome is plaid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
SLIDE 32
A TOUR OF THE NUCLEUS
SLIDE 33 Organization
- bserved at three distinct scales
NUCLEAR SCALE 100Mb CHROMOSOME SCALE MEGABASE SCALE 10Mb 1Mb
SLIDE 34 Organization
- bserved at three distinct scales
NUCLEAR SCALE 100Mb CHROMOSOME SCALE MEGABASE SCALE 10Mb 1Mb
SLIDE 35 Organization
- bserved at three distinct scales
NUCLEAR SCALE 100Mb CHROMOSOME SCALE MEGABASE SCALE 10Mb 1Mb
SLIDE 36 A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping
Suhas Rao*, Miriam Huntley*, Neva Durand, Elena Stamenova, Ivan Bochkov, James Robinson, Adrian Sanborn, Ido Machol, Arina Omer, Eric Lander, Erez Lieberman Aiden Cell 2014
SLIDE 37 5 billion contacts 30 million contacts
More Contacts, Higher Resolution
SLIDE 38 Detection of Chromatin Loops Genome- wide via Hi-C
A A-2ε A-ε A+ε A+2ε B-ε B-2ε B B+ε B+2ε
SLIDE 39 Into the loops
L3 L2 L1 L1 L2 L3
SLIDE 40 Computational Challenge III
Loop calling
Which one shows a loop?
SLIDE 41 X ✔
X
3D Map Features
X
SLIDE 42 Computational Challenge III
Loop calling
- Apply 4 filters for each pixel.
- 20 Giga pixel image.
- Millions of parallel filters.
NVIDIA Tesla GPU
200x faster than previous CPU implementation – from 3 weeks to 3 hours.
SLIDE 43
10,000 Loops in the Human Genome
SLIDE 44
Loops turn genes on and off
Lung fibroblast cell Lymphoblastoid cell
SLIDE 45
SUMMARY OF COMPUTATIONAL EFFORTS
SLIDE 46 Sequence alignment proportions
Genome data production and analysis
- In about 36 months we produced sequence equivalent of more than
2200x coverage of the human genome.
- For reference, the Human Genome Project produced 12.6x coverage, over
the span of 4 years. Storage
- We currently have 25 TB of RAW sequenced data
- We sequence 1 TB each month.
- After processing the raw sequenced data, we store 3 TB of Raw and
processed data.
SLIDE 47 Computational speed up
Cluster processing
- We produce 1 Billion reads per month.
- Power8 is capable of processing alignments at 675 reads/second per CPU
core.
- 50% faster then the cluster system we were using before.
- At this speed, we consume about 17 “CPU days” per month.
- With power8 cluster having over 192 cores, the jobs complete processing
in about 2 hours. GPU processing
- Using NVIDIA Tesla K40, we run our loop calling algorithm over a 20Giga
pixel map 200x faster than CPU implementation.
- Instead of 3 weeks we get the work done in only 3 hours.
SLIDE 48
aidenlab.org/juicebox
SLIDE 49 Aiden Lab
Erez Lieberman Aiden
Suhas Rao Miriam Huntley
Neva C Durand Elena Stamenova Adrian Sanborn Arina Omer Ivan Bochkov Olga Dudchenko Robert Nnake Su-Chen Huang Muhammad Shamim Chris Lui Sarah Nyquist Sanjit Batra Ashok Cutkosky Najeeb Tarazi Jian Li Broad Institute Eric Lander Jim Robinson
GREETINGS FROM ANOTHER DIMENSION