Johnny Israeli
Computing and Deep Learning Johnny Israeli COMPUTE TRENDS - - PowerPoint PPT Presentation
Computing and Deep Learning Johnny Israeli COMPUTE TRENDS - - PowerPoint PPT Presentation
Accelerating Sequencing with GPU Computing and Deep Learning Johnny Israeli COMPUTE TRENDS GPU-Computing perf 10 1.5X per year APPLICATIONS 7 10 6 ALGORITHMS 1.1X per 10 year 5 10 SYSTEMS 4 10 CUDA 1.5X per 3 10 year 2
2
GPU-Computing perf 1.5X per year 10
2
10
3
10
4
10
5
10
6
10
7
Single-threaded perf 1.5X per year 1.1X per year APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE
COMPUTE TRENDS
3
Publications
COMPUTE TRENDS
4
Sequencing Trends
5
SEQUENCING TRENDS
Sequencing Data Growing in Volume and Complexity
Decreasing Cost Increasing Read Length Rise of Single Cell Data
6
SEQUENCING TRENDS
7
2000 2005 2010 2015 2020 2025 1012
Worldwide Annual Sequencing Capacity
1015 1018 1021
SEQUENCING TRENDS
8
Sequencing Data Types
9
SEQUENCING TRENDS: Genomics
*ENA Database
10
SEQUENCING TRENDS: Transcriptomics
*ENA Database
11
SEQUENCING TRENDS: Epigenomics
*ENA Database
12
SEQUENCING TRENDS: Nanopore Long Read Sequencing
*ENA Database
13
Variant Calling
Sequence DNA Map to Reference
Variant Calling
TGGATTTGAAAACGGAGCAAATGACTG TGGATTTGAAAACGGAGCAAATGACTG TGGATTTGAAAACGGAGCAAATGACTG TGGATTTGAAAACAGAGCAAATGACTG TGGATTTGAAAACAGAGCAAATGACTG TGGATTTGAAAACAGAGCAAATGACTG TGGATTTGAAAACGGAGCAAATGACTG Reference Illumina Reads Likely heterozygous variant
- Identify sites with potential mismatch
- True variants or instrument errors?
- SNPs or insertions or deletions?
- Heterozygous or homozygous variants?
Example Pileup Input Data
Heterozygous SNP
Read Index Position
16
GATK Variant Calling Pipeline
Align to Reference Sort Mark Duplicates Calibrate Call Variants Joint Call Variant Calling Pipeline Filter Variants
17
Accelerated GATK Variant Calling Pipeline
Align to Reference Sort Mark Duplicates Calibrate Call Variants Joint Call Variant Calling Pipeline
Parabricks
Preprocessing Joint Genotyping Variant Processing Alignment Variant Calling Filter Variants
18
Accelerated Variant Calling Pipelines
Parabricks Preprocessing Joint Genotyping Variant Processing Alignment Variant Calling Germline Copy Number Somatic
Alignment + Preprocessing Haplotype Caller Mutec2 GenotypeGVCF DeepVariant
Whole Genome Processing in Minutes
Deep Averaging Network (DAN)
- PyTorch-based 1D model
- Learned embeddings of bases
- Encoding variant proposals
- Downsample easy variant candidates during training
DAN Development
Variant Calling Errors
Variant Calling Error Breakdown
23
Atac Sequencing
24
DNA: Open And Closed
Closed DNA inactive Open DNA active Open DNA changes affect development & disease
25
Atac Sequencing
Mapping Open DNA Sites
Sequence Open DNA Map & Count Reads
Open DNA site Open DNA site
26
Atac-seq Limits
Atac-seq signal degrades in due to:
- Less sequencing
- Low quality sample preparation
- Small cell populations
27
AtacWorks SDK
AI-Denoised ATAC-seq Data Processing
Sequence Open DNA Map, Align, Count
High Quality Sequencing Low Quality Sequencing Low Quality Sequencing Denoised with AtacWorks AI
28
AtacWorks Model
Denoising + Open Chromatin Identification
Resblock 1 Resblock 2 Resblock 3 Resblock 4 Resblock 5 Resblock 6 Resblock 7 Predicted Coverage Predicted open chromatin Input (Noisy ATAC-Seq data) Evaluation: MSE Pearson correlation Evaluation: AUPRC
Conv ReLU Conv ReLU ReLU
⊕
Conv
29
Denoising Low Sequencing Data
50 Million Reads 1 Million Reads 1 Million Reads + AtacWorks
AtacWorks identifies open chromatin from low-coverage data
30
Genome-wide Sequencing Reduction
AtacWorks Reduces Sequencing Requirements 3x
1M Reads 1M Reads + AtacWorks
31
Denoising Low Quality Sample
AtacWorks improves signal-to-noise ratio in low quality samples
Distance from transcription start site
32
Denoising Single Cell Atac-seq Data
AtacWorks Improves Open DNA Detection From Few Cells
90 Cells 90 Cells With AtacWorks Open DNA Detection auPRC
33
AtacWorks SDK
SDK on Clara Genomics: https://github.com/clara-genomics/AtacWorks AtacWorks Preprint: https://www.biorxiv.org/content/10.1101/829481v1
90 Cells 90 Cells + AtacWorks 1M Reads 1M Reads + AtacWorks Reduce Sequencing Cost Increase Single Cell Resolution Improve Sample Quality
34
Genome Assembly
35
Long Read De Novo Assembly
ACTCGGTCATTCGTGCTTTATC
Step 1: Mapping to detect
- verlaps between reads
Step 3: Error correction to polish genomes Step 2: Overlap graph traversal to generate draft genomes
GCGTTATCGTCTACTTCGT
Draft genome Original reads
36
Genome Assembly Workflow
Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka Genome Assembly Pipeline
37
Accelerated Genome Assembly Workflow
Before ClaraGenomicsAnalysis
Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN Genome Assembly Pipeline
38
Accelerated Genome Assembly Workflow
ClaraGenomicsAnalysis 0.1
Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN cudaPOA Genome Assembly Pipeline
39
Accelerated Genome Assembly Workflow
ClaraGenomicsAnalysis 0.2
Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN cudaPOA cudaAligner Genome Assembly Pipeline
40
Accelerated Genome Assembly Workflow
ClaraGenomicsAnalysis 0.3
Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN cudaPOA cudaAligner cudaMapper Genome Assembly Pipeline
41
ClaraGenomicsAnalysis SDK
Enabling Accelerated Genome Assembly
Azure v32 CPU V100 GPU Bacteria Genome Assembly Acceleration
Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN cudaPOA cudaAligner cudaMapper
Assembly Pipeline
42
CLARA GENOMICS SW
Open Source CUDA-Accelerated Sequencing Analysis Tools
AtacWorks SDK ClaraGenomicsAnalysis SDK
CUDA Python API Transfer Learning cudaMapper Genomics I/O Reference Models Optimized Inference BASECALLING GENOME ASSEMBLY AI-DENOISED ATAC-SEQ
APPLICATIONS
cudaPOA cudaAligne r C++ API
Reference Applications Integration with 3rd Party Applications and Workflows C++ and Python APIs CUDA Accelerated HPC and Deep Learning Modules
43
Useful Links
- Parabricks: https://www.parabricks.com
- ClaraGenomicsAnalysis
- SDK on GitHub: https://github.com/clara-genomics/ClaraGenomicsAnalysis
- C++ API Examples: cudapoa, cudaaligner
- Python API Examples: cudapoa, cudaaligner
- AtacWorks
- SDK on GitHub: https://github.com/clara-genomics/AtacWorks
- AtacWorks Preprint: https://www.biorxiv.org/content/10.1101/829481v1
- 3rd party integrations:
- Racon: https://github.com/lbcb-sci/racon
- Raven: https://github.com/lbcb-sci/raven
- Bonito: https://github.com/nanoporetech/bonito
- Additional GPU Accelerated Genomics Applications:
- Kipoi Model Zoo: https://ngc.nvidia.com/catalog/containers/hpc:kipoi
- SigProfiler: https://github.com/AlexandrovLab/SigProfilerExtractor
Johnny Israeli