Computing and Deep Learning Johnny Israeli COMPUTE TRENDS - - PowerPoint PPT Presentation

computing and deep learning
SMART_READER_LITE
LIVE PREVIEW

Computing and Deep Learning Johnny Israeli COMPUTE TRENDS - - PowerPoint PPT Presentation

Accelerating Sequencing with GPU Computing and Deep Learning Johnny Israeli COMPUTE TRENDS GPU-Computing perf 10 1.5X per year APPLICATIONS 7 10 6 ALGORITHMS 1.1X per 10 year 5 10 SYSTEMS 4 10 CUDA 1.5X per 3 10 year 2


slide-1
SLIDE 1

Johnny Israeli

Accelerating Sequencing with GPU Computing and Deep Learning

slide-2
SLIDE 2

2

GPU-Computing perf 1.5X per year 10

2

10

3

10

4

10

5

10

6

10

7

Single-threaded perf 1.5X per year 1.1X per year APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE

COMPUTE TRENDS

slide-3
SLIDE 3

3

Publications

COMPUTE TRENDS

slide-4
SLIDE 4

4

Sequencing Trends

slide-5
SLIDE 5

5

SEQUENCING TRENDS

Sequencing Data Growing in Volume and Complexity

Decreasing Cost Increasing Read Length Rise of Single Cell Data

slide-6
SLIDE 6

6

SEQUENCING TRENDS

slide-7
SLIDE 7

7

2000 2005 2010 2015 2020 2025 1012

Worldwide Annual Sequencing Capacity

1015 1018 1021

SEQUENCING TRENDS

slide-8
SLIDE 8

8

Sequencing Data Types

slide-9
SLIDE 9

9

SEQUENCING TRENDS: Genomics

*ENA Database

slide-10
SLIDE 10

10

SEQUENCING TRENDS: Transcriptomics

*ENA Database

slide-11
SLIDE 11

11

SEQUENCING TRENDS: Epigenomics

*ENA Database

slide-12
SLIDE 12

12

SEQUENCING TRENDS: Nanopore Long Read Sequencing

*ENA Database

slide-13
SLIDE 13

13

Variant Calling

slide-14
SLIDE 14

Sequence DNA Map to Reference

Variant Calling

TGGATTTGAAAACGGAGCAAATGACTG TGGATTTGAAAACGGAGCAAATGACTG TGGATTTGAAAACGGAGCAAATGACTG TGGATTTGAAAACAGAGCAAATGACTG TGGATTTGAAAACAGAGCAAATGACTG TGGATTTGAAAACAGAGCAAATGACTG TGGATTTGAAAACGGAGCAAATGACTG Reference Illumina Reads Likely heterozygous variant

  • Identify sites with potential mismatch
  • True variants or instrument errors?
  • SNPs or insertions or deletions?
  • Heterozygous or homozygous variants?
slide-15
SLIDE 15

Example Pileup Input Data

Heterozygous SNP

Read Index Position

slide-16
SLIDE 16

16

GATK Variant Calling Pipeline

Align to Reference Sort Mark Duplicates Calibrate Call Variants Joint Call Variant Calling Pipeline Filter Variants

slide-17
SLIDE 17

17

Accelerated GATK Variant Calling Pipeline

Align to Reference Sort Mark Duplicates Calibrate Call Variants Joint Call Variant Calling Pipeline

Parabricks

Preprocessing Joint Genotyping Variant Processing Alignment Variant Calling Filter Variants

slide-18
SLIDE 18

18

Accelerated Variant Calling Pipelines

Parabricks Preprocessing Joint Genotyping Variant Processing Alignment Variant Calling Germline Copy Number Somatic

Alignment + Preprocessing Haplotype Caller Mutec2 GenotypeGVCF DeepVariant

Whole Genome Processing in Minutes

slide-19
SLIDE 19

Deep Averaging Network (DAN)

slide-20
SLIDE 20
  • PyTorch-based 1D model
  • Learned embeddings of bases
  • Encoding variant proposals
  • Downsample easy variant candidates during training

DAN Development

slide-21
SLIDE 21

Variant Calling Errors

slide-22
SLIDE 22

Variant Calling Error Breakdown

slide-23
SLIDE 23

23

Atac Sequencing

slide-24
SLIDE 24

24

DNA: Open And Closed

Closed DNA inactive Open DNA active Open DNA changes affect development & disease

slide-25
SLIDE 25

25

Atac Sequencing

Mapping Open DNA Sites

Sequence Open DNA Map & Count Reads

Open DNA site Open DNA site

slide-26
SLIDE 26

26

Atac-seq Limits

Atac-seq signal degrades in due to:

  • Less sequencing
  • Low quality sample preparation
  • Small cell populations
slide-27
SLIDE 27

27

AtacWorks SDK

AI-Denoised ATAC-seq Data Processing

Sequence Open DNA Map, Align, Count

High Quality Sequencing Low Quality Sequencing Low Quality Sequencing Denoised with AtacWorks AI

slide-28
SLIDE 28

28

AtacWorks Model

Denoising + Open Chromatin Identification

Resblock 1 Resblock 2 Resblock 3 Resblock 4 Resblock 5 Resblock 6 Resblock 7 Predicted Coverage Predicted open chromatin Input (Noisy ATAC-Seq data) Evaluation: MSE Pearson correlation Evaluation: AUPRC

Conv ReLU Conv ReLU ReLU

Conv

slide-29
SLIDE 29

29

Denoising Low Sequencing Data

50 Million Reads 1 Million Reads 1 Million Reads + AtacWorks

AtacWorks identifies open chromatin from low-coverage data

slide-30
SLIDE 30

30

Genome-wide Sequencing Reduction

AtacWorks Reduces Sequencing Requirements 3x

1M Reads 1M Reads + AtacWorks

slide-31
SLIDE 31

31

Denoising Low Quality Sample

AtacWorks improves signal-to-noise ratio in low quality samples

Distance from transcription start site

slide-32
SLIDE 32

32

Denoising Single Cell Atac-seq Data

AtacWorks Improves Open DNA Detection From Few Cells

90 Cells 90 Cells With AtacWorks Open DNA Detection auPRC

slide-33
SLIDE 33

33

AtacWorks SDK

SDK on Clara Genomics: https://github.com/clara-genomics/AtacWorks AtacWorks Preprint: https://www.biorxiv.org/content/10.1101/829481v1

90 Cells 90 Cells + AtacWorks 1M Reads 1M Reads + AtacWorks Reduce Sequencing Cost Increase Single Cell Resolution Improve Sample Quality

slide-34
SLIDE 34

34

Genome Assembly

slide-35
SLIDE 35

35

Long Read De Novo Assembly

ACTCGGTCATTCGTGCTTTATC

Step 1: Mapping to detect

  • verlaps between reads

Step 3: Error correction to polish genomes Step 2: Overlap graph traversal to generate draft genomes

GCGTTATCGTCTACTTCGT

Draft genome Original reads

slide-36
SLIDE 36

36

Genome Assembly Workflow

Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka Genome Assembly Pipeline

slide-37
SLIDE 37

37

Accelerated Genome Assembly Workflow

Before ClaraGenomicsAnalysis

Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN Genome Assembly Pipeline

slide-38
SLIDE 38

38

Accelerated Genome Assembly Workflow

ClaraGenomicsAnalysis 0.1

Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN cudaPOA Genome Assembly Pipeline

slide-39
SLIDE 39

39

Accelerated Genome Assembly Workflow

ClaraGenomicsAnalysis 0.2

Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN cudaPOA cudaAligner Genome Assembly Pipeline

slide-40
SLIDE 40

40

Accelerated Genome Assembly Workflow

ClaraGenomicsAnalysis 0.3

Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN cudaPOA cudaAligner cudaMapper Genome Assembly Pipeline

slide-41
SLIDE 41

41

ClaraGenomicsAnalysis SDK

Enabling Accelerated Genome Assembly

Azure v32 CPU V100 GPU Bacteria Genome Assembly Acceleration

Overlap Assemble Align Polish DL Polish MiniMap MiniASM Racon x 5 Medaka cuDNN cudaPOA cudaAligner cudaMapper

Assembly Pipeline

slide-42
SLIDE 42

42

CLARA GENOMICS SW

Open Source CUDA-Accelerated Sequencing Analysis Tools

AtacWorks SDK ClaraGenomicsAnalysis SDK

CUDA Python API Transfer Learning cudaMapper Genomics I/O Reference Models Optimized Inference BASECALLING GENOME ASSEMBLY AI-DENOISED ATAC-SEQ

APPLICATIONS

cudaPOA cudaAligne r C++ API

Reference Applications Integration with 3rd Party Applications and Workflows C++ and Python APIs CUDA Accelerated HPC and Deep Learning Modules

slide-43
SLIDE 43

43

Useful Links

  • Parabricks: https://www.parabricks.com
  • ClaraGenomicsAnalysis
  • SDK on GitHub: https://github.com/clara-genomics/ClaraGenomicsAnalysis
  • C++ API Examples: cudapoa, cudaaligner
  • Python API Examples: cudapoa, cudaaligner
  • AtacWorks
  • SDK on GitHub: https://github.com/clara-genomics/AtacWorks
  • AtacWorks Preprint: https://www.biorxiv.org/content/10.1101/829481v1
  • 3rd party integrations:
  • Racon: https://github.com/lbcb-sci/racon
  • Raven: https://github.com/lbcb-sci/raven
  • Bonito: https://github.com/nanoporetech/bonito
  • Additional GPU Accelerated Genomics Applications:
  • Kipoi Model Zoo: https://ngc.nvidia.com/catalog/containers/hpc:kipoi
  • SigProfiler: https://github.com/AlexandrovLab/SigProfilerExtractor
slide-44
SLIDE 44

Johnny Israeli

Accelerating Sequencing with GPU Computing and Deep Learning