GenAx: A Genome Sequence Accelerator Daichi Fujiki et al Presented - PowerPoint PPT Presentation

GenAx: A Genome Sequence Accelerator Daichi Fujiki et al Presented by: Amani Alkayyali Ben Cyr EECS 573 GenAx Paper Presentation 1

Genome Sequencing Thymine ● DNA: Thymine, Cytosine, Adenine, Guanine ● Genome Sequencing: Determining T,C,A,G Order Adenine ● Genome Sequencing Goals: ○ Understanding entire DNA sequence as system Cytosine lgrdlmnqvtthequickababcmfxlqbrownfoxj lgrdlmnqvt thequick ababcmfxlq brownfo xju rvs mpedoverthelazy yyzplf dog jjiurttl urvsmpedoverthelazyyyzplfdogjjiurttlythe doglayhhbeldquietlydreaminghwwiqldns y thedoglay hhbeld quietlydreaming hwwi Guanine ofdinnerplwosiucnd qldns ofdinner plwosiucnd EECS 573 GenAx Paper Presentation 2

Uses ● Individualized treatment and personalized medicine ○ Understanding an individual’s cancer cell mutations ● Understanding causes of diseases https://rnsights.com/the-push-for-personalized-medicine/ EECS 573 GenAx Paper Presentation 3

Methods ● Steps of Genome sequencing: ○ Break into small pieces (reads) at random positions ○ Determine the sequence ○ Figure out which pieces fit together (read alignment) ● Two approaches: ○ Clone-by-Clone ○ Whole Genome Sequencing EECS 573 GenAx Paper Presentation 4

Current State: Genome Sequencing and Computing ● Expensive ○ 2001: $3 billion - first human genome sequencing ● Requires several hundreds to thousands of CPU hours ● Large output ○ Data from 1 mill genomes produces over 300 Petabytes of data ● Moore’s Law tapering leads to hardware acceleration ● BWA-MEM: Burrows-Wheeler Aligner ○ Broad Institute’s standard software for read alignment EECS 573 GenAx Paper Presentation 5

Goals ● Smaller seeds → More parallelism ● Improving locality of data access ● Improve upon Smith-Waterman and Levenshtein Automata (LA) ○ Improve scaling ● Accelerator for read alignment ● Resolve issues from variants and sequencing errors EECS 573 GenAx Paper Presentation 6

Sequence Aligners ● Edit distance: number of deletions, insertions, or substitutions ● Seeding: finding potential matches 1 ● Seed-Extensions: finding best match Reference Genome Seeding 2 Read Alignment Seed Extension EECS 573 GenAx Paper Presentation 7

Seeding Algorithm ● Seeding locates the potential match locations ● Finds the “seeds” for seed extension phase ○ “k-mers”: string matches of k length ○ Super Maximal Exact Matches (SMEMs): Seeding Maximum length match extending from k-mer ● Key Idea: Intersect sets of k-mers until the longest match is found. Seed Extension EECS 573 GenAx Paper Presentation 8

Seeding Algorithm K = 4 EECS 573 GenAx Paper Presentation 9

Seeding Accelerator ● Index and Position Tables are kept in large SRAM blocks ● Intersection computation w/ Content Addressable Memory (CAM) ○ CAMs tell you very quickly if certain data is in the CAM block ○ Small 512 index CAM table ○ When k = 12 (avg case), matches usually < 500 ● If larger than 512 indices, use binary search EECS 573 GenAx Paper Presentation 12

Silla: String Independent Local Levenshtein Automata ● Seed extension algorithm ● Finite-state automata ● Traceback: trace of edits needed to align ● Scored using an affine gap function Seeding ● Insertions, deletions, substitutions ● 3D vs 2D Silla Seed Extension ● Merging confluence paths EECS 573 GenAx Paper Presentation 13

Silla: String Independent Local Levenshtein Automata EECS 573 GenAx Paper Presentation 14

Silla: String Independent Local Levenshtein Automata EECS 573 GenAx Paper Presentation 15

SillaX: Silla Accelerator ● Edit distance, affine gap penalty, traceback ● State = processing element, communicates with neighbor ● Retro comparison = two shift registers ● Scoring → Clipping ● Composable Subgrids ● Verified on human genome ● 62.9x speedup over Smith-Waterman EECS 573 GenAx Paper Presentation 16

GenAx ● Combine seeding accelerator and SillaX ● Direct replacement to BWA-MEM software sequence aligner EECS 573 GenAx Paper Presentation 17

GenAx Architecture EECS 573 GenAx Paper Presentation 18

GenAx Performance Test ● Compared with two other sequence aligners ○ Intel Xeon Processor running BWA-MEM (128 GB DDR4) ○ Nvidia TITAN Xp running CUSHAW2 ● Synthesized and simulated GenAx with 28nm process ● Used real human genome reference from dataset ○ 800 Million reads at 101 base pairs / read EECS 573 GenAx Paper Presentation 19

Performance Results ● GenAx vs BWA-MEM ○ 31.7x Speedup ○ 12x less power ○ ~10 Hrs vs. ~300 Hrs ● Even better vs GPU ○ 72.4x Speedup EECS 573 GenAx Paper Presentation 20

Conclusion and Contributions ● Silla: ○ Computes edit distance between two strings ○ String independent and local communication ● SillaX: ○ Accelerator for Silla supporting traceback ● GenAx: ○ SillaX + Seeding Accelerator ○ Drop-In replacement for BWA-MEM software EECS 573 GenAx Paper Presentation 21

Discussion Questions ● GenAx might take large performance hits when handling certain inputs (i.e. large K-edit distances, many “k-mer” seeds). Is it worth using GenAx even if it is not flexible enough to handle these edge cases? ● Are composable systems (many small systems to form one large system) a good solution for scaling? ● The authors ran the performance test on one specific genome and read configuration. Do you think this is enough to show the usefulness of GenAx? EECS 573 GenAx Paper Presentation 22

GenAx: A Genome Sequence Accelerator Daichi Fujiki et al Presented - PowerPoint PPT Presentation

GenAx: A Genome Sequence Accelerator Daichi Fujiki et al Presented by: Amani Alkayyali Ben Cyr EECS 573 GenAx Paper Presentation 1 Genome Sequencing Thymine DNA: Thymine, Cytosine, Adenine, Guanine Genome Sequencing: Determining

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Current Topics in Genome Analysis Fall 2006 Week 4: Mining Genomic Sequence Data Tyra G.

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Numbers, lists and tuples Genome 559: Introduction to Statistical and Computational Genomics

Genes Multiple Choice Review www.njctl.org Slide 3 / 46 1 Deoxyribonucleic acid nucleotides

Lists Genome 373 Genomic Informatics Elhanan Borenstein Lists A list is an ordered set of

Bioinformatics CS300 Crash course: Structure and Replication of DNA Fall 2019 Oliver

Generating & Designing DNA with Deep Generative Models Nathan Killoran, Leo J. Lee, Andrew

Lists and the for loop Lists Lists are an ordered collection of objects Make an empty

Combinatorial RNA Design: Designability and Structure-Approximating Algorithm s 1 nuch 1 , 3 Yann

Biology & CS Evolution organisms over time Ecology interactions among organisms

GenAx: A Genome Sequence Accelerator Daichi Fujiki et al Presented - PowerPoint PPT Presentation

GenAx: A Genome Sequence Accelerator Daichi Fujiki et al Presented by: Amani Alkayyali Ben Cyr EECS 573 GenAx Paper Presentation 1 Genome Sequencing Thymine DNA: Thymine, Cytosine, Adenine, Guanine Genome Sequencing: Determining

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Current Topics in Genome Analysis Fall 2006 Week 4: Mining Genomic Sequence Data Tyra G.

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

Introduction to Bioinformatics Genome sequencing &amp; assembly Genome sequencing &amp; assembly

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Numbers, lists and tuples Genome 559: Introduction to Statistical and Computational Genomics

Genes Multiple Choice Review www.njctl.org Slide 3 / 46 1 Deoxyribonucleic acid nucleotides

Lists Genome 373 Genomic Informatics Elhanan Borenstein Lists A list is an ordered set of

Bioinformatics CS300 Crash course: Structure and Replication of DNA Fall 2019 Oliver

Generating &amp; Designing DNA with Deep Generative Models Nathan Killoran, Leo J. Lee, Andrew

Lists and the for loop Lists Lists are an ordered collection of objects Make an empty

Combinatorial RNA Design: Designability and Structure-Approximating Algorithm s 1 nuch 1 , 3 Yann

Biology &amp; CS Evolution organisms over time Ecology interactions among organisms

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Introduction to Bioinformatics Genome sequencing & assembly Genome sequencing & assembly

Generating & Designing DNA with Deep Generative Models Nathan Killoran, Leo J. Lee, Andrew

Biology & CS Evolution organisms over time Ecology interactions among organisms