On Accelerating Pair-HMM Computations in Programmable Hardware - PowerPoint PPT Presentation

Subho S. Banerjee, Mohamed el-Hadedy, Ching Y. Tan, Zbigniew T. Kalbarczyk, Steve Lumetta, Ravishankar K. Iyer On Accelerating Pair-HMM Computations in Programmable Hardware

Contributions • Design and implementation for an accelerator to This paper compute the Forward Algorithm (FA) on Pair- �� Hidden Markov Models (PHMM) models. �� • Demonstrate value of the accelerator supporting Other FPGA �� computational genomics workflows where PHMM �� [13] is used to identify mutations in genomes �� • Optimize accelerator architecture for both the �� [10] GPU [11] algorithm and common input data characteristics �� [12] [6] • Reduce compute time: 14.85× higher throughput � • Reduce operational cost (in terms of energy � � � � � �� consumption): 147.49× higher throughput per �� CPU unit energy Citations are consistent with those in paper 1

Forward Algorithm on Pair-HMM Models Plate Class Node • PHMM models are Bayesian multinets that allow for a Symbol in Sequence 1 probabilistic interpretation of the alignment problem Symbol in Sequence 2 • An alignment models the homology between two sequences via a series of mutations, insertions, and deletions of nucleotides. Hidden State • FA algorithm computes of statistical similarity by considering all alignments between two sequences and Transitions between hidden states computing the overall alignment probability by summing over them Equations describe anti-diagonal data-dependecies • Can be described by the following equations 2

PHMM Forward Algorithm in Bioinformatics • PHMMs form the basis of the variant detection tool GATK HaplotypeCaller • Used to pick n-best haplotypes from by maximizing likelihood of a read originating from the haplotype • FA algorithm used • Constitutes >70% of the runtime of the GATK HaplotypeCaller • Executes >3E7 times for a standard clinical human dataset Diagram from GATK Documentation: https://software.broadinstitute.org/gatk/documentation/article.php?id=4148 3

Shortcomings of Related Work � • Past work explores use of FPGAs/ASICs �� • Based on systolic array designs �� • Exploit anti-diagonal parallelism in �� recurrence pattern �� • Common shortcoming is that they are �� optimized only for the algorithm and not � input data characteristics �� • Input size variability can lead to idle cycles CDF shows nearly uniform distribution of for systolic array based designs. input sizes for small (<250) and large (>350) input string size for computation on NA12878 sample 4

Our Design • Design Goal: Optimize design to execute different input sizes in parallel • Expend chip budget on maximizing inter-task parallelism Specialized data path and • Handle intra-task parallelism through aggressive pipelining schedule to ensure that there are no idle cycles while computing 250 MHz 400 MHz IEEE-754 encoded 250 MHz Internal Input Bus Array of “a” parameters ASCII encoded Cache PEs Serializer Quality to “a” Output quality parameters PHMM “f” metrics parameter Serializer Data Path lookup table Serializer Input “f” IBM Supplied Calculated Bus metrics POWER “f” metrics Service Layer Scheduler CAPI (PSL) Scratchpad Buffer Controller Read address Write Internal Output address Address Memory Cache Generator Scheduler Host-accelerator Out of order issue unit to PEs as well as write interface using IBM back logic encapsulated in the bus scheduling Memory scheduler minimizes scratchpad CAPI strategy buffer size used to store intermediate results in Scratchpad buffer 5

Processing Element (PE) Design K A H D G J i − 1 I C J I i − 1 Adder H i − 1 K A K i − 4 D H L i − 6 G I C J F i Multiplier 2 E i B E B i F L C i G i Multiplier 1 D i B E A i F 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 L Time Circuit representation of the Gantt chart corresponding to schedule of operations computation datapath • Goal: Schedule operations to minimize idle cycles • Schedule presented above has no idle cycles • Schedule temporally multiplexes the adders and multipliers • Entire pipeline is 8-deep (8 Operations in flight at a time) 6

Minimize Storage Requirements • Temporary scratchpad space is required to store intermediate Completed blocks Computing “x” outputs produced from the FA overwrites unused X algorithm values L • We minimize this space by following L the anti-diagonal recursion pattern of the FA algorithm Fill memory along Stored blocks Scratchpad X anti-diagonal of the Memory Current block recursion lattice Remaining blocks Memory State Recursion Lattice from Equation 1 • As a result, we need only O(L) space instead of O(L 2 ) space to store entire matrix. 7

Dealing with Accelerator Invocation Overheads • Accelerator invocation overhead significantly 1000 reduces performance because of OS overhead of initializing accelerator Latency ( μ s) / Task 100 • Solution: Amortize cost of accelerator invocation by batching multiple invocations • OS sends batch of tasks to acc. Hardware dist across PEs 10 • Demonstrate several approaches to select task batches 1 • Simple task batching 1 10 100 1000 10000 Batch Size (Tasks) • Common prefix memoization Task batching: Significant drop in mean latency of • FA on partially ordered strings a PHMM task when OS overhead is amortized over large batches 8

Common Prefix Memoization Precompute • Similar inputs to PHMM) have common prefixes Compressed Trie 1 Prefix String 2 • Naïve algorithm recomputes PHMM for all pairs of Reuse pre- Haplotype AAACGC computed strings CGCAAA values A G • Our solution: C Haplotype • Construct a prefix trie to find the longest common CCGCAAA prefix in an input task batch Compute • Compute PHMM FA for prefix only once 3 last row • Saves compute time and host-accelerator bandwidth Example • (AAACGCA, AAACCGG); (AAACGCC, AAACCGG); (AAACGCG, AAACCGG) • Read (Input 1) has common prefix for a single haplotype (Input 2) • Construct TRIE for Input 1 • • Precompute matrix for prefix on accelerator Compute last row and column on host CPU • 9

FA on Partially Ordered Strings • Inputs to the PHMM accelerator in GATK is computed from DeBruijn graphs C C C T A G C T A A A • Core Idea: C C • Do not dispatch multiple paths from DeBruijn A A graphs as separate tasks • Dispatch entire graph at same time A A Traditional PHMM POA based PHMM Dependency Lattice • Present an extension of the POA algorithm Dependency Lattice [1] for computing FA between single read and entire DeBruijn graph [1] C. Lee, C. Grasso, and M. F. Sharlow, “Multiple sequence alignment using partial order graphs,” Bioinformatics , vol. 18, no. 3, pp. 452–464, Mar 2002. 10

Results: Performance Benchmarking Performance of the end-to-end GATK HaplotypeCaller Performance of the accelerator in a PHMM micro- application benchmark �� Amdahl’s Law Limit �� [12] (Best GPU) �� [13] (Best FPGA) �� Power8 Chip � � � � �� 14.85× higher throughput than an 8-core CPU baseline • 3.287× speedup over CPU-only baseline • (that uses SIMD and multi-threading) 3.48× is maximum attainable speedup accroding to • 147.49× improvement in throughput per unit of energy • Amdahl’s Law expended 11

Results: On-Chip Resource Utilization Physical Layout on a Xilinx XC7VX6905T �� Clock �� 31% �� CAPI �� Interface �� Signals 31% �� Logic �� 10% � �� BRAM � 13% � DSP � � � � �� 8% PCIe �� MMCM 4% �� 4% • The use of logic slices is the limiting factor • Potential for larger gains in micro-benchmark performance for larger FPGAs 44 PEs • Memory bandwidth becomes a bottleneck [Simulation results in paper] • Negligible gains to be had in terms of end-to-end application performance • Already close to Amdahl’s law limit 12

On Accelerating Pair-HMM Computations in Programmable Hardware - PowerPoint PPT Presentation

Subho S. Banerjee, Mohamed el-Hadedy, Ching Y. Tan, Zbigniew T. Kalbarczyk, Steve Lumetta, Ravishankar K. Iyer On Accelerating Pair-HMM Computations in Programmable Hardware Contributions Design and implementation for an accelerator to This

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM.

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Efficient Implementation of a Generalized Pair HMM for Efficient Implementation of a Generalized

Using HMM to Blur the Lines between CPU and GPU Programming John Hubbard, May 10, 2017

ROUNDERS (1998) CASINO ROYALE (2006) HAND RANKINGS HIGH CARD HAND RANKINGS PAIR HIGH CARD

Closest Pair of Points Cormen et.al 33.4 Closest Pair of Points Closest pair. Given n points in

ROMs, PLAs and FPGAs October 5, 2006 Typeset by Foil T EX Why Programmable Logic?

PROGRAMMABLE LOGIC CONTROLLER Control Systems Types Programmable Logic Controllers

Field Programmable Gate Arrays by Ketil Red Field Programmable Gate Array Integrated

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Fast TwoLevel Fast TwoLevel HMM Decodi HMM Decoding ng Algor gorithm for thm for Large

Global Robot Ego-Localization C Combining Image Retrieval and HMM- bi i I R i l d HMM

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

A General Interviewer Training Curriculum for Computer-Assisted Personal Interviews (GIT-CAPI)

Lessons Learned from Implementing Response Propensity Models in the 2013 Census Test Gina

Non-Interactive Key Exchange Eduarda S. V. Freire, Dennis Hofheinz, Eike Kiltz and Kenneth G.

2006: OConnell and Dyment explored the benefits of journaling in motivating students in

IO virtualization Michael Kagan Mellanox Technologies IO Virtualization Mission non-stop

Maty Konte*, W.Kouame** and E.Mensah* *UNU-MERIT, Maastricht; **WORLD BANK, Washington DC

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew

American Housing Survey: Survey Topics and Questionnaire May 11 th 2017 Evan Brassell American