Efficient all-against-all protein similarity matrix computation - - PowerPoint PPT Presentation

efficient all against all protein similarity matrix
SMART_READER_LITE
LIVE PREVIEW

Efficient all-against-all protein similarity matrix computation - - PowerPoint PPT Presentation

Efficient all-against-all protein similarity matrix computation using OpenCL Genome-oriented bioinformatics lab - WS2013/2014 Uli Khler & Anton Smirnov LMU & TUM Helmholtz-Zentrum Mnchen Supervisor: Mathias Walter February 24th,


slide-1
SLIDE 1
slide-2
SLIDE 2

Efficient all-against-all protein similarity matrix computation using OpenCL

Genome-oriented bioinformatics lab - WS2013/2014 Uli Köhler & Anton Smirnov LMU & TUM Helmholtz-Zentrum München Supervisor: Mathias Walter February 24th, 2014

slide-3
SLIDE 3

Introduction SIMAP

SIMAP I

Similarity Matrix of Proteins: Database of protein similarities Compares all-against-all Currently ~73 million protein sequences → 5.3 · 1015 alignments BOINC-SIMAP: distributed computing p1 p2 p3 p1 − 5 ... p2 ... − ... p3 ... 170 −

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 3 / 21

slide-4
SLIDE 4

Introduction SIMAP

SIMAP II

Currently uses FASTA algorithm (fast, but suboptimal heuristics) For high-scoring hits, Smith-Waterman is currently in use Smith-Waterman provides better accuracy Requires efficient, parallelized implementation

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 4 / 21

slide-5
SLIDE 5

Introduction Hardware

Computational hardware

CPU: ~1-12 cores, available anywhere GPU: 1000+ cores, good availability FPGA (field programmable gate array)

Configurable number of cores Difficult to use Expensive

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 5 / 21

slide-6
SLIDE 6

Parallelization and OpenCL OpenCL

OpenCL

Programming framework for parallel computing Top level abstraction for low level routines Runs on CPUs, GPUs & FPGAs without modification Driver optimizes code for specific devices

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 6 / 21

slide-7
SLIDE 7

Parallelization and OpenCL Smith-Waterman

Smith-Waterman parallelization

Intra-task Inter-task

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 7 / 21

slide-8
SLIDE 8

Parallelization and OpenCL Padding & sizeclasses

Sequence length optimization

Maximal efficiency of Smith-Waterman implementation: For many optimizations, we need sequences with equal length Equal length can boost performance by multiple magnitudes Pad sequence with ε Alignment score must not change → Substitution score: −∞ Problem: Padding increases matrix size → Large overhead

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 8 / 21

slide-9
SLIDE 9

Parallelization and OpenCL Padding & sizeclasses

Sizeclasses

Solution: Extension sizeclasses / Adaptive binning Divide sequence length into different classes Pad only within one sizeclass Multiple sizeclasses reduce overall padding A K L ε ε A ... ... ... C ... ... ... M ... ... ... M ... ... ... L ... ... ...

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 9 / 21

slide-10
SLIDE 10

Parallelization and OpenCL Padding & sizeclasses

30000 60000 90000 500 1000 1500 2000

Sequence length [AA] Absolute frequency

SIMAP sequence length distribution

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 10 / 21

slide-11
SLIDE 11

Results and benchmarks CLSW Implementation details

CLSW: OpenCL Smith-Waterman

Objective: Develop proof-of-concept score-only OpenCL Smith-Waterman Use inter-task parallelization All-against-all with affine gap costs Can be used to build vendor-independent fast Smith-Waterman implementation

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 11 / 21

slide-12
SLIDE 12

Results and benchmarks Implementation

Implementation aspects

Written in pure C++11 & OpenCL 1.1 No external dependencies, compact binary Tested with SIMAP subset Verified using SeqAn library

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 12 / 21

slide-13
SLIDE 13

Results and benchmarks Advantages

Core advantages

SWIPE: Integer ↔ CLSW : Floating point → Composition based score adjustment → Higher accuracy Concise codebase: < 1,000 C++ lines of code OpenCL Smith-Waterman: <50 lines of code (SWIPE: 10,000 lines of code) Existing implementations are based on CUDA → Only runs on NVidia GPUs

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 13 / 21

slide-14
SLIDE 14

Results and benchmarks Outlook

50 100 150 200 250 ssearch36 swipe swipe−MT CLSW

Program Runtime [s]

1,000 x 1,000 sequences benchmark ; 1,000 AA (query) ; 1,000 AA (target)

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 14 / 21

slide-15
SLIDE 15

Results and benchmarks Outlook

20 40 60 ssearch36 swipe swipe−MT CLSW

Program Runtime [s]

4000x1000 sequences benchmark, 20 AA (query), 1.000 AA (target)

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 15 / 21

slide-16
SLIDE 16

Results and benchmarks Outlook

Integration into SIMAP

Since 2005, only CPU clients Since 2014, also ARM client for Android Users ask for GPU clients regularly since 2005 CLSW was built to be integratable into BOINC → Leverage huge amount of computing power Still, a lot of work needs to be done...

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 16 / 21

slide-17
SLIDE 17

Results and benchmarks Outlook

Other uses

3-4 times faster than SWIPE for short query sequences → Shotgun proteomics, NGS? Huge optimization potential → Reduce overhead, 5-10x speedup Platforms unsupported by SWIPE (e.g. 32 bit platforms)

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 17 / 21

slide-18
SLIDE 18

Conclusion

Conclusion

CLSW: Portable, GPU-based Smith-Waterman Fast for small queries, can be optimized for large queries Floating point score calculation → Composition-based score adjustment GPU computing is underestimated in computational biology

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 18 / 21

slide-19
SLIDE 19

Conclusion Acknowledgements

Thank you for your attention!

Special thanks to Mathias Walter & Thomas Rattei who made this project possible!

Questions?

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 19 / 21

slide-20
SLIDE 20

Advanced Topics Kernel sizeclasses

5000 10000 15000 20000 500 1000 1500 2000

Buffer size Runtime [s]

20 AA x 20 AA ; 4,000 x 4,000 alignment, with variable row buffer

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 20 / 21

slide-21
SLIDE 21

Advanced Topics Sizeclass mathematical background

Sizeclass: (α ·

sizeclass penalty) + (β · |sizeclass|)

Difficult to determine optimal values for α and β Idea: Use population quantiles (e.g. q0.01% to q100%) as sizeclass boundaries. Postprocessing: Divide sizeclasses with penalty > threshold

Köhler U, Smirnov A (LMU, TUM) Efficient S/W using OpenCL February 24th, 2014 21 / 21