Efficient streaming applications on multi-core with FastFlow: the - PowerPoint PPT Presentation

BioBits Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed Marco Aldinucci Computer Science Dept. - University of Torino - Italy Marco Danelutto, Massimiliano Meneghin, Massimo Torquti Computer Science Dept. - University of Pisa - Italy Peter Kilpatrick Computer Science Dept. - Queen’s University Belfast - U.K. ParCo 2009 - Sep. 1st - Lyon - France

Outline Motivation Commodity architecture evolution Efficiency for fine-grained computation POSIX thread evaluation FastFlow Architecture Implementation Experimental results Micro-benchmarks Real-world App: the Smith-Waterman sequence alignment application Conclusion, future works, and surprise dessert (before lunch) BioBits 2

[< 2004] Shared Font-Side Bus (Centralized Snooping)

[2005] Dual Independent Buses (Centralized Snooping)

[2007] Dedicated High-Speed Interconnects (Centralized Snooping)

[2009] QuickPath (MESI-F Directory Coherence)

This and next generation SCM Exploit cache coherence and it is likely to happens also in the next future Memory fences are expensive Increasing core count will make it worse Atomic operations does not solve the problem (still fences) Fine-grained parallelism is off-limits I/O bound problems, High-throughput, Streaming, Irregular DP problems Automatic and assisted parallelization BioBits

Micro-benchmarks: farm of tasks Used to implement: parameter sweeping, master-worker, etc. void Emitter () { for ( i =0; i <streamLen;++i){ int main () { task = create_task (); spawn_thread( Emitter ) ; queue=SELECT_WORKER_QUEUE(); for ( i =0; i <nworkers;++i){ queue − >PUSH(task); spawn_thread(Worker); } } } wait_end () ; } W 1 void Worker() { while (!end_of_stream){ myqueue − >POP(&task); W 2 do_work(task) ; E C } } W n

Using POSIX lock/unlock queues W 1 Ideal 50 μ S 5 μ S 0.5 μ S W 2 E C 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores BioBits

Using CompareAndSwap queues W 1 Ideal 50 μ S 5 μ S 0.5 μ S W 2 E C 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores BioBits

Evaluation Poor performance for fine-grained computations Memory fences seriously affect the performance BioBits

What about avoiding fences in SCM? Highly-level semantics matters! DP paradigms entail data bidirectional data exchange among cores Cache reconciliation can be made faster but not avoided Task Parallel, Streaming, Systolic usually result in a one-way data flow Is cache coherency really strictly needed? Well described by a data flowing graphs (streaming networks) BioBits

Streaming Networks A Streaming Network can be easily build POSIX (or other) threads SPMC Asynchronous channels But exploiting a global address space MPMC Threads can still share the memory using locks Asynchronous channels MCSP SPSC Thread lifecycle control + FIFO Queue Queue: Single Producer Single Consumer (SPSC), Single Producer Multiple Consumer (SPMC), Multiple Producer Single Consumer (MPSC), Multiple Producer Multiple Consumer (MPMC) Lifecycle: ready - active waiting (yield + over-provisioning) BioBits

Queues: state of the art MPMC Dozen of “lock-free” (and wait-free) proposal The quality is usually measured with number of atomic operations (CAS) CAS ≥ 1 SPSC lock-free, fence-free J. Giacomoni, T. Moseley, and M. Vachharajani. Fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. PPoPP 2008. ACM. Supports Total Store Order OOO architectures (e.g. Intel Core) Active waiting. Use OS as less as possible. Native SPMC and MPSC see MPMC BioBits

SPMC and MCSP via SPSC + control SPMC(x) fence-free queue wit x consumers One SPSC “input” queue and x SPSC “output” queues E One flow of control (thread) dispatch items from input to outputs MPSC(y) fence-free queue with y producers One SPSC “output” queue and y SPSC “input” queues One flow of control (thread) gather items from inputs to output C x and y can be dynamically changed MPMC = MCSP + SPMC Just juxtapose the two parametric networks BioBits

FastFlow: A step forward Implements lock-free SPSC, SPMC, MPSC, MPMC queues Exploiting streaming networks Features can be composed as parametric streaming networks (graphs) E.g. an optimized memory allocator can be added by fusing the allocator graphs with the application graphs Not described here Features are represented as skeletons, actually which compilation target are streaming networks C++ STL-like implementation Can be used as a low-level library Can be used to generatively compile skeletons into streaming networks Blazing fast on fine-grained computations BioBits

Very fine grain (0.5 μ S) W 1 W 2 Ideal POSIX lock CAS FastFlow E C W n 8 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores BioBits

Fine grain (5 μ S) W 1 W 2 Ideal POSIX lock CAS FastFlow E C W n 8 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores BioBits

Medium grain (50 μ S) W 1 W 2 Ideal POSIX lock CAS FastFlow E C W n 8 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores BioBits

Biosequence alignment Smith-Waterman algorithm Local alignment Time and space demanding O(mn), often replaced by approximated BLAST Dynamic programming Real-world application It has been accelerated by using FPGA, GCPU (CUDA), SSE2/x86, IBM Cell Best software implementation SWPS3: evolution of Farrar’s implementation SSE2 + POSIX IPC BioBits

Smith-Waterman algorithm Local alignment - dynamic programming - O(nm)

• Substitution Matrix: describes the rate at which one character in a sequence changes to other character states over time • Gap Penalty: describes the costs of gaps, possibly as function of gap length Experiment parameters Affine Gap Penalty: 10-2k, 5-2k, ... Substitution Matrix: BLOSUM50

Biosequence testbed Threads or Processes or ... Each query sequence (protein) is SW 1 aligned against the whole protein DB SW2 E.g. Compare unknown sequence against a Query Results Sequences DB of known sequences SWn SWPS3 implementation exploits UniProtKB Swiss-Prot POSIX processes and pipes 471472 sequences 167326533 amino-acids Faster than POSIX threads + locks Shared memory (read-only) BioBits

Smith Waterman (10-2k gap penalty) SWPS3 FastFlow 40 GCPUS (the higher the better) 30 20 10 0 144 189 246 464 553 1000 2005 3005 4061 22152 Query sequence lenght BioBits

Smith Waterman (5-2k gap penalty) SWPS3 FastFlow 20 GCPUS (the higher the better) 15 10 5 0 144 189 246 464 553 1000 2005 3005 4061 22152 Query sequence lenght BioBits

Conclusions FastFlow support efficiently streaming applications on commodity SCM (e.g. Intel core architecture) More efficiently than POSIX threads (standard or CAS lock) Smith Waterman algorithm with FastFlow Obtained from SWPS3 by syntactically substituting read and write on POSIX pipes with fastflow push and FastFlow pop an push In turn, POSIX pipes are faster than POSIX threads + locks in this case Scores twice the speed of best known parallel implementation (SWPS3) on the same hardware (Intel 2 x Quad-core 2.5 GHz) BioBits

Future Work FastFlow Is open source (STL-like C++ library will be released soon) [ ✔ ] Contact me if you interested Include a specialized (very fast) parallel memory allocator [ ✔ ] Can be used to automatically parallelize a wide class of problems [ ] Since it efficiently supports fine grain computations Can be used as compilation target for skeletons [ ] Support parametric parallelism schemas and support compositionality (can be formalized as graph rewriting) Can be extended for CC-NUMA architectures [ ] Can be used to extend Intel TBB and OpenMP [ ✔ ] Increasing the performances of those tools BioBits

5-2k gap penalty Query sequence length (protein length) 22152 1000 1500 2005 2504 3005 3564 4061 5478 144 189 189 222 246 375 464 497 553 567 20 18 16 14 GCUPS 12 10 FastFlow 8 OpenMP 6 Cilk TBB 4 SWPS3 2 P02232 P01111 P05013 P14942 P00762 P07327 P01008 P10635 P25705 P03435 P27895 P07756 P04775 P19096 P28167 P0C6B8 P20930 Q9UKN1 Q8WXI7 Query sequence (protein) FastFlow is also faster than Open MP, Intel TBB and Cilk (at least for streaming on Intel 2 x quad-core)

Efficient streaming applications on multi-core with FastFlow: the - PowerPoint PPT Presentation

BioBits Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed Marco Aldinucci Computer Science Dept. - University of Torino - Italy Marco Danelutto, Massimiliano Meneghin, Massimo Torquti Computer

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Efficient Wake-Up Scheduling for Efficient Wake-Up Scheduling for Multi-Core Systems Multi-Core

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Playing Video Content Alan Smith ACTIVE SOLUTION, STOCKHOLM, SWEDEN youtube.com/user/CloudCasts

G.J.M. Smit Contents Efficient architectures Introduction energy-efficient systems for

Introduction (1) Packet Loss Recovery for Streaming is growing Commercial streaming

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Streaming Systems Instructor: Matei Zaharia cs245.stanford.edu Outline Motivation Streaming

Landell - live streaming for the masses Luciana Fujii Pontello Landell - live streaming for the

Graph Distances in the Streaming Model Joan Feigenbaum Sampath Kannan Andrew McGregor Siddharth

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

A Novel Framework For Scalable Video A Novel Framework For Scalable Video Streaming Over

Wine industry Supply Chain (WSC) modeling: an Argentine-France comparison Saglietto Laurence -

ROBERT LION CHAIRMAN PARIS REGION ENTREPRISES Born July 1934

Analyzing spatial multivariate structures St ephane Dray Univ. Lyon 1 CARME 2011, Rennes

A More Efficient and Type-Safe Version of FastFlow Etvs Lornd University, Faculty of

Building a Skyscraper with Legos: The Anatomy of a Distributed System Tyler McMullen

Relational Concept Analysis (RCA) Mining multi-relational datasets Applied to class model

GDPR Overview Discussion 25 June 2018 ICANN62 GAC Plenary Meeting Agenda Item 3 Session

Thyroid FNA Even if you do not signout cytopathology, a FNA is the most accurate and cost