Online Tuning of Stream Programs or How To Get The Most Ouf Of Your - PowerPoint PPT Presentation

Online Tuning of Stream Programs or How To Get The Most Ouf Of Your Multicore Walter F. Tichy Institute for Program Structures and Data Organization KIT – University of Baden-Württemberg and National Research Center of the Helmholtz Association

Where is Karlsruhe? University of Karlsruhe - KIT, Germany Faculty of Computer Science One of the leading CS departments in Europe >40 faculty, >400 PhD students in CS 2

The changing parallel computing landscape Cray vector computer 1976 3 Multicore-Transformation

The first five-core mobile phone HTC One X, Feb. 1912, Powered by Nvidia Tegra 3 4

Nvidia Tegra 3 5

Nvidia Tegra 3 Schematic 1 core at 500 MHz (battery saver) 4 cores at 1.5 GHz 1 GPU 6

AMD Opteron 12 cores Sun Niagara3 16 cores ~1.8 Bill. T. on 2x3.46cm 2 ~1 Bill. T. on 3.7cm 2 Intel SCC 48 cores ~1.3 Bill. T. on 5.6 cm 2 Intel 8 cores ~2.3 Bill. T. on 6.8cm 2 Intel 4 cores ~582 Mio. T on 2.86cm 2 Intel Sandy Bridge 4+6 cores ~1 Bill. T. on 2.2 cm 2 Intel 2 cores ~167 Mio. T. on 1.1cm 2 Bus Bu Bus Bu 7 7

The 2011 Intel Sandy Bridge Currently: 4 CPUs, 6 graphics Execution Units Later: 8 CPUs, 12 graphics Execution Units 8

AMD Opteron 12 cores Sun Niagara3 16 cores ~1.8 Bill. T. on 2x3.46cm 2 ~1 Bill. T. on 3.7cm 2 Intel SCC 48 cores ~1.3 Bill. T. on 5.6 cm 2 Intel 8 cores ~2.3 Bill. T. on 6.8cm 2 Intel 4 cores ~582 Mio. T on 2.86cm 2 Intel Sandy Bridge 4+6 cores ~1 Bill. T. on 2.2 cm 2 Intel 2 cores ~167 Mio. T. on 1.1cm 2 Bus Bu Bus Bu 9 9 Victor Pankratius

Fixing Parallel Performance Problems � Parallelization is complex and error- ? ? ? prone � Parallel programs contain a number a= 1 b= 2 c= 3 of tuning parameters a= ? b= ? c= ? ! � Manual optimization difficult and A Examples for Tuning time-consuming Parameters ! � Each target platform may require • Number of pipeline a= 4 b= 5 c= 6 re-tuning stages • Choice of best algorithm B implementation � Auto-Tuning : Let the computer do • Order of execution the tuning! • Size of data partitions • Number of workers • Type of core • Load balancing strategy 10

Online Auto-Tuning � Auto-Tuning Cycle: Result of measurement: Optimize (calculate new Parameter Configuration Performance value parameter values) Parallel program Execute and measure Apply Configuration to with program Program Executable Tuning Parameters program � Example (pseudo code) TuningParameter numthreads(3, 64); Tuning Parameter TuningParameter blocksize(100, 900, 100); for(int i=0; i<numfiles; ++i) { startMeasurement(); Measurement compress(files[i], blocksize, numthreads); Section stopMeasurement(); } 11

Auto-Tuning: BZip2 example Parallelized BZip2, compressing 50 files on a machine with 8 cores Initial tuning parameter values: 3 threads, block size 700 kB Runtime without tuning: 22,9 s Runtime with Auto-Tuner: 8 s Best possible time (start with best configuration): 6,5 s 12

Stream Programming Paradigm � A stream of elements flows through a graph of processing modules called filters . F 2 � Task parallelism F 1 F 4 F 3 � Pipeline parallelism F 1 F 2 F 3 F 4 F 5 F 2 � Data parallelism (by filter replication) F 1 F 2 F 3 Split Join F 2 13

(Some) Implicit Tuning Parameters � Replication factor: F 1 S J F 2 F ··· F n � Cut-off depth: � Alternative Algorithms/Cores: AL 1 ? AL 2 ··· AL n 14

Measurement Sections in Stream Programs � „Classic“ Fork/Join pattern: Measurement Section Measurement Section Measurement Section Seq. parallel 1 Seq. parallel 1 Seq. parallel 1 parallel 2 parallel 3 parallel 2 parallel 3 parallel 2 parallel 3 � Stream program: Measurement Section(s)? Seq. Filter 1 Filter 2 Filter 2 Filter 3 Filter 1 Filter 2 Filter 4 Filter 1 Filter 1 Filter 4 Filter 4 Filter 4 Filter 3 Filter 3 Filter 2 � Solution: � Count „heart beats“ (events triggered by stream elements) � Use heart beats to evaluate performance 15

Using Heartbeats for Online Tuning � Heartbeats are emitted by sink filters � The faster the heartbeat, the better the performance � Heartbeats serve as an input signal for online auto-tuners Illustrating Example: Filter 1 Filter 2 Filter 2 Filter 1 Filter 2 Filter 2 Filter 1 Filter 2 Filter 2 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1 Filter 3 Filter 3 Filter 4 Filter 3 Filter 3 Filter 4 Filter 3 Filter 3 Filter 4 Filter 4 Filter 4 Filter 4 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 time new parameter new parameter new parameter configuration configuration configuration 50 70 80 Auto-Tuner 16

Benchmark 1: Video zoom * Scale I ? ? * * replicable S J Read Cut Write ? First Come/First Serve * Scale II 1200,00 1000,00 pre: Statically predicted Execution time tun: On-line auto-tuned 800,00 best: Started with best known configuration, w/o Auto-Tuning 600,00 400,00 200,00 0,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara 17

Benchmark 2: Electric (Placement of circuits on a die) � Part of VLSI design application � 5 Filters with feedback loop and teleports � 4 Tuning parameters * * * Calculate Repair Producer Movement Finish forces overlaps * replicable 18

Electric: Results 1400,00 1200,00 1000,00 Execution time 800,00 600,00 400,00 200,00 0,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara 19

Benchmarks on 4 cores Fractions of best parallel 120% performance (= 100%) 100% 80% seq 60% pre tun 40% 20% 0% DS Electric Series Vscale Vzoom 20

Benchmarks on 64 cores (Niagara) 120% 100% 80% seq 60% pre tun 40% 20% 0% DS Electric Series Vscale Vzoom 21

Related Work (Selection) � ATLAS/AEOS (Whaley et al., 2000) � Auto-tuning system for algebraic operations and algorithms � Domain specific approach � No support for parallel programs � Active Harmony (Tapus et al., 2002) � Search-based auto-tuning system for library optimization � Comprehensive analysis of search algorithms � Not applicable for parallel programs � MATE (Morajko et al., 2007) � Model-based tuning system for distributed PVM programs � Provides good performance predictions � Limited to special program structures � ATUNE (Schaefer, Tichy, 2010) � General-purpose auto-tuner � Offline tuner (trial runs) � Pattern language for expressing parallel patterns (TADL) 22

Benchmark 3: Desktop search * * Read Write to * replicable Read file directory index 120,00 100,00 80,00 Execution time 60,00 40,00 20,00 0,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara 23

Summary � Computers are not the bottleneck. � Programmers are! � Stream programming simplifies parallel programming � Typical parallel patterns easy to write � Auto-tuning finds optimal operating conditions � Saves lots of tuning work � Further research � Improved online search algorithms � Use static model to predict good starting values � Use auto-tuning to distribute work over heterogeneous cores 24

THANK YOU! QUESTIONS? With many thanks to Frank Otto, Thomas Karcher, Jonas Thedering, Victor Pankratius For more information, see: http://www.ipd.kit.edu/Tichy/ 25

BACKUP SLIDES 26

Benchmarks on 8 cores 120% 100% 80% seq 60% pre tun 40% 20% 0% DS Electric Series Vscale Vzoom 27

Online Tuning of Stream Programs or How To Get The Most Ouf Of Your - PowerPoint PPT Presentation

Online Tuning of Stream Programs or How To Get The Most Ouf Of Your Multicore Walter F. Tichy Institute for Program Structures and Data Organization KIT University of Baden-Wrttemberg and National Research Center of the Helmholtz

Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

Fresh water stream ecosystem Gr ov p 2 The description of stream lives Quadrat 1: Hong Kong Newt

Phase III Stream Assessment Study: Potential Stream Restoration Projects Strawberry Run and

UPLOAD VIDEOS TO MICROSOFT STREAM VIA ACCESSUH To upload a video on Microsoft Stream, go to

Assessing stream and riparian conditions Stream Habitat Assessment Conducted yearly

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Dr. Tun Shwe Director, Oil Seed Crop and Food Legume Division Department of Agricultural

CPD Anniversary Lecture 2018 Ass Assessin essing g the the Challen Challenges es of of SD

Nabto in Video Solutions www.nabto.com R EMOTE A CCESS I NCREASES P RICE AND P ROFIT Only $100 in

XDP - challenges and future work Jesper Dangaard Brouer (Red Hat) Toke Hiland-Jrgensen

Africa in 2018 What to expect A presentation to the Norwegian African Business Association

DPW Road Projects Updates Tuesday, October 15, 2019 GSPE Line-up Village Streets

net tinc VPN A quick introduction... Images: TJA, gobeirne, SKAO, mtearle About tinc Info

JOHN SMI HN SMITH HS BREW S BREWERY ERY NEW EW LAUTER TUN. LAUTER TUN. Br