KIT – University of Baden-Württemberg and National Research Center of the Helmholtz Association
Institute for Program Structures and Data Organization
Online Tuning of Stream Programs
- r
Online Tuning of Stream Programs or How To Get The Most Ouf Of Your - - PowerPoint PPT Presentation
Online Tuning of Stream Programs or How To Get The Most Ouf Of Your Multicore Walter F. Tichy Institute for Program Structures and Data Organization KIT University of Baden-Wrttemberg and National Research Center of the Helmholtz
KIT – University of Baden-Württemberg and National Research Center of the Helmholtz Association
Institute for Program Structures and Data Organization
2
Multicore-Transformation 3
4
5
6
~1.8 Bill. T. on 2x3.46cm2
~1.3 Bill. T. on 5.6 cm2
~167 Mio. T. on 1.1cm2
~2.3 Bill. T. on 6.8cm2
~1 Bill. T. on 3.7cm2
~582 Mio. T on 2.86cm2
7
~1 Bill. T. on 2.2 cm2
7
8
~1.8 Bill. T. on 2x3.46cm2
~1.3 Bill. T. on 5.6 cm2
~167 Mio. T. on 1.1cm2
~2.3 Bill. T. on 6.8cm2
~1 Bill. T. on 3.7cm2
~582 Mio. T on 2.86cm2
9 Victor Pankratius
~1 Bill. T. on 2.2 cm2
9
Parallelization is complex and error-
Parallel programs contain a number
Manual optimization difficult and
Each target platform may require
Auto-Tuning: Let the computer do
10
? ? ?
a=1 b=2 c=3
a=4 b=5 c=6
a=? b=? c=? !
Auto-Tuning Cycle: Example (pseudo code)
Parallel program with Tuning Parameters Optimize (calculate new parameter values) Parameter Configuration Result of measurement: Performance value Apply Configuration to Program Execute and measure program Executable program
TuningParameter numthreads(3, 64); TuningParameter blocksize(100, 900, 100); for(int i=0; i<numfiles; ++i) { startMeasurement(); compress(files[i], blocksize, numthreads); stopMeasurement(); }
Measurement Section Tuning Parameter
11
12
A stream of elements flows through a graph of processing
Task parallelism Pipeline parallelism Data parallelism
13
Split Join
Replication factor: Cut-off depth: Alternative Algorithms/Cores:
S F1 F2 Fn J F
? AL1 AL2 ALn
14
„Classic“ Fork/Join pattern: Stream program: Solution:
Count „heart beats“ (events triggered by stream elements) Use heart beats to evaluate performance
Measurement Section Measurement Section Measurement Section Seq. parallel 1 parallel 2 Seq. parallel 1 parallel 3 Seq. parallel 1 parallel 2 parallel 3 parallel 2 parallel 3 Seq. Filter 1 Filter 2 Filter 3 Filter 1 Filter 2 Measurement Section(s)? Filter 4 Filter 2 Filter 3 Filter 1 Filter 3 Filter 1 Filter 2 Filter 4 Filter 4 Filter 4
15
Heartbeats are emitted by sink filters The faster the heartbeat, the better the performance Heartbeats serve as an input signal for online auto-tuners
16
Filter 1 Filter 2 Filter 2 Filter 2 Filter 4 Filter 1 Filter 3 Filter 3 Filter 4
time
new parameter configuration
new parameter configuration
new parameter configuration
Filter 1 Filter 2 Filter 2 Filter 2 Filter 4 Filter 1 Filter 3 Filter 3 Filter 4 Filter 1 Filter 2 Filter 2 Filter 2 Filter 4 Filter 1 Filter 3 Filter 3 Filter 4 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1
17
S J Read Write
?First Come/First Serve
Cut Scale I Scale II *
? ?
0,00 200,00 400,00 600,00 800,00 1000,00 1200,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara Execution time pre: Statically predicted tun: On-line auto-tuned best: Started with best known configuration, w/o Auto-Tuning
Part of VLSI design application 5 Filters with feedback loop and teleports 4 Tuning parameters
Producer Repair
Movement * Calculate forces Finish
18
19
0,00 200,00 400,00 600,00 800,00 1000,00 1200,00 1400,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara Execution time
20
0% 20% 40% 60% 80% 100% 120% DS Electric Series Vscale Vzoom seq pre tun Fractions of best parallel performance (= 100%)
21
0% 20% 40% 60% 80% 100% 120% DS Electric Series Vscale Vzoom seq pre tun
ATLAS/AEOS (Whaley et al., 2000)
Auto-tuning system for algebraic operations and algorithms Domain specific approach No support for parallel programs
Active Harmony (Tapus et al., 2002)
Search-based auto-tuning system for library optimization Comprehensive analysis of search algorithms Not applicable for parallel programs
MATE (Morajko et al., 2007)
Model-based tuning system for distributed PVM programs Provides good performance predictions Limited to special program structures
ATUNE (Schaefer, Tichy, 2010)
General-purpose auto-tuner Offline tuner (trial runs) Pattern language for expressing parallel patterns (TADL)
22
23
0,00 20,00 40,00 60,00 80,00 100,00 120,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara Execution time Read directory
Write to index
Read file *
Computers are not the bottleneck. Programmers are! Stream programming simplifies parallel programming
Typical parallel patterns easy to write Auto-tuning finds optimal operating conditions Saves lots of tuning work
Further research
Improved online search algorithms Use static model to predict good starting values Use auto-tuning to distribute work over heterogeneous cores
24
25
26
27
0% 20% 40% 60% 80% 100% 120% DS Electric Series Vscale Vzoom seq pre tun