Online Tuning of Stream Programs or How To Get The Most Ouf Of Your - - PowerPoint PPT Presentation

online tuning of stream programs or how to get the most
SMART_READER_LITE
LIVE PREVIEW

Online Tuning of Stream Programs or How To Get The Most Ouf Of Your - - PowerPoint PPT Presentation

Online Tuning of Stream Programs or How To Get The Most Ouf Of Your Multicore Walter F. Tichy Institute for Program Structures and Data Organization KIT University of Baden-Wrttemberg and National Research Center of the Helmholtz


slide-1
SLIDE 1

KIT – University of Baden-Württemberg and National Research Center of the Helmholtz Association

Institute for Program Structures and Data Organization

Online Tuning of Stream Programs

  • r

How To Get The Most Ouf Of Your Multicore

Walter F. Tichy

slide-2
SLIDE 2

Where is Karlsruhe? University of Karlsruhe - KIT, Germany

Faculty of Computer Science One of the leading CS departments in Europe >40 faculty, >400 PhD students in CS

2

slide-3
SLIDE 3

The changing parallel computing landscape

Multicore-Transformation 3

Cray vector computer 1976

slide-4
SLIDE 4

The first five-core mobile phone

4

HTC One X, Feb. 1912, Powered by Nvidia Tegra 3

slide-5
SLIDE 5

Nvidia Tegra 3

5

slide-6
SLIDE 6

Nvidia Tegra 3 Schematic

6

1 core at 500 MHz (battery saver) 4 cores at 1.5 GHz 1 GPU

slide-7
SLIDE 7

AMD Opteron 12 cores

~1.8 Bill. T. on 2x3.46cm2

Intel SCC 48 cores

~1.3 Bill. T. on 5.6 cm2

Bu Bus Bu Bus

Intel 2 cores

~167 Mio. T. on 1.1cm2

Intel 8 cores

~2.3 Bill. T. on 6.8cm2

Sun Niagara3 16 cores

~1 Bill. T. on 3.7cm2

Intel 4 cores

~582 Mio. T on 2.86cm2

7

Intel Sandy Bridge 4+6 cores

~1 Bill. T. on 2.2 cm2

7

slide-8
SLIDE 8

The 2011 Intel Sandy Bridge

8

Currently: 4 CPUs, 6 graphics Execution Units Later: 8 CPUs, 12 graphics Execution Units

slide-9
SLIDE 9

AMD Opteron 12 cores

~1.8 Bill. T. on 2x3.46cm2

Intel SCC 48 cores

~1.3 Bill. T. on 5.6 cm2

Bu Bus Bu Bus

Intel 2 cores

~167 Mio. T. on 1.1cm2

Intel 8 cores

~2.3 Bill. T. on 6.8cm2

Sun Niagara3 16 cores

~1 Bill. T. on 3.7cm2

Intel 4 cores

~582 Mio. T on 2.86cm2

9 Victor Pankratius

Intel Sandy Bridge 4+6 cores

~1 Bill. T. on 2.2 cm2

9

slide-10
SLIDE 10

Fixing Parallel Performance Problems

Parallelization is complex and error-

prone

Parallel programs contain a number

  • f tuning parameters

Manual optimization difficult and

time-consuming

Each target platform may require

re-tuning

Auto-Tuning: Let the computer do

the tuning!

10

? ? ?

a=1 b=2 c=3

A

a=4 b=5 c=6

B

a=? b=? c=? !

!

Examples for Tuning Parameters

  • Number of pipeline

stages

  • Choice of best algorithm

implementation

  • Order of execution
  • Size of data partitions
  • Number of workers
  • Type of core
  • Load balancing strategy
slide-11
SLIDE 11

Online Auto-Tuning

Auto-Tuning Cycle: Example (pseudo code)

Parallel program with Tuning Parameters Optimize (calculate new parameter values) Parameter Configuration Result of measurement: Performance value Apply Configuration to Program Execute and measure program Executable program

TuningParameter numthreads(3, 64); TuningParameter blocksize(100, 900, 100); for(int i=0; i<numfiles; ++i) { startMeasurement(); compress(files[i], blocksize, numthreads); stopMeasurement(); }

Measurement Section Tuning Parameter

11

slide-12
SLIDE 12

Auto-Tuning: BZip2 example

12

Parallelized BZip2, compressing 50 files on a machine with 8 cores Initial tuning parameter values: 3 threads, block size 700 kB Runtime without tuning: 22,9 s Runtime with Auto-Tuner: 8 s Best possible time (start with best configuration): 6,5 s

slide-13
SLIDE 13

Stream Programming Paradigm

A stream of elements flows through a graph of processing

modules called filters.

Task parallelism Pipeline parallelism Data parallelism

(by filter replication)

13

F1 F2 F3 F4 F1 F2 F3 F5 F4 F1 F2 F2 F3 F2

Split Join

slide-14
SLIDE 14

(Some) Implicit Tuning Parameters

Replication factor: Cut-off depth: Alternative Algorithms/Cores:

S F1 F2 Fn J F

···

? AL1 AL2 ALn

···

14

slide-15
SLIDE 15

Measurement Sections in Stream Programs

„Classic“ Fork/Join pattern: Stream program: Solution:

Count „heart beats“ (events triggered by stream elements) Use heart beats to evaluate performance

Measurement Section Measurement Section Measurement Section Seq. parallel 1 parallel 2 Seq. parallel 1 parallel 3 Seq. parallel 1 parallel 2 parallel 3 parallel 2 parallel 3 Seq. Filter 1 Filter 2 Filter 3 Filter 1 Filter 2 Measurement Section(s)? Filter 4 Filter 2 Filter 3 Filter 1 Filter 3 Filter 1 Filter 2 Filter 4 Filter 4 Filter 4

15

slide-16
SLIDE 16

Using Heartbeats for Online Tuning

Heartbeats are emitted by sink filters The faster the heartbeat, the better the performance Heartbeats serve as an input signal for online auto-tuners

16

Filter 1 Filter 2 Filter 2 Filter 2 Filter 4 Filter 1 Filter 3 Filter 3 Filter 4

50

Auto-Tuner

time

new parameter configuration

70

new parameter configuration

80

new parameter configuration

Filter 1 Filter 2 Filter 2 Filter 2 Filter 4 Filter 1 Filter 3 Filter 3 Filter 4 Filter 1 Filter 2 Filter 2 Filter 2 Filter 4 Filter 1 Filter 3 Filter 3 Filter 4 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1 Filter 2 Filter 4 Filter 1

Illustrating Example:

slide-17
SLIDE 17

Benchmark 1: Video zoom

17

S J Read Write

*replicable

?First Come/First Serve

Cut Scale I Scale II *

? ?

* *

0,00 200,00 400,00 600,00 800,00 1000,00 1200,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara Execution time pre: Statically predicted tun: On-line auto-tuned best: Started with best known configuration, w/o Auto-Tuning

slide-18
SLIDE 18

Benchmark 2: Electric (Placement of circuits on a die)

Part of VLSI design application 5 Filters with feedback loop and teleports 4 Tuning parameters

Producer Repair

  • verlaps

*replicable

Movement * Calculate forces Finish

* *

18

slide-19
SLIDE 19

Electric: Results

19

0,00 200,00 400,00 600,00 800,00 1000,00 1200,00 1400,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara Execution time

slide-20
SLIDE 20

Benchmarks on 4 cores

20

0% 20% 40% 60% 80% 100% 120% DS Electric Series Vscale Vzoom seq pre tun Fractions of best parallel performance (= 100%)

slide-21
SLIDE 21

Benchmarks on 64 cores (Niagara)

21

0% 20% 40% 60% 80% 100% 120% DS Electric Series Vscale Vzoom seq pre tun

slide-22
SLIDE 22

Related Work (Selection)

ATLAS/AEOS (Whaley et al., 2000)

Auto-tuning system for algebraic operations and algorithms Domain specific approach No support for parallel programs

Active Harmony (Tapus et al., 2002)

Search-based auto-tuning system for library optimization Comprehensive analysis of search algorithms Not applicable for parallel programs

MATE (Morajko et al., 2007)

Model-based tuning system for distributed PVM programs Provides good performance predictions Limited to special program structures

ATUNE (Schaefer, Tichy, 2010)

General-purpose auto-tuner Offline tuner (trial runs) Pattern language for expressing parallel patterns (TADL)

22

slide-23
SLIDE 23

Benchmark 3: Desktop search

23

0,00 20,00 40,00 60,00 80,00 100,00 120,00 pre tun best pre tun best pre tun best Quadcore Dell Niagara Execution time Read directory

*replicable

Write to index

*

Read file *

slide-24
SLIDE 24

Summary

Computers are not the bottleneck. Programmers are! Stream programming simplifies parallel programming

Typical parallel patterns easy to write Auto-tuning finds optimal operating conditions Saves lots of tuning work

Further research

Improved online search algorithms Use static model to predict good starting values Use auto-tuning to distribute work over heterogeneous cores

24

slide-25
SLIDE 25

THANK YOU! QUESTIONS?

25

For more information, see: http://www.ipd.kit.edu/Tichy/ With many thanks to Frank Otto, Thomas Karcher, Jonas Thedering, Victor Pankratius

slide-26
SLIDE 26

BACKUP SLIDES

26

slide-27
SLIDE 27

Benchmarks on 8 cores

27

0% 20% 40% 60% 80% 100% 120% DS Electric Series Vscale Vzoom seq pre tun