Atune-IL: An Instrumentation Language for Auto-Tuning Parallel - - PowerPoint PPT Presentation

atune il an instrumentation language for auto tuning
SMART_READER_LITE
LIVE PREVIEW

Atune-IL: An Instrumentation Language for Auto-Tuning Parallel - - PowerPoint PPT Presentation

Software Engineering Seminar Atune-IL: An Instrumentation Language for Auto-Tuning Parallel Applications Christoph A. Schaefer, Victor Pankratius, Walter F. Tichy Institue for Program Structures and Data Organization (IPD) University of


slide-1
SLIDE 1

Atune-IL: An Instrumentation Language for Auto-Tuning Parallel Applications

Christoph A. Schaefer, Victor Pankratius, Walter F. Tichy Institue for Program Structures and Data Organization (IPD) University of Karlsruhe 2009

Software Engineering Seminar

Michael Berli, December 14th 2011

slide-2
SLIDE 2

Motivation

Parallel Program

http://www.teknovadi.com/lenovo-laptop/lenovo-ideapad-b560/ http://lian-li.com/v2/tw/product/upload/image/pc-60fn/pc-60fn-26.jpg http://www.netzwelt.de/news/74776_3-apple-intel-produktuebersicht-erstes- fazit.html http://www.shoppydoo.co.uk/prices-desktop-packard_bell_ipower.html http://stodolatest.pl/produkt/HP_6735s_KU221EA/opinie/0/0/1

slide-3
SLIDE 3

Motivation

Parallel Program

http://www.teknovadi.com/lenovo-laptop/lenovo-ideapad-b560/ http://lian-li.com/v2/tw/product/upload/image/pc-60fn/pc-60fn-26.jpg http://www.netzwelt.de/news/74776_3-apple-intel-produktuebersicht-erstes- fazit.html http://www.shoppydoo.co.uk/prices-desktop-packard_bell_ipower.html http://stodolatest.pl/produkt/HP_6735s_KU221EA/opinie/0/0/1

  • Number of Cores
  • Memory Management
  • Cache sizes
slide-4
SLIDE 4

Motivation

Parallel Program

http://www.teknovadi.com/lenovo-laptop/lenovo-ideapad-b560/ http://lian-li.com/v2/tw/product/upload/image/pc-60fn/pc-60fn-26.jpg http://www.netzwelt.de/news/74776_3-apple-intel-produktuebersicht-erstes- fazit.html http://www.shoppydoo.co.uk/prices-desktop-packard_bell_ipower.html http://stodolatest.pl/produkt/HP_6735s_KU221EA/opinie/0/0/1

  • Number of Cores
  • Memory Management
  • Cache sizes

gain optimal performance

slide-5
SLIDE 5

Motivation

Parallel Program

http://www.teknovadi.com/lenovo-laptop/lenovo-ideapad-b560/ http://lian-li.com/v2/tw/product/upload/image/pc-60fn/pc-60fn-26.jpg http://www.netzwelt.de/news/74776_3-apple-intel-produktuebersicht-erstes- fazit.html http://www.shoppydoo.co.uk/prices-desktop-packard_bell_ipower.html http://stodolatest.pl/produkt/HP_6735s_KU221EA/opinie/0/0/1 http://www.iconarchive.com/show/soft-scraps-icons-by-deleket/ Gear-icon.html

Program 2 Program 3 Program 4 Program 5 Program 1

adjust tuning parameters

slide-6
SLIDE 6

Automatic Performance Tuning

  • Auto-Tuner: Generate several program variants automatically
  • n a specific architecture
  • find an optimal tuning parameter configuration

Auto-Tuner

Parallel Program performance data p1 p2 p3 parameter configuration p1 p2 p3 p1: 2, 4, 6, 8 p2: „static“, „dynamic“ p3: algo1, algo2

http://www.teknovadi.com/lenovo-laptop/lenovo-ideapad-b560/

slide-7
SLIDE 7

Automatic Performance Tuning

  • Auto-Tuner: Generate several program variants automatically
  • n a specific architecture
  • find an optimal tuning parameter configuration

Auto-Tuner

Parallel Program performance data p1 p2 p3 parameter configuration p2 p3 p1 p1: 2, 4, 6, 8 p2: static, dynamic p3: algo1, algo2

http://www.teknovadi.com/lenovo-laptop/lenovo-ideapad-b560/

slide-8
SLIDE 8

Automatic Performance Tuning

  • Auto-Tuner: Generate several program variants automatically
  • n a specific architecture
  • find an optimal tuning parameter configuration

Auto-Tuner

Parallel Program performance data p1 p2 p3 parameter configuration p1 p2 p3 p1: 2, 4, 6, 8 p2: static, dynamic p3: algo1, algo2

http://www.teknovadi.com/lenovo-laptop/lenovo-ideapad-b560/

slide-9
SLIDE 9

Automatic Performance Tuning

  • Auto-Tuner: Generate several program variants automatically
  • n a specific architecture
  • find an optimal tuning parameter configuration

Auto-Tuner

Parallel Program performance data p1 p2 p3 parameter configuration p1 p2 p3 p1: 2, 4, 6, 8 p2: static, dynamic p3: algo1, algo2

http://www.teknovadi.com/lenovo-laptop/lenovo-ideapad-b560/

slide-10
SLIDE 10

Automatic Performance Tuning

  • Huge search space
  • cross product of parameter domains

p1: 2, 4, 6, 8 p2: static, dynamic p3: algo1, algo2 dom(p1) = 4 dom(p2) = 2 dom(p3) = 2

slide-11
SLIDE 11

Automatic Performance Tuning

  • Huge search space
  • cross product of parameter domains

24 mio parameter configurations

13 parameters 240‘000 program variants 1% search space

slide-12
SLIDE 12

Automatic Performance Tuning

  • Huge search space
  • cross product of parameter domains

24 mio parameter configurations

13 parameters 240‘000 program variants 1%

need to prune the search space !

search space

slide-13
SLIDE 13

Automatic Performance Tuning

  • Three ways to prune the search space
  • try & fail
  • make use of heuristics / previous tuning iterations
  • use the developers knowledge

24 mio parameter configurations

13 parameters 240‘000 program variants 1% search space

slide-14
SLIDE 14

Automatic Performance Tuning

  • Three ways to prune the search space
  • try & fail
  • make use of heuristics / previous tuning iterations

✓ use the developers knowledge

  • Atune-IL: annotate tuning parameters,

independent sections, monitoring probes...

24 mio parameter configurations

13 parameters 240‘000 program variants 1% search space

slide-15
SLIDE 15

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program Variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-16
SLIDE 16

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program Variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-17
SLIDE 17

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program Variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-18
SLIDE 18

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program Variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-19
SLIDE 19

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-20
SLIDE 20

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program Variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-21
SLIDE 21

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program Variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-22
SLIDE 22

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program Variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-23
SLIDE 23

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program Variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-24
SLIDE 24

Atune‘s tuning cycle

Program Code Instrumented Program Code Parser Optimizer generate program variant based on c compile & execute program variant Optimal Program Variant find new configuration c instrument with Atune-IL performance feedback Atune-IL independent of host language independent of application domain

slide-25
SLIDE 25

Numeric Parameters

  • SETVAR keyword

public void SETVAR_Example() { int numThreads = 2; for (int i=1; i <=numThreads; i++){ Thread.Create(StartCalculation); } WaitAll(); }

slide-26
SLIDE 26

Numeric Parameters

  • SETVAR keyword

public void SETVAR_Example() { int numThreads = 2; #pragma atune SETVAR numThreads TYPE int VALUES 2-16 STEP 2 for (int i=1; i <=numThreads; i++){ Thread.Create(StartCalculation); } WaitAll(); } 2, 4, ..., 16 Threads

slide-27
SLIDE 27

Architectural Parameters

  • SETVAR keyword

public void SETVAR_Example2() { SortAlgorithm sortAlgo = new ParallelMergeSort(); #pragma atune SETVAR sortAlgo TYPE generic VALUES „new QuickSort()“, „new ParallelMergeSort()“ if (sortAlgo != null) sortAlgo.run(); }

slide-28
SLIDE 28

Parameter Dependencies

  • DEPENDS keyword

public void DEPENDS_Example() { SortAlgorithm sortAlgo = new ParallelMergeSort(); #pragma atune SETVAR sortAlgo TYPE generic VALUES „new QuickSort()“, „new ParallelMergeSort()“ if (sortAlgo != null) sortAlgo.run(depth); } int depth = 2; #pragma atune SETVAR depth TYPE int VALUES 2-8

14 combinations

slide-29
SLIDE 29

Parameter Dependencies

  • DEPENDS keyword

public void DEPENDS_Example() { SortAlgorithm sortAlgo = new ParallelMergeSort(); #pragma atune SETVAR sortAlgo TYPE generic VALUES „new QuickSort()“, „new ParallelMergeSort()“ if (sortAlgo != null) sortAlgo.run(depth); } int depth = 2; #pragma atune SETVAR depth TYPE int VALUES 2-8 DEPENDS sortAlgo VALUES “new ParallelMergeSort()“

8 combinations instead of 14!

slide-30
SLIDE 30

Tuning Blocks

  • Define independent sections
  • Tuning-Approach 1:
  • Tuning-Approach 2:
  • Tuning-Block 1:
  • Tuning-Block 2:

http://www.ipd.uni-karlsruhe.de/multicore/research/download/ATuneIL-Autotuning.pdf

slide-31
SLIDE 31

Tuning Blocks

http://www.ipd.uni-karlsruhe.de/multicore/research/download/ATuneIL-Autotuning.pdf

  • Define independent sections
  • Tuning-Approach 1:
  • Tuning-Approach 2:
  • Tuning-Block 1:
  • Tuning-Block 2:
slide-32
SLIDE 32

Tuning Blocks

  • STARTBLOCK keyword

public void TUNINGBLOCKS_Example() { // other tuning parameters... int numThreads = 2; #pragma atune SETVAR numThreads TYPE int VALUES 2-16 STEP 2 for (int i=1; i <=numThreads; i++){ Thread.Create(StartCalculation); } WaitAll(); } #pragma atune STARTBLOCK parallelSection #pragma atune ENDBLOCK

slide-33
SLIDE 33

Nested Tuning Blocks

  • INSIDE keyword
  • ptimization starts with the innermost block
  • combines inner and outer parameters successively

public void TUNINGBLOCKS_Example() { int numThreads = 2; #pragma atune SETVAR numThreads TYPE int VALUES 2-16 STEP 2 for (int i=1; i <=numThreads; i++){ Thread.Create( ); } WaitAll(); } #pragma atune STARTBLOCK parallelSection #pragma atune ENDBLOCK StartCalculation()

slide-34
SLIDE 34

Nested Tuning Blocks

  • INSIDE keyword
  • ptimization starts with the innermost block
  • combines inner and outer parameteters successively

StartCalculation() public void { #pragma atune STARTBLOCK nestedSection INSIDE parallelSection /* calculation with own tuning parameters */ #pragma atune ENDBLOCK }

slide-35
SLIDE 35

Monitoring Probes

public void TUNINGBLOCKS_Example() { int numThreads = 2; #pragma atune SETVAR numThreads TYPE int VALUES 2-16 STEP 2 for (int i=1; i <=numThreads; i++){ Thread.Create(StartCalculation()); } WaitAll(); } #pragma atune GAUGE execTime #pragma atune GAUGE execTime t #pragma atune STARTBLOCK parallelSection #pragma atune ENDBLOCK

http://www.iconfinder.com/icondetails/48883/256/

slide-36
SLIDE 36

Context

✓ Motivation ✓ Introduction to Auto-Tuning ✓ Atune‘s Tuning Cycle ✓ Atune-IL

  • Case Study
  • Results
  • Pros & Cons

http://images.productwiki.com/upload/images/safari_iphone_app-400-400.jpg

slide-37
SLIDE 37

Case Study

  • MetaboliteID (Agilent Technologies)
  • Identify effects caused by a drug on a very low level
  • by comparing control samples to metabolite samples

Control Sample Sample 1 ... Sample 2 Time compare compare compare drug effect

http://www.tjohnsonmedia.com/wp-content/uploads/2011/11/kid-icon-256.jpg, http://icons.iconarchive.com/icons/devcom/medical/256/pill-icon.png

slide-38
SLIDE 38

Metabolite ID

Control Sample Control Sample Sample 2 Sample 1 rol 2

Tuning Block 1

NumW1

Input

Algorithm 3 Algorithm 2

NumW2 Lb1 pSize1

Tuning Block 2 Output Similar to Tuning Block 2

Parallel Section

NumW3 Lb2 pSize2

ExecTime ExecTime

slide-39
SLIDE 39

Metabolite ID

Control Sample Control Sample Sample 2 Sample 1 rol 2

Tuning Block 1

NumW1

Input

Algorithm 3 Algorithm 2

NumW2 Lb1 pSize1

Tuning Block 2 Output Similar to Tuning Block 2

Parallel Section

NumW3 Lb2 pSize2

ExecTime ExecTime

slide-40
SLIDE 40

Metabolite ID

Control Sample Control Sample Sample 2 Sample 1 rol 2

Tuning Block 1

NumW1

Input

Algorithm 3 Algorithm 2

NumW2 Lb1 pSize1

Tuning Block 2 Output Similar to Tuning Block 2

Parallel Section

NumW3 Lb2 pSize2

ExecTime ExecTime

http://www.iconfinder.com/icondetails/11746/32/_icon

slide-41
SLIDE 41

Case Study: Results 1

  • Search space reduction
  • Difference in execution time
  • 45% between the best and the worst configuration (8 core machine)

~24 Mio. Combinations 1600 Combinations all combinations instrumented with Atune-IL

slide-42
SLIDE 42

Case Study: Results 2

  • Implementation effort

747 LOC 25 LOC manually implemented used Atune-IL

slide-43
SLIDE 43

Related Work

  • POET1
  • independent of application domain / host language
  • ptimization on source code level
  • XLanguage
  • #pragma approach
  • C / C++ code transformations
  • loop unrolling

1Parameterized Optimizing for Empirical Tuning

slide-44
SLIDE 44

Pros and Cons

✓ Drastical search space reduction ✓ Host-language independent ✓ Independent of application domain ✓ Portability, maintenance

slide-45
SLIDE 45

Pros and Cons

✓ Drastical search space reduction ✓ Host-language independent ✓ Independent of application domain ✓ Portability, maintenance

  • Nothing available on the web
  • Portability, maintenance
  • Paper is incomplete / wrong
  • WEIGHT not specified
  • mixed START/DEFAULT
  • LOC: hardly depends on programming style
  • template files ignored
  • Section 6.4 „Results“ is weak
slide-46
SLIDE 46