Performance Evaluation of Throughput Constrained Dataflow Programs - - PowerPoint PPT Presentation

performance evaluation of throughput constrained dataflow
SMART_READER_LITE
LIVE PREVIEW

Performance Evaluation of Throughput Constrained Dataflow Programs - - PowerPoint PPT Presentation

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives Performance Evaluation of Throughput Constrained Dataflow Programs Executed On Shared-Memory Multi-Core Architectures Manuel SELVA


slide-1
SLIDE 1

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Performance Evaluation of Throughput Constrained Dataflow Programs Executed On Shared-Memory Multi-Core Architectures

Manuel SELVA

Supervised by: Lionel MOREL, CITI Kevin MARQUET, CITI Stéphane FRÉNOT, CITI Frédéric SOINNE, Bull Stéphane ZENG, Bull

July 2, 2015

1 / 42

slide-2
SLIDE 2

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Moore’s Law

2 / 42

slide-3
SLIDE 3

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

What To Do With Transistors?

Transistors Frequency Power Instruction Level Parallelism

3 / 42

slide-4
SLIDE 4

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

What To Do With Transistors?

Transistors Frequency Power Instruction Level Parallelism

Multi-core

3 / 42

slide-5
SLIDE 5

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Multi-core Architectures

Intel Nehalem - 4 cores - 2009 Kalray MPPA - 256 cores - 2013 Samsung Exynos - 2 x 4 cores - 2012

4 / 42

slide-6
SLIDE 6

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Multi-core Architectures

Intel Nehalem - 4 cores - 2009 Kalray MPPA - 256 cores - 2013 Samsung Exynos - 2 x 4 cores - 2012

One taxonomy

  • Computing homogeneity
  • Memory organization

4 / 42

slide-7
SLIDE 7

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Centralized Shared Memory

Core 1 Core 2 Core 3 Core 4

Mem Bandwidth bottleneck

5 / 42

slide-8
SLIDE 8

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Centralized Shared Memory

Core 1 Core 2 Core 3 Core 4

Mem Bandwidth bottleneck Bandwidth bottleneck Memory Wall Mem Mem

5 / 42

slide-9
SLIDE 9

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Distributed Shared Memory - Aka NUMA

Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8

Mem Interconnect Mem

6 / 42

slide-10
SLIDE 10

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Distributed Private Memory

Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Core 16 Core 17 Core 18 Core 19 Core 20 Core 21 Core 22 Core 23 Core 24 Core 25 Core 26 Core 27 Core 28 Core 29 Core 30 Core 31 Core 32

Mem Mem Network

7 / 42

slide-11
SLIDE 11

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Software Challenge: Several Applications

Application 2 Application 1 Application 3 Application 4 Application 2 Application 1 Application 3 Application 4

8 / 42

slide-12
SLIDE 12

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Software Challenge: Several Applications

Application 2 Application 1 Application 3 Application 4 Application 2 Application 1 Application 3 Application 4 Core 1 Core 2 Core 3 Core 4

8 / 42

slide-13
SLIDE 13

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Software Challenge: Single Application

Application Core 1 Core 2 Core 3

Problem

  • Identify several activities
  • Handle communication and synchronization

9 / 42

slide-14
SLIDE 14

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Software Challenge: Single Application

Application Core 1 Core 2 Core 3

Problem

  • Identify several activities
  • Handle communication and synchronization

Solutions

  • (Semi) Automatically split existing apps
  • (Re) Write apps using concurrent programming models
  • Threads, Data parallelism, Dataflow

9 / 42

slide-15
SLIDE 15

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Applications Examples

Medical image processing [Albers2012] Software Defined Radio [Dardaillon2014] Video Decoding [Lucarz09]

10 / 42

slide-16
SLIDE 16

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Outline

Dataflow Programming & Problematic Detection of SDF Bottleneck Actors Profiling of Dataflow Programs Conclusion & Perspectives

11 / 42

slide-17
SLIDE 17

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Programming Model: Syntax

A B C D

Dataflow Application Graph

  • Actors with sequential atomic function
  • Communication over FIFO channels only

12 / 42

slide-18
SLIDE 18

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Programming Model: Semantic

A B C D

Actors activation driven by tokens availability

13 / 42

slide-19
SLIDE 19

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Programming Model: Semantic

A B C D

Actors activation driven by tokens availability

13 / 42

slide-20
SLIDE 20

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Programming Model: Semantic

A B C D

Actors activation driven by tokens availability

13 / 42

slide-21
SLIDE 21

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Programming Model: Semantic

A B C D

Actors activation driven by tokens availability

13 / 42

slide-22
SLIDE 22

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Programming Model: Semantic

A B C D

Actors activation driven by tokens availability

13 / 42

slide-23
SLIDE 23

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Programming Model: Semantic

A B C D

Actors activation driven by tokens availability Why is dataflow interesting?

  • Actors can be executed in parallel
  • Communication abstraction

13 / 42

slide-24
SLIDE 24

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Execution Model: 4 Cores

A B C D

14 / 42

slide-25
SLIDE 25

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Execution Model: 4 Cores

A B C D

C; D; B; A;

Compiler

14 / 42

slide-26
SLIDE 26

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Execution Model: 4 Cores

A B C D

C; D; B; A;

Compiler

Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 RAM Mapper

A; B; C; D; 14 / 42

slide-27
SLIDE 27

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Execution Model: 2 Cores

A B C D

C; D; B; A;

Compiler

Core 1 Core 2 Core 1 Core 2 RAM Mapper

A; C; D; B; 15 / 42

slide-28
SLIDE 28

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Execution Model: Single Core

A B C D

C; D; B; A;

Compiler

Core 1 Core 1 RAM Mapper

A; B; C; D; 16 / 42

slide-29
SLIDE 29

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Motivation

1 2 3 4 5 6 7 8 9 10 11 12 1 1.5 2 2.5 3 Different inputs HEVC decoding 200 frames 33 Actors Number of cores Speedup vs single-core

17 / 42

slide-30
SLIDE 30

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Problem Statement

How to understand and identify performance bottlenecks in dataflow programs?

  • Contribution 1: Automatic instrumentation to detect

bottleneck actors in SDF graphs

  • Contribution 2: CPU/memory profiling to analyse (and fix)

bottlenecks on dataflow programs

18 / 42

slide-31
SLIDE 31

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Outline

Dataflow Programming & Problematic Detection of SDF Bottleneck Actors Profiling of Dataflow Programs Conclusion & Perspectives

19 / 42

slide-32
SLIDE 32

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Synchronous Dataflow [Lee87] - SDF

A B C D 2 1 1 3 1 6 3 3 2

Tokens consumption/production rates static

  • Memory boundedness
  • Static scheduling of actors

20 / 42

slide-33
SLIDE 33

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Applications Have Throughput Requirements

X Images/s X Frames/s X Images/s

21 / 42

slide-34
SLIDE 34

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Activation Frequency Analysis

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

Use throughput constraint and static rates

22 / 42

slide-35
SLIDE 35

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Activation Frequency Analysis

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

500Hz

Use throughput constraint and static rates

  • Required Activation Frequency (RAF)

22 / 42

slide-36
SLIDE 36

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Activation Frequency Analysis

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

500Hz 250Hz 500Hz

Use throughput constraint and static rates

  • Required Activation Frequency (RAF)

22 / 42

slide-37
SLIDE 37

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Activation Frequency Analysis

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

500Hz 250Hz 500Hz 250Hz

Use throughput constraint and static rates

  • Required Activation Frequency (RAF)

22 / 42

slide-38
SLIDE 38

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Activation Frequency Analysis

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

500Hz 250Hz 500Hz 250Hz

Use throughput constraint and static rates

  • Required Activation Frequency (RAF)
  • Time information in addition to data exchanges

22 / 42

slide-39
SLIDE 39

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Instrumentation of SDF Actors

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

23 / 42

slide-40
SLIDE 40

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Instrumentation of SDF Actors

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

C; D; B; A;

250Hz 500Hz 250Hz 500Hz Extended Compiler

23 / 42

slide-41
SLIDE 41

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Instrumentation of SDF Actors

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

C; D; B; A;

250Hz 500Hz 250Hz 500Hz Extended Compiler

Core 1 Core 2 Core 1 Core 2 RAM Mapper

A; C; D; B; 23 / 42

slide-42
SLIDE 42

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Instrumentation of SDF Actors

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

C; D; B; A;

250Hz 500Hz 250Hz 500Hz Extended Compiler

Core 1 Core 2 Core 1 Core 2 RAM Mapper

A; C; D; B; C; 150Hz 23 / 42

slide-43
SLIDE 43

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Conclusion

  • Identification of SDF bottleneck actors
  • Validation on Streamit benchmarks [Thies10]

Limitations

  • Real-life applications don’t fit all into SDF
  • Identify where the problem is but not its origin

24 / 42

slide-44
SLIDE 44

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Outline

Dataflow Programming & Problematic Detection of SDF Bottleneck Actors Profiling of Dataflow Programs Conclusion & Perspectives

25 / 42

slide-45
SLIDE 45

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Motivation

1 2 3 4 5 6 7 8 9 10 11 12 1 1.5 2 2.5 3 Different inputs HEVC decoding 200 frames 33 Actors Number of cores Speedup vs single-core

26 / 42

slide-46
SLIDE 46

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dynamic Dataflow [Lee95] - DDF

A B C D

* * * * * * * * *

Token consumption/production rates dynamic

  • No static analyses
  • Runtime mechanisms required for scheduling
  • Runtime mechanisms to identify bottlenecks

27 / 42

slide-47
SLIDE 47

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

How To Understand and Identify Performance Bottlenecks in Dataflow Programs?

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

Correlate hw profiling to the DF graph

28 / 42

slide-48
SLIDE 48

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

How To Understand and Identify Performance Bottlenecks in Dataflow Programs?

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

Correlate hw profiling to the DF graph

28 / 42

slide-49
SLIDE 49

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

How To Understand and Identify Performance Bottlenecks in Dataflow Programs?

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

Correlate hw profiling to the DF graph

28 / 42

slide-50
SLIDE 50

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Dataflow Profiler

CPU Profiling

  • Measure activation time
  • At core level
  • At actors level

Memory Profiling

  • Based on hardware mechanisms
  • Latency of accesses to FIFO
  • Latency of accesses to the actors’ internal state
  • Identification of hardware bottlenecks

29 / 42

slide-51
SLIDE 51

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Cores Balance

1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 Split it [Jerbi14] Single actor: Inter pred. 200 frames Input: Kimono HEVC 28 27 27 29 29 31 34 36 36 43 54 100 Number of cores Work distribution by core (%)

Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 30 / 42

slide-52
SLIDE 52

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Cores Balance

1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 Split it [Jerbi14] Single actor: Inter pred. Single actor: Inter pred. 200 frames Input: Kimono HEVC 28 27 27 29 29 31 34 36 36 43 54 100 Number of cores Work distribution by core (%)

Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 30 / 42

slide-53
SLIDE 53

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Cores Balance

1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 Split it [Jerbi14] Split it [Jerbi14] Single actor: Inter pred. Single actor: Inter pred. 200 frames Input: Kimono HEVC 28 27 27 29 29 31 34 36 36 43 54 100 Number of cores Work distribution by core (%)

Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11 Core 12 30 / 42

slide-54
SLIDE 54

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Total Work Time is Increasing

1 2 3 4 5 6 7 8 9 10 11 12 5 6 7 8 ·1010 +49% Input: Kimono 200 frames HEVC Number of cores Total Work Time (cycles)

31 / 42

slide-55
SLIDE 55

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Communication Overhead On NUMA

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

32 / 42

slide-56
SLIDE 56

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Communication Overhead On NUMA

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

Remote vs local latency +30% [Molka2009, David2013]

32 / 42

slide-57
SLIDE 57

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Communication Overhead On NUMA

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

Cache coherency protocol QPI overhead [Molka2009]

32 / 42

slide-58
SLIDE 58

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Communication Overhead On NUMA

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

Memory controlers and QPI links contention [Dashti2013]

32 / 42

slide-59
SLIDE 59

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

NUMA - Performance Monitoring Unit

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain PMU PMU PMU PMU PMU PMU

Hardware profiling mechanisms

  • Hard to program

33 / 42

slide-60
SLIDE 60

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

A library for NUMA Profiling

Hardware Write assembler Run in supervisor PMU Linux Kernel

perf_event_open()

system call Kernel module /dev/cpu/msr Linux Perf PAPI numap Intel PCM

  • Architecture abstraction
  • Memory bandwidth profiling
  • Memory access sampling

34 / 42

slide-61
SLIDE 61

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Using numap for Dataflow Memory Profiling

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

35 / 42

slide-62
SLIDE 62

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Using numap for Dataflow Memory Profiling

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

DF applications saturate memory bandwidth?

35 / 42

slide-63
SLIDE 63

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Using numap for Dataflow Memory Profiling

Core 1 L1 L2 ... Core 6 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 1 Core 7 L1 L2 ... Core 12 L1 L2 L3

  • Mem. Ctrl

QPI Xeon X5650 Memory Bank 2 Core domain Uncore domain

PMU PMU PMU PMU PMU PMU

@=0x7123CFF

Associate remote accesses with actors and FIFOs

35 / 42

slide-64
SLIDE 64

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Memory Bandwidth Usage

1 2 3 4 5 6 7 8 9 10 11 12 5 10 15 20 25 Write max bandwidth Read max bandwidth Input: Kimono 200 frames HEVC Number of cores Average Bandwidth (GB/s)

Read Write

36 / 42

slide-65
SLIDE 65

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Communication Cost

1 2 3 4 5 6 7 8 9 101112 20 40 60 80 100 200 frames Input: Kimono HEVC Average Memory Latency (cycles) 21.1 20.0 19.4 16.0 16.0 14.5 12.9 11.4 10.0 9.5 8.2 7.9 18 18 25 25 21 14 17 26 32 25 18 17 14 17 22 11 16 16 16 19 18 17 89 90 79 75 64 57 52 47 47 39 39 37 Number of cores % of accesses

L1 LFB L2 L3 RemoteCache LocalRAM RemoteRAM

37 / 42

slide-66
SLIDE 66

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Communication Cost

1 2 3 4 5 6 7 8 9 101112 20 40 60 80 100 200 frames Input: Kimono HEVC Average Memory Latency (cycles) 21.1 20.0 19.4 16.0 16.0 14.5 12.9 11.4 10.0 9.5 8.2 7.9 Average Memory Latency (cycles) 21.1 20.0 19.4 16.0 16.0 14.5 12.9 11.4 10.0 9.5 8.2 7.9 18 18 25 25 21 14 17 26 32 25 18 17 14 17 22 11 16 16 16 19 18 17 89 90 79 75 64 57 52 47 47 39 39 37 Number of cores % of accesses

L1 LFB L2 L3 RemoteCache LocalRAM RemoteRAM

37 / 42

slide-67
SLIDE 67

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Where to Optimize?

High latency High latency

Link memory samples to FIFO channels

38 / 42

slide-68
SLIDE 68

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Outline

Dataflow Programming & Problematic Detection of SDF Bottleneck Actors Profiling of Dataflow Programs Conclusion & Perspectives

39 / 42

slide-69
SLIDE 69

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Conclusion

Contribution 1: Detect SDF bottleneck actors

  • Independent of the language
  • Independent of the architecture
  • Published [Selva15]

Contribution 2: CPU/memory profiling for DF programs

  • Implementation in Orcc [Yviquel13]
  • Memory profiling with the help of numap
  • Conclusions about where to optimize
  • Journal article to be submitted in July

40 / 42

slide-70
SLIDE 70

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Perspectives

Tooling

  • Integration in the official Orcc release
  • Open source numap (in PAPI?)

Research

  • Detect bottleneck actors in less restrictive models
  • Optimize dataflow runtime using memory sampling results
  • Towards a dataflow aware operating system
  • What about other concurrent programming models?

41 / 42

slide-71
SLIDE 71

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Thank you for your attention

42 / 42

slide-72
SLIDE 72

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography I

◮ A H R Albers and P H N de With.

Task complexity analysis and qos management for mapping dynamic video-processing tasks on a multi-core platform. Journal of Real-Time Image Processing, 7(3):185–202, 2012.

◮ Thomas W. Bartenstein and Yu David Liu.

Rate types for stream programs. SIGPLAN Not., 49(10):213–232, October 2014.

◮ A. Bonfietti, L. Benini, M. Lombardi, and M. Milano.

An efficient and complete approach for throughput-maximal sdf allocation and scheduling on multi-core platforms. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010, pages 897–902, March 2010.

/ 42

slide-73
SLIDE 73

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography II

◮ Yoonseo Choi, Cheng-Hong Li, Dilma Da Silva, Alan Bivens,

and Eugen Schenfeld. Adaptive task duplication using on-line bottleneck detection for streaming applications. In Proceedings of the 9th Conference on Computing Frontiers, CF ’12, pages 163–172, New York, NY, USA, 2012. ACM.

◮ Rebecca L. Collins and Luca P

. Carloni. Flexible filters: Load balancing through backpressure for stream programs. In Proceedings of the Seventh ACM International Conference

  • n Embedded Software, EMSOFT ’09, pages 205–214, New

York, NY, USA, 2009. ACM.

/ 42

slide-74
SLIDE 74

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography III

◮ Mickaël Dardaillon, Kevin Marquet, Tanguy Risset, Jérôme

Martin, and Henri-Pierre Charles. A compilation flow for parametric dataflow: Programming model, scheduling, and application to heterogeneous mpsoc. In Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES ’14, pages 8:1–8:10, New York, NY, USA,

  • 2014. ACM.

◮ Mohammad Dashti, Alexandra Fedorova, Justin Funston,

Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic management: A holistic approach to memory placement on numa systems.

/ 42

slide-75
SLIDE 75

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography IV

In Proceedings of the Eighteenth International Conference

  • n Architectural Support for Programming Languages and

Operating Systems, ASPLOS ’13, pages 381–394, New York, NY, USA, 2013. ACM.

◮ Tudor David, Rachid Guerraoui, and Vasileios Trigonakis.

Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 33–48, New York, NY, USA, 2013. ACM.

/ 42

slide-76
SLIDE 76

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography V

◮ Jack Dongarra, Kevin London, Shirley Moore, Phil Mucci,

and Dan Terpstra. Using papi for hardware performance monitoring on linux systems. In In Conference on Linux Clusters: The HPC Revolution, Linux Clusters Institute, 2001.

◮ Andi Drebes, Pop Antoniu, Karine Heydemann, Albert

Cohen, and Nathalie Drach. Aftermath: A graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems. In Seventh Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2014), Vienna, Austria, January 2014.

/ 42

slide-77
SLIDE 77

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography VI

◮ A.-H. Ghamarian, M. C W Geilen, S. Stuijk, T. Basten, A. J M

Moonen, M.J.G. Bekooij, B.D. Theelen, and M.R. Mousavi. Throughput analysis of synchronous data flow graphs. In Application of Concurrency to System Design, 2006. ACSD 2006. Sixth International Conference on, pages 25–36, 2006.

◮ A.H. Ghamarian, M.C.W. Geilen, T. Basten, and S. Stuijk.

Parametric throughput analysis of synchronous data flow graphs. In Design, Automation and Test in Europe, 2008. DATE ’08, pages 116–121, March 2008.

/ 42

slide-78
SLIDE 78

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography VII

◮ Khaled Jerbi, Daniele Renzi, Damien de Saint-Jorre, Hervé

Yviquel, Mickaël Raulet, Claudio Alberti, and Marco Mattavelli. Development and optimization of high level dataflow programs: the HEVC decoder design case. In 48th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, United States, November 2014.

◮ Renaud Lachaize, Baptiste Lepers, and Vivien Quéma.

Memprof: A memory profiler for numa multicore systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC’12, pages 5–5, Berkeley, CA, USA, 2012. USENIX Association.

/ 42

slide-79
SLIDE 79

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography VIII

◮ Edward A. Lee and D.G. Messerschmitt.

Synchronous data flow. Proceedings of the IEEE, 75(9):1235 – 1245, sept. 1987.

◮ Edward A. Lee and T.M. Parks.

Dataflow process networks. Proceedings of the IEEE, 83(5):773 –801, may 1995.

◮ Xu Liu and John Mellor-Crummey.

A tool to analyze the performance of multithreaded programs

  • n numa architectures.

In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, pages 259–272, New York, NY, USA, 2014. ACM.

/ 42

slide-80
SLIDE 80

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography IX

◮ I. Amer, C. Lucarz, G. Roquier, M. Mattavelli, M. Raulet, J.-F

. Nezan, and O. Deforges. Reconfigurable video coding on multicore. Signal Processing Magazine, IEEE, 26(6):113 –123, november 2009.

◮ Daniel Molka, Daniel Hackenberg, Robert Schone, and

Matthias S. Muller. Memory performance and cache coherency effects on an intel nehalem multiprocessor system. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT ’09, pages 261–270, Washington, DC, USA, 2009. IEEE Computer Society.

◮ http://oprofile.sourceforge.net/.

/ 42

slide-81
SLIDE 81

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography X

◮ https://perf.wiki.kernel.org/index.php/. ◮ Manuel Selva, Lionel Morel, Kevin Marquet, and Stéphane

Frénot. A monitoring system for runtime adaptations of streaming applications. In Parallel, Distributed and Network-Based Processing (PDP), 2015 23nd Euromicro International Conference on, March 2015.

◮ S. Stuijk, T. Basten, M. C W Geilen, and H. Corporaal.

Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs. In Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE, pages 777–782, 2007.

/ 42

slide-82
SLIDE 82

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Bibliography XI

◮ William Thies and Saman Amarasinghe.

An empirical characterization of stream programs and its implications for language and compiler design. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, page 365, 2010.

◮ Herve Yviquel, Antoine Lorence, Khaled Jerbi, Gildas

Cocherel, Alexandre Sanchez, and Mickael Raulet. Orcc: Multimedia development made easy. In Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, pages 863–866. ACM, 2013.

/ 42

slide-83
SLIDE 83

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Proposal

Extend DF languages

Compile time

Extend DF languages

  • Throughput expression and exploitation in SDF

A B C D 2 1 1 3 1 6 3 3 2 1000tokens

s

/ 42

slide-84
SLIDE 84

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Proposal

Extend DF languages

Compile time

Profile app/resources

Runtime

Profile app/resources

  • Throughput expression and exploitation in SDF
  • Throughput violation and bottlenecks identification

/ 42

slide-85
SLIDE 85

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Proposal

Extend DF languages

Compile time

Profile app/resources

Runtime

Adapt execution choices On-line Off-line Adapt execution choices

  • Throughput expression and exploitation in SDF
  • Throughput violation and bottlenecks identification
  • Adaptation of actors mapping and FIFO location

/ 42

slide-86
SLIDE 86

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

1st Contrib - Related Work

Use information computed statically at runtime

Statically compute SDF maximal throughput

[Ghamarian06, Ghamarian08, Stuijk07, Bonfietti10, Bartenstein14] Actors execution time statically known

Identify bottlenecks using FIFOs filling

[Collins09, Choi12] Complex runtime mechanisms

/ 42

slide-87
SLIDE 87

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

2nd Contrib - Related Work

Correlate low level profiling with the DF programming model

Non NUMA specific profiling abstraction in existing APIs

[Dongarra01]

Non DF-aware profilers

[Oprofile, Perf, Lachaize2012, Liu2014]

Same ideas for a task parallel programing model

[Drebes14]

/ 42

slide-88
SLIDE 88

Context DF & Problematic Detection of SDF Bottleneck Actors Profiling of DF Programs Perspectives

Cores Imbalance

A B C D

* * * * * * * * *

A1 A2 A3 A4 A5 A6 A7 B1 B2 B3 B4 B5 B6 C1 C2 C3 D1 D2 D3 Cores Time

/ 42