An Empirical Characterization of A E i i l Ch t i ti f - - PowerPoint PPT Presentation

an empirical characterization of a e i i l ch t i ti f
SMART_READER_LITE
LIVE PREVIEW

An Empirical Characterization of A E i i l Ch t i ti f - - PowerPoint PPT Presentation

An Empirical Characterization of A E i i l Ch t i ti f Stream Programs and its Implications g p for Language and Compiler Design Bill Thies 1 and Saman Amarasinghe 2 1 Mi 1 Microsoft Research India f R h I di 2 Massachusetts Institute


slide-1
SLIDE 1

A E i i l Ch t i ti f An Empirical Characterization of Stream Programs and its Implications g p for Language and Compiler Design

Bill Thies1 and Saman Amarasinghe2

1 Mi

f R h I di

1 Microsoft Research India 2 Massachusetts Institute of Technology

gy

PACT 2010

slide-2
SLIDE 2

What Does it Take to Evaluate a New Language? Evaluate a New Language?

StreamIt (PACT'10) AG (LDTA'06) Contessa (FPT'07) ( ) RASCAL (SCAM'09) AG (LDTA 06) Anne (PLDI'10) NDL (LCTES'04) Teapot (PLDI'96) UR (PLDI'10) Facile (PLDI'01) Teapot (PLDI 96)

1000 2000 1000 2000 1000 2000 Lines of Code

slide-3
SLIDE 3

What Does it Take to Evaluate a New Language? Evaluate a New Language?

StreamIt (PACT'10) AG (LDTA'06) Contessa (FPT'07) ( )

Small studies make it hard to assess:

  • Experiences of new users over time

RASCAL (SCAM'09) AG (LDTA 06)

Experiences of new users over time

  • Common patterns across large programs

Anne (PLDI'10) NDL (LCTES'04) Teapot (PLDI'96) UR (PLDI'10) Facile (PLDI'01) Teapot (PLDI 96)

1000 2000 1000 2000 1000 2000 Lines of Code

slide-4
SLIDE 4

What Does it Take to Evaluate a New Language? Evaluate a New Language?

StreamIt (PACT'10) StreamIt (PACT’10) AG (LDTA'06) Contessa (FPT'07) ( ) ( ) RASCAL (SCAM'09) AG (LDTA 06) Anne (PLDI'10) NDL (LCTES'04) Teapot (PLDI'96) UR (PLDI'10)

2000 4000 6000 800010000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000

Facile (PLDI'01) Teapot (PLDI 96)

10K 20K 30K Lines of Code

2000 4000 6000 800010000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000

10K 20K 30K

slide-5
SLIDE 5

What Does it Take to Evaluate a New Language? Evaluate a New Language?

StreamIt (PACT'10) StreamIt (PACT’10) AG (LDTA'06) Contessa (FPT'07) ( ) ( ) RASCAL (SCAM'09) AG (LDTA 06) Anne (PLDI'10) NDL (LCTES'04) Teapot (PLDI'96) UR (PLDI'10)

2000 4000 6000 800010000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000

Facile (PLDI'01) Teapot (PLDI 96)

10K 20K 30K Lines of Code

2000 4000 6000 800010000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000

10K 20K 30K

slide-6
SLIDE 6

What Does it Take to Evaluate a New Language? Evaluate a New Language?

StreamIt (PACT'10) StreamIt (PACT’10) AG (LDTA'06) Contessa (FPT'07) ( ) ( )

Our characterization:

  • 65 programs

RASCAL (SCAM'09) AG (LDTA 06)

65 programs

  • 34,000 lines of code
  • Written by 22 students

O i d f 8

Anne (PLDI'10) NDL (LCTES'04)

  • Over period of 8 years

This allows:

Teapot (PLDI'96) UR (PLDI'10)

  • Non-trivial benchmarks
  • Broad picture of application space

Understanding long term user

2000 4000 6000 800010000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000

Facile (PLDI'01) Teapot (PLDI 96)

10K 20K 30K

  • Understanding long-term user

experience

Lines of Code

2000 4000 6000 800010000 12000 14000 16000 18000 20000 22000 24000 26000 28000 30000 32000 34000

10K 20K 30K

slide-7
SLIDE 7

Streaming Application Domain

  • For programs based on streams of data

Audio video DSP networking and

AtoD

– Audio, video, DSP, networking, and cryptographic processing kernels – Examples: HDTV editing, radar

FMDemod

p g, tracking, microphone arrays, cell phone base stations, graphics

LPF Duplicate LPF LPF

  • Properties of stream programs

– Regular and repeating computation

LPF1 LPF2 LPF3 HPF1 HPF2 HPF3

– Independent filters with explicit communication

RoundRobin HPF1 HPF2 HPF3 Adder RoundRobin Speaker

slide-8
SLIDE 8

StreamIt: A Language and Compiler for Stream Programs for Stream Programs

  • Key idea: design language that enables static analysis
  • Key idea: design language that enables static analysis
  • Goals:
  • 1. Improve programmer productivity in the streaming domain
  • 2. Expose and exploit the parallelism in stream programs
  • Project contributions:

– Language design for streaming [CC'02, CAN'02, PPoPP'05, IJPP'05] – Automatic parallelization [ASPLOS'02, G.Hardware'05, ASPLOS'06, MIT’10] – Domain-specific optimizations [PLDI'03, CASES'05, MM'08] – Cache-aware scheduling [LCTES'03, LCTES'05] – Extracting streams from legacy code [MICRO'07] – User + application studies [PLDI'05, P-PHEC'05, IPDPS'06]

slide-9
SLIDE 9

StreamIt Language Basics

  • High-level, architecture-independent language

Backend support for uniprocessors multicores (Raw SMP) – Backend support for uniprocessors, multicores (Raw, SMP), cluster of workstations

  • Model of computation: synchronous dataflow

[Lee & Messerschmidt, 1987]

  • Model of computation: synchronous dataflow

– Program is a graph of independent filters Filters have an atomic execution step

Input

1

x 10 1987]

– Filters have an atomic execution step with known input / output rates – Compiler is responsible for

Decimate

1 10

x 1

p p scheduling and buffer management

  • Extensions to synchronous dataflow

O tp t

1 1

x 1

Extensions to synchronous dataflow

– Dynamic I/O rates – Support for sliding window operations

Output x 1

Support for sliding window operations – Teleport messaging [PPoPP’05]

slide-10
SLIDE 10

Example Filter: Low Pass Filter

float-> float filter LowPassFilter (int N, float[N] weights;) {

work peek N push 1 pop 1 {

float result = 0; for (int i= 0; i< weights.length; i+ + ) { result + = weights[i] * peek(i);

N

Stateful

g

p

( ) }

push(result); pop();

filter

Stateless

p p();

} }

slide-11
SLIDE 11

Example Filter: Low Pass Filter

float-> float filter LowPassFilter (int N float[N] weights; ) {

work peek N push 1 pop 1 {

float result = 0; float[N] weights; h d h l() for (int i= 0; i< weights.length; i+ + ) { result + = weights[i] * peek(i);

N

weights = adaptChannel();

Stateful

g

p

( ) }

push(result); pop();

filter

p p();

} }

slide-12
SLIDE 12

Structured Streams

i li filter

  • Each structure is single-

input single-output

may be any StreamIt language

pipeline

input, single-output

  • Hierarchical and

composable

construct

splitjoin

composable

joiner splitter feedback loop joiner splitter

slide-13
SLIDE 13

StreamIt Benchmark Suite (1/2)

  • Realistic applications (30):

MPEG2 encoder / decoder – Serpent encryption – MPEG2 encoder / decoder – Ground Moving Target Indicator – Mosaic – Serpent encryption – Vocoder – RayTracer Mosaic – MP3 subset – Medium Pulse Compression Radar RayTracer – 3GPP physical layer – Radar Array Front End Medium Pulse Compression Radar – JPEG decoder / transcoder – Feature Aided Tracking Radar Array Front End – Freq-hopping radio – Orthogonal Frequency g – HDTV – H264 subset g q y Division Multiplexer – Channel Vocoder – Synthetic Aperture Radar – GSM Decoder – Filterbank – Target Detector – 802.11a transmitte – DES encryption – FM Radio – DToA Converter

slide-14
SLIDE 14

StreamIt Benchmark Suite (2/2)

  • Libraries / kernels (23):

– Autocorrelation – Matrix Multiplication Autocorrelation – Cholesky – CRC Matrix Multiplication – Oversampler – Rate Convert – DCT (1D / 2D, float / int) – FFT (4 granularities) – Time Delay Equalization – Trellis – Lattice

  • Graphics pipelines (4):

– VectAdd

p p p ( )

– Reference pipeline – Phong shading – Shadow volumes – Particle system

  • Sorting routines (8)

– Bitonic sort (3 versions) – Insertion sort to c so t (3 e s o s) – Bubble Sort – Comparison counting – Merge sort – Radix sort

slide-15
SLIDE 15

3GPP

slide-16
SLIDE 16

802.11a

slide-17
SLIDE 17

Bitonic Sort

slide-18
SLIDE 18

Note to online viewers:

f For high-resolution stream graphs of all benchmarks, please see pp. 173-240 of this thesis: http://groups csail mit edu/commit/papers/09/thies-phd-thesis pdf http://groups.csail.mit.edu/commit/papers/09/thies phd thesis.pdf

slide-19
SLIDE 19

Characterization Overview

  • Focus on architecture-independent features

Avoid performance artifacts of the StreamIt compiler – Avoid performance artifacts of the StreamIt compiler – Estimate execution time statically (not perfect)

Th t i f i i

  • Three categories of inquiry:
  • 1. Throughput bottlenecks

2 S h d li h t i ti

  • 2. Scheduling characteristics
  • 3. Utilization of StreamIt language features
slide-20
SLIDE 20

Lessons Learned from Lessons Learned from the StreamIt Language g g

What we did right What we did wrong Opportunities for doing better

slide-21
SLIDE 21
  • 1. Expose Task, Data, & Pipeline Parallelism

Data parallelism

  • Analogous to DOALL loops

Splitter

Task parallelism

Joiner

Pipeline parallelism

Task

slide-22
SLIDE 22
  • 1. Expose Task, Data, & Pipeline Parallelism

Data parallelism

Splitter

Stateless

Splitter

Joiner

ne

Task parallelism

Pipeli Joiner

Pipeline parallelism

Data Task

slide-23
SLIDE 23
  • 1. Expose Task, Data, & Pipeline Parallelism

Data parallelism

  • 74% of benchmarks contain entirely

data-parallel filters

Splitter

  • In other benchmarks, 5% to 96%

(median 71%) of work is data-parallel

Splitter

Joiner

ne

Task parallelism

  • 82% of benchmarks contain

Pipeli

at least one splitjoin

  • Median of 8 splitjoins per benchmark

Joiner

Pipeline parallelism

Data Task

slide-24
SLIDE 24

Characterizing Stateful Filters

763 Filter Types 49 Stateful Types 94% Stateless 55%

A id bl

45%

Al ith i

6% Stateful

Avoidable State Algorithmic State

Stateful Sources of Algorithmic State

– MPEG2: bit-alignment, reference frame encoding, motion prediction, … – HDTV: Pre-coding and Ungerboeck encoding g g g – HDTV + Trellis: Ungerboeck decoding – GSM: Feedback loops – Vocoder: Accumulator adaptive filter feedback loop – Vocoder: Accumulator, adaptive filter, feedback loop – OFDM: Incremental phase correction – Graphics pipelines: persistent screen buffers

slide-25
SLIDE 25

Characterizing Stateful Filters

27 Types with 763 Filter Types 49 Stateful Types 27 Types with “Avoidable State” 94% Stateless 55%

A id bl

45%

Al ith i

Due to induction 6% Stateful

Avoidable State Algorithmic State

induction variables Stateful Sources of Algorithmic State

– MPEG2: bit-alignment, reference frame encoding, motion prediction, … – HDTV: Pre-coding and Ungerboeck encoding g g g – HDTV + Trellis: Ungerboeck decoding – GSM: Feedback loops – Vocoder: Accumulator adaptive filter feedback loop – Vocoder: Accumulator, adaptive filter, feedback loop – OFDM: Incremental phase correction – Graphics pipelines: persistent screen buffers

slide-26
SLIDE 26

Characterizing Stateful Filters

  • 2. Eliminate Stateful Induction Variables

27 Types with 763 Filter Types 49 Stateful Types 27 Types with “Avoidable State” 94% Stateless 55%

A id bl

45%

Al ith i

Due to induction 6% Stateful

Avoidable State Algorithmic State

induction variables Stateful Sources of Induction Variables

– MPEG encoder: counts frame # to assign picture type – MPD / Radar: count position in logical vector for FIR – MPD / Radar: count position in logical vector for FIR – Trellis: noise source flips every N items – MPEG encoder / MPD: maintain logical 2D position (row/column) – MPD: reset accumulator when counter overflows

Opportunity: Language primitive to return current iteration?

slide-27
SLIDE 27

Characterizing Stateful Filters

  • 2. Eliminate Stateful Induction Variables

27 Types with 763 Filter Types 49 Stateful Types 27 Types with “Avoidable State” D t 94% Stateless 55%

A id bl

45%

Al ith i

Due to induction Due to Granularity 6% Stateful

Avoidable State Algorithmic State

Due to message induction variables Stateful handlers Sources of Induction Variables

– MPEG encoder: counts frame # to assign picture type – MPD / Radar: count position in logical vector for FIR – MPD / Radar: count position in logical vector for FIR – Trellis: noise source flips every N items – MPEG encoder / MPD: maintain logical 2D position (row/column) – MPD: reset accumulator when counter overflows

Opportunity: Language primitive to return current iteration?

slide-28
SLIDE 28
  • 3. Expose Parallelism in Sliding Windows

1 2 3 4 5 6 7 8 9 10 11

input

FIR

  • Legacy codes obscure parallelism in sliding windows
  • utput

1

g y p g

– In von-Neumann languages, modulo functions or copy/shift

  • perations prevent detection of parallelism in sliding windows
  • Sliding windows are prevalent in our benchmark suite

– 57% of realistic applications contain at least one sliding window pp g – Programs with sliding windows have 10 instances on average – Without this parallelism, 11 of our benchmarks would have a p new throughput bottleneck (work: 3% - 98%, median 8%)

slide-29
SLIDE 29

Characterizing Sliding Windows

44%

34 Sliding Window Types

29% 44% FIR Filters

push 1

One-item windows

push 1 pop 1 peek N pop N peek N+1

3GPP, OFDM, Filterbank, TargetDetect, DToA, Mosaic, HDTV, FMRadio, JPEG decode / transcode, Vocoder g , , Oversampler, RateConvert, Vocoder, ChannelVocoder, FMRadio

27% Miscellaneous

FMRadio MP3: reordering (peek >1000) 802.11: error codes (peek 3-7) Vocoder / A.beam: skip data Channel Vocoder: sliding correlation (peek 100)

slide-30
SLIDE 30
  • 4. Expose Startup Behaviors
  • Example: difference encoder (JPEG, Vocoder)

int > int filter Diff Encoder() { int > int filter Diff Encoder() { int-> int filter Diff_Encoder() { int state = 0;

work push 1 pop 1 {

int-> int filter Diff_Encoder() {

prework push 1 pop 1 { push(peek(0)); work push 1 pop 1 { push(peek(0) – state);

state = pop(); }

push(peek(0));

}

work push 1 pop 1 peek 2 {

} }

p p p p

{

push(peek(1) – peek(0)); pop();

}

Stateful

  • Required by 15 programs:

– For delay: MPD, HDTV, Vocoder, 3GPP, Filterbank,

Stateless

} }

For delay: MPD, HDTV, Vocoder, 3GPP, Filterbank, DToA, Lattice, Trellis, GSM, CRC – For picture reordering (MPEG) – For initialization (MPD, HDTV, 802.11) – For difference encoder or decoder: JPEG, Vocoder

slide-31
SLIDE 31
  • 5. Surprise:

Mis Matched Data Rates Uncommon Mis-Matched Data Rates Uncommon

1 2 3 2 7 8 7 5

x 147 x 98 x 28 x 32 CD-DAT benchmark multiplicities x 147 x 98 x 28 x 32 p

Converts CD audio (44.1 kHz) to digital audio tape (48 kHz)

  • This is a driving application in many papers

– Eg: [MBL94] [TZB99] [BB00] [BML95] [CBL01] [MB04] [KSB08] – Due to large filter multiplicities, clever scheduling is needed to control code size, buffer size, and latency

  • But are mis-matched rates common in practice? No!
slide-32
SLIDE 32
  • 5. Surprise:

Mis Matched Data Rates Uncommon Mis-Matched Data Rates Uncommon

Excerpt from

Execute once

JPEG transcoder

Execute once per steady state

slide-33
SLIDE 33

Characterizing Mis-Matched Data Rates

  • In our benchmark suite:

89% of programs have a filter with a multiplicity of 1 – 89% of programs have a filter with a multiplicity of 1 – On average, 63% of filters share the same multiplicity – For 68% of benchmarks the most common multiplicity is 1 For 68% of benchmarks, the most common multiplicity is 1

  • Implication for compiler design:

Do not expect advanced buffering strategies to Do not expect advanced buffering strategies to have a large impact on average programs

– Example: Karczmarek Thies & Amarasinghe LCTES’03 Example: Karczmarek, Thies, & Amarasinghe, LCTES 03 – Space saved on CD-DAT: 14x – Space saved on other programs (median): 1.2x Space saved on other programs (median): 1.2x

slide-34
SLIDE 34
  • 6. Surprise: Multi-Phase Filters

Cause More Harm than Good

  • A multi-phase filter divides its execution into many steps

Cause More Harm than Good

A multi phase filter divides its execution into many steps

– Formally known a cyclo-static dataflow – Possible benefits:

1 2

F F

  • Shorter latencies
  • More natural code

1 3

Step 1

F F

Step 2

  • We implemented multi-phase filters, and we regretted it

– Programmers did not understand the difference between Programmers did not understand the difference between a phase of execution, and a normal function call – Compiler was complicated by presences of phases

  • However, phases proved important for splitters / joiners

– Routing items needs to be done with minimal latency Routing items needs to be done with minimal latency – Otherwise buffers grow large, and deadlock in one case (GSM)

slide-35
SLIDE 35
  • 7. Programmers Introduce

Unnecessary State in Filters Unnecessary State in Filters

  • Programmers do not implement things how you expect
  • Programmers do not implement things how you expect

void-> int filter SquareWave() { int x = 0; void-> int filter SquareWave() {

k h 2 {

int x = 0;

work push 1 { work push 2 { push(0); push(1); push(x);

x = 1 - x; }

push(1);

} }

Stateless

} }

Stateful

  • Opportunity: add a “stateful” modifier to filter decl?

– Require programmer to be cognizant of the cost of state

slide-36
SLIDE 36
  • 8. Leverage and Improve Upon

Structured Streams Structured Streams

  • Overall programmers found it

Overall, programmers found it useful and tractable to write programs using structured streams p g g

– Syntax is simple to write, easy to read

  • However, structured streams are
  • ccasionally unnatural

y

– And, in rare cases, insufficient

slide-37
SLIDE 37
  • 8. Leverage and Improve Upon

Structured Streams Structured Streams

Original: Structured: Original: Structured:

Compiler recovers unstructured graph using synchronization removal [Gordon 2010]

slide-38
SLIDE 38
  • 8. Leverage and Improve Upon

Structured Streams Structured Streams

Original: Structured: Original: Structured:

Ch t i ti

  • Characterization:

– 49% of benchmarks have an Identity node In those benchmarks Identities account – In those benchmarks, Identities account for 3% to 86% (median 20%) of instances

O t it

  • Opportunity:

– Bypass capability (ala GOTO) for streams

slide-39
SLIDE 39

Related Work

  • Benchmark suites in von-Neumann languages often

include stream programs, but lose high-level properties p g , g p p

– MediaBench – ALPBench – HandBench – MiBench – SPEC – PARSEC – Berkeley MM Workload

  • Brook language includes 17K LOC benchmark suite

– NetBench – Perfect Club

  • Brook language includes 17K LOC benchmark suite

– Brook disallows stateful filters; hence, more data parallelism – Also more focus on dynamic rates & flexible program behavior Also more focus on dynamic rates & flexible program behavior

  • Other stream languages lack benchmark characterization

St C / K lC S idl – StreamC / KernelC – Cg – Baker – SPUR – Spidle

  • In-depth analysis of 12 StreamIt “core” benchmarks

published concurrently to this paper [Gordon 2010]

slide-40
SLIDE 40

Conclusions

  • First characterization of a streaming benchmark suite

that was written in a stream programming language that was written in a stream programming language

– 65 programs; 22 programmers; 34 KLOC

Implications for streaming languages and compilers:

  • Implications for streaming languages and compilers:

– DO: expose task, data, and pipeline parallelism DO: expose parallelism in sliding windows – DO: expose parallelism in sliding windows – DO: expose startup behaviors DO NOT: optimize for unusual case of mis matched I/O rates – DO NOT: optimize for unusual case of mis-matched I/O rates – DO NOT: bother with multi-phase filters – TRY: to prevent users from introducing unnecessary state TRY: to prevent users from introducing unnecessary state – TRY: to leverage and improve upon structured streams – TRY: to prevent induction variables from serializing filters TRY: to prevent induction variables from serializing filters

  • Exercise care in generalizing results beyond StreamIt
slide-41
SLIDE 41

Acknowledgments: Authors of the StreamIt Benchmarks Authors of the StreamIt Benchmarks

  • Sitij Agrawal
  • Ali Meli

j g

  • Basier Aziz
  • Jiawen Chen
  • Mani Narayanan
  • Satish Ramaswamy
  • Jiawen Chen
  • Matthew Drake

Shi l F

  • Satish Ramaswamy
  • Rodric Rabbah

J i S li

  • Shirley Fung
  • Michael Gordon
  • Janis Sermulins
  • Magnus Stenemo
  • Ola Johnsson
  • Andrew Lamb
  • Jinwoo Suh
  • Zain ul-Abdin
  • Chris Leger
  • Michal Karczmarek
  • Amy Williams
  • Jeremy Wong

Michal Karczmarek

  • David Maze

Jeremy Wong