Learning to Generate Fast Signal Processing Implementations Bryan - - PowerPoint PPT Presentation

learning to generate fast signal processing
SMART_READER_LITE
LIVE PREVIEW

Learning to Generate Fast Signal Processing Implementations Bryan - - PowerPoint PPT Presentation

Learning to Generate Fast Signal Processing Implementations Bryan Singer Joint work with Manuela Veloso Shorter version to be presented at ICML-2001 Overview Background and Motivation Key Signal Processing Observations Predicting


slide-1
SLIDE 1

Learning to Generate Fast Signal Processing Implementations

Bryan Singer Joint work with Manuela Veloso Shorter version to be presented at ICML-2001

slide-2
SLIDE 2

Overview

  • Background and Motivation
  • Key Signal Processing Observations
  • Predicting Leaf Cache Misses
  • Generating Fast Formulas
  • Conclusions
slide-3
SLIDE 3

Signal Processing Many signal processing algorithms:

  • take as input a signal X as a vector
  • produce transformation of signal Y = A X

Issue:

  • Na

¨ ıve implementation of matrix multiplication is slow Example signal processing applications:

  • Real time audio, image, speech processing
  • Analysis of large data sets
slide-4
SLIDE 4

Factoring Signal Transforms

  • Transformation matrices are highly structured
  • Can factor transformation matrices
  • Factorizations allow for faster implementations
slide-5
SLIDE 5

Walsh-Hadamard Transform (WHT) Highly structured, for example: WHT(22) =

      

1 1 1 1 1 −1 1 −1 1 1 −1 −1 1 −1 −1 1

      

Factorization or break down rule: WHT(2n) =

t

  • i=1

(I2n1+···+ni−1 ⊗ WHT(2ni) ⊗ I2ni+1+···+nt) for positive integers ni such that n = n1 + · · · + nt

slide-6
SLIDE 6

WHT Example WHT(25) = [WHT(23) ⊗ I22][I23 ⊗ WHT(22)] = [{(WHT(21) ⊗ I22)(I21 ⊗ WHT(22))} ⊗ I22] [I23 ⊗ {(WHT(21) ⊗ I21)(I21 ⊗ WHT(21))}] We can visualize this as a split tree:

5 3 2 2 1 1 1

1-1 correspondence between split trees and formulas

slide-7
SLIDE 7

Search Space Large number of factorizations:

  • WHT(2n) has Θ((4 +

√ 8)n/n3/2) different split trees

  • WHT(2n) has Θ(5n/n3/2) different binary split trees
  • WHT(210) has 51,819 binary split trees
slide-8
SLIDE 8

Varying Performance Varying performance of factorizations:

  • Formulas have very different running times
  • Small changes in the split tree can lead to

significantly different running times

  • Optimal formulas across machines are different

Reasons:

  • Cache performance
  • Utilization of execution units
  • Number of registers
slide-9
SLIDE 9

Histogram of WHT(216) Running Times

0.5 1 1.5 2 2.5 3 x 10

7

50 100 150 200 250 300 350 400 Running time in CPU cycles Number of formulas

slide-10
SLIDE 10

Problem Huge search space of formulas Want to find the fastest formula

  • For a given transform
  • For a given size
  • For a given machine
  • But for any input vector

Our Approach: Learn to generate fast formulas

  • Learn to predict cache misses for leaves
  • Use this as the base cases for determining values of

different splittings

  • Construct fast formulas by choosing best splittings
slide-11
SLIDE 11

Overview

  • Background and Motivation
  • Key Signal Processing Observations
  • Predicting Leaf Cache Misses
  • Generating Fast Formulas
  • Conclusions
slide-12
SLIDE 12

Run Times and Cache Misses

5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 1.0e+05 2.0e+05 3.0e+05 4.0e+05 5.0e+05 Runtime in CPU Cycles

  • Level 1 Data Cache Misses
slide-13
SLIDE 13

Run Times and Cache Misses

  • Fastest formula has minimal number of cache misses
  • Minimizing cache misses produces small group of

formulas which contains the fastest formula

slide-14
SLIDE 14

WHT Leaves

  • WHT leaves are implemented as unrolled code

(sizes 21 to 28)

  • Internal nodes recursively call their children
  • All run time and cache misses occur in the leaves
  • Total run time or cache misses of a formula is the

sum of that incurred by the leaves

  • If we can predict for leaves,

then we can predict for entire formulas

slide-15
SLIDE 15

Leaf Cache Misses: WHT(216) example

2000 4000 6000 8000 10000 12000 14000 16000 18000

Level 1 Data Cache Misses Number of Leaves

214 215 216 217

slide-16
SLIDE 16

Leaf Cache Misses

  • The number of cache misses incurred by leaves is
  • nly of a few possible values
  • These values are fractions of the transform size
  • We can predict one of a few categories instead of

real valued number of cache misses

  • We can learn across different sizes by learning the

categories corresponding to fractions of the transform size

slide-17
SLIDE 17

Review of Observations

  • Fastest formula has minimal number of cache misses
  • All computation performed in the leaves
  • Leaf cache misses only have a few values
  • Leaf cache misses are fractions of transform size
slide-18
SLIDE 18

Overview

  • Background and Motivation
  • Key Signal Processing Observations
  • Predicting Leaf Cache Misses
  • Generating Fast Formulas
  • Conclusions
slide-19
SLIDE 19

Predicting Leaf Cache Misses

  • Want to learn to accurately predict leaf cache misses
  • Should then be able to predict cache misses for

entire formulas

slide-20
SLIDE 20

Learning Algorithm

  • 1. Collect cache misses for leaves of WHT formulas
  • 2. Classify (cache misses / transform size) as:
  • near-zero if less than 1/8
  • near-quarter if less than 1/2
  • near-whole if less than 3/2
  • large otherwise.

2000 4000 6000 8000 10000 12000 14000 16000 18000

Level 1 Data Cache Misses Number of Leaves

214 215 216 217

  • 3. Train a classification algorithm to predict one of the

four classes given a leaf

slide-21
SLIDE 21

Features for WHT Leaves Need to describe WHT leaves with features Could use:

  • Size of the given leaf
  • Stride of the given leaf

Stride:

  • Determines how a node accesses its input and
  • utput data
  • Greatly impacts cache performance
  • Determined by location of node in split tree
slide-22
SLIDE 22

More Features for WHT Leaves

  • Size and stride of the given leaf
  • Size and stride of the parent of the given leaf
  • Size and stride of the common parent

A

PrevLeaf: - ComPar: - PrevLeaf: - ComPar: -

C B

ComPar: A

F G

PrevLeaf: - ComPar: -

D E

PrevLeaf: F ComPar: C PrevLeaf: G PrevLeaf: F ComPar: A PrevLeaf: E ComPar: B

slide-23
SLIDE 23

Review: Learning Algorithm

  • 1. Collect cache misses for leaves of WHT formulas
  • 2. Classify (cache misses / transform size) as:
  • near-zero if less than 1/8
  • near-quarter if less than 1/2
  • near-whole if less than 3/2
  • large otherwise.
  • 3. Describe leaves with features
  • 4. Train a classification algorithm to predict one of the

four classes given features for a leaf

slide-24
SLIDE 24

Evaluation

  • Trained a decision tree
  • Used a random 10% of leaves of all binary

WHT(216) split trees with no leaves of size 21

  • Evaluated performance using subsets of formulas

known to be fastest

  • Can not evaluate over all formulas because there are

too many possible formulas

slide-25
SLIDE 25

Leaf Cache Miss Category Performance Error rates for predicting cache miss category incurred by leaves Binary No-21-Leaf Binary No-21-Leaf Rightmost Size Errors 212 0.5% 213 1.7% 214 0.9% 215 0.9% 216 0.7% Size Errors 217 1.7% 218 1.7% 219 1.7% 220 1.6% 221 1.6% Trained on one size, performs well across many!

slide-26
SLIDE 26

Predicting Cache Misses for Entire Formulas Average percentage error for predicting cache misses for entire formulas Binary No-21-Leaf Binary No-21-Leaf Rightmost Size Errors 212 12.7% 213 8.6% 214 6.7% 215 5.2% 216 4.6% Size Errors 217 8.2% 218 8.2% 219 7.9% 220 8.1% 221 10.4%

Error =

1 |TestSet|

  • i∈TestSet

|ai−pi| ai

, where ai and pi are the actual and predicted number of cache misses for formula i.

slide-27
SLIDE 27

Runtime Versus Predicted Cache Misses

Binary No-21-Leaf Binary No-21-Leaf WHT(214) Rightmost WHT(220)

1e+06 2e+06 3e+06 4e+06 5e+06 2.0e+04 4.0e+04 6.0e+04 8.0e+04 1.0e+05

Actual Running Time in CPU Cycles

  • Predicted Number of Cache Misses

1.5e+08 2e+08 2.5e+08 3e+08 3.5e+08 4e+08 2.0e+06 4.0e+06 6.0e+06

Actual Running Time in CPU Cycles

  • Predicted Number of Cache Misses
slide-28
SLIDE 28

Review: Predicting Cache Misses By learning to predict leaf cache misses:

  • Accurately predict cache misses for entire formulas
  • Fastest formulas have fewest predicted cache misses
  • Predict accurately across many transform sizes while

trained on one size

slide-29
SLIDE 29

Overview

  • Background and Motivation
  • Key Signal Processing Observations
  • Predicting Leaf Cache Misses
  • Generating Fast Formulas
  • Conclusions
slide-30
SLIDE 30

Generating Fast Formulas

  • Can now quickly predict cache misses for a formula
  • Fastest formulas have minimal cache misses
  • But still MANY formulas to search through

Can we learn to generate fast formulas?

slide-31
SLIDE 31

Generating Fast Formulas: Approach Control Learning Problem:

  • Learn to control the generation of formulas to

produce fast ones

20

Want to grow the fastest WHT split tree:

  • Begin with a root node of the desired size
slide-32
SLIDE 32

Generating Fast Formulas: Approach Control Learning Problem:

  • Learn to control the generation of formulas to

produce fast ones

20 4 16

Want to grow the fastest WHT split tree:

  • Begin with a root node of the desired size
  • Grow best possible children
slide-33
SLIDE 33

Generating Fast Formulas: Approach Control Learning Problem:

  • Learn to control the generation of formulas to

produce fast ones

20 4 16 2 2

Want to grow the fastest WHT split tree:

  • Begin with a root node of the desired size
  • Grow best possible children
  • Recurse on each of the children
slide-34
SLIDE 34

Generating Fast Formulas: Approach

  • Try to formulate in terms of Markov Decision

Processes (MDPs) and Reinforcement Learning (RL)

  • Final formulation not an MDP
  • Final formulation borrows concepts from RL
slide-35
SLIDE 35

MDPs An MDP is a tuple (S, A, T, C):

  • S is a set of states
  • A is a set of actions
  • T: S × A → S is a transition function that maps the

current state and action to the next state

  • C: S × A → ℜ is a cost function that maps the

current state and action onto its real valued cost Markov Property: T and C only depend on the current state and action

slide-36
SLIDE 36

MDPs and RL Agent:

  • Observes current state
  • Selects action to take
  • Receives the cost for that action in that state
  • Observes next state, and repeat

Reinforcement learning provides methods for finding a policy π: S → A that selects the best action at each state that minimizes the sum of costs incurred

slide-37
SLIDE 37

Basic Formulation Given a size, want to grow a fast WHT split tree

  • States = unexpanded nodes in split tree
  • Start state = root node of given size w/ no children
  • Actions = ways to split a node, giving it children

OR, make the node a leaf

  • Cost Function =
  • Zero when giving children to a node
  • The leaf’s run time when making a node a leaf
  • Goal = minimize sum of costs
slide-38
SLIDE 38

Detail: State Space Representation States = unexpanded nodes in split tree But how to represent the states??? Modified leaf features for arbitrary nodes:

  • Size and stride of the given node
  • Size and stride of the parent of the given node
  • Size and stride of the common parent to this node
slide-39
SLIDE 39

Detail: Cost Function Ideal Cost Function =

  • Zero when giving children to a node
  • The leaf’s run time when making a node a leaf

But, a leaf’s runtime is not easily obtained However, we can predict cache misses for leaves! Used Cost Function =

  • Zero when giving children to a node
  • The leaf’s predicted cache misses when making a

node a leaf Now we really minimize the number of cache misses

slide-40
SLIDE 40

Difficulty: Transition Function What is the transition function for this problem? Given that 2 children of the root are grown:

  • Which node is the next state?
  • When will we transition back to the sibling?
  • Where to transition to from a leaf node?
  • And still maintain the Markov property?

We depart from the MDP framework here . . .

slide-41
SLIDE 41

Our Approach Problem advantages:

  • Deterministic and known actions
  • Deterministic and known cost function

(learned decision tree) Approach:

  • Define an optimal value function on states
  • Run DP to determine value function

(basically like solving an MDP)

slide-42
SLIDE 42

Value Function Define an optimal value function on states:

  • Value of a state is the cost of the best subtree
  • Value of root node is the cost of the best formula
  • Choose children that have minimal sum of values
slide-43
SLIDE 43

Mathematically: Value Function on States State = unexpanded node in split tree, described by 6 features The optimal value of a state is: V ∗(state) = min

subtrees

  • leaf∈subtree

CacheMisses(leaf)

  • Min over all possible subtrees of the given state
  • CacheMisses() returns the predicted number of

cache misses for the given leaf

slide-44
SLIDE 44

Recursive Formulation of Value Function Define: LeafCM(state) =

  

CacheMisses(state), if state can be a leaf ∞, if state cannot be a leaf and SplitV (state) = min

splittings

  • child∈splitting

V ∗(child) Then: V ∗(state) = min{LeafCM(state), SplitV (state)}

slide-45
SLIDE 45

Computing the Value Function Use dynamic programming to calculate value function:

  • Consider all possible sets of children of the root
  • Recursively call DP on each of the children,

memoizing results

  • Determine set of children with minimal sum of values
  • Root’s value is this minimal sum of values
slide-46
SLIDE 46

Generating Fast Formulas Generate split tree with minimal Value (or near minimal)

  • Consider all possible sets of children of the root
  • Choose those that have the minimal sum of values
  • Recurse on children
slide-47
SLIDE 47

Evaluation Difficulty:

  • Do not know what the optimal formula is
  • Too many formulas to exhaust at larger sizes

Possible:

  • Exhaust over limited subspaces
  • Limit based on signal processing knowledge and prior

experience from using different search methods

  • Compare my method with best found by this limited

exhaust

slide-48
SLIDE 48

Fast Formula Generation Results Number of Generated # of Fastest Formulas Included the Formulas Size Generated Fastest Known in Generated 212 101 yes 77 213 86 yes 4 214 101 yes 70 215 86 yes 11 216 101 yes 68 217 86 yes 15 218 101 yes 25 219 86 yes 16 220 101 yes 16

slide-49
SLIDE 49

Histograms: WHT(220)

1.5 2 2.5 3 3.5 4 x 10

8

20 40 60 80 100 120 140 160 180

Running Time in CPU Cycles Number of Formulas

1.6 1.8 2 2.2 2.4 x 10

8

1 2 3 4 5 6

Running Time in CPU Cycles Number of Formulas

Limited Exhaust Our method

Binary No-21-Leaf Rightmost

slide-50
SLIDE 50

Overview

  • Background and Motivation
  • Key Signal Processing Observations
  • Predicting Leaf Cache Misses
  • Generating Fast Formulas
  • Conclusions
slide-51
SLIDE 51

Conclusions

  • New method for constructing fast WHT formulas
  • Generates fastest known formulas!
  • Method can be trained on data for one size and

perform well across many sizes

  • Also, can learn to accurately predict cache misses of

formulas On going and future work:

  • Test and extend to other architectures
  • Extend to other transforms
slide-52
SLIDE 52

Acknowledgements SPIRAL group:

  • Jos´

e Moura, ECE, CMU

  • Manuela Veloso, CS, CMU
  • Jeremy Johnson, MCS, Drexel
  • Bob Johnson, MathStar
  • David Padua, CS, University of Illinois
  • Viktor Prasanna, CS, USC
  • Markus P¨

uschel, ECE, CMU

  • Gavin Haentjens, ECE, CMU
  • David Sepiashvili, ECE, CMU
  • Jianxin Xiong, CS, University of Illinois
slide-53
SLIDE 53

Questions?