Learning to Generate Fast Signal Processing Implementations Bryan - PowerPoint PPT Presentation

Learning to Generate Fast Signal Processing Implementations Bryan Singer Joint work with Manuela Veloso Shorter version to be presented at ICML-2001

Overview • Background and Motivation • Key Signal Processing Observations • Predicting Leaf Cache Misses • Generating Fast Formulas • Conclusions

Signal Processing Many signal processing algorithms: • take as input a signal X as a vector • produce transformation of signal Y = A X Issue: • Na ¨ ıve implementation of matrix multiplication is slow Example signal processing applications: • Real time audio, image, speech processing • Analysis of large data sets

Factoring Signal Transforms • Transformation matrices are highly structured • Can factor transformation matrices • Factorizations allow for faster implementations

Walsh-Hadamard Transform (WHT) Highly structured, for example:   1 1 1 1  1 − 1 1 − 1  WHT (2 2 ) =     1 1 − 1 − 1       1 − 1 − 1 1 Factorization or break down rule: t WHT (2 n ) = ( I 2 n 1+ ··· + ni − 1 ⊗ WHT (2 n i ) ⊗ I 2 ni +1+ ··· + nt ) � i =1 for positive integers n i such that n = n 1 + · · · + n t

WHT Example WHT (2 5 ) = [ WHT (2 3 ) ⊗ I 2 2 ][ I 2 3 ⊗ WHT (2 2 )] = [ { ( WHT (2 1 ) ⊗ I 2 2 )( I 2 1 ⊗ WHT (2 2 )) } ⊗ I 2 2 ] [ I 2 3 ⊗ { ( WHT (2 1 ) ⊗ I 2 1 )( I 2 1 ⊗ WHT (2 1 )) } ] We can visualize this as a split tree: 5 3 2 1 2 1 1 1-1 correspondence between split trees and formulas

Search Space Large number of factorizations: √ • WHT (2 n ) has Θ((4 + 8) n /n 3 / 2 ) different split trees • WHT (2 n ) has Θ(5 n /n 3 / 2 ) different binary split trees • WHT (2 10 ) has 51,819 binary split trees

Varying Performance Varying performance of factorizations: • Formulas have very different running times • Small changes in the split tree can lead to significantly different running times • Optimal formulas across machines are different Reasons: • Cache performance • Utilization of execution units • Number of registers

Histogram of WHT (2 16 ) Running Times 400 350 300 Number of formulas 250 200 150 100 50 0 0.5 1 1.5 2 2.5 3 Running time in CPU cycles 7 x 10

Problem Huge search space of formulas Want to find the fastest formula • For a given transform • For a given size • For a given machine • But for any input vector Our Approach: Learn to generate fast formulas • Learn to predict cache misses for leaves • Use this as the base cases for determining values of different splittings • Construct fast formulas by choosing best splittings

� Run Times and Cache Misses 3e+07 2.5e+07 Runtime in CPU Cycles 2e+07 1.5e+07 1e+07 5e+06 1.0e+05 2.0e+05 3.0e+05 4.0e+05 5.0e+05 Level 1 Data Cache Misses

Run Times and Cache Misses • Fastest formula has minimal number of cache misses • Minimizing cache misses produces small group of formulas which contains the fastest formula

WHT Leaves • WHT leaves are implemented as unrolled code (sizes 2 1 to 2 8 ) • Internal nodes recursively call their children • All run time and cache misses occur in the leaves • Total run time or cache misses of a formula is the sum of that incurred by the leaves • If we can predict for leaves, then we can predict for entire formulas

Leaf Cache Misses: WHT (2 16 ) example 18000 16000 14000 Number of Leaves 12000 10000 8000 6000 4000 2000 0 2 14 2 15 2 16 2 17 0 Level 1 Data Cache Misses

Leaf Cache Misses • The number of cache misses incurred by leaves is only of a few possible values • These values are fractions of the transform size • We can predict one of a few categories instead of real valued number of cache misses • We can learn across different sizes by learning the categories corresponding to fractions of the transform size

Review of Observations • Fastest formula has minimal number of cache misses • All computation performed in the leaves • Leaf cache misses only have a few values • Leaf cache misses are fractions of transform size

Predicting Leaf Cache Misses • Want to learn to accurately predict leaf cache misses • Should then be able to predict cache misses for entire formulas

Learning Algorithm 1. Collect cache misses for leaves of WHT formulas 2. Classify (cache misses / transform size) as: 18000 • near-zero if less than 1/8 16000 14000 Number of Leaves 12000 • near-quarter if less than 1/2 10000 8000 • near-whole if less than 3/2 6000 4000 2000 • large otherwise. 0 2 14 2 15 2 16 2 17 0 Level 1 Data Cache Misses 3. Train a classification algorithm to predict one of the four classes given a leaf

Features for WHT Leaves Need to describe WHT leaves with features Could use: • Size of the given leaf • Stride of the given leaf Stride: • Determines how a node accesses its input and output data • Greatly impacts cache performance • Determined by location of node in split tree

More Features for WHT Leaves • Size and stride of the given leaf • Size and stride of the parent of the given leaf • Size and stride of the common parent A ComPar: - PrevLeaf: - B C ComPar: A ComPar: - PrevLeaf: F PrevLeaf: - D E F G ComPar: B ComPar: A ComPar: C ComPar: - PrevLeaf: E PrevLeaf: F PrevLeaf: G PrevLeaf: -

Review: Learning Algorithm 1. Collect cache misses for leaves of WHT formulas 2. Classify (cache misses / transform size) as: • near-zero if less than 1/8 • near-quarter if less than 1/2 • near-whole if less than 3/2 • large otherwise. 3. Describe leaves with features 4. Train a classification algorithm to predict one of the four classes given features for a leaf

Evaluation • Trained a decision tree • Used a random 10% of leaves of all binary WHT (2 16 ) split trees with no leaves of size 2 1 • Evaluated performance using subsets of formulas known to be fastest • Can not evaluate over all formulas because there are too many possible formulas

Leaf Cache Miss Category Performance Error rates for predicting cache miss category incurred by leaves Binary No-2 1 -Leaf Binary No-2 1 -Leaf Rightmost Size Errors Size Errors 2 12 2 17 0.5% 1.7% 2 13 2 18 1.7% 1.7% 2 14 2 19 0.9% 1.7% 2 15 2 20 0.9% 1.6% 2 16 2 21 0.7% 1.6% Trained on one size, performs well across many!

Predicting Cache Misses for Entire Formulas Average percentage error for predicting cache misses for entire formulas Binary No-2 1 -Leaf Binary No-2 1 -Leaf Rightmost Size Errors Size Errors 2 12 2 17 12.7% 8.2% 2 13 2 18 8.6% 8.2% 2 14 2 19 6.7% 7.9% 2 15 2 20 5.2% 8.1% 2 16 2 21 4.6% 10.4% | a i − p i | 1 Error = , where a i and p i are the actual and � i ∈ TestSet | TestSet | a i predicted number of cache misses for formula i .

� � Runtime Versus Predicted Cache Misses Binary No-2 1 -Leaf Binary No-2 1 -Leaf WHT (2 14 ) Rightmost WHT (2 20 ) Actual Running Time in CPU Cycles Actual Running Time in CPU Cycles 5e+06 4e+08 3.5e+08 4e+06 3e+08 3e+06 2.5e+08 2e+06 2e+08 1e+06 1.5e+08 2.0e+04 4.0e+04 6.0e+04 8.0e+04 1.0e+05 2.0e+06 4.0e+06 6.0e+06 Predicted Number of Cache Misses Predicted Number of Cache Misses

Review: Predicting Cache Misses By learning to predict leaf cache misses: • Accurately predict cache misses for entire formulas • Fastest formulas have fewest predicted cache misses • Predict accurately across many transform sizes while trained on one size

Generating Fast Formulas • Can now quickly predict cache misses for a formula • Fastest formulas have minimal cache misses • But still MANY formulas to search through Can we learn to generate fast formulas?

Generating Fast Formulas: Approach Control Learning Problem: • Learn to control the generation of formulas to produce fast ones Want to grow the fastest WHT split tree: • Begin with a root node of the desired size 20

Generating Fast Formulas: Approach Control Learning Problem: • Learn to control the generation of formulas to produce fast ones Want to grow the fastest WHT split tree: • Begin with a root node of the desired size 20 • Grow best possible children 4 16

Generating Fast Formulas: Approach Control Learning Problem: • Learn to control the generation of formulas to produce fast ones Want to grow the fastest WHT split tree: • Begin with a root node of the desired size 20 • Grow best possible children • Recurse on each of the children 4 16 2 2

Generating Fast Formulas: Approach • Try to formulate in terms of Markov Decision Processes (MDPs) and Reinforcement Learning (RL) • Final formulation not an MDP • Final formulation borrows concepts from RL

Learning to Generate Fast Signal Processing Implementations Bryan - PowerPoint PPT Presentation

Learning to Generate Fast Signal Processing Implementations Bryan Singer Joint work with Manuela Veloso Shorter version to be presented at ICML-2001 Overview Background and Motivation Key Signal Processing Observations Predicting

Signal Processing - Introduction Signal Processing Analogue/digital filters: extensively used

Digital Signal Processing Solutions Digital Signal Processing Solutions SIGNAL PROCESSING

Speech Processing 15-492/18-492 Speech Synthesis Signal Processing Signal Manipulation Signal

Tx Signal: 1000 Hz sine wave; Attenuation; Random noise with 0.5ms spike Tx Signal Noise Rx

Machine Learning for Signal Processing Lecture 1: Signal Representations Class 1. 27 August

Waveform Generation Fundamental part of signal processing is the signal. Within the

Advanced Digital Signal Processing Part 5: Multi-Rate Digital Signal Processing Gerhard Schmidt

VLSI Digital Signal Processing Systems Keshab K. Parhi VLSI Digital Signal Processing Systems

Signal Processing in MATLAB Signal Processing in MATLAB February 2, 1998 Tom Krauss PhD Student

Efficient audio signal processing using LLVM and Haskell Henning Thielemann 2013-04-30

Language to Image Generation Generate a bird with Generate a bird with Generate a bird

Signal Processing in the Pure Programming Signal Processing in Pure Language Albert Grf Dept.

Speech Processing 11-492/18-492 Speech Synthesis Signal Processing Signal Manipulation

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

kill Run default signal handler! Process Process A B kill signal(SIGINT, func) Process

Sampling a Signal an analog signal together with some samples of the signal. The samples

CS 61A Lecture 11: Trees Tammy Nguyen (tammynguyen@berkeley.edu) Thursday, 07/07 Announcements

Chapter 4 Tree-Structured Indexing ISAM and B + -trees Binary Search ISAM Architecture and

Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep Chakraborty VMware Greenplum

R-Tree An R-tree is a depth-balanced tree Each node corresponds to a disk page Leaf

Modeling and Infering a Spatio-temporal Dynamic For Apple Scab in Orchards R. Senoussi * , C.

Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and

Minimizing MAXLIVE For Spill-Free Register Allocation Shashi Deepa Arcot, Henry Gordon Dietz,

From the Consent of the Routed: Improving the Transparency of the RPKI. Ethan Heilman Danny