Computation Regrouping: Restructuring Programs for Temporal Data - - PowerPoint PPT Presentation

computation regrouping restructuring programs for
SMART_READER_LITE
LIVE PREVIEW

Computation Regrouping: Restructuring Programs for Temporal Data - - PowerPoint PPT Presentation

1 Computation Regrouping: Restructuring Programs for Temporal Data Cache Locality Venkata K. Pingali Sally A. McKee Wilson C. Hsieh John B. Carter http://www.cs.utah.edu/impulse School of Computing Impulse Adaptable Memory System 2


slide-1
SLIDE 1

School of Computing Impulse Adaptable Memory System 1

Computation Regrouping: Restructuring Programs for Temporal Data Cache Locality

Venkata K. Pingali Sally A. McKee Wilson C. Hsieh John B. Carter http://www.cs.utah.edu/impulse

slide-2
SLIDE 2

School of Computing Impulse Adaptable Memory System 2

20 40 60 80 100

FFTW RAY TRACE CUDD R-TREE HEALTH EM3D IRREG

Memory TLB Computation

Problem: Memory Performance

60-80% of execution time spent in memory stalls (generated by Perfex)

194 MHz, MIPS R10K Processor, 32K L1D, 32K L1I, 2MB L2

Normalized Time

slide-3
SLIDE 3

School of Computing Impulse Adaptable Memory System 3

Related Work

Compiler approaches

– Loop, data and integrated restructuring:Tiling, permutation, fusion, fission [CarrMckinley94] – Data-centric: Multi-level fusion [DingKennedy01], Compile-time resolution[Rogers89]

Prefetching

– Hardware or software based, simple,efficient models: Jump pointers, prefetch arrays[Karlsson00], dependence-based [Roth98]

Cache-conscious, application-level approaches

– Algorithmic changes: Sorting [Lamarca96], query processing, matrix multiplication – Data structure modifications: Clustering, coloring, compression [Chilimbi99] – Application construction: Cohort Scheduling [Larus02]

slide-4
SLIDE 4

School of Computing Impulse Adaptable Memory System 4

Computation Regrouping

Logical operations

– Short streams of independent computation performing a unit task – Examples: R-Tree query, FFTW column walk, Processing one ray in Ray Trace

Application-dependent optimization

– Improve temporal locality – Techniques: deferred execution, early execution, filtered execution, computation merging

Preliminary performance improvements encouraging

– Speedups range from 1.26 to 3.03 – Modest code changes

slide-5
SLIDE 5

School of Computing Impulse Adaptable Memory System 5

Access Matrix

…. Logical Operations/Time Data Objects …. Regrouped computations

slide-6
SLIDE 6

School of Computing Impulse Adaptable Memory System 6

Optimization Process Summary

Identify data object set

– Whose accesses result in cache misses – Can fit into the L2 cache

Identify suitable computations

– Deferrable – Easily parameterizable – Estimated gain

Extend data/control structures

– Extensions to store regrouped computation – Extensions to data structure to support partial execution

Decide run time strategy:

– Temporal/spatial constraints – Estimation of gain

Identify Logical operations

slide-7
SLIDE 7

School of Computing Impulse Adaptable Memory System 7

Filtered Execution: IRREG

Simplified CFD code Series of indirect accesses If index vector random, working set is as large as data array Memory stall accounts for more than 80% of execution time Logical operation: set of remote accesses

for all i { sum += data[index[i]]; } Unoptimized

INDEX DATA

slide-8
SLIDE 8

School of Computing Impulse Adaptable Memory System 8

Filtered Execution: IRREG

Defer accesses to data outside the window Significant additional computation cost : n loops instead of 1 Tradeoff: window size vs. number of passes

for k = 0,n step block { for all i { if (index[i] >= k && index[i] < (k+block)){ sum += data[index[i]]; } } } Optimized

+ Pass 1 Pass 2

slide-9
SLIDE 9

School of Computing Impulse Adaptable Memory System 9

Deferred Execution: R-Tree

Height balanced tree Branching factor 2-15 Used for spatial searches Problem: data dependent

accesses, large working set of queries/deletes

Logical operation: insert,

delete, query

Query 1 Query 2

slide-10
SLIDE 10

School of Computing Impulse Adaptable Memory System 10

R-Tree Regrouping

Query 1 Query 2 Query 3 Query 4

slide-11
SLIDE 11

School of Computing Impulse Adaptable Memory System 11

Regrouping: Perfex Estimates

57 72 55 52 80

25 50 75 100

RAY TRACE OPT CUDD OPT R-TREE OPT EM3D OPT IRREG OPT

Memory TLB Computation Overhead

Normalized Time

slide-12
SLIDE 12

School of Computing Impulse Adaptable Memory System 12

Regrouping Vs Clustering (R-Tree)

5 6 2 7 2 4

25 50 75 100

ORIGINAL CLUSTER REGROUPING COMBINED

Memory TLB Computation Overhead

Normalized Time

slide-13
SLIDE 13

School of Computing Impulse Adaptable Memory System 13

Discussion

Downsides

– Useful only for a subset of inputs – Increased code complexity – Hard to automate

Application structure crucial to low regrouping overhead

– Commutative operations – Program-level parallelism and independence

Execution speed traded for output ordering and per-

  • peration latency
slide-14
SLIDE 14

School of Computing Impulse Adaptable Memory System 14

Summary

Regrouping exploits (1) low cost of computation (2)

application-level parallelism

Improves temporal locality Changes small compared to overall code size Hand-optimized applications show good performance

improvements

slide-15
SLIDE 15

School of Computing Impulse Adaptable Memory System 15

slide-16
SLIDE 16

School of Computing Impulse Adaptable Memory System 16

ORIGINAL COMPUTATION EXPENSIVE NOT EXECUTED

Implementation Techniques

LESS EXPENSIVE

  • ITERATION. 1
  • ITERATION. 2

DEFERRED EXECUTION COMPUTATION MERGING EARLY EXECUTION FILTERED EXECUTION

slide-17
SLIDE 17

School of Computing Impulse Adaptable Memory System 17

Deferred Execution: R-Tree

slide-18
SLIDE 18

School of Computing Impulse Adaptable Memory System 18

Performance

SGI Power Onyx, R10K, 2MB L2, 32K L1D, 32K L1I

Benchmark Input Technique Speedup FFTW 10K*32*32 Early 2.53 RAY TRACE Balls, 256*256 Filtered 1.98 CUDD C3540.blif Early + Deferred 1.26 IRREG MOL2 Filtered 1.74 HEALTH 6, 500 Merging 3.03 EM3D 128K nodes Merging + Filtered 1.43 R-TREE dm23.in Deferred 1.87

slide-19
SLIDE 19

School of Computing Impulse Adaptable Memory System 19

Application Analysis

Bad memory behavior

– Working set larger than L2 – Data dependent accesses – Hard to optimize using compiler

Benchmark Source Domain Access Characteristics R-TREE DARPA Databases Pointer Chasing RAY TRACE DARPA Graphics Pointer Chasing + Strided Accesses CUDD

  • U. of Colorado

CAD Pointer Chasing EM3D Public domain Scientific Indirect Accesses + Pointer Chasing IRREG Public Domain Scientific Indirect Accesses HEALTH Public Domain Simulator Pointer Chasing FFTW DARPA/MIT Signal Processing Strided Accesses

slide-20
SLIDE 20

School of Computing Impulse Adaptable Memory System 20

Thesis Overview

Problem: complex applications increasingly limited by

memory performance of applications

Proposed approach: Computation Regrouping Application structure Generic implementation techniques Performance Simple scheduling abstraction

slide-21
SLIDE 21

School of Computing Impulse Adaptable Memory System 21

Characteristics of Logical Operations

Access large number of objects Low reuse of data objects within a single operation Low computation per access May have high degree of reuse across operations Access sequence data-dependent Strict ordering among operations

slide-22
SLIDE 22

School of Computing Impulse Adaptable Memory System 22

Contributions

Showing that computation regrouping is a viable

alternative

Characterizing the applications that can be optimized Developing four implementation techniques to realize

computation regrouping

– Deferred execution – Computation merging – Early execution – Filtered execution

Developing simple abstraction with potential for

automation (locality grouping)

slide-23
SLIDE 23

School of Computing Impulse Adaptable Memory System 23

Techniques Summary

Deferred execution, e.g., R-TREE, CUDD

– Postpone execution until sufficient computation accessing the same data is gathered

Computation merging, e.g., HEALTH, EM3D

– Special case of deferring – Application specific merging of deferred computation

Early execution, e.g., FFTW, CUDD

– Execute future computation that accesses the same data

Filtered execution, e.g., IRREG, EM3D

– Brute force technique – Use a sliding window to enable accesses – As many iterations as necessary

slide-24
SLIDE 24

School of Computing Impulse Adaptable Memory System 24

Deferred Execution - HEALTH

Columbian health system simulation Essentially a traversal of a quad-tree and linked lists

attached at nodes

Key operation: counter update of nodes in waiting list Logical operation: one simulation time step

WAITING LIST

QUADTREE NODE

slide-25
SLIDE 25

School of Computing Impulse Adaptable Memory System 25

Deferred Execution - HEALTH

Key idea: defer waiting-list traversals and remember the

cumulative counter update

Specific technique: computation merging

QUADTREE NODE

  • verhead: space and processing

benefit: 1 traversal instead of many

WAITING LIST

slide-26
SLIDE 26

School of Computing Impulse Adaptable Memory System 26

Benchmarks

Benchmark Logical Operation R-TREE Tree operations, i.e., insert,delete and query. RAY TRACE A scan of the input scene by one ray. CUDD Hash table operations performed during variable swap. EM3D Group of accesses to a set of remote nodes. IRREG Group of accesses to a set of remote nodes. HEALTH One time step. FFT V1 Column walks of a 3D array.

slide-27
SLIDE 27

School of Computing Impulse Adaptable Memory System 27

Discussion

Correctness

– Breaking strict logical operation ordering changes the completion and

  • utput order

Subtle performance issues

– Increased throughput at the cost of increased average latency, and standard deviation – Sensitivity to optimization parameters

slide-28
SLIDE 28

School of Computing Impulse Adaptable Memory System 28

R-Tree Performance Characteristics

20 40 60 80 100 120 140 100 200 300 400 500 600

Queue Sizes

Average Result Latency (s) Throughput (queries/s)

Queries/sec Latency (in secs)

Synthetic input: Query operations on a large static tree

slide-29
SLIDE 29

School of Computing Impulse Adaptable Memory System 29

R-Tree Performance Characteristics

20 40 60 80 100 120 140 100 200 300 400 500 600

Queue Sizes

Average Result Latency (s) Throughput (queries/s)

Queries/sec Latency (in secs)

sweet spot

slide-30
SLIDE 30

School of Computing Impulse Adaptable Memory System 30

1500 1700 1900 2100 2300 2500 2700 2900 Execution Time 2,4,6,… 2,5,8,… 2,6,10,… Queue Placement Orig 32 64 128 256 512

R-Tree sensitivity to Optimization Parameters

Choice of optimization parameters is important

– 1.4x difference between best and worst execution times 4783

slide-31
SLIDE 31

School of Computing Impulse Adaptable Memory System 31

R-Tree Clustering

Inter-node Clustering Intra-node Clustering

+

Intra-node Clustering

=

slide-32
SLIDE 32

School of Computing Impulse Adaptable Memory System 32

Locality Grouping (LG)

Locality groups: User identified groups of tasks that share

  • bjects

Library interface Runtime scheduling Simple abstraction

– lg *createlg(), void deletelg(lg *) – void addtolg(lg *, void *data, void (*proc)(void *)) – void flushlg(lg *)

slide-33
SLIDE 33

School of Computing Impulse Adaptable Memory System 33

Locality Grouping - Health

Group Task

node->group = CreateLG(); if ( list ! = NULL && only_increment){ AddToLG(node->group, list, perform_increment); } else { FlushLG(node->group); perform_update(list); …... } DeleteLG(node->group); AddToLG(g,arg,func){

  • p = malloc();
  • p.arg = arg;
  • p.func = func;

enqueue(g.ops_list,

  • p);

} FlushLG(g){ while (op =dequeue()){ (*op->func)(op->arg); } }

slide-34
SLIDE 34

School of Computing Impulse Adaptable Memory System 34

Performance

SGI Power Onyx, R10K, 2MB L2, 32K L1D, 32K L1I

0.5 1 1.5 2 2.5 Speedup HEALTH FFTW Hand-coded Locality Grouping

slide-35
SLIDE 35

School of Computing Impulse Adaptable Memory System 35

Conclusion

Computation regrouping is an effective software

alternative

Identified applications that can be optimized using

regrouping

Developed four implementation techniques to realize

regrouping

Demonstrated speedups ranging from 1.29 to 2.13

slide-36
SLIDE 36

School of Computing Impulse Adaptable Memory System 36

Acknowledgements John, Sally, and Wilson and Rest of the Impulse Group

slide-37
SLIDE 37

School of Computing Impulse Adaptable Memory System 37

Questions ?

slide-38
SLIDE 38

School of Computing Impulse Adaptable Memory System 38

Key Issues

Notion of a window

– A sequence of computations – Used to capture the assumptions that the compiler can make about previous accesses – No 1-1 mapping between code and accesses

Exploitation of reuse requires a large “window”

– Many existing optimizations mostly look at a small window – Small window sufficient for many applications and input combinations

Some current optimizations use large “window”

– Multilevel-fusion [dingkennedy01] – Loop tiling algorithms

slide-39
SLIDE 39

School of Computing Impulse Adaptable Memory System 39

Regrouping Properties

Implementation

– Implementation supports deferring deletes and queries – Insert executed out-of-band – R-tree extensions to support correct operation

High overhead : profitable only for large, reasonably

stable trees

Interleaving of output Increased throughput and operation latency

– May be suitable for batch processing

slide-40
SLIDE 40

School of Computing Impulse Adaptable Memory System 40

Regrouping Example - FFTW

Fast fourier transform implementation Key operation : column walks N column walks share same cache lines Application spends between 50-90% time in these

column walks ... FFTW

slide-41
SLIDE 41

School of Computing Impulse Adaptable Memory System 41

Regrouping Example - FFTW

Fast fourier transform implementation Key operation : column walks N column walks share same cache lines Early execution

... FFTW ...

  • verhead : control

benefit : cache misses 1/8 FFTW

slide-42
SLIDE 42

School of Computing Impulse Adaptable Memory System 42

Observations

A single application may be optimized using multiple

techniques

Optimizations can be implemented to varying levels of

aggressiveness

Performance sensitive to optimization parameters

slide-43
SLIDE 43

School of Computing Impulse Adaptable Memory System 43

Characteristics of Logical Operations

Access large number of objects Low reuse of data objects within a single operation May have high degree of reuse across operations Access sequence data-dependent Strict ordering among operations

Ensures that prefetching has limited impact

slide-44
SLIDE 44

School of Computing Impulse Adaptable Memory System 44

Characteristics of Logical Operations

Access large number of objects Low reuse of data objects within a single operation May have high degree of reuse across operations Access sequence data-dependent Strict ordering among operations

Ensures that clustering has limited impact

slide-45
SLIDE 45

School of Computing Impulse Adaptable Memory System 45

Characteristics of Logical Operations

Access large number of objects Low reuse of data objects within a single operation May have high degree of reuse across operations Access sequence data-dependent Strict ordering among operations

Ensures that large caches have limited impact