Software Controlled Memory Bandwidth - Deepak N. Agarwal AMD - - - PowerPoint PPT Presentation

software controlled memory bandwidth
SMART_READER_LITE
LIVE PREVIEW

Software Controlled Memory Bandwidth - Deepak N. Agarwal AMD - - - PowerPoint PPT Presentation

Software Controlled Memory Bandwidth - Deepak N. Agarwal AMD - Wanli Liu University of Maryland - Dr. Donald Yeung University of Maryland Factors Stressing Memory Bandwidth Processor Improvement -Clock Speed Increase -More ILP


slide-1
SLIDE 1
  • Deepak N. Agarwal

AMD

  • Wanli Liu

University of Maryland

  • Dr. Donald Yeung

University of Maryland

Software Controlled Memory Bandwidth

slide-2
SLIDE 2
  • Processor Improvement
  • Clock Speed Increase
  • More ILP
  • Latency Tolerance Techniques Used
  • Non-Blocking Caches, Prefetching, Multi-Threading,etc
  • Pin Limitation and Packaging Considerations

Factors Stressing Memory Bandwidth

slide-3
SLIDE 3

Bandwidth Impacts Performance

From 2Gb/s to 4Gb/s performance improves by 38%

slide-4
SLIDE 4

Opportunity

Overall fetch wastage = 51.3%

slide-5
SLIDE 5

+

………………… ………………… ………………… ………………… ………………… …………………… …………………… …………………… …………………… …………………… ………………… ………………… ………………… ………………… …………………

= Dense/Sparse Applications

Matrix Addition Linked List for(j=0;j<X;j++){ for(i=0;i<X;i++){ C[j][i]= A[j][i] + B[j][i]; } } While(ptr){ sum+=ptrdata; ptr=ptrnext; }

slide-6
SLIDE 6

Hardware vs. Software Techniques

Spatial Footprint Predictor (S.Kumar, ISCA’98)

  • Hardware Technique
  • Selectively Prefetches Required Data Elements
  • Complexity effective Software Centric Approach
  • Sparse Memory Accesses Detected at Source Code Level

Contribution

slide-7
SLIDE 7
  • Motivation
  • Our Technique
  • Experimental Results
  • Conclusion

Roadmap

slide-8
SLIDE 8

Approach

  • Identify Sparse Memory Accesses
  • Compute Transfer Size
  • Annotate Selected Memory Instructions

Memory Cache

While(ptr){ ptr=ptrnext; }

Processor Sparse load Compute size transferring just

  • req. bytes

Sparse code

slide-9
SLIDE 9

Sparse Memory Access Patterns

  • Affine Array Accesses
  • Indexed Array Accesses
  • Pointer Chasing Accesses
slide-10
SLIDE 10

for(i=0;i<X;i+=N){ sum+= A[i]; }

Affine Array Accesses

slide-11
SLIDE 11

for(i=0;i<N;i++){ sum+= A[B[i]]; }

Indexed Array Accesses

slide-12
SLIDE 12

for(ptr=root; ptr; ){ sum+=ptrdata; ptr=ptrnext; }

Pointer Chasing Accesses

slide-13
SLIDE 13

Computing Transfer Size

While(ptrfwd){ sum+= ptrdata1; ptr = ptrfwd; } data1 data2 back fwd Structure Layout Size #1 16 bytes Size #2 4 bytes Load #1 Load #2 Size #1 – Normal Load Size #2 – sizeof(A[i])(Sparse Load) for(i=0;i<N;i++){ sum+= A[B[i]]; } Load1 Load2

slide-14
SLIDE 14

Annotating Memory Instructions

Memory Instructions with Size Information

slide-15
SLIDE 15

Sectored caches

slide-16
SLIDE 16

. . . .

Ld R0(&R1) Ld R0(&R2) Ld8 R0(&R3) Ld R0(&R5) Ld16 R0(&R4)

. . . .

Fetching Variable Sized Data

Sector Miss Sector Hit/ Cache block miss Sector Hit/ Cache Block Miss Sector Miss Lower Level Memory

slide-17
SLIDE 17

IRREG Scientific

Indexed Array

MOLDYN Scientific

Indexed Array

NBF Scientific

Indexed Array

HEALTH Olden

  • Ptr. Chasing

MST Olden

  • Ptr. Chasing

BZIP2 SPEC2000

Indexed Array

MCF SPEC2000

Affine Array, Ptr. Chasing

Application Overview

slide-18
SLIDE 18

Experimental Methodology

Cache Simulations

  • Traffic and Miss-rate Behavior
  • SFP-Ideal (8 Mbytes)
  • SFP-Real (32 Kbytes)

Performance Simulations

  • Comparison with Conventional
  • Latency Tolerant Study
  • Prefetching
  • Bandwidth Sensitivity

Processor Speed Issue Width Memory Latency Memory Bandwidth Memory Bus Width DRAM Banks Processor Model Super scalar 2 GHz 64 8 2 GB/s 120 8 Bytes Processor and Memory parameters

slide-19
SLIDE 19

Traffic Behavior

MCF

Conventional Annotated MTC SFP-Real SFP-Ideal

Traffic Reduction for MCF – 57%

slide-20
SLIDE 20

Traffic Behavior

Irreg Moldyn NBF Health MST BZIP2

Overall Traffic Reduces by 31 - 71%

slide-21
SLIDE 21

Miss-Rates

Conventional Annotated SFP-Ideal

Miss rate increases by 18% MCF

slide-22
SLIDE 22

Miss-Rates

Irreg NBF Bzip2 MST Health

Overall Miss rate increases by 7- 43%

Moldyn

slide-23
SLIDE 23

Baseline Performance

Overall performance improves by 17%

slide-24
SLIDE 24

Baseline Performance with Prefetching

Overall performance improves by 26%

slide-25
SLIDE 25

Bandwidth Sensitivity

slide-26
SLIDE 26

Bandwidth Sensitivity

Irreg NBF Health Moldyn Bzip2 MST

slide-27
SLIDE 27
  • Complexity effective way for memory bandwidth bottleneck
  • Sparse memory references can be identified at source code level
  • Software can effectively control memory bandwidth
  • Performance numbers:
  • Cache traffic reduces by 31-71%; miss rates increases by 7-43%
  • 17% performance gain over normal caches
  • Annotated s/w prefetching gains 26% over normal prefetching
  • Our technique looses effectiveness at higher bandwidth

Conclusion