Software Controlled Memory Bandwidth - Deepak N. Agarwal AMD - - PowerPoint PPT Presentation

Software Controlled Memory Bandwidth - Deepak N. Agarwal AMD - Wanli Liu University of Maryland - Dr. Donald Yeung University of Maryland

Factors Stressing Memory Bandwidth •Processor Improvement -Clock Speed Increase -More ILP •Latency Tolerance Techniques Used -Non-Blocking Caches, Prefetching, Multi-Threading,etc • Pin Limitation and Packaging Considerations

Bandwidth Impacts Performance From 2Gb/s to 4Gb/s performance improves by 38%

Opportunity Overall fetch wastage = 51.3%

Dense/Sparse Applications ………………… ………………… …………………… …………………… …………………… …………………… …………………… ………………… ………………… + = ………………… ………………… ………………… ………………… ………………… ………………… Matrix Addition Linked List for(j=0;j<X;j++){ While(ptr){ sum+=ptr � data; for(i=0;i<X;i++){ ptr=ptr � next; C[j][i]= A[j][i] + B[j][i]; } } }

Hardware vs. Software Techniques Spatial Footprint Predictor (S.Kumar, ISCA’98) • Hardware Technique • Selectively Prefetches Required Data Elements Contribution • Complexity effective Software Centric Approach • Sparse Memory Accesses Detected at Source Code Level

Roadmap • Motivation • Our Technique • Experimental Results • Conclusion

Approach • Identify Sparse Memory Accesses • Compute Transfer Size • Annotate Selected Memory Instructions Sparse Compute load size transferring just req. bytes While(ptr){ ptr=ptr � next; } Sparse code Processor Cache Memory

Sparse Memory Access Patterns • Affine Array Accesses • Indexed Array Accesses • Pointer Chasing Accesses

Affine Array Accesses for(i=0;i<X;i+=N){ sum+= A[i]; }

Indexed Array Accesses for(i=0;i<N;i++){ sum+= A[B[i]]; }

Pointer Chasing Accesses for(ptr=root; ptr; ){ sum+=ptr � data; ptr=ptr � next; }

Computing Transfer Size for(i=0;i<N;i++){ sum+= A[B[i]]; Size #1 – Normal Load } Size #2 – sizeof(A[i])(Sparse Load) Load1 Load2 Structure Layout While(ptr � fwd){ sum+= ptr � data1; data1 Size #1 ptr = ptr � fwd; data2 16 bytes } back Size #2 fwd Load #1 4 bytes Load #2

Annotating Memory Instructions Memory Instructions with Size Information

Sectored caches

Fetching Variable Sized Data Sector Hit/ Sector Hit/ . Cache Block Miss Cache block miss . . . Sector Miss Sector Miss Ld R0(&R1) Ld R0(&R2) Ld8 R0(&R3) Ld16 R0(&R4) Ld R0(&R5) . . . . Lower Level Memory

Application Overview IRREG Scientific Indexed Array MOLDYN Scientific Indexed Array NBF Scientific Indexed Array HEALTH Olden Ptr. Chasing MST Olden Ptr. Chasing BZIP2 SPEC2000 Indexed Array MCF SPEC2000 Affine Array, Ptr. Chasing

Experimental Methodology Cache Simulations Processor and Memory parameters • Traffic and Miss-rate Behavior Processor Model Super scalar • SFP-Ideal (8 Mbytes) Processor Speed 2 GHz • SFP-Real (32 Kbytes) Issue Width 8 Performance Simulations Memory Bandwidth 2 GB/s • Comparison with Conventional Memory Latency 120 • Latency Tolerant Study Memory Bus Width 8 Bytes -Prefetching DRAM Banks 64 • Bandwidth Sensitivity

Traffic Behavior MCF Annotated Conventional SFP-Real SFP-Ideal MTC Traffic Reduction for MCF – 57%

Traffic Behavior Irreg Moldyn NBF Health MST BZIP2 Overall Traffic Reduces by 31 - 71%

Miss-Rates MCF Annotated Conventional SFP-Ideal Miss rate increases by 18%

Miss-Rates Moldyn NBF Irreg Health MST Bzip2 Overall Miss rate increases by 7- 43%

Baseline Performance Overall performance improves by 17%

Baseline Performance with Prefetching Overall performance improves by 26%

Bandwidth Sensitivity

Bandwidth Sensitivity Irreg Moldyn NBF Bzip2 Health MST

Conclusion • Complexity effective way for memory bandwidth bottleneck • Sparse memory references can be identified at source code level • Software can effectively control memory bandwidth • Performance numbers: -Cache traffic reduces by 31-71%; miss rates increases by 7-43% -17% performance gain over normal caches -Annotated s/w prefetching gains 26% over normal prefetching • Our technique looses effectiveness at higher bandwidth

Software Controlled Memory Bandwidth - Deepak N. Agarwal AMD - - PowerPoint PPT Presentation

Software Controlled Memory Bandwidth - Deepak N. Agarwal AMD - Wanli Liu University of Maryland - Dr. Donald Yeung University of Maryland Factors Stressing Memory Bandwidth Processor Improvement -Clock Speed Increase -More ILP

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Screening Controlled Substance Screening Controlled Substance Screening Controlled Substance

MEDICAL SOLUTIONS Controlled Power Company MEDICAL SOLUTIONS Controlled Power Company MEDICAL

Count Controlled CSCI-UA.0002-008 Loops Count Controlled Loops A count controlled loop is a

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2010 Ingredients What is bandwidth

Bandwidth Ex Parte Addendum M a y 1 0 , 2 0 1 8 Addendum to Bandwidth FCC Meeting on May 2,

Virtualising our CPE Mantychore is part-funded by the EC under Grant Agreement N 261527

Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Introduction to Experimental Robotics CSCI 1108 Lecture 15 PID Controller CSCI 1108 Lecture

LECTURE 5 Single-Cycle Datapath and Control PROCESSORS Datapath and control are the two

Control Path Design and Lab 3 1 Separating Control From Data The datapath is where data

The Cloud Computations on Encrypted Data and Privacy David Pointcheval CNRS - ENS - INRIA 11th

VMAT Dosimetric characteris1cs and delivery Marta Paiusco Agenda IMAT

Paper Summaries Any takers? Interpolation Papers for interpolation Projects Logistics

A Generic Adaptive Runtime Autotuning Framework Isaac Dooley 7th Annual Workshop on Charm++ and

HotSwap: Correct and Efficient Controller Upgrades for Software-Defined Networks Laurent Vanbever