Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses - - PowerPoint PPT Presentation

memory systems
SMART_READER_LITE
LIVE PREVIEW

Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses - - PowerPoint PPT Presentation

IEE5008 Autumn 2012 Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses in the Memory Interface of GPUs Garrido Platero, Luis Angel EECS Graduate Program National Chiao Tung University luis.garrido.platero@gmail.com Luis


slide-1
SLIDE 1

Luis Garrido, 2012

IEE5008 –Autumn 2012 Memory Systems Survey on the Off-Chip Scheduling of Memory Accesses in the Memory Interface of GPUs

Garrido Platero, Luis Angel EECS Graduate Program National Chiao Tung University luis.garrido.platero@gmail.com

slide-2
SLIDE 2

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Outline

Introduction Overview of GPU Architectures

The SIMD Execution Model Memory Requests in GPUs

Differences between GDDRx and DDRx State-of-the-art Memory Scheduling Techniques

Effect of instruction fetch and memory scheduling in GPU Performance An alternative Memory Access Scheduling in Manycore Accelerators

2

slide-3
SLIDE 3

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Outline

DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems

Conclusion References

3

slide-4
SLIDE 4

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Introduction

Memory controllers are a critical bottleneck of performance Scheduling policies compliant with the SIMD execution model. Characteristics of the memory requests in GPU architectures Integration of GPU+CPU systems

4

slide-5
SLIDE 5

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Overview: SIMD execution model

5

Core re DP P Unit it LD LD/ ST ST Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re DP P Unit it DP P Unit it DP P Unit it DP P Unit it LD LD/ ST ST LD LD/ ST ST LD LD/ ST ST LD LD/ ST ST Core re DP P Unit it LD LD/ ST ST Core re Core re Core re Core re Core re Core re Core re Core re DP P Unit it DP P Unit it LD LD/ ST ST LD LD/ ST ST Core re Core re Core re Core re Core re Core re DP P Unit it DP P Unit it LD LD/ ST ST LD LD/ ST ST

REGISTER FILE (65536 x 32-bit)

… … … … …

Warp Scheduler Warp Scheduler Warp Scheduler Instruction Cache Interconnect Network 64 KB Shared Memory / L1 Cache 48 KB Read-Only Data Cache Tex Tex Tex Tex

Tex Tex Tex Tex

Core re DP P Unit it LD LD/ ST ST Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re Core re DP P Unit it DP P Unit it DP P Unit it DP P Unit it LD LD/ ST ST LD LD/ ST ST LD LD/ ST ST LD LD/ ST ST Core re DP P Unit it LD LD/ ST ST Core re Core re Core re Core re Core re Core re Core re Core re DP P Unit it DP P Unit it LD LD/ ST ST LD LD/ ST ST Core re Core re Core re Core re Core re Core re DP P Unit it DP P Unit it LD LD/ ST ST LD LD/ ST ST

REGISTER FILE (65536 x 32-bit)

… … … … …

Warp Scheduler Warp Scheduler Warp Scheduler Instruction Cache Interconnect Network 64 KB Shared Memory / L1 Cache 48 KB Read-Only Data Cache Tex Tex Tex Tex

Tex Tex Tex Tex

L2 Unified Cache Memory Controller Memory Controller Memory Controller Memory Controller GigaThread Engine

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1) Block (0,2) Block (1,2) Block (2,2)

GRID

Thread (0,0) Thread (0,1) Thread (0,2) Thread (0,1) Thread (1,1) Thread (2,1) Thread (0,2) Thread (1,2) Thread (2,2)

Block(0,2)

LD LD/ ST ST LD LD/ ST ST LD LD/ ST ST LD LD/ ST ST

… … … … Interconnect Network

Memory

slide-6
SLIDE 6

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Overview: Memory requests in GPUs

Varying number accesses with different characteristics generated simultaneously

6

i A A B B i+1 C C D D i+2 E E B C i+3 A F A

  • i+4

F F B F i+5 A E E E i+6 B B B B

Load/Store units handle the data fetch Concept of memory coalescing Accesses can be merged

Intra-core merging Inter-core merging

Reduce number of transactions

slide-7
SLIDE 7

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Overview: Memory requests in GPUs

7

Number of transactions depend on:

Parameters of memory sub-system: how many cache levels, cache line size, Application’s behavior Scheduling policy and GDDRx controller capabilities

0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384

a) b)

Coalesced Accesses = 1 transaction … … … …

c) d) e)

Scattered accesses = k transactions Same word accesses = 1 transaction Misaligned accesses = 2 transactions Permutated accesses = 1 transaction

b) d)

Same word access = 1 transaction Permuted accesses = 4 transactions

0 32 64 96 128 160 192 224 256 288 320 352 384

a)

Coalesced accesses = 4 transactions … … … …

c)

Scattered accesses = k transactions

Misaligned access =< 6 transactions

0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384 0 32 64 96 128 160 192 224 256 288 320 352 384

e)

slide-8
SLIDE 8

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Differences in GDDRx and DDRx

GDDRx can receive and send data in the same clock cycles More channels that DDRx Higher clock speeds at GDDRx

Higher bandwidth demand

Different power consumption characteristic:

GDDRs require better heat dissipation

8

slide-9
SLIDE 9

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Memory Request Scheduling Problem

Bandwidth limitation Desirable to increase:

Row-buffer locality Bank-level Parallelism Reduce negative interference among memory accesses

Root cause of the problem: massive amount of simultaneous memory requests Memory scheduling mechanisms can partially alleviate this issue

9

slide-10
SLIDE 10

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Effect of IF and memory scheduling

Paper: N.B. Lakshminarayana, H. Kim, “Effect of Instruction Fetch and Memory Scheduling on GPU Performance”, In Workshop on Language, Compiler, and Architecture Support for GPGPU. 2010. Analyzed effect of different IF and memory scheduling policies.  Seven instruction policies: RR, Icount, LRF, FAIR, ALL, BAR, MEM-BAR Four scheduling policies: FCFS, FR-FCFS, FR- FAIR, REMINST

10

slide-11
SLIDE 11

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Effect of IF and memory scheduling

GPUOcelot used to evaluate performance Instruction count of warps:

Max-Warp to Min-Warp ratio

a) b) a) b)

slide-12
SLIDE 12

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Effect of IF and memory scheduling

Most IF policies provide improvement of about 3- 6% on average, except for ICOUNT. Same IF policies do not improve FCFS to FR- FCFS. Memory intensive apps. improve with FR-FAIR and REMINST Adequate IF policies can increase memory access merging

slide-13
SLIDE 13

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Effect of IF and memory scheduling

For asymmetric apps., best performance for FR-FCFS+ RR None can provide good benefit for all asymmetrical apps. Application’s characteristics

  • ver policy’s effectiveness.

The regularity of the warps gives clues on app. Behavior.

a) b) c)

slide-14
SLIDE 14

NCTU IEE5008 Memory Systems 2012 Luis Garrido

An Alternative Memory Access Scheduling

Paper: Y. Kim, H. Lee, J. Kim, “An Alternative Memory Access Scheduling in Manycore Accelerators”, in Conf. on Parallel Architectures and Compilation Techniques, 2011.

Algorithm 1 mLCA algorithm 1. At each core (c) 2. if (all req in a warp satisfy rc(m,b) = 0 and O(c) < t then 3. for all req within a warp do 4. inject req; 5. rc(m,b)++; 6. O(c)++; 7. end for 8. else 9. Throttle

  • 10. end if

FR-FCFS requires a complex structure

Scheduling decisions near the cores

Avoid network congestion: mLCA algorithm

slide-15
SLIDE 15

NCTU IEE5008 Memory Systems 2012 Luis Garrido

An Alternative Memory Access Scheduling

Observation: locality information affected by the NoC traffic. Grouping multiple requests into superpackets Considering: order of requests, source and destination criteria Two: ICR and OCR GPGPUSim for experiments 95% speedup: BIFO+mLCA

slide-16
SLIDE 16

NCTU IEE5008 Memory Systems 2012 Luis Garrido

An Alternative Memory Access Scheduling

Authors did not provide explanations for the improvement obtained Did not provide a characterization of the applications evaluated,

Difficult to assess robustness and effectiveness

Distributing the scheduling tasks at different points

  • f the memory subsystem
slide-17
SLIDE 17

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Scheduling Based on a Potential Function

Paper: N.B. Lakshminarayana, J. Lee, H. Kim, J. Shin, “DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function”, In Computer Architecture Letters , vol.11, no.2, pp.33- 36, July-Dec. 2012. Model DRAM behavior and scheduling policy, called α-SJF (α-Shortest Job First): potential function. Memory requests of the warps as jobs Two features: large scale multithreading, SIMD execution

slide-18
SLIDE 18

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Scheduling Based on a Potential Function

α-SJF does not consider row-buffer locality Policy chooses between SJF and FR-FCFS

Introduction of a tolerance metric SIMD execution model Row-buffer locality at DRAM

Baseline potential function: For GDDR5 memories, m = 3

r t =

𝑗=1 𝑂

𝑟𝑗(𝑢) +

𝑗=1 𝑂

𝑞𝑗(𝑢)

  • Eq. 1

r t + 1 = 𝑠 𝑢 − 1 (hit to DRAM buffer) r t + 1 = 𝑠 𝑢 − 1 𝑛 (miss to DRAM buffer) m = (service time of a req. with DRAM Row buffer miss) (𝑡𝑓𝑠𝑤𝑗𝑑𝑓 𝑢𝑗𝑛𝑓 𝑝𝑔 𝑏 𝑠𝑓𝑟. 𝑥𝑗𝑢ℎ 𝐸𝑆𝐵𝑁 𝑆𝑝𝑥 𝑐𝑣𝑔𝑔𝑓𝑠 ℎ𝑗𝑢)

slide-19
SLIDE 19

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Scheduling Based on a Potential Function

To reflect SIMD-execution, a new parameter ‘α’ is introduced: ‘α’ serves two purposes:

Select a request from the warp with the shortest queue Schedule requests from the same warp

How is the selection done?

Changing value of α, depending on benchmarks

Introducing the tolerance metric:

r t =

𝑗=1 𝑂

𝑟𝑗(𝑢)𝛽 +

𝑗=1 𝑂

𝑞𝑗(𝑢)𝛽 0 ≤ α ≤ 1

  • Eq. 2

ftol c𝑙, 𝑢 = 𝑠𝑙 𝑢 ∗ 𝑥𝑑

  • Eq. 3
slide-20
SLIDE 20

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Scheduling Based on a Potential Function

The complete scheduling policy is formulated as: Two particular features:

Tolerance Aware Core Selection Queue Selection

MacSim to perform experiments Two different variations of α-SJF

Services writes after all reads have been served No preference between reads and writes

r t =

𝑗=1 𝑂

𝑟𝑗(𝑢)𝛽 1 𝑔𝑢𝑝𝑚(c𝑟, 𝑢) +

𝑗=1 𝑂

𝑞𝑗(𝑢)𝛽 1 𝑔𝑢𝑝𝑚(c𝑞, 𝑢)

  • Eq. 4
slide-21
SLIDE 21

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Scheduling Based on a Potential Function

Backprop, NearestNeighbor, OceanFFT and Streamcluster achieve 3%, 9%, 11% and 7% speedups. When α-SJF or α-SJFW boost performance, larger values of ‘α’ often show more benefit Adaptability of potential function for the α-SJF policy

a-SJF (alpha = 0.25) BackProp BlackScholes CFD NearestNeighbor OceanFFT StreamCluster a-SJF (alpha = 0.50) a-SJF (alpha = 0.75) a-SJF (alpha = 1.00) a-SJFW (alpha = 0.25) a-SJFW (alpha = 0.50) a-SJFW (alpha = 0.75) a-SJFW (alpha = 1.00)

1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6

Relative IPC

slide-22
SLIDE 22

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Scheduling Based on a Potential Function

Complexity of the access patterns determine the performance of scheduling policy Higher degree of coordination is needed

Impractical approach

Potential function needs to account for the cost of serving row miss requests all subsequent requests Necessary to consider, not only the SIMD model of execution, but the various behavioral cases

Adaptability

slide-23
SLIDE 23

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Staged Memory Scheduling

Paper: R. Ausavarungnirun, K.K. Chang, L. Subramanian, G. Loh, O. Mutlu, “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems”, in Proc. of

  • Intl. Symposium on Computer Architecture, 2012

Scheduling problem for a CPU+GPU system. Huge number of accesses generated by GPU can harm CPU performance:

CPU Memory Requests GPU Memory Requests a) b) c)

slide-24
SLIDE 24

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Staged Memory Scheduling

Centralized out-of-order scheduling (FR-FCFS) is not adequate:

Size of buffers increase Combinational Complexity Latency, area and power

Complex logic to exploit row-buffer locality and meet GDDR timing constraints Proposed mechanism: Staged Memory Scheduler

Decouple scheduling tasks from MC Distribute the tasks across simpler hardware structures

slide-25
SLIDE 25

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Staged Memory Scheduling

Three stages for SMS:

Detect basic application memory access characteristics Prioritize across applications on CPU and GPU cores DRAM command scheduling, meet timing constraints, resolution of resource conflicts.

Different behavior for different apps.: robustness SMS has three corresponding components:

Batch builder Batch scheduler DRAM command scheduler

slide-26
SLIDE 26

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Staged Memory Scheduling

Batch: group of memory requests from the same app. with row-buffer locality Batch builder: a FIFO per memory access source (core) Batch scheduler chooses a batch and sends its accesses to the DRAM command scheduler.

Considers ready batches only Chooses based on SJF or RR

Applications classified based on memory intensity

slide-27
SLIDE 27

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Staged Memory Scheduling

Introducing a configurable parameter ‘p’:

Batch scheduler chooses SJF or RR based on ‘p’ When ‘p’ is high, SJF is utilized, prioritizing CPU apps. When ‘p’ is low, RR is utilized, prioritizing GPU apps.

Architecture of DRAM command scheduler:

FIFOs coupled with each DRAM bank Choice of which FIFO depends on opened row-buffers

SMS is by-passed when (two-cycle overhead):

CPU+GPU executing latency-sensitive applications System is lightly loaded

slide-28
SLIDE 28

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Staged Memory Scheduling

System reserves half of the request buffer entries for CPU requests for non-SMS MCs. To perform experiments, authors utilized an in- house cycle-accurate simulator. To evaluate performance, the following metrics were proposed:

CPU𝑥𝑡 =

𝑗=1 𝑂𝑣𝑛𝐷𝑝𝑠𝑓𝑡 𝐽𝑄𝐷𝑗 𝑡ℎ𝑏𝑠𝑓𝑒

𝐽𝑄𝐷𝑗

𝑏𝑚𝑝𝑜𝑓

  • Eq. 5

GPU𝑡𝑞𝑓𝑓𝑒𝑣𝑞 = 𝐻𝑄𝑉𝐺𝑠𝑏𝑛𝑓𝑆𝑏𝑢𝑓

𝑡ℎ𝑏𝑠𝑓𝑒

𝐻𝑄𝑉𝐺𝑠𝑏𝑛𝑓𝑆𝑏𝑢𝑓

𝑏𝑚𝑝𝑜𝑓

  • Eq. 6

CGWS = CPU𝑥𝑡 + GPU𝑥𝑓𝑗𝑕ℎ𝑢 ∗ GPU𝑡𝑞𝑓𝑓𝑒𝑣𝑞

  • Eq. 7

Unfairness=max max i IPCCore−i alone IPCCore−i shared , GPUFrameRate alone GPUFrameRate shared

slide-29
SLIDE 29

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Staged Memory Scheduling

SMS with p=0.9 improves performance for CPU by 22.1%/35.7% over ATLAS and PCM on average.

Restricts BW allocated to GPUs

SMS-0.9 reduces GPU Frame Rate, but still higher than 30 frames per second. For p=0, frame rate increases

L ML M HL HML HM H Avg L ML M HL HML HM H Avg 2 4 6 8 10 12 15 30 45 60 75 90 105 120

FR FR-FC FCFS FS CFR FR-FC FCFS FS ATLA TLAS TC TCM CTC TCM SM SMS-0.9 SM SMS-0

CPU Weighted Speedup GPU Frame Rate

slide-30
SLIDE 30

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Staged Memory Scheduling

 SMS-0.9: higher CGWS than FRFCFS, ATLAS and TCM, for GPU𝑥𝑓𝑗𝑕ℎ𝑢 (< 7.5).  As GPU𝑥𝑓𝑗𝑕ℎ𝑢 increases, SMS-0.9 has lower CGWS  A ‘p’ that maximizes CGWS favors the GPU, slowing down CPU applications, and thus leading to higher unfairness.  For smaller GPU𝑥𝑓𝑗𝑕ℎ𝑢, SMS provides the highest fairness  SMS consumes 66.7% less leakage power and 46.3% less area

0.001 0.1 10 1000 0.001 0.1 10 1000 0.001 0.1 10 1000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 5 10 20 25 30 Normalized CGWS Normalized CGWS Unfairness a) b) c)

FR-FCFS ATLAS TCM SMS-0.9 SMS-0.05 SMS-0.02 SMS-0 FR-FCFS ATLAS TCM SMS-Max FR-FCFS ATLAS TCM SMS-Max

GPUWeight GPUWeight GPUWeight

slide-31
SLIDE 31

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Staged Memory Scheduling

Comparing against SIMD execution model agnostic policies seems unfair. Necessary to compare against other scheduling mechanisms already tested for GPU systems Provided a key metric to characterize apps.:

Memory intensity

 Only considered graphics applications Application-aware approach is necessary to extend the applicability of proposed mechanisms.

slide-32
SLIDE 32

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Conclusion

Widely used approach: FR-FCFS Why strive for different solutions? Two general approaches:

Centralized mechanism (FR-FCFS) improving over the policy Throttling the instruction issue mechanism Distributing scheduling tasks at different points of the GPU system

Implementing a general case scheduling mechanism is a challenging task

32

slide-33
SLIDE 33

NCTU IEE5008 Memory Systems 2012 Luis Garrido

Conclusion

Consider execution model + Application’s behavior Thorough comparison of the mechanisms presented is not that straightforward.

Mechanisms design to exploit specific characteristics

33

slide-34
SLIDE 34

NCTU IEE5008 Memory Systems 2012 Luis Garrido

References

N.B. Lakshminarayana, H. Kim, “Effect of Instruction Fetch and Memory Scheduling on GPU Performance”, In Workshop on Language, Compiler, and Architecture Support for GPGPU. 2010. Paper: Y. Kim, H. Lee, J. Kim, “An Alternative Memory Access Scheduling in Manycore Accelerators”, in Conf. on Parallel Architectures and Compilation Techniques, 2011.

34

slide-35
SLIDE 35

NCTU IEE5008 Memory Systems 2012 Luis Garrido

References

N.B. Lakshminarayana, J. Lee, H. Kim, J. Shin, “DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function”, In Computer Architecture Letters , vol.11, no.2, pp.33-36, July-

  • Dec. 2012.

R. Ausavarungnirun, K.K. Chang, L. Subramanian,

  • G. Loh, O. Mutlu, “Staged Memory Scheduling:

Achieving High Performance and Scalability in Heterogeneous Systems”, in Proc. of International Symposium on Computer Architecture, 2012.

35

slide-36
SLIDE 36

NCTU IEE5008 Memory Systems 2012 Luis Garrido

References

H. Kim, J. Lee, N. Lakshminarayana, J. Sim, J. Lim,

  • T. Pho, “MacSim: A CPU-GPU Heterogeneous

Simulation Framework. User Guide”, Georgia Institute of Technology, 2012. NVIDIA Corporation. NVIDIA CUDA C Programming Guide, v. 4.2. 2012. G. Diamos, A. Kerr, S. Yalamanchili, N. Clark, “Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems”, PACT ’10.

36