Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner nder - - PowerPoint PPT Presentation

dynamic memory dependence predication
SMART_READER_LITE
LIVE PREVIEW

Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner nder - - PowerPoint PPT Presentation

Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner nder ISCA-2018, Los Angeles 6/19/2018 Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is implemented to keep the


slide-1
SLIDE 1

Dynamic Memory Dependence Predication

Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles

6/19/2018

slide-2
SLIDE 2

Background

  • 1. Store instructions do not update the cache until they are

retired (too late).

  • 2. Store queue is implemented to keep the speculative store

instructions before they are retired.

  • 3. Load instructions need to associatively search the store

queue to have an early execution.

6/19/2018

slide-3
SLIDE 3

6/19/2018

Store Queue

SW1 SW2 SW3 SW4 LW

  • The addresses match.
  • The store has to be older

than the load.

  • If there are multiple

matching stores, the youngest store is selected.

Store Queue Design

slide-4
SLIDE 4

6/19/2018

Store Queue

SW1 SW2 SW3 SW4 LW

  • The addresses are matching.
  • The store has to be older

than the load.

  • If there are multiple

matching stores, the youngest store is selected.

Store Queue Design

Due to the hardware complexity, store queue does not scale well.

slide-5
SLIDE 5

Store-Queue-Free Architecture

6/19/2018 SW1 SW2 : SW $9(P10), 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) Memory Dependence Prediction P10 (Memory Cloaking)

The DEF-store-load-USE dependence is collapsed to the DEF-USE.

slide-6
SLIDE 6

Store-Queue-Free Architecture

6/19/2018

  • If the memory dependence prediction is wrong, a

misspeculation recovery is launched.

SW1 SW2 : SW $9(P10), 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) P10 (Memory Cloaking)

slide-7
SLIDE 7

Store-Queue-Free Architecture

6/19/2018

  • If the memory dependence prediction is wrong, a

misspeculation recovery is launched.

  • If a load is frequently mispredicted, it is marked as a

low confidence load.

SW1 SW2 : SW $9(P10), 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) P10 (Memory Cloaking)

slide-8
SLIDE 8

Store-Queue-Free Architecture

6/19/2018

  • If the memory dependence prediction is wrong, a

misspeculation recovery is launched.

  • If a load is frequently mispredicted, it is marked as a

low confidence load.

  • A low confidence load only gets its data from the
  • cache. Therefore, it has to wait for the predicted

store (SW2) retires and updates the cache (too late).

SW1 SW2 : SW $9(P10), 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) P10 (Memory Cloaking)

slide-9
SLIDE 9

Store-Queue-Free Architecture

6/19/2018

  • If the memory dependence prediction is wrong, a

misspeculation recovery is launched.

  • If a load is frequently mispredicted, it is marked as a

low confidence load.

  • A low confidence load only gets its data from the
  • cache. Therefore, it has to wait for the predicted

store (SW2) retires and updates the cache (too late).

  • Low confidence loads are kept in a special buffer

where they are selected to execute once their predicted stores retire.

SW1 SW2 : SW $9(P10), 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) P10 (Memory Cloaking)

slide-10
SLIDE 10

Load instruction distribution

6/19/2018

Direct access

  • Read data from the cache.

Bypassing

  • Rename the destination

register with the store register (memory cloaking). Delayed access

  • Do not read the cache until

the store is retired (low confidence).

78.75% 14.27% 6.98%

slide-11
SLIDE 11

Delayed access VS. Bypassing

6/19/2018

Average execution time comparison

If the number is greater than zero, that means the delayed access load instructions take more cycles to execute.

slide-12
SLIDE 12

Dynamic Memory Dependence Predication

6/19/2018 SW1 SW2 : SW $9, 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) DMDP

Data Cache

==?

slide-13
SLIDE 13

Dynamic Memory Dependence Predication

6/19/2018 SW1 SW2 : SW $9, 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) DMDP

Data Cache

==? CMP $32, $7, $8

slide-14
SLIDE 14

Dynamic Memory Dependence Predication

6/19/2018 SW1 SW2 : SW $9, 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) DMDP

Data Cache

==? CMP $32, $7, $8 LW $33, 0($8) $33

slide-15
SLIDE 15

Dynamic Memory Dependence Predication

6/19/2018 SW1 SW2 : SW $9, 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) DMDP

Data Cache

==? CMP $32, $7, $8 LW $33, 0($8) CMOV $6, $32, $9 CMOV $6, !$32, $33 $33

slide-16
SLIDE 16

DMDP can also mispredict dependence

6/19/2018

Real dependence

SW1 SW2 : SW $9, 0 ($7) SW3 SW4 LW : LW $6, 0 ($8) DMDP

Data Cache

==?

slide-17
SLIDE 17

When a predicate is inserted for a load instruction:

  • 1. The load depends on the predicted store ☑
  • 2. The load does not depend on any in-flight store ☑
  • 3. The load depends on a different store ☒

6/19/2018

DMDP can also mispredict dependence

slide-18
SLIDE 18

Memory dependence prediction results over low confidence loads

6/19/2018 7.7% 3.7% 88.6% IndepStore

  • The load is independent of

any in-flight store. DiffStore

  • The load is dependent on a

different store. Correct

  • The prediction is right.
slide-19
SLIDE 19

Memory dependence prediction results over low confidence loads

6/19/2018 7.7% 3.7% 88.6% IndepStore

  • The load is independent of

any in-flight store. DiffStore

  • The load is dependent on a

different store. Correct

  • The prediction is right.

DMDP can cover IndepStore (88.6%) and Correct (7.7%), only DiffStore (3.7% low confidence loads) is not covered.

slide-20
SLIDE 20

Simulation configuration

6/19/2018

ROB / RS / PRF 256 / 64 / 320 Fetch / Decode / Issue 8 / 8 / 8 Store Queue (baseline) Unlimited entries, 4 cycles latency Store Buffer 16 entries, store coalesce Cache 32kB8-way set associative, 4 cycles hit latency, 2 read ports, 1 write port, iL1, dL1; 512kB 8-way set associative, 10 cycles hit latency L2 Memory 16GBDDR3L-1600, 2 channels, 2 ranks, 8 banks,

  • pen page, up to 64 pending request

Recovery penalty Minimum 15 cycles IntALU / IntMUL 1 cycles / 3 cycles IntDIV, FP ALU 7 cycles Branch predictor 8kB TAGE Tech node 22nm Clock frequency 3.2GHz

  • 1. Baseline : a superscalar processor

with unlimited store queue entries, using store-set to resolve memory dependence prediction.

  • 2. NoSQ : the store-queue-free

architecture.

  • 3. DMDP : dynamic memory

dependence predication.

  • 4. Perfect : NoSQ + perfect memory

dependence predictor.

slide-21
SLIDE 21

Evaluation results

6/19/2018

IPC normalized to the baseline

0.992 1.049 1.068

slide-22
SLIDE 22

Evaluation results

6/19/2018

IPC normalized to the baseline

Average execution time for low confidence loads is significantly reduced (DMDP VS. NoSQ). hmmer 37.19 cycles -> 8.91 cycles wrf 61.59 cycles -> 12.78 cycles

slide-23
SLIDE 23

Evaluation results

6/19/2018

IPC normalized to the baseline

DMDP still encounters some memory dependence mispredictions in some benchmarks bzip2 1.409 MPKI hmmer 1.029 MPKI (Mispredictions Per 1k retired instructions)

slide-24
SLIDE 24

Energy Delay Product

6/19/2018

EDP normalized to NoSQ

  • 1. Reducing the execution

time.

  • 2. Fewer memory dependence

mispredictions means fewer misprediction recoveries. 0.933

slide-25
SLIDE 25

Conclusion

  • 1. DMDP is the first mechanism to use predication for memory

dependence handling.

  • 2. The storage for maintaining the low confidence loads is

completely removed.

  • 3. The memory dependence is translated to register

dependence and will be checked in the reservation station.

6/19/2018