The Use of Prediction for The Use of Prediction for Accelerating - - PowerPoint PPT Presentation

the use of prediction for the use of prediction for
SMART_READER_LITE
LIVE PREVIEW

The Use of Prediction for The Use of Prediction for Accelerating - - PowerPoint PPT Presentation

The Use of Prediction for The Use of Prediction for Accelerating Upgrade Misses in Accelerating Upgrade Misses in cc-NUMA Multiprocessors cc-NUMA Multiprocessors Manuel E. Acacio , Jos Gonzlez


slide-1
SLIDE 1

e-mail: meacacio@ditec.um.es

The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

✂ ✄

Manuel E. Acacio

, José González

, José M. García

and José Duato

✄ ☎
slide-2
SLIDE 2

Introduction Introduction

Scalable shared-memory multiprocessors

Based on the use of directories Known as cc-NUMA architectures

Long L2 miss latencies

Mainly caused by the indirection introduced by the access to

the directory information – Network latency – Directory latency

Upgrade misses

Important fraction of the L2 miss rate (> 40%) Store instruction for which a read-only copy of the line is

found in the local L2 cache

Exclusive ownership is required

slide-3
SLIDE 3

– Line L shared by nodes 1, 3 and 4 – Directory: Node 2 – Node 1 issues an Upgrade for L

Directory? Node 2

Introduction Introduction

Upgrade misses in a conventional cc-NUMA

Directory Node 2 Store Miss Node 1

Line L Shared

Line L Sharers? Nodes 1,3,4

Sharer Node 4

Line L Shared

Sharer Node 3

Line L Shared

Owner for L? Node 1

Store Miss (UPGR) 1st Inv L 2nd Inv L 2nd Ack Ack 3rd 3rd Line L Ownership 4th

slide-4
SLIDE 4

– Line L shared by nodes 1, 3 and 4 – Directory: Node 2 – Node 1 issues an Upgrade for L

Predicted Nodes? 3,4

Introduction Introduction

Upgrade misses using prediction

Store Miss Node 1

Line L Shared Line L Sharers? Nodes 1,3(OK),4(OK)

Directory Node 2 Sharer Node 4

Line L Shared

Sharer Node 3

Line L Shared

Owner for L? Node 1

Store Miss (UPGR) Inv L Inv L 1st 1st 1st Line L Ownership Ack Ack 2nd 2nd 2nd

slide-5
SLIDE 5

Introduction Introduction

Two key observations motivate our work:

Repetitive behavior found for upgrade misses Small number of invalidations sent on an upgrade miss

Two main elements must be developed:

An effective prediction engine

– Accessed on an upgrade miss – Provides a list of the sharers

A coherence protocol

– Properly extended to support the use of prediction

slide-6
SLIDE 6

Outline Outline

Introduction Predictor Design for Upgrade Misses Extensions to a MESI Coherence Protocol Performance Evaluation Conclusions

slide-7
SLIDE 7

Predictor Design for Upgrade… Predictor Design for Upgrade…

Predictor characteristics:

Address-based predictor

– Accessed using the effective address of the line

3 pointers per entry

– Small number of sharers per line – Addition of confidence bits per each pointer – (3 x log2N + 6) bits per entry

Implemented as a non-tagged table

– Initially, all 2-bit counters store 0 – Predictor is probed on each upgrade miss

Miss predicted when confidence

– Predictor is updated in two situations:

On the reply from the directory On a load miss serviced with a $-to-$ transfer (Migratory Data)

slide-8
SLIDE 8

Predictor Design for Upgrade… Predictor Design for Upgrade…

Predictor Anatomy

slide-9
SLIDE 9

Outline Outline

Introduction Predictor Design for Upgrade Misses Extensions to a MESI Coherence Protocol Performance Evaluation Conclusions

slide-10
SLIDE 10

Extensions to a MESI Protocol Extensions to a MESI Protocol

Changes to Requesting node, sharer nodes, home directory Requesting Node Operation

On suffering a predicted UPGRADE MISS

– Create & send invalidation messages to predicted nodes

Put message Predicted bit to 1

– Send miss to the directory

Put message Predicted bit to 1 and include the list of

predicted nodes

– Collect directory reply and ACK / NACK from predicted nodes:

Re-invalidate those real sharers that replied NACK (if any)

– Gain exclusive ownership

slide-11
SLIDE 11

Extensions to a MESI Protocol Extensions to a MESI Protocol

Sharer Node Operation

On receiving a predicted INVALIDATION message and

– Pending Load Miss: store invalidation and return NACK – Pending UPGR Miss (line in the Shared state):

Directory reply not received: return ACK and invalidate line Directory reply previously received: return NACK

– Not pending UPGR Miss and line in the Shared state:

Return ACK and invalidate Insert tag in Invalidated Lines Table (ILT)

– Otherwise, return NACK message

On suffering a Load Miss

– If entry found in the ILT, put message Invalidated bit to 1

slide-12
SLIDE 12

Extensions to a MESI Protocol Extensions to a MESI Protocol

Predictor + ILT added to each node Anatomy of the Invalidated Lines Table (ILT)

slide-13
SLIDE 13

Extensions to a MESI Protocol Extensions to a MESI Protocol

Directory Node Operation

On receiving a predicted UPGRAGE MISS

– If line is in the Shared state

All sharers predicted send reply (TOTAL HIT) Some actual sharers not predicted (PARTIAL HIT) or none

correctly predicted (TOTAL MISS) Invalidate and send reply

– Otherwise, process as usually (NOT INV)

On receiving a Load Miss

– If message Invalidated bit is set && requesting node present in sharing code wait until UPGR to complete! – Otherwise, process as usually

slide-14
SLIDE 14

Outline Outline

Introduction Predictor Design for Upgrade Misses Extensions to a MESI Coherence Protocol Performance Evaluation Conclusions

slide-15
SLIDE 15

Performance Evaluation Performance Evaluation

Performance Evaluation

RSIM multiprocessor simulator We assume that predictors do not add any cycle Benchmarks

– Applications with more than 25% upgrade misses covering a variety of patterns

EM3D, FFT, MP3D, Ocean and Unstructured

slide-16
SLIDE 16

Performance Evaluation Performance Evaluation

Experimental Framework

Compared systems:

– Base: Traditional cc-NUMA using a bit-vector directory – UPT: Added unlimited Prediction Table and ILT – LPT: Added a "realistic" Prediction Table and ILT

Prediction Table: 16K entries (non-tagged) ILT: 128 entries (totally associative) Total size less than 48 KB (1 MB L2 caches)

We study:

– Predictor accuracy – Impact on latency of upgrade misses – Impact on latency of load & store misses – Impact on execution time

slide-17
SLIDE 17

A Novel Architecture A Novel Architecture Performance Evaluation Performance Evaluation

Results(1). Predictor Accuracy

Predictor Accuracy

0,00 0,20 0,40 0,60 0,80 1,00 1,20 % Inv Misses Not Inv Not Predict Total Miss Partial Hit Total Hit

EM3D Unstruct Ocean MP3D FFT UPT LPT UPT LPT UPT LPT UPT LPT UPT LPT

slide-18
SLIDE 18

A Novel Architecture A Novel Architecture Performance Evaluation Performance Evaluation

Results(2). Average Upgrade Miss Latency

Average Upgrade Miss Latency

0,00 0,20 0,40 0,60 0,80 1,00 1,20 Normalized Latency

Misc Directory Network

EM3D Unstruct Ocean MP3D FFT Base UPT LPT Base UPT LPT Base UPT LPT Base UPT LPT Base UPT LPT

slide-19
SLIDE 19

A Novel Architecture A Novel Architecture Performance Evaluation Performance Evaluation

Results(3). Average Load/Store Miss Latency

Average Load and Store Miss Latencies

0,00 0,20 0,40 0,60 0,80 1,00 1,20 Normalized Latency Base UPT LPT

Load Store Load Store Load Store Load Store Load Store

EM3D FFT MP3D Ocean Unstruct

slide-20
SLIDE 20

A Novel Architecture A Novel Architecture Performance Evaluation Performance Evaluation

Results(4). Application Speed-ups

Application Speed-ups

0% 2% 4% 6% 8% 10% 12% 14% 16%

EM3D FFT MP3D Ocean Unstruct Speed-up

UPT LPT

slide-21
SLIDE 21

Outline Outline

Introduction Predictor Design for Upgrade Misses Extensions to a MESI Coherence Protocol Performance Evaluation Conclusions

slide-22
SLIDE 22

Conclusions Conclusions

Conclusions (1)

Upgrade misses are caused by a store instruction when a

read-only copy is found: – Message sent to directory – Directory lookup – Invalidations sent to sharers – Replies to the invalidations sent back – Ownership message returned

Account for an important fraction of the L2 miss rate (>40%) We propose use of prediction for accelerating them

– On an upgrade miss: predict sharers and invalidate them in parallel with the access to the directory – Based on:

Repetitive behavior Small number of invalidations per upgrade miss

slide-23
SLIDE 23

Conclusions Conclusions

Conclusions (2)

Results:

– Great fraction of upgrade misses successfully predicted – Reductions > 40% on average upgrade miss latency – Load miss latencies are not affected in most cases – Speed-ups on application execution time up to 14%

slide-24
SLIDE 24

e-mail: meacacio@ditec.um.es

The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

✂ ✄

Manuel E. Acacio

, José González

, José M. García

and José Duato

✄ ☎