Cost-Effective Compiler Directed Memory Prefetching and Bypassing - - PowerPoint PPT Presentation

cost effective compiler directed memory prefetching and
SMART_READER_LITE
LIVE PREVIEW

Cost-Effective Compiler Directed Memory Prefetching and Bypassing - - PowerPoint PPT Presentation

Cost-Effective Compiler Directed Memory Prefetching and Bypassing Daniel Ortega, , Eduard Ayguad e , Jean-Loup Baer and Mateo Valero Departamento de Arquitectura de Computadores, Department of


slide-1
SLIDE 1

Cost-Effective Compiler Directed Memory Prefetching and Bypassing

Daniel Ortega,

  • , Eduard Ayguad´

e

  • , Jean-Loup Baer

and Mateo Valero

Departamento de Arquitectura de Computadores,

Department of Computer Science and Engineering, Universidad Polit´ ecnica de Catalu˜ na – Barcelona University of Washington – Seattle

dortega,eduard,mateo

@ac.upc.es baer@cs.washington.edu

PACT 2002 – p.1/20

slide-2
SLIDE 2

Conventional Cache Hierarchy

Regs

Main Memory L2

10s cycles >100 cycles

L1

1 cycle

L1

PACT 2002 – p.2/20

slide-3
SLIDE 3

Conventional Cache Hierarchy

Regs

Main Memory L2

10s cycles >100 cycles

L1

3-5 cycles (more?)

L1

PACT 2002 – p.2/20

slide-4
SLIDE 4

Conventional Cache Hierarchy

Regs

Main Memory L2

10s cycles >100 cycles

L1

3-5 cycles (more?) (smaller?)

L1

PACT 2002 – p.2/20

slide-5
SLIDE 5

Our approach

Attack register-L1 gap with memory instruction bypassing

Use hardware prefetcher for L1-L2 gap

Directed by the software

thus simple Hardware

no recovery needed in case of misprediction

PACT 2002 – p.3/20

slide-6
SLIDE 6

Limit Study

1.0 1.5 2.0 2.5

SpeedUp

No Prefetching No Bypassing No Prefetching Perfect Bypassing Perfect Prefetching No Bypassing Perfect Prefetching Perfect Bypassing

applu apsi hydro2d swim tomcatv Average

Limit values for a 4-way machine

PACT 2002 – p.4/20

slide-7
SLIDE 7

Index

Motivation

Memory Instruction Bypassing

Compiler Directed Memory Prefetcher

Comparison with APDP

PACT 2002 – p.5/20

slide-8
SLIDE 8

MIB through Renaming I

... pref a[i] ... load r

, a[i+1] load r

, a[i] load r

✡☛

, a[i+3] ... loop branch

PACT 2002 – p.6/20

slide-9
SLIDE 9

MIB through Renaming I

... ... pref a[i] pref

r

, a[i]

✍ ✡✏✎ ✡✏✎ ✑ ✎ ✡ ✒

... ... load r

, a[i+1] load r

✡ ✑

, a[i+1] load r

, a[i] load r

, a[i] load r

✡☛

, a[i+3] load r

✡ ✡

, a[i+3] ... ... loop branch loop branch

PACT 2002 – p.6/20

slide-10
SLIDE 10

MIB through Renaming II

. . .

. . . r

✔ ✓

r

✕✖ ✓

r

✕ ✕ ✓

. . .

. . . f

✕✖

f

✕ ✗

f

Renaming Table Special Renaming Table

. . .

. . . r

✔ ✓

r

✕ ✖ ✓

r

✕ ✕ ✓

. . .

. . . . . . . . . . . .

PACT 2002 – p.7/20

slide-11
SLIDE 11

MIB through Renaming II

... pref

r

, a[i]

✙ ✚✜✛ ✚✜✛ ✢ ✛ ✚ ✣

. . .

. . . r

✔ ✓

r

✕✖ ✓

r

✕ ✕ ✓

. . .

. . . f

✕✖

f

✕ ✗

f

Renaming Table Special Renaming Table

. . .

. . . r

✔ ✓

r

✕ ✖ ✓

r

✕ ✕ ✓

. . .

. . . f

✤ ✕

f

✥ ✤

f

✥✦

PACT 2002 – p.7/20

slide-12
SLIDE 12

MIB through Renaming II

... pref

r

, a[i]

✙ ✚✜✛ ✚✜✛ ✢ ✛ ✚ ✣

... load r

✚ ✢

, a[i+1]

. . .

. . . r

✔ ✓

r

✕✖ ✓

r

✕ ✕ ✓

. . .

. . . f

✕✖

f

✥ ✤

f

Renaming Table Special Renaming Table

. . .

. . . r

✔ ✓

r

✕ ✖ ✓

r

✕ ✕ ✓

. . .

. . . f

✤ ✕

. . . f

✥✦

PACT 2002 – p.7/20

slide-13
SLIDE 13

MIB through Renaming II

... pref

r

, a[i]

✙ ✚✜✛ ✚✜✛ ✢ ✛ ✚ ✣

... load r

✚ ✢

, a[i+1] load r

, a[i]

. . .

. . . r

✔ ✓

r

✕✖ ✓

r

✕ ✕ ✓

. . .

. . . f

✤ ✕

f

✥ ✤

f

Renaming Table Special Renaming Table

. . .

. . . r

✔ ✓

r

✕ ✖ ✓

r

✕ ✕ ✓

. . .

. . . . . . . . . f

✥✦

PACT 2002 – p.7/20

slide-14
SLIDE 14

MIB through Renaming II

... pref

r

, a[i]

✙ ✚✜✛ ✚✜✛ ✢ ✛ ✚ ✣

... load r

✚ ✢

, a[i+1] load r

, a[i] load r

✚ ✚

, a[i+3]

. . .

. . . r

✔ ✓

r

✕✖ ✓

r

✕ ✕ ✓

. . .

. . . f

✤ ✕

f

✥ ✤

f

✥ ✦

Renaming Table Special Renaming Table

. . .

. . . r

✔ ✓

r

✕ ✖ ✓

r

✕ ✕ ✓

. . .

. . . . . . . . . . . .

PACT 2002 – p.7/20

slide-15
SLIDE 15

MIB through Renaming II

... pref

r

, a[i]

✙ ✚✜✛ ✚✜✛ ✢ ✛ ✚ ✣

... load r

✚ ✢

, a[i+1] load r

, a[i] load r

✚ ✚

, a[i+3] ... loop branch

. . .

. . . r

✔ ✓

r

✕✖ ✓

r

✕ ✕ ✓

. . .

. . . f

✤ ✕

f

✥ ✤

f

✥ ✦

Renaming Table Special Renaming Table

. . .

. . . r

✔ ✓

r

✕ ✖ ✓

r

✕ ✕ ✓

. . .

. . . . . . . . . . . .

PACT 2002 – p.7/20

slide-16
SLIDE 16

Index

Motivation

Memory Instruction Bypassing

Compiler Directed Memory Prefetcher

Comparison with APDP

PACT 2002 – p.8/20

slide-17
SLIDE 17

Decoupled Prefetcher

Compiler inserts prefetching operations

Instructions bring data closer to the processor

Memory instruction bypassing takes care of bringing data to the register file from L1

Compiler also instructs prefetching hardware to prefetch ahead (to L1)

No.s of prefetches are minimised by compiler control

PACT 2002 – p.9/20

slide-18
SLIDE 18

Prefetching Mechanism

last @ type pctag ...

PACT 2002 – p.10/20

slide-19
SLIDE 19

Prefetching Mechanism

last @ type pctag ... PC last @ pctag

Decode phase

PACT 2002 – p.10/20

slide-20
SLIDE 20

Prefetching Mechanism

last @ type pctag ... PC last @ pctag eff @

  • stride

Decode phase

Address calc.

PACT 2002 – p.10/20

slide-21
SLIDE 21

Prefetching Mechanism

last @ type pctag ... PC last @ pctag eff @

  • stride

*N+

Decode phase

Address calc.

Generate new prefetch

PACT 2002 – p.10/20

slide-22
SLIDE 22

Prefetching Mechanism

last @ type pctag ... PC last @ pctag eff @

  • stride

*N+

Decode phase

Address calc.

Generate new prefetch

N depends on type

PACT 2002 – p.10/20

slide-23
SLIDE 23

Prefetching Mechanism

last @ type pctag ... PC last @ pctag eff @

  • stride

*N+

Decode phase

Address calc.

Generate new prefetch

N depends on type

wait for free port

PACT 2002 – p.10/20

slide-24
SLIDE 24

Performance Results

1.0 1.5 2.0

SpeedUp

O3 LoadStore O3+SW pref LoadStore O3+SW pref only Bypassing O3+SW pref Proposal 16 O3+SW pref Proposal 32 O3+SW pref Proposal 64 O3+SW pref Proposal infinite

applu apsi hydro2d swim tomcatv Average

Effect of number of streams in a 4-way machine

PACT 2002 – p.11/20

slide-25
SLIDE 25

Performance Results II

1.0 1.5 2.0

SpeedUp

O3 LoadStore O3+SW pref LoadStore O3+SW pref only Bypassing O3+SW pref Proposal N=(pref 1,load 1) O3+SW pref Proposal N=(pref 1,load 2) O3+SW pref Proposal N=(pref 2,load 1)

applu apsi hydro2d swim tomcatv Average

Effect of lookahead policy in a 4-way machine with 32 entries

PACT 2002 – p.12/20

slide-26
SLIDE 26

Index

Motivation

Memory Instruction Bypassing

Compiler Directed Memory Prefetcher

Comparison with APDP

PACT 2002 – p.13/20

slide-27
SLIDE 27

Comparison with APDP

✩ ✝

Address Prediction for Data Prefetching

predicts addresses of memory operations

makes prediction available as soon as load arrives to decoding

needs recovery mechanism in case of misprediction

dynamic mechanism

  • J. González and A. González, ICS 97

PACT 2002 – p.14/20

slide-28
SLIDE 28

Prefetching Mechanism

...

address stride conf. val v

PACT 2002 – p.15/20

slide-29
SLIDE 29

Prefetching Mechanism

...

address stride conf. val v PC

Decode phase

PACT 2002 – p.15/20

slide-30
SLIDE 30

Prefetching Mechanism

...

address stride conf. val v PC ?

Decode phase

Is value correct?

PACT 2002 – p.15/20

slide-31
SLIDE 31

Prefetching Mechanism

...

address stride conf. val v PC ? register file

Decode phase

Is value correct?

Yes

bypass

PACT 2002 – p.15/20

slide-32
SLIDE 32

Prefetching Mechanism

...

address stride conf. val v PC ? register file new address ?

Decode phase

Is value correct?

Yes

bypass

Address calc.

PACT 2002 – p.15/20

slide-33
SLIDE 33

Prefetching Mechanism

...

address stride conf. val v PC ? register file new address

Decode phase

Is value correct?

Yes

bypass

Address calc.

Update table

PACT 2002 – p.15/20

slide-34
SLIDE 34

Prefetching Mechanism

...

address stride conf. val v PC ? register file new address

  • +

? new prediction

Decode phase

Is value correct?

Yes

bypass

Address calc.

Update table

Generate prefetch?

PACT 2002 – p.15/20

slide-35
SLIDE 35

Comparison Results

1.0 1.5 2.0

SpeedUp

O3 LoadStore O3 APDP only prefetching O3 APDP (squash) O3 APDP (selective) O3+SW pref LoadStore O3+SW pref APDP only prefetching O3+SW pref APDP (squash) O3+SW pref APDP (selective)

applu apsi hydro2d swim tomcatv Average

Effect of only prefetching in APDP (4-way and 2 ports)

PACT 2002 – p.16/20

slide-36
SLIDE 36

Comparison Results

0.9 1.0 1.1 1.2

SpeedUp

O3 LoadStore O3 APDP (squash) O3 APDP (selective) O3 only Bypassing O3 Proposal 16 entries O3 Proposal 32 entries

applu apsi hydro2d swim tomcatv Average

4 way machine comparison against APDP

PACT 2002 – p.17/20

slide-37
SLIDE 37

Memory traffic

applu apsi hydro2d swim tomcatv LoadStore

1.00 1.00 1.00 1.00 1.00

APDP sel.

1.14 1.14 1.24 1.21 1.21

APDP sq.

1.14 1.15 1.24 1.21 1.22

Bypassing

0.90 0.74 0.82 0.72 0.89

Proposal

0.98 0.85 0.95 0.83 0.94

PACT 2002 – p.18/20

slide-38
SLIDE 38

Any questions?

PACT 2002 – p.19/20

slide-39
SLIDE 39

Thank you

dortega@ac.upc.es

PACT 2002 – p.20/20