Improving Data Access Efficiency by Using Context-Aware Loads and - - PowerPoint PPT Presentation

improving data access efficiency by using context aware
SMART_READER_LITE
LIVE PREVIEW

Improving Data Access Efficiency by Using Context-Aware Loads and - - PowerPoint PPT Presentation

Improving Data Access Efficiency by Using Context-Aware Loads and Stores Alen Bardizbanyan, Magnus Sjlander , David Whalley , Per Larsson-Edefors Chalmers University of Technology Uppsala University Florida State University


slide-1
SLIDE 1

Improving Data Access Efficiency by Using Context-Aware Loads and Stores

Alen Bardizbanyan, Magnus Själander†, David Whalley‡, Per Larsson-Edefors

Chalmers University of Technology

†Uppsala University ‡Florida State University

slide-2
SLIDE 2

Conventional L1 DC Access

ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING

slide-3
SLIDE 3

Energy Usage of a 4-way L1 Data Cache

Virtual Page Number (VPN) Line offset Set index

DTLB

VPN Physical Page Number (PPN) Data Array-0 Way Select-0 Tag Array-0

=

PPN set index line offset Data-0 Address: Data Array-1 Way Select-1 Tag Array-1

=

PPN set index line offset Data-1 Data Array-2 Way Select-2 Tag Array-2

=

PPN set index line offset Data-2 Data Array-3 Way Select-3 Tag Array-3

=

PPN set index line offset Data-3

Way Select Logic

Data Out

60% 30% 10%

Contribution to overall L1 load access energy

slide-4
SLIDE 4

Energy Usage of a 4-way L1 Data Cache

Virtual Page Number (VPN) Line offset Set index

DTLB

VPN Physical Page Number (PPN) Data Array-0 Way Select-0 Tag Array-0

=

PPN set index line offset Data-0 Address: Data Array-1 Way Select-1 Tag Array-1

=

PPN set index line offset Data-1 Data Array-2 Way Select-2 Tag Array-2

=

PPN set index line offset Data-2 Data Array-3 Way Select-3 Tag Array-3

=

PPN set index line offset Data-3

Way Select Logic

Data Out

60% 30% 10%

Contribution to overall L1 load access energy

60% of the energy is due to reading the data memories in parallel

slide-5
SLIDE 5

Energy Breakdown of an Embedded Processor

L1 DC (21.7%) L1 IC (34.8%) Pipeline (31.4%) Clock (12.1%)

slide-6
SLIDE 6

Phased L1 DC Access

ADDR-GEN DTLB/TAG-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS

slide-7
SLIDE 7

Phased L1 DC Performance Overhead

adpcm basicmath bitcount blowfish crc dijkstra fft gsm ispell jpeg lame patricia pgp qsort rijndael rsynth sha stringsearch susan tiff average Execution time (normalized) 1 1.02 1.04 1.06 1.08 1.1 1.12 1.14 1.16

Average performance overhead of 8%.

slide-8
SLIDE 8

Context Aware Loads — Case0

ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING

r[2] = M[r[4]+76] r[3] = r[3]+r[2]

slide-9
SLIDE 9

Context Aware Loads — Case0

ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING

r[2] = M[r[4]+76] r[3] = r[3]+r[2]

slide-10
SLIDE 10

Context Aware Loads — Case0

ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING

r[2] = M[r[4]+76] r[3] = r[3]+r[2]

slide-11
SLIDE 11

Context Aware Loads — Case1

r[2] = M[r[4]] r[3] = r[3]+r[2]

ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING

slide-12
SLIDE 12

Context Aware Loads — Case1

r[2] = M[r[4]] r[3] = r[3]+r[2]

ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING

slide-13
SLIDE 13

Context Aware Loads — Case1

r[2] = M[r[4]] r[3] = r[3]+r[2]

DTLB TAG-0 TAG-N DATA-0 DATA-N = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS

slide-14
SLIDE 14

Context Aware Loads — Case2

DTLB TAG-0 TAG-N DATA-0 DATA-N = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS

r[2] = M[r[4]+4] r[3] = r[3]+r[2]

slide-15
SLIDE 15

Context Aware Loads — Case2

DTLB TAG-0 TAG-N DATA-0 DATA-N = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS

r[2] = M[r[4]+4] r[3] = r[3]+r[2]

slide-16
SLIDE 16

Context Aware Loads — Case2

ADDR-GEN DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS

r[2] = M[r[4]+4] r[3] = r[3]+r[2]

slide-17
SLIDE 17

Context Aware Loads — Case3

ADDR-GEN DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS

r[2] = M[r[4]+4] <3 or more insts> r[3] = r[3]+r[2]

slide-18
SLIDE 18

Context Aware Loads — Case3

ADDR-GEN DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS

r[2] = M[r[4]+4] <3 or more insts> r[3] = r[3]+r[2]

slide-19
SLIDE 19

Context Aware Loads — Case3

ADDR-GEN DTLB/TAG-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS

r[2] = M[r[4]+4] <3 or more insts> r[3] = r[3]+r[2]

slide-20
SLIDE 20

Context Aware Loads — Cases 0-3

ADDR-GEN DTLB/TAG-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING ADDR-GEN DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING DATA-ACCESS

r[2] = M[r[4]+76] r[3] = r[3]+r[2] r[2] = M[r[4]+4] <3 or more insts> r[3] = r[3]+r[2] r[2] = M[r[4]+4] r[3] = r[3]+r[2] r[2] = M[r[4]] r[3] = r[3]+r[2] Case1: Avoids Stalls Case2: 1x Data Array Access Case3: 1x Data Array Access No Tag Speculation Case0: Normal Access

slide-21
SLIDE 21

Context Aware Loads — Pipeline

ADDR-GEN DTLB/TAG-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Addr. Offset = = Format Data Way Select Forward Data Writeback Execution Units DATA-FORMATTING DATA-ACCESS Register File Other Forwarding

slide-22
SLIDE 22

Strided Accesses

r[2]=M[r[4]]; L3: ... r[4]=r[4]+4; PC=r[4]!=r[5],L3; r[22]=M[r[sp]+100]; r[21]=M[r[sp]+96]; r[20]=M[r[sp]+92]; ... ...

slide-23
SLIDE 23

Strided Accesses — Strided Access Structure

r[2]=M[r[4]]; L3: ... r[4]=r[4]+4; PC=r[4]!=r[5],L3; r[22]=M[r[sp]+100]; r[21]=M[r[sp]+96]; r[20]=M[r[sp]+92]; ... ...

way L1 DC V 1 ... n 2 −1 PP data (SD) strided

  • ffset

DV word index L1 DC L1 DC tag

slide-24
SLIDE 24

Context Aware Loads — Case4

r[2] = M[r[4]] r[3] = r[3]+r[2]

DTLB TAG-0 TAG-N DATA-0 DATA-N = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding T+I

slide-25
SLIDE 25

Context Aware Loads — Case5

r[2] = M[r[4]+4] r[3] = r[3]+r[2]

ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING T+I

slide-26
SLIDE 26

Context Aware Loads — Case6

r[2] = M[r[4]] r[3] = r[3]+r[2]

DTLB TAG-0 TAG-N DATA-0 DATA-N = = Execution Units Register File Other Forwarding DATA-ACCESS T+I+SD

slide-27
SLIDE 27

Context Aware Loads — Case7

r[2] = M[r[4]+4] r[3] = r[3]+r[2]

ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Execution Units Register File Other Forwarding T+I SD

slide-28
SLIDE 28

Context Aware Loads — Cases 4-7

DTLB TAG-0 TAG-N DATA-0 DATA-N = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding

r[2] = M[r[4]+4] r[3] = r[3]+r[2] r[2] = M[r[4]] r[3] = r[3]+r[2] Case4: Avoid 1 Stall No DTLB Access No Tag Checks 1x Data Array Access Case5: No DTLB Access No Tag Checks 1x Data Array Access

T+I DTLB TAG-0 TAG-N DATA-0 DATA-N = = Execution Units Register File Other Forwarding DATA-ACCESS

r[2] = M[r[4]+4] r[3] = r[3]+r[2] r[2] = M[r[4]] r[3] = r[3]+r[2] Case6: Avoid 2 Stalls No DTLB Access No Tag Checks No Data Array Access Case7: Avoid 1 Stall No DTLB Access No Tag Checks No Data Array Access

T+I+SD ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Format Data Way Select Forward Data Writeback Execution Units Register File Other Forwarding DATA-FORMATTING T+I ADDR-GEN SRAM-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Address Offset = = Execution Units Register File Other Forwarding T+I SD

slide-29
SLIDE 29

Context Aware Loads — Pipeline

ADDR-GEN DTLB/TAG-ACCESS DTLB TAG-0 TAG-N DATA-0 DATA-N A G U Base Addr. Offset = = Format Data Way Select Forward Data Writeback Execution Units DATA-FORMATTING DATA-ACCESS Register File Other Forwarding T+I TSD

slide-30
SLIDE 30

Instruction Format

meminfo

  • pcode

rs rt immediate (b) Enhanced Load and Store Instruction Format 14−n bits 2+n bits 5 bits 5 bits 6 bits

  • pcode

rs rt (a) MIPS Instruction I Format immediate 16 bits 5 bits 5 bits 6 bits

slide-31
SLIDE 31

Evaluation Framework

  • VPO compiler
  • MiBench
  • Simple Scalar
  • L1 DC: 16KiB, 4-way, 32-byte line
  • DTLB: 16 entries, fully associative
  • Energy: P&R netlist in 65-nm technology
slide-32
SLIDE 32

SAS Entries

SAS Entries 1 3 7 Relative Data Access Energy 0.45 0.50 0.55 0.60 L1 DC DTLB SAS

slide-33
SLIDE 33

Classification of Loads and Stores

Memory Access Classifications C0 C1 C2 C3 C4 C5 C6 C7 UL S0 S1 S2 S3 US Relative All Loads or Stores 0.00 0.04 0.08 0.12 0.16 0.20 0.24 0.28 0.32 0.36 0.40 0.44 C0_S0 C5 Original

slide-34
SLIDE 34

Per Case

Memory Access Classifications C2 C3 C4 C5 C6 C7 S2 S3 Data Access Energy Savings (%) 5 10 15 20 25 L1 DC DTLB

slide-35
SLIDE 35

Per Case

Memory Access Classifications C2 C3 C4 C5 C6 C7 S2 S3 Data Access Energy Savings (%) 5 10 15 20 25 L1 DC DTLB Classifications C1 C4 C6 C7 Execution Time Savings (%) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

slide-36
SLIDE 36

Energy Improvements

adpcm basicmath bitcount blowfish crc dijkstra fft gsm ispell jpeg lame patricia pgp qsort rijndael rsynth sha stringsearch susan tiff average Data Access Energy 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 L1 DC DTLB SAS

slide-37
SLIDE 37

Energy Improvements

adpcm basicmath bitcount blowfish crc dijkstra fft gsm ispell jpeg lame patricia pgp qsort rijndael rsynth sha stringsearch susan tiff average Data Access Energy 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 L1 DC DTLB SAS

On average 43% energy usage reduction

slide-38
SLIDE 38

Execution Time Improvements

Benchmarks adpcm basicmath bitcount blowfish crc dijkstra fft gsm ispell jpeg lame patricia pgp qsort rijndael rsynth sha stringsearch susan tiff average Execution Time 0.75 0.8 0.85 0.9 0.95 1

slide-39
SLIDE 39

Execution Time Improvements

Benchmarks adpcm basicmath bitcount blowfish crc dijkstra fft gsm ispell jpeg lame patricia pgp qsort rijndael rsynth sha stringsearch susan tiff average Execution Time 0.75 0.8 0.85 0.9 0.95 1

On average 6% execution time improvement

slide-40
SLIDE 40

Summary

  • Conventional associative L1 caches are power hungry
  • Context aware data accesses reduce L1 data cache power
  • Speculative and early tag access improves performance
slide-41
SLIDE 41

Summary

  • Conventional associative L1 caches are power hungry
  • Context aware data accesses reduce L1 data cache power
  • Speculative and early tag access improves performance
  • 43% reduction in L1 data cache and DTLB energy usage
  • 6% performance improvement
slide-42
SLIDE 42