Executive summary 2 Hash tables suffer from poor core utilization - - PowerPoint PPT Presentation

executive summary
SMART_READER_LITE
LIVE PREVIEW

Executive summary 2 Hash tables suffer from poor core utilization - - PowerPoint PPT Presentation

L EVERAGING C ACHES TO A CCELERATE H ASH T ABLES AND M EMOIZATION G UOWEI Z HANG AND D ANIEL S ANCHEZ MICRO 2019 Executive summary 2 Hash tables suffer from poor core utilization & poor spatial poor core utilization poor spatial


slide-1
SLIDE 1

LEVERAGING CACHES TO ACCELERATE HASH TABLES AND MEMOIZATION

GUOWEI ZHANG AND DANIEL SANCHEZ

MICRO 2019

slide-2
SLIDE 2

Executive summary

¨ Hash tables suffer from poor core utilization & poor spatial

locality

¨ HTA accelerates hash tables with simple ISA & HW changes

¤ Adopts HTA table format that leverages cache characteristics ¤ Leaves rare cases to software

¨ HTA accelerates hash-table-intensive applications by up to 2x ¨ HTA-based memoization improves performance significantly

2

Core L1I L1D L2 … LLC Core L1I L1D L2

Flat-HTA

Reduces runtime overheads

Hierarchical-HTA

Improves spatial locality

poor core utilization poor spatial locality

slide-3
SLIDE 3

Hash table performance is critical

3

¨ Hash table performance is critical for memoization

¤ Uses hash tables to skip repetitive computation ¤ Beneficial only if hash table lookups are cheaper than memoized code

found, value = hashtable.lookup(key); hashtable.insert(key, value); hashtable.delete(key);

Database Key-value store Networking Genomics Memoization

key value

… Hash table

slide-4
SLIDE 4

0.2 0.4 0.6 0.8 1 1.2

Baseline Flat-HTA Normalized cycles

Backend stalls Wrong path execution Other

Issue 1: Poor core utilization

4

Flat-HTA reduces runtime overheads!

Data-dependent branches

  • High misprediction rate
  • High penalty

Poor use of core backend

  • Frequent misses
  • Hard-to-overlap due to

too many µops

slide-5
SLIDE 5

LLC

Issue 2: Poor spatial locality

5

L2

k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32

Conventional system

slide-6
SLIDE 6

LLC

Issue 2: Poor spatial locality

5

L2

k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32 k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32

Conventional system Wastes cache capacity

slide-7
SLIDE 7

LLC

Issue 2: Poor spatial locality

5

Improves spatial locality!

L2

k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32 k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32

Conventional system L2

k0, v0 k1, v1 k2, v2 Line 0 k3, v3

Hierarchical-HTA Wastes cache capacity

slide-8
SLIDE 8

Prior hardware acceleration underused caches 6

¨ Domain-specific management [Costa 2000, Choi 2008, Chalamalasetti

2013, Lim 2013, Gope 2017…]

¤ E.g., PHP processing, distributed key-value store, memoization ¤ Requires dedicated on-chip storage (e.g., 98KB [Costa et al 2000]) ¤ Or bypasses memory hierarchy [Lloyd 2017, Tanaka 2014, Xu 2016…]

HTA is general HTA avoids dedicated on-chip storage HTA exploits memory hierarchy for spatial locality

slide-9
SLIDE 9

HTA: Hash Table Acceleration

slide-10
SLIDE 10

HTA overview

8 1.Table format

Core L1I L1D L2 … LLC Core L1I L1D L2 LLC L2

Hierarchical-HTA

Fetch Decode Issue Execute Commit Mem

Flat-HTA

Line Comparison Address Calculation k0, v0 k1, v1 k2, v2 k3, v3 Line 0 Line 16 k0, v0 Line 0 k1, v1 k2, v2 k3, v3

Reduces runtime overheads Improves spatial locality

3.Hardware implementation

Overflow Key Accelerated by HTA function unit

2.ISA extensions

Make the common case fast!

slide-11
SLIDE 11

Conventional table

  • Variable number of probes
  • Introduces hard-to-predict branches
  • Minimizes work

HTA Table format

9

Memory

Reg0 Reg1

128

H

M

Key 2M cache lines

Key 0 Value 0 Key 1 Value 1 Unused

128b 64b 128b 64b 128b while (key != curSlot.key) { // Probe next slot }

HTA table

  • Small, fixed number of probes
  • Overflows are handled by software path
  • Avoids hard-to-predict branches
  • Enables hardware acceleration
slide-12
SLIDE 12

HTA ISA extensions

10

lookup: hta_lookup <table_id>, <key_reg>, <value_reg>, done call swLookup # Accesses software hash table done: …

Single-threaded lookup Branch semantics

  • Easy to predict
  • Exploits existing predictors

if (key is found) or (line is not full): taken # done else: not taken # call swLookup

insert: hta_swap <table_id>, <key_reg>, <value_reg>, done call swHandleInsert # Accesses software hash table done: …

Single-threaded insert Multi-threaded insert

insert: hta_update <table_id>, <key_reg>, <value_reg>, done call swLockLine hta_swap <table_id>, <key_reg>, <value_reg>, release call swHandleInsert release: call swUnlockLine done: …

  • We prototype a CISC

(x86) implementation

  • RISC is also possible
slide-13
SLIDE 13

Flat-HTA implementation

11

Fetch Decode Issue Execute Commit Mem HTA function unit Address calculation Line comparison Area 0.055% of core key à lineAddr lineValue à outcome

slide-14
SLIDE 14

L2

LLC

Hierarchical-HTA overview

12

… Legend

1 2 3 1 2 3 4 12 13 14 15

Frequently-accessed pair Infrequently-accessed pair Empty slot Cache line

slide-15
SLIDE 15

L2

LLC

Hierarchical-HTA overview

12

… Legend

1 2 3 1 2 3 4 12 13 14 15

Frequently-accessed pair Infrequently-accessed pair Empty slot Cache line

slide-16
SLIDE 16

L2

LLC

Hierarchical-HTA overview

12

… Legend

1 2 3 1 2 3 4 12 13 14 15

Frequently-accessed pair Infrequently-accessed pair Empty slot Cache line

slide-17
SLIDE 17

Check out paper for more

13

¨ Hierarchical-HTA implementation

¤ Maintains coherence conservatively ¤ Handles overflows conservatively

¨ Details on ISA and Flat-HTA implementation

slide-18
SLIDE 18

Methodology

14

¨ Simulation with zsim ¨ System

¤ 1 to16 cores ¤ 2MB LLC per core

¨ Schemes

¤ Baseline: best of n Google dense_hash_map n C++11 unordered_map ¤ HTA-SW n w/ HTA table format n w/o HTA function unit ¤ Flat-HTA ¤ Hierarchical-HTA

¨ Applications

¤ bfcounter (bioinformatics) ¤ lzw (data compression) ¤ Hashjoin (database) ¤ ycsb-read (key-value store) ¤ ycsb-write (key-value store)

Core L1I L1D L2 Core L1I L1D L2

Shared LLC

slide-19
SLIDE 19

Flat-HTA speedups

15

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Speedup

0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

Baseline HTA-SW Flat-HTA Baseline HTA-SW Flat-HTA bfcounter lzw hashjoin ycsb-read ycsb-write (software-only)

slide-20
SLIDE 20

Flat-HTA speedups

15

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Speedup

0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

Baseline HTA-SW Flat-HTA Baseline HTA-SW Flat-HTA bfcounter lzw hashjoin ycsb-read ycsb-write (software-only)

slide-21
SLIDE 21

Flat-HTA speedups

15

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Speedup

0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

Baseline HTA-SW Flat-HTA Baseline HTA-SW Flat-HTA bfcounter lzw hashjoin ycsb-read ycsb-write (software-only)

slide-22
SLIDE 22

Flat-HTA cycles breakdown

16

0.0 0.2 0.4 0.6 0.8 1.0

Normalized cycles B S F

0.0 0.2 0.4 0.6 0.8 1.0 1.2

B S F

0.0 0.2 0.4 0.6 0.8 1.0

B S F

0.0 0.2 0.4 0.6 0.8 1.0 1.2

B S F

0.0 0.2 0.4 0.6 0.8 1.0 1.2

B S F

Others Wrong path execution Backend stall

bfcounter lzw hashjoin ycsb-read ycsb-write

B: Baseline S: HTA-SW F: Flat-HTA

(software-only)

slide-23
SLIDE 23

Flat-HTA on multithreaded applications

17 1 16

Cores

2 4 6 8 10 12 14 16

Speedup

1 16

Cores

2 4 6 8 10 12 14 16

ycsb-read ycsb-write

Flat-HTA Baseline

slide-24
SLIDE 24

HTA on memoization

18

¨ Example ¨ Schemes ¤ Baseline (no memoization) ¤ Software memoization ¤ HTA memoization

¨ Applications selected from

¤ SPECCPU2006 ¤ SPECOMP2001 ¤ SPECOMP2012 ¤ PARSEC ¤ SPLASH2 ¤ BioParallel

memo_exp: hta_lookup <table id>, <key reg>, <value reg>, done call exp hta_swap <table id>, <key reg>, <value reg>, done done: …

slide-25
SLIDE 25

Flat-HTA speedups on memoization

19

2 4 6 8 10 12 14 16 18

Speedup

2 4 6 8 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

Baseline Software Memoization HTA Memoization Baseline Software Memoization HTA Memoization

bwaves bschols equake water nab semphy

slide-26
SLIDE 26

Flat-HTA speedups on memoization

19

2 4 6 8 10 12 14 16 18

Speedup

2 4 6 8 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

Baseline Software Memoization HTA Memoization Baseline Software Memoization HTA Memoization

bwaves bschols equake water nab semphy

slide-27
SLIDE 27

Conclusion

20

¨ HTA accelerates hash tables and memoization

¤ Adopts a new hash table format ¤ Accelerates common cases in HW; leaves rare cases to SW

¨ Flat-HTA reduces runtime overheads significantly

¤ Requires minor (0.055% area) changes to cores

¨ Hierarchical-HTA improves spatial locality

¤ Needs changes to cores and cache controllers

¨ HTA improves hash-table-intensive applications by up to 2x ¨ HTA enables memoization of small code regions

slide-28
SLIDE 28

THANKS FOR YOUR ATTENTION! QUESTIONS ARE WELCOME!