Executive summary 2 Hash tables suffer from poor core utilization - - PowerPoint PPT Presentation
Executive summary 2 Hash tables suffer from poor core utilization - - PowerPoint PPT Presentation
L EVERAGING C ACHES TO A CCELERATE H ASH T ABLES AND M EMOIZATION G UOWEI Z HANG AND D ANIEL S ANCHEZ MICRO 2019 Executive summary 2 Hash tables suffer from poor core utilization & poor spatial poor core utilization poor spatial
Executive summary
¨ Hash tables suffer from poor core utilization & poor spatial
locality
¨ HTA accelerates hash tables with simple ISA & HW changes
¤ Adopts HTA table format that leverages cache characteristics ¤ Leaves rare cases to software
¨ HTA accelerates hash-table-intensive applications by up to 2x ¨ HTA-based memoization improves performance significantly
2
Core L1I L1D L2 … LLC Core L1I L1D L2
Flat-HTA
Reduces runtime overheads
Hierarchical-HTA
Improves spatial locality
poor core utilization poor spatial locality
Hash table performance is critical
3
¨ Hash table performance is critical for memoization
¤ Uses hash tables to skip repetitive computation ¤ Beneficial only if hash table lookups are cheaper than memoized code
found, value = hashtable.lookup(key); hashtable.insert(key, value); hashtable.delete(key);
Database Key-value store Networking Genomics Memoization
key value
… Hash table
0.2 0.4 0.6 0.8 1 1.2
Baseline Flat-HTA Normalized cycles
Backend stalls Wrong path execution Other
Issue 1: Poor core utilization
4
Flat-HTA reduces runtime overheads!
Data-dependent branches
- High misprediction rate
- High penalty
Poor use of core backend
- Frequent misses
- Hard-to-overlap due to
too many µops
LLC
Issue 2: Poor spatial locality
5
L2
k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32
Conventional system
LLC
Issue 2: Poor spatial locality
5
L2
k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32 k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32
Conventional system Wastes cache capacity
LLC
Issue 2: Poor spatial locality
5
Improves spatial locality!
L2
k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32 k0, v0 k1, v1 k2, v2 Line 0 Line 16 k3, v3 Line 32
Conventional system L2
k0, v0 k1, v1 k2, v2 Line 0 k3, v3
Hierarchical-HTA Wastes cache capacity
Prior hardware acceleration underused caches 6
¨ Domain-specific management [Costa 2000, Choi 2008, Chalamalasetti
2013, Lim 2013, Gope 2017…]
¤ E.g., PHP processing, distributed key-value store, memoization ¤ Requires dedicated on-chip storage (e.g., 98KB [Costa et al 2000]) ¤ Or bypasses memory hierarchy [Lloyd 2017, Tanaka 2014, Xu 2016…]
HTA is general HTA avoids dedicated on-chip storage HTA exploits memory hierarchy for spatial locality
HTA: Hash Table Acceleration
HTA overview
8 1.Table format
Core L1I L1D L2 … LLC Core L1I L1D L2 LLC L2
Hierarchical-HTA
Fetch Decode Issue Execute Commit Mem
Flat-HTA
Line Comparison Address Calculation k0, v0 k1, v1 k2, v2 k3, v3 Line 0 Line 16 k0, v0 Line 0 k1, v1 k2, v2 k3, v3
Reduces runtime overheads Improves spatial locality
3.Hardware implementation
Overflow Key Accelerated by HTA function unit
2.ISA extensions
Make the common case fast!
Conventional table
- Variable number of probes
- Introduces hard-to-predict branches
- Minimizes work
HTA Table format
9
Memory
Reg0 Reg1
128
H
M
Key 2M cache lines
Key 0 Value 0 Key 1 Value 1 Unused
128b 64b 128b 64b 128b while (key != curSlot.key) { // Probe next slot }
HTA table
- Small, fixed number of probes
- Overflows are handled by software path
- Avoids hard-to-predict branches
- Enables hardware acceleration
HTA ISA extensions
10
lookup: hta_lookup <table_id>, <key_reg>, <value_reg>, done call swLookup # Accesses software hash table done: …
Single-threaded lookup Branch semantics
- Easy to predict
- Exploits existing predictors
if (key is found) or (line is not full): taken # done else: not taken # call swLookup
insert: hta_swap <table_id>, <key_reg>, <value_reg>, done call swHandleInsert # Accesses software hash table done: …
Single-threaded insert Multi-threaded insert
insert: hta_update <table_id>, <key_reg>, <value_reg>, done call swLockLine hta_swap <table_id>, <key_reg>, <value_reg>, release call swHandleInsert release: call swUnlockLine done: …
- We prototype a CISC
(x86) implementation
- RISC is also possible
Flat-HTA implementation
11
Fetch Decode Issue Execute Commit Mem HTA function unit Address calculation Line comparison Area 0.055% of core key à lineAddr lineValue à outcome
L2
LLC
Hierarchical-HTA overview
12
… Legend
1 2 3 1 2 3 4 12 13 14 15
Frequently-accessed pair Infrequently-accessed pair Empty slot Cache line
L2
LLC
Hierarchical-HTA overview
12
… Legend
1 2 3 1 2 3 4 12 13 14 15
Frequently-accessed pair Infrequently-accessed pair Empty slot Cache line
L2
LLC
Hierarchical-HTA overview
12
… Legend
1 2 3 1 2 3 4 12 13 14 15
Frequently-accessed pair Infrequently-accessed pair Empty slot Cache line
Check out paper for more
13
¨ Hierarchical-HTA implementation
¤ Maintains coherence conservatively ¤ Handles overflows conservatively
¨ Details on ISA and Flat-HTA implementation
Methodology
14
¨ Simulation with zsim ¨ System
¤ 1 to16 cores ¤ 2MB LLC per core
¨ Schemes
¤ Baseline: best of n Google dense_hash_map n C++11 unordered_map ¤ HTA-SW n w/ HTA table format n w/o HTA function unit ¤ Flat-HTA ¤ Hierarchical-HTA
¨ Applications
¤ bfcounter (bioinformatics) ¤ lzw (data compression) ¤ Hashjoin (database) ¤ ycsb-read (key-value store) ¤ ycsb-write (key-value store)
Core L1I L1D L2 Core L1I L1D L2
…
Shared LLC
Flat-HTA speedups
15
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Speedup
0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Baseline HTA-SW Flat-HTA Baseline HTA-SW Flat-HTA bfcounter lzw hashjoin ycsb-read ycsb-write (software-only)
Flat-HTA speedups
15
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Speedup
0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Baseline HTA-SW Flat-HTA Baseline HTA-SW Flat-HTA bfcounter lzw hashjoin ycsb-read ycsb-write (software-only)
Flat-HTA speedups
15
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Speedup
0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Baseline HTA-SW Flat-HTA Baseline HTA-SW Flat-HTA bfcounter lzw hashjoin ycsb-read ycsb-write (software-only)
Flat-HTA cycles breakdown
16
0.0 0.2 0.4 0.6 0.8 1.0
Normalized cycles B S F
0.0 0.2 0.4 0.6 0.8 1.0 1.2
B S F
0.0 0.2 0.4 0.6 0.8 1.0
B S F
0.0 0.2 0.4 0.6 0.8 1.0 1.2
B S F
0.0 0.2 0.4 0.6 0.8 1.0 1.2
B S F
Others Wrong path execution Backend stall
bfcounter lzw hashjoin ycsb-read ycsb-write
B: Baseline S: HTA-SW F: Flat-HTA
(software-only)
Flat-HTA on multithreaded applications
17 1 16
Cores
2 4 6 8 10 12 14 16
Speedup
1 16
Cores
2 4 6 8 10 12 14 16
ycsb-read ycsb-write
Flat-HTA Baseline
HTA on memoization
18
¨ Example ¨ Schemes ¤ Baseline (no memoization) ¤ Software memoization ¤ HTA memoization
¨ Applications selected from
¤ SPECCPU2006 ¤ SPECOMP2001 ¤ SPECOMP2012 ¤ PARSEC ¤ SPLASH2 ¤ BioParallel
memo_exp: hta_lookup <table id>, <key reg>, <value reg>, done call exp hta_swap <table id>, <key reg>, <value reg>, done done: …
Flat-HTA speedups on memoization
19
2 4 6 8 10 12 14 16 18
Speedup
2 4 6 8 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Baseline Software Memoization HTA Memoization Baseline Software Memoization HTA Memoization
bwaves bschols equake water nab semphy
Flat-HTA speedups on memoization
19
2 4 6 8 10 12 14 16 18
Speedup
2 4 6 8 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Baseline Software Memoization HTA Memoization Baseline Software Memoization HTA Memoization
bwaves bschols equake water nab semphy
Conclusion
20
¨ HTA accelerates hash tables and memoization
¤ Adopts a new hash table format ¤ Accelerates common cases in HW; leaves rare cases to SW
¨ Flat-HTA reduces runtime overheads significantly
¤ Requires minor (0.055% area) changes to cores
¨ Hierarchical-HTA improves spatial locality
¤ Needs changes to cores and cache controllers
¨ HTA improves hash-table-intensive applications by up to 2x ¨ HTA enables memoization of small code regions