Executive summary 2 Hash tables suffer from poor core utilization - PowerPoint PPT Presentation

L EVERAGING C ACHES TO A CCELERATE H ASH T ABLES AND M EMOIZATION G UOWEI Z HANG AND D ANIEL S ANCHEZ MICRO 2019

Executive summary 2 ¨ Hash tables suffer from poor core utilization & poor spatial poor core utilization poor spatial locality locality ¨ HTA accelerates hash tables with simple ISA & HW changes ¤ Adopts HTA table format that leverages cache characteristics ¤ Leaves rare cases to software LLC Flat-HTA Hierarchical-HTA L2 L2 Reduces runtime overheads Improves spatial locality … L1I L1D L1I L1D Core Core ¨ HTA accelerates hash-table-intensive applications by up to 2x ¨ HTA-based memoization improves performance significantly

Hash table performance is critical 3 Hash table found, value = hashtable. lookup (key); key value hashtable. insert (key, value); … hashtable. delete (key); Database Key-value store Networking Genomics Memoization ¨ Hash table performance is critical for memoization ¤ Uses hash tables to skip repetitive computation ¤ Beneficial only if hash table lookups are cheaper than memoized code

Issue 1: Poor core utilization 4 Backend stalls Wrong path execution Other 1.2 1 Normalized cycles Data-dependent branches Poor use of core backend 0.8 • • High misprediction rate Frequent misses • • High penalty Hard-to-overlap due to 0.6 too many µ ops 0.4 0.2 0 Baseline Flat-HTA Flat-HTA reduces runtime overheads!

Issue 2: Poor spatial locality 5 LLC k0, v0 k1, v1 Line 0 Line 16 k2, v2 Line 32 k3, v3 Conventional system L2

Issue 2: Poor spatial locality 5 LLC k0, v0 k1, v1 Line 0 Line 16 k2, v2 Line 32 k3, v3 Conventional system L2 k0, v0 k1, v1 Line 0 Line 16 k2, v2 Line 32 k3, v3 Wastes cache capacity

Issue 2: Poor spatial locality 5 LLC k0, v0 k1, v1 Line 0 Line 16 k2, v2 Line 32 k3, v3 Conventional system Hierarchical-HTA L2 L2 k0, v0 k1, v1 Line 0 k0, v0 k1, v1 k2, v2 k3, v3 Line 0 Line 16 k2, v2 Line 32 k3, v3 Improves spatial locality! Wastes cache capacity

Prior hardware acceleration underused caches 6 ¨ Domain-specific management [Costa 2000, Choi 2008, Chalamalasetti 2013, Lim 2013, Gope 2017…] ¤ E.g., PHP processing, distributed key-value store, memoization ¤ Requires dedicated on-chip storage (e.g., 98KB [Costa et al 2000] ) ¤ Or bypasses memory hierarchy [Lloyd 2017, Tanaka 2014, Xu 2016…] HTA is general HTA avoids dedicated on-chip storage HTA exploits memory hierarchy for spatial locality

HTA: Hash Table Acceleration

HTA overview 8 Make the common case fast! 1.Table format Key Overflow 2.ISA extensions Accelerated by HTA function unit 3.Hardware implementation Fetch LLC LLC k0, v0 k1, v1 Line 0 Decode Execute k3, v3 Line 16 k2, v2 L2 L2 Issue Mem Commit L2 … L1I L1D L1I L1D k0, v0 k1, v1 k2, v2 k3, v3 Address Line Line 0 Calculation Comparison Core Core Flat-HTA Hierarchical-HTA Reduces runtime overheads Improves spatial locality

HTA Table format 9 Memory Key Reg0 Reg1 2 M cache lines 128 H M 128b 128b 64b 64b 128b Key 0 Key 1 Value 0 Value 1 Unused Conventional table HTA table • • Variable number of probes Small, fixed number of probes • • Introduces hard-to-predict branches Overflows are handled by software path • • Minimizes work Avoids hard-to-predict branches • Enables hardware acceleration while (key != curSlot.key) { // Probe next slot }

HTA ISA extensions 10 Branch semantics Single-threaded lookup • Easy to predict • Exploits existing predictors lookup: hta_lookup <table_id>, <key_reg>, <value_reg>, done call swLookup # Accesses software hash table if (key is found) or (line is not full): done: … taken # done else: not taken # call swLookup Single-threaded insert insert: hta_swap <table_id>, <key_reg>, <value_reg>, done call swHandleInsert # Accesses software hash table done: … • We prototype a CISC Multi-threaded insert (x86) implementation • RISC is also possible insert: hta_update <table_id>, <key_reg>, <value_reg>, done call swLockLine hta_swap <table_id>, <key_reg>, <value_reg>, release call swHandleInsert release: call swUnlockLine done: …

Flat-HTA implementation 11 Execute Mem Commit Fetch Decode Issue HTA function unit Line comparison Address calculation key à lineAddr lineValue à outcome Area 0.055% of core

Hierarchical-HTA overview 12 0 1 2 Legend 3 4 Frequently-accessed pair … LLC Infrequently-accessed pair 12 Empty slot 13 14 15 Cache line 0 1 L2 2 3

Check out paper for more 13 ¨ Hierarchical-HTA implementation ¤ Maintains coherence conservatively ¤ Handles overflows conservatively ¨ Details on ISA and Flat-HTA implementation

Methodology 14 ¨ Schemes ¨ Simulation with zsim ¤ Baseline: best of ¨ System n Google dense_hash_map ¤ 1 to16 cores n C++11 unordered_map ¤ 2MB LLC per core ¤ HTA-SW n w/ HTA table format n w/o HTA function unit ¤ Flat-HTA ¤ Hierarchical-HTA Shared LLC L2 L2 ¨ Applications … L1I L1D L1I L1D ¤ bfcounter (bioinformatics) ¤ lzw (data compression) Core Core ¤ Hashjoin (database) ¤ ycsb-read (key-value store) ¤ ycsb-write (key-value store)

Flat-HTA speedups 15 Baseline Baseline HTA-SW HTA-SW Flat-HTA Flat-HTA (software-only) 1.4 1.8 2.0 1.4 1.8 1.6 1.6 1.2 1.2 1.4 1.4 1.5 1.0 1.0 1.2 1.2 Speedup 0.8 0.8 1.0 1.0 1.0 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.5 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 bfcounter lzw hashjoin ycsb-read ycsb-write

Flat-HTA cycles breakdown 16 Others Wrong path execution Backend stall 1.0 1.0 1.2 1.2 1.2 Normalized cycles 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 B S F B S F B S F B S F B S F bfcounter lzw hashjoin ycsb-read ycsb-write B: Baseline S: HTA-SW F: Flat-HTA (software-only)

Flat-HTA on multithreaded applications 17 Baseline Flat-HTA 16 16 14 14 12 12 Speedup 10 10 8 8 6 6 4 4 2 2 0 0 1 16 1 16 Cores Cores ycsb-read ycsb-write

HTA on memoization 18 ¨ Example memo_exp: hta_lookup <table id>, <key reg>, <value reg>, done call exp hta_swap <table id>, <key reg>, <value reg>, done done: … ¨ Schemes ¤ Baseline (no memoization) ¤ Software memoization ¤ HTA memoization ¨ Applications selected from ¤ SPECCPU2006 ¤ SPECOMP2001 ¤ SPECOMP2012 ¤ PARSEC ¤ SPLASH2 ¤ BioParallel

Flat-HTA speedups on memoization 19 Baseline Baseline Software Memoization Software Memoization HTA Memoization HTA Memoization 2.0 18 2.0 2.0 2.0 8 16 14 1.5 6 1.5 1.5 1.5 12 Speedup 10 1.0 4 1.0 1.0 1.0 8 6 0.5 2 0.5 0.5 0.5 4 2 0.0 0 0 0.0 0.0 0.0 bschols semphy bwaves equake nab water

Conclusion 20 ¨ HTA accelerates hash tables and memoization ¤ Adopts a new hash table format ¤ Accelerates common cases in HW; leaves rare cases to SW ¨ Flat-HTA reduces runtime overheads significantly ¤ Requires minor (0.055% area) changes to cores ¨ Hierarchical-HTA improves spatial locality ¤ Needs changes to cores and cache controllers ¨ HTA improves hash-table-intensive applications by up to 2x ¨ HTA enables memoization of small code regions

T HANKS F OR Y OUR A TTENTION ! Q UESTIONS ARE WELCOME !

Executive summary 2 Hash tables suffer from poor core utilization - PowerPoint PPT Presentation

L EVERAGING C ACHES TO A CCELERATE H ASH T ABLES AND M EMOIZATION G UOWEI Z HANG AND D ANIEL S ANCHEZ MICRO 2019 Executive summary 2 Hash tables suffer from poor core utilization & poor spatial poor core utilization poor spatial

EXECUTIVE SUMMARY ABOUT SEMPERTI Semperti Executive Summary Version: v1 // 2016 SEMPERTI

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

Investor Presentation May 2016 Table of Contents Executive Summary Company Overview Business

INVESTOR PRESENTATION JUNE 2019 EXECUTIVE SUMMARY 2 EXECUTIVE SUMMARY Overview Modi

INVESTOR PRESENTATION MAY 2019 Index Executive Summary Company Overview Business Overview

ADVANCE SYNTEX LIMITED INVESTOR PRESENTATION September 2019 Index 1 Executive Summary

EARNINGS PRESENTATION EXECUTIVE SUMMARY Executive Summary 3 Overview Business Mix Goodluck

INVESTOR PRESENTATION MARCH 2016 EXECUTIVE SUMMARY Executive Summary 3 Overview Business

Investor Presentation November 2014 EXECUTIVE SUMMARY EXECUTIVE SUMMARY Overview

INVESTOR PRESENTATION JUNE 2019 INDEX 2 3 3 Executive Summary Executive Summary 4 4

Investor Presentation January 2016 Table of Contents Executive Summary Company Overview

INVESTOR PRESENTATION Index Executive Summary Company Overview Business Overview Industry

INVESTOR PRESENTATION NOVEMBER 2016 EXECUTIVE SUMMARY Executive Summary 3 Overview Business

INVESTOR PRESENTATION AUGUST 2019 Index Executive Summary Company Overview Business

PIRAEUS BANK GROUP July 2018 TABLE OF CONTENTS 01 EXECUTIVE SUMMARY 02 ASSET QUALITY 03 CAPITAL

INVESTOR PRESENTATION JUNE 2016 EXECUTIVE SUMMARY Executive Summary 3 Overview Business Mix

Lecture 6: Hashing Steven Skiena Department of Computer Science State University of New York

Topic 22 Hash Tables " hash collision n. [from the techspeak] (var. `hash clash') When used

Data Structures in Java Lecture 12: Introduction to Hashing. 10/19/2015 Daniel Bauer Homework

Introduction to using Titanium with Drupal Services Stephen Barker, Digital Frontiers Media

Hashing 14 September 2020 OSU CSE 1 Performance of Set (and Map ) How long does it take to

Hash Functions and Hash Tables (2.5.2) A hash function h maps keys of a given type to

Hash Tables Data Structures and Algorithms for CL III, WS 2019-2020 Corina Dima

Lecture 4: Hashes and Message Digests Markku-Juhani O. Saarinen Helsinki University of Technology

Executive summary 2 Hash tables suffer from poor core utilization - PowerPoint PPT Presentation

L EVERAGING C ACHES TO A CCELERATE H ASH T ABLES AND M EMOIZATION G UOWEI Z HANG AND D ANIEL S ANCHEZ MICRO 2019 Executive summary 2 Hash tables suffer from poor core utilization & poor spatial poor core utilization poor spatial

EXECUTIVE SUMMARY ABOUT SEMPERTI Semperti Executive Summary Version: v1 // 2016 SEMPERTI

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

Investor Presentation May 2016 Table of Contents Executive Summary Company Overview Business

INVESTOR PRESENTATION JUNE 2019 EXECUTIVE SUMMARY 2 EXECUTIVE SUMMARY Overview Modi

INVESTOR PRESENTATION MAY 2019 Index Executive Summary Company Overview Business Overview

ADVANCE SYNTEX LIMITED INVESTOR PRESENTATION September 2019 Index 1 Executive Summary

EARNINGS PRESENTATION EXECUTIVE SUMMARY Executive Summary 3 Overview Business Mix Goodluck

INVESTOR PRESENTATION MARCH 2016 EXECUTIVE SUMMARY Executive Summary 3 Overview Business

Investor Presentation November 2014 EXECUTIVE SUMMARY EXECUTIVE SUMMARY Overview

INVESTOR PRESENTATION JUNE 2019 INDEX 2 3 3 Executive Summary Executive Summary 4 4

Investor Presentation January 2016 Table of Contents Executive Summary Company Overview

INVESTOR PRESENTATION Index Executive Summary Company Overview Business Overview Industry

INVESTOR PRESENTATION NOVEMBER 2016 EXECUTIVE SUMMARY Executive Summary 3 Overview Business

INVESTOR PRESENTATION AUGUST 2019 Index Executive Summary Company Overview Business

PIRAEUS BANK GROUP July 2018 TABLE OF CONTENTS 01 EXECUTIVE SUMMARY 02 ASSET QUALITY 03 CAPITAL

INVESTOR PRESENTATION JUNE 2016 EXECUTIVE SUMMARY Executive Summary 3 Overview Business Mix

Lecture 6: Hashing Steven Skiena Department of Computer Science State University of New York

Topic 22 Hash Tables &quot; hash collision n. [from the techspeak] (var. `hash clash') When used

Data Structures in Java Lecture 12: Introduction to Hashing. 10/19/2015 Daniel Bauer Homework

Introduction to using Titanium with Drupal Services Stephen Barker, Digital Frontiers Media

Hashing 14 September 2020 OSU CSE 1 Performance of Set (and Map ) How long does it take to

Hash Functions and Hash Tables (2.5.2) A hash function h maps keys of a given type to

Hash Tables Data Structures and Algorithms for CL III, WS 2019-2020 Corina Dima

Lecture 4: Hashes and Message Digests Markku-Juhani O. Saarinen Helsinki University of Technology

Topic 22 Hash Tables " hash collision n. [from the techspeak] (var. `hash clash') When used