HASHI: An Application-Specific Instruction Set Extension for Hashing - - PowerPoint PPT Presentation

hashi an application specific instruction set extension
SMART_READER_LITE
LIVE PREVIEW

HASHI: An Application-Specific Instruction Set Extension for Hashing - - PowerPoint PPT Presentation

Technische Universitt Dresden HASHI: An Application-Specific Instruction Set Extension for Hashing Oliver Arnold, Sebastian Haas, Gerhard Fettweis, Benjamin Schlegel, Thomas Kissinger, Tomas Karnagel, Wolfgang Lehner Technische Universitt


slide-1
SLIDE 1

Technische Universität Dresden

HASHI: An Application-Specific Instruction Set Extension for Hashing

Oliver Arnold, Sebastian Haas, Gerhard Fettweis, Benjamin Schlegel, Thomas Kissinger, Tomas Karnagel, Wolfgang Lehner Technische Universität Dresden Dresden, Germany

slide-2
SLIDE 2

2

Motivation

TU Dresden

Today’s Database Systems

Fat Cores (area & power) Few HW adaptions CMOS Scaling

Database Processors

Processors build from scratch Long development cycle High development costs

Our Approach

HW/SW codesign Customizable processor Hashing-specific ISA extensions Tool flow short HW development cycles

slide-3
SLIDE 3

3

Application Scenario 1: Integer Hash Function Bit Extraction

Selection of specific bits in a 32-bit key via arbitrary hash mask

TU Dresden

<32 Bit Key> Bit Selection (32 Bit ->n Bit) Shuffle Network Result (n Bit)

Histogram

<32 Bit Key> Bit 0 Bit 1 Bit 2 Bit 3 Bit 4 Bit 31 … <32 Bit Key> <32 Bit Key> … <32 Bit Key> Bit 0 Bit 1 Bit 2 Bit 3 Bit 4 Bit 31

Sampling Bit Extraction

Sampling

Scanning a subset of the data set to choose the most efficient hash mask

slide-4
SLIDE 4

4

Application Scenario 2 CityHash32

  • Non-cryptographic hash

function for strings

  • Returns 32-bit hash value

TU Dresden

Hash Table Operators (Insert, Lookup)

  • Operate on 32-bit keys
  • Apply integer hash function

unsigned int CityHash32(char *s, int len){ int hash = comp_1(s+len-20); int i = (len-1)/20; do { hash = comp_2(s, hash); s += 20; } while(--i != 0); return comp_3(hash); }

slide-5
SLIDE 5

5

Customizable Processor Model

Basic Core: Tensilica LX5

TU Dresden

Processor

  • Inst. Fetch

L/S Unit 0 Instruction Set L/S Unit 1 Local Memory

Inst.

Local Memory

Data0

Local Memory

Data1 Basic RISC ISA Hash-Specific ISA Basic Registers Hash-Specific Registers Hash-Specific States

Interconnection Data Prefetcher

slide-6
SLIDE 6

6

Integer Hash Function: C code

unsigned int hash, shVal, shVal_neg; unsigned int mask = 0xFFFFFFFF; for(i=0; i<keySize; i++){ //load key, bit selection hash = key[i] & hashFunc; //extract bits for(j=30; j>=0; j--){ if(!(hashFunc & (0x1<<j))){ //partial shift right shVal = hash & (mask<<j); shVal_neg = hash & ~(mask<<j); hash = (shVal>>1) | shVal_neg; } } //store hash value hashValue[i] = hash; }

TU Dresden

Pure C code

slide-7
SLIDE 7
  • //init pointer, variables

init_states(key, hashValue, hashFunc); LD_0(); LD_1(); //load keys, extract bits, store hash values for(i=0; i<(keySize/16); i++){ LD_0(); LD_1(); HOP(); LD_0(); LD_1(); HOP(); ST_0(); ST_1(); } HOP(); ST_0(); ST_1();

Integer Hash Function: C code

unsigned int hash, shVal, shVal_neg; unsigned int mask = 0xFFFFFFFF; for(i=0; i<keySize; i++){ //load key, bit selection hash = key[i] & hashFunc; //extract bits for(j=30; j>=0; j--){ if(!(hashFunc & (0x1<<j))){ //partial shift right shVal = hash & (mask<<j); shVal_neg = hash & ~(mask<<j); hash = (shVal>>1) | shVal_neg; } } //store hash value hashValue[i] = hash; }

TU Dresden

Pure C code C code with new instructions

slide-8
SLIDE 8
  • 1 cycle

1 cycle 1 cycle

//init pointer, variables init_states(key, hashValue, hashFunc); LD_0(); LD_1(); //load keys, extract bits, store hash values for(i=0; i<(keySize/16); i++){ LD_0(); LD_1(); HOP(); LD_0(); LD_1(); HOP(); ST_0(); ST_1(); } HOP(); ST_0(); ST_1();

Integer Hash Function: C code

unsigned int hash, shVal, shVal_neg; unsigned int mask = 0xFFFFFFFF; for(i=0; i<keySize; i++){ //load key, bit selection hash = key[i] & hashFunc; //extract bits for(j=30; j>=0; j--){ if(!(hashFunc & (0x1<<j))){ //partial shift right shVal = hash & (mask<<j); shVal_neg = hash & ~(mask<<j); hash = (shVal>>1) | shVal_neg; } } //store hash value hashValue[i] = hash; }

TU Dresden

Pure C code C code with new instructions

slide-9
SLIDE 9
  • Integer Hash Function: ISA Extensions

TU Dresden

Dataflow Dataflow

Load Execution Load

ST HOP LD_0 LD_1

Key_0 Key_1 Key_2 Key_3 Key_4 Key_5 Key_6 Key_7

Result_0 Result_1 Result_3 Result_5 HASH Op.

Load-Store Unit 1 Load-Store Unit 0 Local Data Memory 1 Local Data Memory 0

Hash Func HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. Result_2 Result_4 Result_6 Result_7

slide-10
SLIDE 10
  • Integer Hash Function: Pipeline Snippet

TU Dresden

ST_0

Cycle n Cycle (n+1) Cycle (n+2) Cycle (n+5) Cycle (n+6)

ST_1

Cycle (n+3) Cycle (n+4)

… …

LD_0 LD_0 LD_1 LD_1 LD_0 LD_1 LD_0 LD_1

ST_0 ST_1

LD_0 LD_1 LD_0 LD_1

… ST_0 ST_1

Cycle (n+7) Cycle (n+8)

HOP HOP HOP HOP HOP HOP

Latency: 6 cycles

slide-11
SLIDE 11
  • Integer Hash Function: Throughput

TU Dresden

Final processor

Throughput nkey: number of keys t: time to perform the operation =

  • +1 Load-Store unit (2x)

+ Extended ISA (500x) Data bus: 32->128 bit (2x)

slide-12
SLIDE 12

1

Results: Throughput

TU Dresden

Final processor

Speedup: HASHI vs. 108Mini 386x 354x 2303x 1288x 125x

slide-13
SLIDE 13

1

Results: Timing and Area

TU Dresden

Relative Area Consumption (HASHI) Final processor

slide-14
SLIDE 14

1

Results: Comparison

TU Dresden

3x/7x lower 57x/176x lower 113x/271x lower Measures: HASHI vs. INTEL

slide-15
SLIDE 15

1

Conclusion Hardware/Software Codesign approach Results

  • High database throughput
  • Highly reduced area and power consumption
  • 170x less energy consumption than a high-end

x86 processor (@ same performance)

Silicon Prototype

  • Tape-out April 2014
  • 28 nm LP process: Globalfoundries
  • ISA: Hash Functions, Hash Table Operators etc.

TU Dresden

1 Nöthen et al., A 105GOPS 36mm2 Heterogeneous SDR MPSoC with Energy-Aware Dynamic Scheduling and Iterative Detection-Decoding for 4G in 65nm CMOS, ISSCC. 2014 [1]