Overview on Hardware Optimizations for Database Engines Annett - - PowerPoint PPT Presentation

overview on hardware optimizations for database engines
SMART_READER_LITE
LIVE PREVIEW

Overview on Hardware Optimizations for Database Engines Annett - - PowerPoint PPT Presentation

Overview on Hardware Optimizations for Database Engines Annett Ungethm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner BTW 2017, Stuttgart, Germany, 2017-03-09 Interaction DB-Engine and Hardware


slide-1
SLIDE 1

Overview on Hardware Optimizations for Database Engines

Annett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner

BTW 2017, Stuttgart, Germany, 2017-03-09

slide-2
SLIDE 2

2

Interaction DB-Engine and Hardware

Applications/Database Engines Modern Hardware

We Well ll-Kno Known n Cha halleng nge:

Exploit hardware technology by specific data management techniques (indexing, data storage, query & transaction processing)

1970 1980 1990 2000 2010 2020 10 100 1000 10000 1e+05 1e+06 1e+07 memory (KByte) 1970 1980 1990 2000 2010 2020 2 4 6 8 10 #cores

Main Memory CPU

slide-3
SLIDE 3

3

Era of Dark Silicon

MOORE‘S LAW

§ Number of transistors in a dense integrated circuit doubles approximately every two years.

DARK SILICON

§ We can no longer power the transistors that Moore is giving us

1970 1980 1990 2000 2010 2020 1 10 100 1000 10000 1e+05 1e+06 1e+07 #transistors (x1000) process (nm) http://engineering.nyu.edu/garg/node/31

slide-4
SLIDE 4

4

HW/SW Co-Design for DB-Engines

Applications/Database Engines Modern Hardware

Ch Challenge:

HW/SW Co-Design for Database Engines Specialization of Hardware to overcome Dark Silicon

slide-5
SLIDE 5

5

Outline

HARDWARE FOUNDATION EXTENSIONS FOR PROCESSING ELEMENTS INTELLIGENT DMA CONTROLLER

slide-6
SLIDE 6

6

Hardware Foundation

TOMAHAWK PLATFORM

slide-7
SLIDE 7

7

Hardware Foundation – Zoom In

slide-8
SLIDE 8

8

Hardware Foundation – Zoom In (2)

CORE MANAGER (CM)

§ Extended Xtensa-LX5 from Tensilica (now Cadence) § 32KB for code § 64KB for data

PROCESSING ELEMENTS (PE)

§ Xtensa-LX5 from Tensilica (now Cadence) § 32KB for code § 2x32KB for data on PE

APPLICATION CORE (APP)

§ 570T core from Tensilica (now Cadence) Co Control-Pl Plane Co Control-Pl Plane

slide-9
SLIDE 9

9

Outline

Co Control-Pl Plane Co Control-Pl Plane

PART I: EXTENSIONS OF PROCESSING ELEMENTS

slide-10
SLIDE 10

10

Development Flow

DEVEL

EVELOPMEN ENT OF OF IN INSTRUCTIO ION SE SET EX EXTEN TENSIONS WI WITH TH

TEN

ENSILICA TO TOOLS

§ Tensilica Instruction Extension (TIE) language § C/TIE compiler § Cycle accurate simulator/debugger § Processor generator

SYN

YNTHE HESIS OF OF RT

RTL COD

CODE

§ Synopsys Design Compiler, PrimeTime PX § TSMC CMOS LP 65nm libraries int res= (v0 + v1 + v2) >> shift8; // shift8 -> internal state int res=ad add3_shift(v0, v1, v2);

slide-11
SLIDE 11

11

Investigated Database Primitives

Bi Bitmap p Co Compression and Pr Processin ing ( (AND, OR OR, XOR OR) Ha Hashing So Sorted Se Set Operat ations ns WAH PLWAH COMPAX Hash + Lookup Hash + Insert Hash Keys Hash Sampling CityHash32 Merge Sort Intersection Union Difference Sort-Merge Join Sort-Merge Aggregation (SUM)

Primivites

2014

slide-12
SLIDE 12

12 Basic RISC Instruction Set Application-Specific Instruction Set

Instruction Set

Application-Specific States Application-Specific Registers Basic Registers

Register Files

Instruction fetch Load-Store Unit 0 Load-Store Unit 1 Data Prefetcher Interconnect Local Instruction Memory Local Data Memory 0 Local Data Memory 1

Ex Extended Te Tensilic ilica LX LX5 Pro rocessor

64 bit 128 bit 128 bit

General Approach for all Extensions

slide-13
SLIDE 13

13

Bitmap Primitives

BITMAPS ARE A SPECIAL KIND OF INDEX BITMAPS COMPRESSION

§ bit length equals number of tuples

WORD-ALIGNED HYBRID (WAH) CODE

§ Stateless compression § Run-length-encoding (RLE)

  • run of 0‘s and 1‘s

§ WAH bitmaps contain RLE

  • compressed fills and
  • uncompressed literals

bitmap index OID X =0 =1 =2 =3 1 1 2 1 1 3 3 1 4 2 1 5 3 1 6 3 1 7 1 1 8 3 1 b1 b2 b3 b4

select * from T where X < 2 Table T

Bit-wise OR

slide-14
SLIDE 14

14

Bit-Wise OR on Compressed Bitmaps

40000380 00000000 00000000 001FFFFF b1 40000380 8000002 001FFFFF

Literal 0 fill Literal

7FFFFFFF 7FFFFFFF 7C0001E0 3FE00000 b2 WAH b1 C0000002 7C0001E0 3FE00000

1 fill Literal Literal

WAH b2 Bit-wise OR

32 bit words In hex OR OR OR OR

Logical operations (AND, OR, XOR)

  • n two compressed bitmaps

1) Load WAH word(s) 2) Calculate output (Fill-Fill, Literal-Fill, Literal-Literal) 3) Combine output

10<runlength> 11<runlength>

... ... 7FFFFFFF 00000000

slide-15
SLIDE 15

15

C-Code

WHILE(XIDX!=XSIZE && YIDX!=YSIZE) { //new X or Y? Calculate new fill count … if(XisFill==1 && YisFill==1) { //2 fills if(XfillWords<YfillWords) min=XfillWords; else min=YfillWords; writeFill(comprResultBI,&Zidx,X[Xidx]|Y[Yidx],min); XfillWords-=min; YfillWords-=min; } else if((XisFill==1 && YisFill==0) || (XisFill==0 && YisFill==1)) { if(XisFill==1){ XfillWords--; if((X[Xidx]&0xC0000000)==0xC0000000) writeFill(comprResultBI, &Zidx, 0xC0000000, 1); else { comprResultBI[Zidx]=Y[Yidx]; Zidx++; } } if(YisFill==1){ YfillWords--; if((Y[Yidx]&0xC0000000)==0xC0000000) writeFill(comprResultBI, &Zidx, 0xC0000000, 1); else {comprResultBI[Zidx]=X[Xidx]; Zidx++; } } } else { result=X[Xidx]|Y[Yidx]; if((result&0x7FFFFFFF)==0x7FFFFFFF) writeFill(comprResultBI, &Zidx, 0xC0000000, 1); else if((result&0x7FFFFFFF)==0) writeFill(comprResultBI, &Zidx, 0x80000000, 1); else { comprResultBI[Zidx]=X[Xidx]|Y[Yidx]; Zidx++; } } }

Fill-Fill Literal-Fill Literal-Literal

slide-16
SLIDE 16

16

Processing with PE Extension

Application specific states Preprocessing Operation Postprocessing Application specific states Initial Load Load Prepare Store Store Memory 0 Memory 1 Memory 0 Memory 1

0000000F 00000003 40000380 80000002 001FFFFF C0000002 7C0001E0 3FE00000

M E M O R Y M E M O R Y 1

10000000..11000001..00101010..0111011..

11000000..00101010..11000001..00110111.. 10000000..11000001..00101010..01110111..

Is word fill or Literal?

  • > fill -> overwrite input words

11111111..11111111..11111111..11111111.. 00000000..00000000..00000000..

11000000..00101010..11000001..0011011..

00000000.. v 11111111... => 111111.. Write to output stream

  • > append or overwrite previous word

with increased fill counter 00000000.0000000..00000..110011010..

Buffer result

11001110.. 00000000.. 00000000.. 00000000..

M E M O R Y 0/1

Proceed to next word (4x)

Align to 128-bit lines

Perform operation OR

ldXstream() ldYstream() 4 x WAHinst()

slide-17
SLIDE 17

17

Bit-Wise OR on Compressed Bitmaps

40000380 00000000 00000000 001FFFFF b1 40000380 8000002 001FFFFF

Literal 0 fill Literal

7FFFFFFF 7FFFFFFF 7C0001E0 3FE00000 b2 WAH b1 C0000002 7C0001E0 3FE00000

1 fill Literal Literal

WAH b2 Bit-wise OR

32 bit words In hex OR OR OR OR

Co Code wi with Ext xtension

do{ ldXstream(); ldYstream(); WAHinst(); WAHinst(); WAHinst(); } while(WAHinst());

slide-18
SLIDE 18

18

Many More Extensions

Bi Bitmap p Co Compression and and Pr Processin ing ( (AND, OR OR, XOR OR) Ha Hashing So Sorted Se Set Op Operations WAH PLWAH COMPAX Hash + Lookup Hash + Insert Hash Keys Hash Sampling CityHash32 Merge Sort Intersection Union Difference Sort-Merge Join Sort-Merge Aggregation (SUM) BitiX X X X HASHI X X X X X Titan3D X X X X X X X Tomahawk DBA X X X X X X

Processor Extension

slide-19
SLIDE 19

19

Evaluation

REFERENCE PROCESSORS

§ Tomahawk DBA Processor --> Set of different DB-Extensions for WAH-Compression, Hashing, and Sortes-Set Operations

Pr Processor De Description Te Technology [n [nm] Ato

tota tal [m

[mm²] fMA

MAX [GHz

Hz] PMA

MAX [W

[W] ] @ fMA

MAX

Tomahawk without DBA Basic Xtensa LX5 without instruction set extensions, 1 LSU, 32-bit memory interface 28 15.92 0.555 0.7 Tomahawk with DBA Set of different DB-Extensions for WAH- Compression, Hashing and Sorted-Set Operations 28 18 0.5 0.753 Intel i7-6500U Low-power Intel 2-core processor based

  • n Skylake architecture, 4MB L3 cache

14 99* 3.1 25 Comparison

slide-20
SLIDE 20

20

Evaluation - Bitmaps

slide-21
SLIDE 21

21

Outline

PART 2: INTELLIGENT DMA CONTROLLER

slide-22
SLIDE 22

22

Problem Statement

22

NoC T2 RISC Core T2 RISC Core T2 RISC Core APP CM Memory

Micron DDR2 SDRAM

Local Memory Local Memory Local Memory

Memory Controller

Synopsys DWC DDR2

APP

Tensilica 570T

CM

LX4-ISA_E

Local Memory Cache tAN tNMc 0xCCA 1 0x00B 2 0x0FA 3 0x1FD 4 0xDE1 5 0x0ED 6 0x00E 7 0xD0A tNA tMcN tMcM tMMc tAPP Problem: Many round-trips for key lookups Approach: “Teach B-trees to the memory controller“

slide-23
SLIDE 23

23

Intelligent Main Memory Controller (iDMA)

23

NoC Core Core Core Pointer Chaser APP CM Memory

Micron DDR2 SDRAM

Local Memory Local Memory Local Memory

Memory Controller APP CM Memory Controller

Synopsis

Local Memory Cache

Memory Controller

Synopsys DWC DDR2

0xCC6 1 0x000 2 0x0F0 3 0x1FD 4 0xDE1 5 0x0ED 6 0x00E 7 0xD0A tCN tNC tMcM tMMc tNP tPN tPMc tMcP Vision (and first simulations)

  • Intelligent memory controller
  • Is aware of the semantics of

memory layout

  • Implements core operations (e.g. lookup)

Implementation (no yet in silicon)

  • 0,183mm² PE with 200Mhz
slide-24
SLIDE 24

24

First iDMA Design

slide-25
SLIDE 25

25

Evaluation using Simulator

slide-26
SLIDE 26

26

Summary

HARDWARE FOUNDATION EXTENSIONS FOR PROCESSING ELEMENTS INTELLIGENT DMA CONTROLLER

slide-27
SLIDE 27

Overview on Hardware Optimizations for Database Engines

Annett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner

BTW 2017, Stuttgart, Germany, 2017-03-09