Q100: The Architecture and Design of a DATABASE PROCESSING UNIT - - PowerPoint PPT Presentation

q100 the architecture and design of a database processing
SMART_READER_LITE
LIVE PREVIEW

Q100: The Architecture and Design of a DATABASE PROCESSING UNIT - - PowerPoint PPT Presentation

Q100: The Architecture and Design of a DATABASE PROCESSING UNIT Lisa Wu, Andrea Lottarini, Tim Paine, Martha Kim, and Ken Ross Columbia University, NYC 1 Thursday, March 6, 2014 DPUs are analogous to GPUs Graphics Database Workloads


slide-1
SLIDE 1

Q100: The Architecture and Design of a DATABASE PROCESSING UNIT

Lisa Wu, Andrea Lottarini, Tim Paine, Martha Kim, and Ken Ross Columbia University, NYC

1 Thursday, March 6, 2014

slide-2
SLIDE 2

02/27/14 Columbia University

Database Workloads Graphics Workloads

DPUs are analogous to GPUs

CPU GPU CPU DPU

2

Thursday, March 6, 2014

slide-3
SLIDE 3

02/27/14 Columbia University

Q100: A Stream-Based DPU

3

Thursday, March 6, 2014

slide-4
SLIDE 4

02/27/14 Columbia University

Q100: A Stream-Based DPU

  • Accelerates analytic queries

3

Thursday, March 6, 2014

slide-5
SLIDE 5

02/27/14 Columbia University

Q100: A Stream-Based DPU

  • Accelerates analytic queries

The sale of an airline ticket Sales Projection Transactional Processing Analytic Processing 3

Thursday, March 6, 2014

slide-6
SLIDE 6

02/27/14 Columbia University

Q100: A Stream-Based DPU

  • Accelerates analytic queries
  • Direct hardware support for relational
  • perators

3

Thursday, March 6, 2014

slide-7
SLIDE 7

02/27/14 Columbia University

Q100: A Stream-Based DPU

  • Accelerates analytic queries
  • Direct hardware support for relational
  • perators

JOIN AGGRE- GATE SORT SELECT

3

Thursday, March 6, 2014

slide-8
SLIDE 8

02/27/14 Columbia University

Q100: A Stream-Based DPU

  • Accelerates analytic queries
  • Direct hardware support for relational
  • perators
  • Processes data as streams

3

Thursday, March 6, 2014

slide-9
SLIDE 9

02/27/14 Columbia University

Q100: A Stream-Based DPU

  • Accelerates analytic queries
  • Direct hardware support for relational
  • perators
  • Processes data as streams

3

Relational Operator

OUT IN

Thursday, March 6, 2014

slide-10
SLIDE 10

02/27/14 Columbia University

Q100: A Stream-Based DPU

  • Accelerates analytic queries
  • Direct hardware support for relational
  • perators
  • Processes data as streams
  • Combines spatial and temporal

instructions to form a DPU ISA

3

Thursday, March 6, 2014

slide-11
SLIDE 11

02/27/14 Columbia University SELECT s_season, SUM(s_qty) as sum_qty FROM sales WHERE s_shipdate >= ‘2013-01-01’ GROUP BY s_season ORDER BY s_season

Query

4

Thursday, March 6, 2014

slide-12
SLIDE 12

02/27/14 Columbia University SELECT s_season, SUM(s_qty) as sum_qty FROM sales WHERE s_shipdate >= ‘2013-01-01’ GROUP BY s_season ORDER BY s_season

Query

SALES Bool Gen Col3 Col2 Col1 Col Select Col Select Col Filter Col Filter Bool1 Stitch Col4 Col5 Table1 Agg Agg Agg Agg Parti- tion Table2 Table4 Table5 Append Final Answer Table6 Table7 Append Append Col Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Table3

Plan

4

Thursday, March 6, 2014

slide-13
SLIDE 13

02/27/14 Columbia University SELECT s_season, SUM(s_qty) as sum_qty FROM sales WHERE s_shipdate >= ‘2013-01-01’ GROUP BY s_season ORDER BY s_season

Query

SALES Bool Gen Col3 Col2 Col1 Col Select Col Select Col Filter Col Filter Bool1 Stitch Col4 Col5 Table1 Agg Agg Agg Agg Parti- tion Table2 Table4 Table5 Append Final Answer Table6 Table7 Append Append Col Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Table3

Plan Nodes = Relational Operators Edges = Data Dependencies

4

Thursday, March 6, 2014

slide-14
SLIDE 14

02/27/14 Columbia University

Query Plan Q100 Program

5

SALES Bool Gen Col3 Col2 Col1 Col Select Col Select Col Filter Col Filter Bool1 Stitch Col4 Col5 Table1 Agg Agg Agg Agg Parti- tion Table2 Table4 Table5 Append Final Answer Table6 Table7 Append Append Col Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Table3

MEMORY INTERCONNECT MEMORY

Q100 Device

Thursday, March 6, 2014

slide-15
SLIDE 15

02/27/14 Columbia University

Query Plan Q100 Program

5

SALES Bool Gen Col3 Col2 Col1 Col Select Col Select Col Filter Col Filter Bool1 Stitch Col4 Col5 Table1 Agg Agg Agg Agg Parti- tion Table2 Table4 Table5 Append Final Answer Table6 Table7 Append Append Col Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Table3

MEMORY INTERCONNECT MEMORY

Q100 Device

Thursday, March 6, 2014

slide-16
SLIDE 16

02/27/14 Columbia University

Query Plan Q100 Program

5

SALES Bool Gen Col3 Col2 Col1 Col Select Col Select Col Filter Col Filter Bool1 Stitch Col4 Col5 Table1 Agg Agg Agg Agg Parti- tion Table2 Table4 Table5 Append Final Answer Table6 Table7 Append Append Col Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Table3

MEMORY INTERCONNECT MEMORY

Q100 Device

Thursday, March 6, 2014

slide-17
SLIDE 17

SALES Bool Gen Col3 Col2 Col1 Col Select Col Select Col Filter Col Filter Bool1 Stitch Col4 Col5 Table1 Agg Agg Agg Agg Parti- tion Table2 Table4 Table5 Append Final Answer Table6 Table7 Append Append Col Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Table3

02/27/14 Columbia University

Query Plan Q100 Program

6

Thursday, March 6, 2014

slide-18
SLIDE 18

SALES Bool Gen Col3 Col2 Col1 Col Select Col Select Col Filter Col Filter Bool1 Stitch Col4 Col5 Table1 Agg Agg Agg Agg Parti- tion Table2 Table4 Table5 Append Final Answer Table6 Table7 Append Append Col Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Table3

02/27/14 Columbia University

Query Plan Q100 Program Spatial Instructions

6

Thursday, March 6, 2014

slide-19
SLIDE 19

SALES Bool Gen Col3 Col2 Col1 Col Select Col Select Col Filter Col Filter Bool1 Stitch Col4 Col5 Table1 Agg Agg Agg Agg Parti- tion Table2 Table4 Table5 Append Final Answer Table6 Table7 Append Append Col Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Table3

02/27/14 Columbia University

Query Plan Q100 Program Temporal Instructions Spatial Instructions

6

Thursday, March 6, 2014

slide-20
SLIDE 20

SALES Bool Gen Col3 Col2 Col1 Col Select Col Select Col Filter Col Filter Bool1 Stitch Col4 Col5 Table1 Agg Agg Agg Agg Parti- tion Table2 Table4 Table5 Append Final Answer Table6 Table7 Append Append Col Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col SelectCol Select Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Table3

02/27/14 Columbia University

Query Plan Q100 Program Temporal Instructions Spatial Instructions

6

Thursday, March 6, 2014

slide-21
SLIDE 21

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE 7

Thursday, March 6, 2014

slide-22
SLIDE 22

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE 7

Thursday, March 6, 2014

slide-23
SLIDE 23

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE 7

Thursday, March 6, 2014

slide-24
SLIDE 24

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE Partitioned Tables Temp Column 7

Thursday, March 6, 2014

slide-25
SLIDE 25

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE Partitioned Tables Temp Column 7

Thursday, March 6, 2014

slide-26
SLIDE 26

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE Partitioned Tables Temp Column Read datum

  • nce,

perform multiple

  • perations

7

Thursday, March 6, 2014

slide-27
SLIDE 27

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE Partitioned Tables Temp Column Pipeline Parallelism Temp Table Read datum

  • nce,

perform multiple

  • perations

7

Thursday, March 6, 2014

slide-28
SLIDE 28

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE Partitioned Tables Temp Column Pipeline Parallelism Temp Table Temp Column Data Parallelism Read datum

  • nce,

perform multiple

  • perations

7

Thursday, March 6, 2014

slide-29
SLIDE 29

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE Partitioned Tables Temp Column Pipeline Parallelism Temp Table Temp Column Data Parallelism Minimize Spills/Fills Read datum

  • nce,

perform multiple

  • perations

7

Thursday, March 6, 2014

slide-30
SLIDE 30

02/27/14 Columbia University MEMORY

Q100 Execution and Efficiencies

SALES TABLE Partitioned Tables Temp Column Pipeline Parallelism Temp Table Temp Column Data Parallelism Minimize Spills/Fills Use coarse-grain hardware primitives that operate on coarse-grain data Read datum

  • nce,

perform multiple

  • perations

7

Thursday, March 6, 2014

slide-31
SLIDE 31

02/27/14 Columbia University MEMORY MEMORY INTER- CONNECT

How do we implement these

  • perators?

How many tiles should there be and

  • f what type?

How do we generate these query plans? What kind of interconnect should we use? How do we schedule the plans? Is the Q100 performance and energy efficient?

8

Thursday, March 6, 2014

slide-32
SLIDE 32

02/27/14 Columbia University MEMORY MEMORY INTER- CONNECT

How many tiles should there be and

  • f what type?

How do we generate these query plans? What kind of interconnect should we use? How do we implement these

  • perators?

How do we schedule the plans? Is the Q100 performance and energy efficient?

8

Thursday, March 6, 2014

slide-33
SLIDE 33

02/27/14 Columbia University MEMORY MEMORY INTER- CONNECT

How do we generate these query plans? What kind of interconnect should we use? How do we implement these

  • perators?

How many tiles should there be and

  • f what type?

How do we schedule the plans? Is the Q100 performance and energy efficient?

8

Thursday, March 6, 2014

slide-34
SLIDE 34

02/27/14 Columbia University MEMORY MEMORY INTER- CONNECT

How do we generate these query plans? What kind of interconnect should we use? How do we implement these

  • perators?

How many tiles should there be and

  • f what type?

How do we schedule the plans? Is the Q100 performance and energy efficient?

8

Thursday, March 6, 2014

slide-35
SLIDE 35

02/27/14 Columbia University MEMORY MEMORY INTER- CONNECT

How do we generate these query plans? What kind of interconnect should we use? Bandwidth needs on- and

  • ff- chip

How do we implement these

  • perators?

How many tiles should there be and

  • f what type?

How do we schedule the plans? Is the Q100 performance and energy efficient?

8

Thursday, March 6, 2014

slide-36
SLIDE 36

02/27/14 Columbia University

How do we schedule the plans?

MEMORY MEMORY INTER- CONNECT

How do we generate these query plans? What kind of interconnect should we use? Bandwidth needs on- and

  • ff- chip

How do we implement these

  • perators?

How many tiles should there be and

  • f what type?

Is the Q100 performance and energy efficient?

8

Thursday, March 6, 2014

slide-37
SLIDE 37

02/27/14 Columbia University MEMORY MEMORY INTER- CONNECT

How do we implement these

  • perators?

9

Thursday, March 6, 2014

slide-38
SLIDE 38

02/27/14 Columbia University

Example Tile: Boolean Generator

10

IN 0 IN 1 OUT

BOOLGEN

==

  • p

EN

Thursday, March 6, 2014

slide-39
SLIDE 39

02/27/14 Columbia University

Example Tile: Boolean Generator

10

IN 0 IN 1 OUT

BOOLGEN

==

  • p

EN

COLUMN FILTER

IN 2

Thursday, March 6, 2014

slide-40
SLIDE 40

02/27/14 Columbia University

Example Tile: Boolean Generator

10

IN 0 IN 1 OUT

BOOLGEN

==

  • p

EN

COLUMN FILTER

IN 2

WHERE s_shipdate >= ‘2013-01-01’

Thursday, March 6, 2014

slide-41
SLIDE 41

02/27/14 Columbia University

Example Tile: Aggregator

11

GRP OUT

AGG

sum

  • p

EN DATA

Thursday, March 6, 2014

slide-42
SLIDE 42

02/27/14 Columbia University

Example Tile: Aggregator

11

GRP OUT

AGG

sum

  • p

EN DATA

SORT

GRP DATA

Thursday, March 6, 2014

slide-43
SLIDE 43

02/27/14 Columbia University

Example Tile: Sorter

12

GRP DATA

SORT

GRP DATA

Thursday, March 6, 2014

slide-44
SLIDE 44

02/27/14 Columbia University

Example Tile: Sorter

12

GRP DATA

SORT

GRP DATA Limitation: number of records

Thursday, March 6, 2014

slide-45
SLIDE 45

02/27/14 Columbia University

Example Tile: Sorter

12

GRP DATA

SORT

GRP DATA

PARTITION SORT SORT

Limitation: number of records

Thursday, March 6, 2014

slide-46
SLIDE 46

02/27/14 Columbia University

Q100 Tiles

  • Functional Tiles (7)

Aggregator ALU Boolean Generator Column Filter Joiner Partitioner Sorter

13

Thursday, March 6, 2014

slide-47
SLIDE 47

02/27/14 Columbia University

Q100 Tiles

  • Functional Tiles (7)

Aggregator ALU Boolean Generator Column Filter Joiner Partitioner Sorter

  • Auxiliary Tiles (4)

Table Appender Column Selector Column Concatenator Column Stitcher

13

Thursday, March 6, 2014

slide-48
SLIDE 48

02/27/14 Columbia University

Q100 Tiles

  • Functional Tiles (7)

Aggregator ALU Boolean Generator Column Filter Joiner Partitioner Sorter

  • Auxiliary Tiles (4)

Table Appender Column Selector Column Concatenator Column Stitcher

Tile Characterization Methodology Verilog implementation for each tile, synthesized, placed, and routed using Synopsys 32nm Generic Libraries

13

Thursday, March 6, 2014

slide-49
SLIDE 49

02/27/14 Columbia University

Tile Characterization

1 2 3 4

AGG ALU BOOLGENCOLFILTER JOIN PART SORT APPEND COLSELECTCONCAT STITCH

Critical Path (ns) 10 20 30 40 Power (mW) 0.25 0.5 0.75 1 Area (mm2)

14

Thursday, March 6, 2014

slide-50
SLIDE 50

02/27/14 Columbia University

Tile Characterization

1 2 3 4

AGG ALU BOOLGENCOLFILTER JOIN PART SORT APPEND COLSELECTCONCAT STITCH

Critical Path (ns) 10 20 30 40 Power (mW) 0.25 0.5 0.75 1 Area (mm2)

14

Thursday, March 6, 2014

slide-51
SLIDE 51

02/27/14 Columbia University

Tile Characterization

1 2 3 4

AGG ALU BOOLGENCOLFILTER JOIN PART SORT APPEND COLSELECTCONCAT STITCH

Critical Path (ns) 10 20 30 40 Power (mW) 0.25 0.5 0.75 1 Area (mm2)

14

Thursday, March 6, 2014

slide-52
SLIDE 52

02/27/14 Columbia University

Tile Characterization

1 2 3 4

AGG ALU BOOLGENCOLFILTER JOIN PART SORT APPEND COLSELECTCONCAT STITCH

Critical Path (ns) 10 20 30 40 Power (mW) 0.25 0.5 0.75 1 Area (mm2)

Max Freq 315 MHz

14

Thursday, March 6, 2014

slide-53
SLIDE 53

02/27/14 Columbia University MEMORY MEMORY INTER- CONNECT 15

How many tiles should there be and

  • f what type?

Thursday, March 6, 2014

slide-54
SLIDE 54

02/27/14 Columbia University AGGREGATOR 1 2 3 4 5 6 7 8 9 10 11 12… ALU 1 2 3 4 5 6 7 8 9 10 11 12… BOOLEAN GENERATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN FILTER 1 2 3 4 5 6 7 8 9 10 11 12… JOINER 1 2 3 4 5 6 7 8 9 10 11 12… PARTITIONER 1 2 3 4 5 6 7 8 9 10 11 12… SORTER 1 2 3 4 5 6 7 8 9 10 11 12… TABLE APPENDER 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN SELECTOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN CONCATENATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN STITCHER 1 2 3 4 5 6 7 8 9 10 11 12…

Unbounded Design Space

16

Thursday, March 6, 2014

slide-55
SLIDE 55

02/27/14 Columbia University

Performance Simulation Methodology

17

  • TPC-H as target workload
  • Home-grown C++ simulator, validated against

MonetDB

  • Completion cycles for each spatial and temporal

instructions

  • Memory access overheads
  • Completion time for each query converted to

throughput using the Q100 frequency

Thursday, March 6, 2014

slide-56
SLIDE 56

02/27/14 Columbia University

Example: Bounding ALU Count

0.25 0.5 0.75 1 1 2 3 4 5 6 7 8 9 10 Query Runtime wrt. 1 ALU Number of ALUs

TPC-H Queries

18 Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 8 Q 10 Q 11 Q 12 Q 14 Q 15 Q 16 Q 17 Q 18 Q 19 Q 20 Q 21

Thursday, March 6, 2014

slide-57
SLIDE 57

02/27/14 Columbia University

Example: Bounding ALU Count

0.25 0.5 0.75 1 1 2 3 4 5 6 7 8 9 10 Query Runtime wrt. 1 ALU Number of ALUs

TPC-H Queries

18 Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 8 Q 10 Q 11 Q 12 Q 14 Q 15 Q 16 Q 17 Q 18 Q 19 Q 20 Q 21

Thursday, March 6, 2014

slide-58
SLIDE 58

02/27/14 Columbia University AGGREGATOR 1 2 3 4 5 6 7 8 9 10 11 12… ALU 1 2 3 4 5 6 7 8 9 10 11 12… BOOLEAN GENERATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN FILTER 1 2 3 4 5 6 7 8 9 10 11 12… JOINER 1 2 3 4 5 6 7 8 9 10 11 12… PARTITIONER 1 2 3 4 5 6 7 8 9 10 11 12… SORTER 1 2 3 4 5 6 7 8 9 10 11 12… TABLE APPENDER 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN SELECTOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN CONCATENATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN STITCHER 1 2 3 4 5 6 7 8 9 10 11 12…

Bounded Design Space

19

Thursday, March 6, 2014

slide-59
SLIDE 59

02/27/14 Columbia University AGGREGATOR 1 2 3 4 5 6 7 8 9 10 11 12… ALU 1 2 3 4 5 6 7 8 9 10 11 12… BOOLEAN GENERATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN FILTER 1 2 3 4 5 6 7 8 9 10 11 12… JOINER 1 2 3 4 5 6 7 8 9 10 11 12… PARTITIONER 1 2 3 4 5 6 7 8 9 10 11 12… SORTER 1 2 3 4 5 6 7 8 9 10 11 12… TABLE APPENDER 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN SELECTOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN CONCATENATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN STITCHER 1 2 3 4 5 6 7 8 9 10 11 12…

Bounded Design Space

19

Thursday, March 6, 2014

slide-60
SLIDE 60

02/27/14 Columbia University AGGREGATOR 1 2 3 4 5 6 7 8 9 10 11 12… ALU 1 2 3 4 5 6 7 8 9 10 11 12… BOOLEAN GENERATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN FILTER 1 2 3 4 5 6 7 8 9 10 11 12… JOINER 1 2 3 4 5 6 7 8 9 10 11 12… PARTITIONER 1 2 3 4 5 6 7 8 9 10 11 12… SORTER 1 2 3 4 5 6 7 8 9 10 11 12… TABLE APPENDER 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN SELECTOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN CONCATENATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN STITCHER 1 2 3 4 5 6 7 8 9 10 11 12…

Bounded Design Space

2.9 Million Designs!! 19

Thursday, March 6, 2014

slide-61
SLIDE 61

02/27/14 Columbia University AGGREGATOR 1 2 3 4 5 6 7 8 9 10 11 12… ALU 1 2 3 4 5 6 7 8 9 10 11 12… BOOLEAN GENERATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN FILTER 1 2 3 4 5 6 7 8 9 10 11 12… JOINER 1 2 3 4 5 6 7 8 9 10 11 12… PARTITIONER 1 2 3 4 5 6 7 8 9 10 11 12… SORTER 1 2 3 4 5 6 7 8 9 10 11 12… TABLE APPENDER 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN SELECTOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN CONCATENATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN STITCHER 1 2 3 4 5 6 7 8 9 10 11 12…

Bounded Design Space

2.9 Million Designs!! Explore tiles that consume >= 5mW 4 6 6 8 7 3 2 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 4 19

Thursday, March 6, 2014

slide-62
SLIDE 62

02/27/14 Columbia University AGGREGATOR 1 2 3 4 5 6 7 8 9 10 11 12… ALU 1 2 3 4 5 6 7 8 9 10 11 12… BOOLEAN GENERATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN FILTER 1 2 3 4 5 6 7 8 9 10 11 12… JOINER 1 2 3 4 5 6 7 8 9 10 11 12… PARTITIONER 1 2 3 4 5 6 7 8 9 10 11 12… SORTER 1 2 3 4 5 6 7 8 9 10 11 12… TABLE APPENDER 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN SELECTOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN CONCATENATOR 1 2 3 4 5 6 7 8 9 10 11 12… COLUMN STITCHER 1 2 3 4 5 6 7 8 9 10 11 12…

Bounded Design Space

150 Designs Explore tiles that consume >= 5mW 4 6 6 8 7 3 2 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 4 19

Thursday, March 6, 2014

slide-63
SLIDE 63

2 4 6 8 10 12 0.2 0.4 0.6 TPC-H Runtime (miliseconds) Power (Watts)

02/27/14 Columbia University

Q100 Designs for Further Evaluation

20

Thursday, March 6, 2014

slide-64
SLIDE 64

2 4 6 8 10 12 0.2 0.4 0.6 TPC-H Runtime (miliseconds) Power (Watts)

02/27/14 Columbia University

Q100 Designs for Further Evaluation

20

Low Power 1 ALU 1 Partitioner 1 Sorter

Thursday, March 6, 2014

slide-65
SLIDE 65

2 4 6 8 10 12 0.2 0.4 0.6 TPC-H Runtime (miliseconds) Power (Watts)

02/27/14 Columbia University

Q100 Designs for Further Evaluation

20

Low Power 1 ALU 1 Partitioner 1 Sorter Pareto 4 ALUs 2 Partitioners 1 Sorter

Thursday, March 6, 2014

slide-66
SLIDE 66

2 4 6 8 10 12 0.2 0.4 0.6 TPC-H Runtime (miliseconds) Power (Watts)

02/27/14 Columbia University

Q100 Designs for Further Evaluation

20

Low Power 1 ALU 1 Partitioner 1 Sorter Pareto 4 ALUs 2 Partitioners 1 Sorter High Perf 5 ALUs 3 Partitioners 6 Sorters

Thursday, March 6, 2014

slide-67
SLIDE 67

02/27/14 Columbia University MEMORY MEMORY INTER- CONNECT 21

Bandwidth needs on- and

  • ff- chip

Thursday, March 6, 2014

slide-68
SLIDE 68

02/27/14 Columbia University

Interconnect (Network on Chip) Bandwidth Needs

10 20 30 40 50 5 10 15 20 IDEAL

Runtime Normalized to IDEAL NoC BW Limit (GB/s)

High Perf

22

Thursday, March 6, 2014

slide-69
SLIDE 69

02/27/14 Columbia University

Interconnect (Network on Chip) Bandwidth Needs

10 20 30 40 50 5 10 15 20 IDEAL

Runtime Normalized to IDEAL NoC BW Limit (GB/s)

High Perf

23

Thursday, March 6, 2014

slide-70
SLIDE 70

02/27/14 Columbia University

Interconnect (Network on Chip) Bandwidth Needs

10 20 30 40 50 5 10 15 20 IDEAL

Runtime Normalized to IDEAL NoC BW Limit (GB/s)

Pareto High Perf

23

Thursday, March 6, 2014

slide-71
SLIDE 71

02/27/14 Columbia University

Interconnect (Network on Chip) Bandwidth Needs

10 20 30 40 50 5 10 15 20 IDEAL

Runtime Normalized to IDEAL NoC BW Limit (GB/s)

Low Power Pareto High Perf

23

Thursday, March 6, 2014

slide-72
SLIDE 72

02/27/14 Columbia University

Interconnect (Network on Chip) Bandwidth Needs

10 20 30 40 50 5 10 15 20 IDEAL

Runtime Normalized to IDEAL NoC BW Limit (GB/s)

Low Power Pareto High Perf

23

NoC Limit @ 6.3 GB/s Scaled down from Intel TeraFlop

Thursday, March 6, 2014

slide-73
SLIDE 73

02/27/14 Columbia University

10 20 30 40

Q 14 Q 19 Q 12 Q 8 Q 6 Q 17 Q 7 Q 5 Q 15 Q 4 Q 1 Q 3 Q 16 Q 18 Q 21 Q 2 Q 20 Q 10 Q 11

Bandwidth (GB/s)

Bandwidth to/from Memory

24

Read Write

Low Power

Thursday, March 6, 2014

slide-74
SLIDE 74

02/27/14 Columbia University

10 20 30 40

Q 14 Q 19 Q 12 Q 8 Q 6 Q 17 Q 7 Q 5 Q 15 Q 4 Q 1 Q 3 Q 16 Q 18 Q 21 Q 2 Q 20 Q 10 Q 11

Bandwidth (GB/s)

Q 14 Q 17 Q 19 Q 8 Q 5 Q 7 Q 3 Q 4 Q 15 Q 12 Q 6 Q 18 Q 1 Q 21 Q 16 Q 2 Q 11 Q 20 Q 10 Q 14 Q 19 Q 8 Q 4 Q 7 Q 12 Q 1 Q 3 Q 17 Q 21 Q 15 Q 18 Q 5 Q 6 Q 16 Q 2 Q 10 Q 20 Q 11

Bandwidth to/from Memory

25

Read Write Read Write Read Write

Low Power Pareto High Perf

Thursday, March 6, 2014

slide-75
SLIDE 75

02/27/14 Columbia University

10 20 30 40

Q 14 Q 19 Q 12 Q 8 Q 6 Q 17 Q 7 Q 5 Q 15 Q 4 Q 1 Q 3 Q 16 Q 18 Q 21 Q 2 Q 20 Q 10 Q 11

Bandwidth (GB/s)

Q 14 Q 17 Q 19 Q 8 Q 5 Q 7 Q 3 Q 4 Q 15 Q 12 Q 6 Q 18 Q 1 Q 21 Q 16 Q 2 Q 11 Q 20 Q 10 Q 14 Q 19 Q 8 Q 4 Q 7 Q 12 Q 1 Q 3 Q 17 Q 21 Q 15 Q 18 Q 5 Q 6 Q 16 Q 2 Q 10 Q 20 Q 11

Bandwidth to/from Memory

25

Read Write Read Write Read Write

Low Power Pareto High Perf

Thursday, March 6, 2014

slide-76
SLIDE 76

02/27/14 Columbia University

10 20 30 40

Q 14 Q 19 Q 12 Q 8 Q 6 Q 17 Q 7 Q 5 Q 15 Q 4 Q 1 Q 3 Q 16 Q 18 Q 21 Q 2 Q 20 Q 10 Q 11

Bandwidth (GB/s)

Q 14 Q 17 Q 19 Q 8 Q 5 Q 7 Q 3 Q 4 Q 15 Q 12 Q 6 Q 18 Q 1 Q 21 Q 16 Q 2 Q 11 Q 20 Q 10 Q 14 Q 19 Q 8 Q 4 Q 7 Q 12 Q 1 Q 3 Q 17 Q 21 Q 15 Q 18 Q 5 Q 6 Q 16 Q 2 Q 10 Q 20 Q 11

Bandwidth to/from Memory

25

Read Write Read Write Read Write

Low Power Pareto High Perf BW Write Limit @ 10 GB/s

Thursday, March 6, 2014

slide-77
SLIDE 77

02/27/14 Columbia University

10 20 30 40

Q 14 Q 19 Q 12 Q 8 Q 6 Q 17 Q 7 Q 5 Q 15 Q 4 Q 1 Q 3 Q 16 Q 18 Q 21 Q 2 Q 20 Q 10 Q 11

Bandwidth (GB/s)

Q 14 Q 17 Q 19 Q 8 Q 5 Q 7 Q 3 Q 4 Q 15 Q 12 Q 6 Q 18 Q 1 Q 21 Q 16 Q 2 Q 11 Q 20 Q 10 Q 14 Q 19 Q 8 Q 4 Q 7 Q 12 Q 1 Q 3 Q 17 Q 21 Q 15 Q 18 Q 5 Q 6 Q 16 Q 2 Q 10 Q 20 Q 11

Bandwidth to/from Memory

25

Read Write Read Write Read Write

Low Power Pareto High Perf BW Write Limit @ 10 GB/s BW Read Limit @ 20 or 30 GB/s

Thursday, March 6, 2014

slide-78
SLIDE 78

02/27/14 Columbia University MEMORY MEMORY INTER- CONNECT

Is the Q100 performance and energy efficient?

26

Thursday, March 6, 2014

slide-79
SLIDE 79

02/27/14 Columbia University

Software Comparison Methodology

  • MonetDB on Sandybridge server
  • Energy Measurements:
  • Intel’s Running Average Power Limit

(RAPL) energy meters

  • Core domain only
  • Sample energy counters at 10ms intervals
  • Exclude machine idle power

27

Thursday, March 6, 2014

slide-80
SLIDE 80

02/27/14 Columbia University

Comparison with Software (MonetDB)

28 0% 10% 20% 30% 100X Input Data Size Relative Runtime 0% 5% 10% 15% 100X Input Data Size Relative Power 0% 0.25% 0.50% 0.75% 1.00% 100X Input Data Size Relative Energy LowPower Pareto HighPerf

Thursday, March 6, 2014

slide-81
SLIDE 81

02/27/14 Columbia University

Comparison with Software (MonetDB)

28 0% 10% 20% 30% 100X Input Data Size Relative Runtime 0% 5% 10% 15% 100X Input Data Size Relative Power 0% 0.25% 0.50% 0.75% 1.00% 100X Input Data Size Relative Energy 37X-70X Better Performance LowPower Pareto HighPerf

Thursday, March 6, 2014

slide-82
SLIDE 82

02/27/14 Columbia University

Comparison with Software (MonetDB)

28 0% 10% 20% 30% 100X Input Data Size Relative Runtime 0% 5% 10% 15% 100X Input Data Size Relative Power 0% 0.25% 0.50% 0.75% 1.00% 100X Input Data Size Relative Energy 37X-70X Better Performance 1/1000th Energy Consumption LowPower Pareto HighPerf

Thursday, March 6, 2014

slide-83
SLIDE 83

02/27/14 Columbia University

Comparison with Software (MonetDB)

28 0% 10% 20% 30% 100X Input Data Size Relative Runtime 0% 5% 10% 15% 100X Input Data Size Relative Power 0% 0.25% 0.50% 0.75% 1.00% 100X Input Data Size Relative Energy 37X-70X Better Performance 1/1000th Energy Consumption LowPower Pareto HighPerf

Thursday, March 6, 2014

slide-84
SLIDE 84

02/27/14 Columbia University

Comparison with Software (MonetDB)

28 0% 10% 20% 30% 100X Input Data Size Relative Runtime 0% 5% 10% 15% 100X Input Data Size Relative Power 0% 0.25% 0.50% 0.75% 1.00% 100X Input Data Size Relative Energy 37X-70X Better Performance 1/1000th Energy Consumption 10X Performance LowPower Pareto HighPerf

Thursday, March 6, 2014

slide-85
SLIDE 85

02/27/14 Columbia University

Comparison with Software (MonetDB)

28 0% 10% 20% 30% 100X Input Data Size Relative Runtime 0% 5% 10% 15% 100X Input Data Size Relative Power 0% 0.25% 0.50% 0.75% 1.00% 100X Input Data Size Relative Energy 37X-70X Better Performance 1/1000th Energy Consumption < 1/100th Energy Consumption 10X Performance LowPower Pareto HighPerf

Thursday, March 6, 2014

slide-86
SLIDE 86

02/27/14 Columbia University

Conclusions

  • Q100 is a highly efficient domain-specific

accelerator for analytical database workloads

  • ISA exploits parallelism and streaming efficiencies
  • At < 15% area and power of a Xeon core, a Q100

device gets exceptional performance and energy efficiency

  • Exciting research opportunities for DPU

29

Thursday, March 6, 2014

slide-87
SLIDE 87

FIN

30 Thursday, March 6, 2014