NAVIGATING BIG DATA with High-Throughput, Energy- Efficient Data - - PowerPoint PPT Presentation

navigating big data
SMART_READER_LITE
LIVE PREVIEW

NAVIGATING BIG DATA with High-Throughput, Energy- Efficient Data - - PowerPoint PPT Presentation

NAVIGATING BIG DATA with High-Throughput, Energy- Efficient Data Partitioning Lisa Wu, R.J. Barker, Martha Kim, and Ken Ross Columbia University Sunday, July 28, 2013 1 BIG DATA is here Sources: IDC Worldwide Big Data Technology and


slide-1
SLIDE 1

NAVIGATING BIG DATA

with High-Throughput, Energy- Efficient Data Partitioning

Lisa Wu, R.J. Barker, Martha Kim, and Ken Ross Columbia University

1 Sunday, July 28, 2013

slide-2
SLIDE 2

BIG DATA is here

Sources: IDC Worldwide Big Data Technology and Services 2012-2015 Forecast, #233485, March 2012 The Next Web, DAZEINFO NetApp

2 Columbia University

2 Sunday, July 28, 2013

slide-3
SLIDE 3

Do sunblock sales correlate with weather?

3 Columbia University

3 Sunday, July 28, 2013

slide-4
SLIDE 4

4

JOINs = Cross Reference

Columbia University

4 Sunday, July 28, 2013

slide-5
SLIDE 5

SALES WEATHER

4

JOINs = Cross Reference

Columbia University

4 Sunday, July 28, 2013

slide-6
SLIDE 6

SALES WEATHER

JOINDATE

4

JOINs = Cross Reference

Columbia University

4 Sunday, July 28, 2013

slide-7
SLIDE 7

SALES WEATHER JOINDATE(SALES, WEATHER)

JOINDATE

4

JOINs = Cross Reference

Columbia University

4 Sunday, July 28, 2013

slide-8
SLIDE 8

0% 25% 50% 75% 100% 17 9 11 5 7 8 22 1 2 12 21 18 19 10 15 3 20 4 16 14 13 6 Avg.

Runtime in Join TPC-H Query

JOINs = 47% TPC-H Execution Time

5 Columbia University

5 Sunday, July 28, 2013

slide-9
SLIDE 9

6

Naïve JOINs of BIG DATA Thrash the Cache

SALES WEATHER

Columbia University

6 Sunday, July 28, 2013

slide-10
SLIDE 10

Scan

6

Naïve JOINs of BIG DATA Thrash the Cache

SALES WEATHER

Columbia University

6 Sunday, July 28, 2013

slide-11
SLIDE 11

Scan Random Lookup

6

Naïve JOINs of BIG DATA Thrash the Cache

SALES WEATHER

Columbia University

6 Sunday, July 28, 2013

slide-12
SLIDE 12

Scan Random Lookup Small table doesn’ t fit into cache ➛ lookups thrash

6

Naïve JOINs of BIG DATA Thrash the Cache

SALES WEATHER

Columbia University

6 Sunday, July 28, 2013

slide-13
SLIDE 13

Partitioned JOIN

SALES WEATHER 7 Columbia University

7 Sunday, July 28, 2013

slide-14
SLIDE 14

Partitioned JOIN

SALES WEATHER SALES_0 SALES_1 SALES_2 SALES_3 7 Columbia University

7 Sunday, July 28, 2013

slide-15
SLIDE 15

Partitioned JOIN

SALES WEATHER SALES_0 SALES_1 SALES_2 WEATHER_1 WEATHER_2 WEATHER_3 SALES_3 7 WEATHER_0 Columbia University

7 Sunday, July 28, 2013

slide-16
SLIDE 16

Partitioned JOIN

SALES WEATHER SALES_0 SALES_1 SALES_2 WEATHER_1 WEATHER_2 WEATHER_3 SALES_3 7 WEATHER_0

DATE SALES_0 WEATHER_0

Columbia University

7 Sunday, July 28, 2013

slide-17
SLIDE 17

Partitioned JOIN

SALES WEATHER SALES_0 SALES_1 SALES_2 WEATHER_1 WEATHER_2 WEATHER_3 SALES_3

DATE SALES_1 WEATHER_1 DATE SALES_2 WEATHER_2 DATE SALES_3 WEATHER_3

7 WEATHER_0

DATE SALES_0 WEATHER_0

Columbia University

7 Sunday, July 28, 2013

slide-18
SLIDE 18

JOIN Runtime

time Naive Join Partitioned Join

8 Columbia University

8 Sunday, July 28, 2013

slide-19
SLIDE 19

JOIN Runtime

time Naive Join Partitioned Join

8 Columbia University

8 Sunday, July 28, 2013

slide-20
SLIDE 20

JOIN Runtime

time Naive Join Partitioned Join Partition SALES Partition WEATHER

8 Columbia University

8 Sunday, July 28, 2013

slide-21
SLIDE 21

JOIN Runtime

time Naive Join Partitioned Join Partition SALES Partition WEATHER 1 2 3

Join

8 Columbia University

8 Sunday, July 28, 2013

slide-22
SLIDE 22

JOIN Runtime

time Naive Join Partitioned Join Partition SALES Partition WEATHER 1 2 3

Join

8

Partition ≈ 50% of State-of-the-Art Joins

Columbia University

8 Sunday, July 28, 2013

slide-23
SLIDE 23

JOIN Runtime

time Naive Join Partitioned Join Partition SALES Partition WEATHER 1 2 3

Join

8

Partition ≈ 50% of State-of-the-Art Joins

Aggregate

Columbia University

8 Sunday, July 28, 2013

slide-24
SLIDE 24

JOIN Runtime

time Naive Join Partitioned Join Partition SALES Partition WEATHER 1 2 3

Join

8

Partition ≈ 50% of State-of-the-Art Joins

Aggregate Sort

Columbia University

8 Sunday, July 28, 2013

slide-25
SLIDE 25

9

Data Partitioning is...

Columbia University

9 Sunday, July 28, 2013

slide-26
SLIDE 26
  • broadly applicable in the database domain
  • partitioned-operations reduce I/O cost,

increase parallel processing, and reduce shipping costs

9

Data Partitioning is...

Columbia University

9 Sunday, July 28, 2013

slide-27
SLIDE 27
  • broadly applicable in the database domain
  • partitioned-operations reduce I/O cost,

increase parallel processing, and reduce shipping costs

  • widely used by commercial databases
  • Oracle 11g, IBM DB2, Microsoft SQL Server

9

Data Partitioning is...

Columbia University

9 Sunday, July 28, 2013

slide-28
SLIDE 28
  • broadly applicable in the database domain
  • partitioned-operations reduce I/O cost,

increase parallel processing, and reduce shipping costs

  • widely used by commercial databases
  • Oracle 11g, IBM DB2, Microsoft SQL Server
  • applicable in the non-database domain
  • divide-and-conquer, map-reduce

9

Data Partitioning is...

Columbia University

9 Sunday, July 28, 2013

slide-29
SLIDE 29

7.5 15 22.5 30 150 300 450 600

Partitioning Throughput (GB/s) Number of Partitions

SW Partitioning Performance

1 thread 16 threads Potential System Memory Throughput

10

25.6 GB/s

Columbia University

10 Sunday, July 28, 2013

slide-30
SLIDE 30

7.5 15 22.5 30 150 300 450 600

Partitioning Throughput (GB/s) Number of Partitions

SW Partitioning Performance

1 thread 16 threads Potential System Memory Throughput

10

25.6 GB/s 3 GB/s

Columbia University

10 Sunday, July 28, 2013

slide-31
SLIDE 31

Research Overview

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

11

New Modules

Columbia University

11 Sunday, July 28, 2013

slide-32
SLIDE 32

Research Overview

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

11

New Modules

Hardware accelerated range partitioner (HARP): 7.8X more performance @ 7.5X less energy

Columbia University

11 Sunday, July 28, 2013

slide-33
SLIDE 33

Research Overview

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

11

New Modules

Streaming framework: Can keep up with the throughput of HARP Hardware accelerated range partitioner (HARP): 7.8X more performance @ 7.5X less energy

Columbia University

11 Sunday, July 28, 2013

slide-34
SLIDE 34

Remainder of the Talk

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

12

New Modules

  • Brief System Overview
  • HARP UArch
  • Streaming Framework

UArch

  • HARP and Streaming

Framework Evaluation

  • Discussion

Columbia University

12 Sunday, July 28, 2013

slide-35
SLIDE 35

HARP System Architecture

13

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

Columbia University

13 Sunday, July 28, 2013

slide-36
SLIDE 36

HARP System Architecture

13

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

HARP communicates to/from memory through stream buffers: SBin, SBout

Columbia University

13 Sunday, July 28, 2013

slide-37
SLIDE 37

HARP System Architecture

13

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

HW Partitioning with SW configuration HARP communicates to/from memory through stream buffers: SBin, SBout

Columbia University

13 Sunday, July 28, 2013

slide-38
SLIDE 38

Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14 Columbia University

14 Sunday, July 28, 2013

slide-39
SLIDE 39

Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14

RDs WRs

Columbia University

14 Sunday, July 28, 2013

slide-40
SLIDE 40

Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14

Original SW in Assembly

RDs WRs

Columbia University

14 Sunday, July 28, 2013

slide-41
SLIDE 41

Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14

Original SW in Assembly

CMP , BR, etc. RDs WRs

Columbia University

14 Sunday, July 28, 2013

slide-42
SLIDE 42

LDs STs Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14

Original SW in Assembly

CMP , BR, etc. RDs WRs

Columbia University

14 Sunday, July 28, 2013

slide-43
SLIDE 43

LDs STs Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14

Original SW in Assembly

CMP , BR, etc. RDs WRs

Executed on unmodified hardware

Columbia University

14 Sunday, July 28, 2013

slide-44
SLIDE 44

LDs STs Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14

Original SW in Assembly Modified SW in Assembly

CMP , BR, etc. RDs WRs

Executed on unmodified hardware

Columbia University

14 Sunday, July 28, 2013

slide-45
SLIDE 45

LDs STs Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14

Original SW in Assembly Modified SW in Assembly

CMP , BR, etc.

Hardware Accelerated

RDs WRs

Executed on unmodified hardware Executed

  • n HARP

Columbia University

14 Sunday, July 28, 2013

slide-46
SLIDE 46

SBLDs SBSTs LDs STs Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14

Original SW in Assembly Modified SW in Assembly SB Insts SB Insts

CMP , BR, etc.

Hardware Accelerated

RDs WRs

Executed on unmodified hardware Executed

  • n HARP

Columbia University

14 Sunday, July 28, 2013

slide-47
SLIDE 47

SBLDs SBSTs LDs STs Table Partition Lock Acq. Lock Rel.

Original SW

Programming Model

14

Original SW in Assembly Modified SW in Assembly SB Insts SB Insts

ASM ASM ASM ASM CMP , BR, etc.

Hardware Accelerated

RDs WRs

Executed on unmodified hardware Executed

  • n HARP

Columbia University

14 Sunday, July 28, 2013

slide-48
SLIDE 48

X Y Z 15 8 27 20 52 29 16 31

Range Partition

15 Columbia University

15 Sunday, July 28, 2013

slide-49
SLIDE 49

X Y Z 15 8 27 20 52 29 16 31

Range Partition

key

15 Columbia University

15 Sunday, July 28, 2013

slide-50
SLIDE 50

X Y Z 15 8 27 20 52 29 16 31

Range Partition

key

10 20 30

splitters

15 Columbia University

15 Sunday, July 28, 2013

slide-51
SLIDE 51

X Y Z 15 8 27 20 52 29 16 31

Range Partition

key

10 20 30

splitters

8

partitions

<= >

15 Columbia University

15 Sunday, July 28, 2013

slide-52
SLIDE 52

X Y Z 15 8 27 20 52 29 16 31

Range Partition

key

10 20 30

splitters

8

partitions

<= >

15 20 16 27 29 52 31

<= <= > >

15 Columbia University

15 Sunday, July 28, 2013

slide-53
SLIDE 53

From SBin

To SBout

HARP Microarchitecture

16

HARP ISA set_splitter partition_start partition_stop

Columbia University

16 Sunday, July 28, 2013

slide-54
SLIDE 54

From SBin

To SBout

HARP Microarchitecture

16

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

HARP ISA set_splitter partition_start partition_stop

Columbia University

16 Sunday, July 28, 2013

slide-55
SLIDE 55

Step 1: HARP Configuration

17

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

HARP ISA partition_start partition_stop set_splitter

Columbia University

17 Sunday, July 28, 2013

slide-56
SLIDE 56

Step 1: HARP Configuration

17

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

10 20 30 HARP ISA set_splitter partition_start partition_stop

Columbia University

17 Sunday, July 28, 2013

slide-57
SLIDE 57

Step 2: Signal HARP to Start Processing

18

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

10 20 30 HARP ISA partition_stop set_splitter partition_start

Columbia University

18 Sunday, July 28, 2013

slide-58
SLIDE 58

Step 2: Signal HARP to Start Processing

18

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

10 20 30 HARP ISA partition_start partition_stop set_splitter

Columbia University

18 Sunday, July 28, 2013

slide-59
SLIDE 59

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

Step 3: Serialize SBin Cachelines into Records

19

10 20 30 HARP ISA set_splitter partition_start partition_stop

Columbia University

19 Sunday, July 28, 2013

slide-60
SLIDE 60

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

Step 3: Serialize SBin Cachelines into Records

19

10 20 30 HARP ISA set_splitter partition_start partition_stop

Columbia University

19 Sunday, July 28, 2013

slide-61
SLIDE 61

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

Step 4: Comparator Conveyor

20

10 20 30 HARP ISA set_splitter partition_start partition_stop

Columbia University

20 Sunday, July 28, 2013

slide-62
SLIDE 62

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

Step 4: Comparator Conveyor

15 15 15, part2

20

10 20 30 HARP ISA set_splitter partition_start partition_stop

Columbia University

20 Sunday, July 28, 2013

slide-63
SLIDE 63

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

Step 5: Merge Output Records to SBout

21

10 20 30 HARP ISA set_splitter partition_start partition_stop

Columbia University

21 Sunday, July 28, 2013

slide-64
SLIDE 64

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

Step 5: Merge Output Records to SBout

21

10 20 30 HARP ISA set_splitter partition_start partition_stop

Columbia University

21 Sunday, July 28, 2013

slide-65
SLIDE 65

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

Step 6: Drain In-Flight Records and Signal HARP to Stop Processing

22

10 20 30 HARP ISA set_splitter partition_start partition_stop

Columbia University

22 Sunday, July 28, 2013

slide-66
SLIDE 66

From SBin

To SBout

< = < = < =

Serializer

1

Conveyor

2

Merge

3 WE WE WE WE

Step 6: Drain In-Flight Records and Signal HARP to Stop Processing

22

10 20 30 HARP ISA set_splitter partition_start partition_stop

Columbia University

22 Sunday, July 28, 2013

slide-67
SLIDE 67

Streaming Framework Architecture

23

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

Inspired by Jouppi’s work

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990.

Columbia University

23 Sunday, July 28, 2013

slide-68
SLIDE 68

Streaming Framework Architecture

23

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

Software- controlled data streaming in/out

Inspired by Jouppi’s work

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990.

Columbia University

23 Sunday, July 28, 2013

slide-69
SLIDE 69

Step 1: Issue sbload from Core

24

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

Columbia University

24 Sunday, July 28, 2013

slide-70
SLIDE 70

Step 1: Issue sbload from Core

24

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbstore sbsave sbrestore sbload

Columbia University

24 Sunday, July 28, 2013

slide-71
SLIDE 71

Step 2: Send sbload from Req Buffer to Memory

25

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

Columbia University

25 Sunday, July 28, 2013

slide-72
SLIDE 72

Step 2: Send sbload from Req Buffer to Memory

25

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

✗ ✗ ✗

Columbia University

25 Sunday, July 28, 2013

slide-73
SLIDE 73

Step 2: Send sbload from Req Buffer to Memory

25

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

✗ ✗ ✗

C: Cache S: SB

Columbia University

25 Sunday, July 28, 2013

slide-74
SLIDE 74

Step 3: Data Return from Memory to SBin

26

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

✗ ✗ ✗

C: Cache S: SB

Columbia University

26 Sunday, July 28, 2013

slide-75
SLIDE 75

Step 3: Data Return from Memory to SBin

26

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

✗ ✗ ✗

C: Cache S: SB

Columbia University

26 Sunday, July 28, 2013

slide-76
SLIDE 76

Step 3: Data Return from Memory to SBin

26

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

✗ ✗ ✗

C: Cache S: SB

Columbia University

26 Sunday, July 28, 2013

slide-77
SLIDE 77

Step 4: HARP Pulls Data from SBin and Pushes Data to SBout

27

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

✗ ✗ ✗

C: Cache S: SB

Columbia University

27 Sunday, July 28, 2013

slide-78
SLIDE 78

Step 4: HARP Pulls Data from SBin and Pushes Data to SBout

27

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

✗ ✗ ✗

C: Cache S: SB

Columbia University

27 Sunday, July 28, 2013

slide-79
SLIDE 79

Step 5: Issue sbstore from Core

28

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

Columbia University

28 Sunday, July 28, 2013

slide-80
SLIDE 80

Step 5: Issue sbstore from Core

28

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbsave sbrestore sbstore

Columbia University

28 Sunday, July 28, 2013

slide-81
SLIDE 81

Step 6: Data Copied from head of SBout to Store Buffer

29

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

Columbia University

29 Sunday, July 28, 2013

slide-82
SLIDE 82

Step 6: Data Copied from head of SBout to Store Buffer

29

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

Columbia University

29 Sunday, July 28, 2013

slide-83
SLIDE 83

Step 7: Data Written Back to Memory via Existing Store Datapath

30

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

Columbia University

30 Sunday, July 28, 2013

slide-84
SLIDE 84

Step 7: Data Written Back to Memory via Existing Store Datapath

30

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

Columbia University

30 Sunday, July 28, 2013

slide-85
SLIDE 85

Interrupts and Context Switches

31

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

Columbia University

31 Sunday, July 28, 2013

slide-86
SLIDE 86

Interrupts and Context Switches

31

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbsave sbrestore

Architectural

Columbia University

31 Sunday, July 28, 2013

slide-87
SLIDE 87

Interrupts and Context Switches

31

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore sbrestore

Architectural

sbsave

Columbia University

31 Sunday, July 28, 2013

slide-88
SLIDE 88

Interrupts and Context Switches

31

Memory Core

L1

L2 SBin LLC

HARP

SBout

Req Buffer Store Buffer

SB ISA sbload sbstore

Architectural

sbsave sbrestore

Columbia University

31 Sunday, July 28, 2013

slide-89
SLIDE 89

Accelerator Integration Choice

  • Tightly coupled and software controlled:
  • area/power savings
  • coherence
  • utilize hardware prefetchers
  • software-managed data layout
  • address-free domain for accelerators

32 Columbia University

32 Sunday, July 28, 2013

slide-90
SLIDE 90

Remainder of the Talk

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

33

  • Brief System Overview
  • HARP UArch
  • Streaming Framework

UArch

  • HARP and Streaming

Framework Evaluation

  • Discussion and DSE

Columbia University

33 Sunday, July 28, 2013

slide-91
SLIDE 91

Evaluation Methodology

34 Columbia University

34 Sunday, July 28, 2013

slide-92
SLIDE 92

Evaluation Methodology

  • HARP
  • Bluespec System Verilog implementation
  • Cycle-accurate simulation in BlueSim
  • Synthesis, P&R with Synopsys (32nm std cells)

34 Columbia University

34 Sunday, July 28, 2013

slide-93
SLIDE 93

Evaluation Methodology

  • HARP
  • Bluespec System Verilog implementation
  • Cycle-accurate simulation in BlueSim
  • Synthesis, P&R with Synopsys (32nm std cells)
  • Streaming framework
  • 3 versions of 1GB table memcpy: c-lib, ASM

(scalar), ASM(vector)

  • Conservative area/power estimates with CACTI

34 Columbia University

34 Sunday, July 28, 2013

slide-94
SLIDE 94

Area and Power Overheads

0% 4% 8% 11% 15% 15 31 63 127 255 511

Area (% Xeon core) HARP Stream Buffers Number of Partitions

35 Columbia University

35 Sunday, July 28, 2013

slide-95
SLIDE 95

Area and Power Overheads

0% 4% 8% 11% 15% 15 31 63 127 255 511

Area (% Xeon core) HARP Stream Buffers

0% 2% 4% 6% 8% 10% 15 31 63 127 255 511

Power (% Xeon core) Number of Partitions

35 Columbia University

35 Sunday, July 28, 2013

slide-96
SLIDE 96

SW Partitioning Performance

36 2 4 6 8 150 300 450 600

Partitioning Throughput (GB/s) Number of Partitions 1 thread 16 threads

Columbia University

36 Sunday, July 28, 2013

slide-97
SLIDE 97

2 4 6 8 150 300 450 600

Partitioning Throughput (GB/s) Number of Partitions

Performance Evaluation

1 thread 16 threads 1 thread + HARP

37 Columbia University

37 Sunday, July 28, 2013

slide-98
SLIDE 98

2 4 6 8 150 300 450 600

Partitioning Throughput (GB/s) Number of Partitions

Performance Evaluation

1 thread 16 threads 1 thread + HARP

37

7 .8x

Columbia University

37 Sunday, July 28, 2013

slide-99
SLIDE 99

2 4 6 8 150 300 450 600

Partitioning Throughput (GB/s) Number of Partitions

Performance Evaluation

1 thread 16 threads 1 thread + HARP

37

7 .8x 8.8x

Columbia University

37 Sunday, July 28, 2013

slide-100
SLIDE 100

Streaming Framework Provides Sufficient BW to Feed HARP?

38 1.75 3.5 5.25 7 150 300 450 600

Partitioning Throughput (GB/s) Number of Partitions 1 thread + HARP

Columbia University

38 Sunday, July 28, 2013

slide-101
SLIDE 101

Streaming Framework Provides Sufficient BW to Feed HARP?

38 1.75 3.5 5.25 7 150 300 450 600

Partitioning Throughput (GB/s) Number of Partitions

Our measure- ments scalar ASM vector ASM memcpy

1 thread + HARP

Columbia University

38 Sunday, July 28, 2013

slide-102
SLIDE 102

Streaming Framework Provides Sufficient BW to Feed HARP?

38 1.75 3.5 5.25 7 150 300 450 600

Partitioning Throughput (GB/s) Number of Partitions

vector ASM memcpy From the literature Our measure- ments scalar ASM vector ASM memcpy

1 thread + HARP

Columbia University

38 Sunday, July 28, 2013

slide-103
SLIDE 103

HARP Energy vs. SW

5 10 15 20 150 300 450 600

Partitioning Energy (J/GB) Number of Partitions 1 thread 16 threads 1 thread + HARP

39 Columbia University

39 Sunday, July 28, 2013

slide-104
SLIDE 104

HARP Energy vs. SW

5 10 15 20 150 300 450 600

Partitioning Energy (J/GB) Number of Partitions 1 thread 16 threads 1 thread + HARP

6.3x

39 Columbia University

39 Sunday, July 28, 2013

slide-105
SLIDE 105

HARP Energy vs. SW

5 10 15 20 150 300 450 600

Partitioning Energy (J/GB) Number of Partitions 1 thread 16 threads 1 thread + HARP

6.3x 7 .3x

39 Columbia University

39 Sunday, July 28, 2013

slide-106
SLIDE 106

Remainder of the Talk

SBin SBout

HARP Core L1

Memory Controller Memory

SBin SBout

HARP Core L1 L2 L2

40

  • Brief System Overview
  • HARP UArch
  • Streaming Framework

UArch

  • HARP and Streaming

Framework Evaluation

  • Discussion

Columbia University

40 Sunday, July 28, 2013

slide-107
SLIDE 107

255 -way partitioning 4B keys 16B records

Design Space Exploration (in the paper)

41 Columbia University

41 Sunday, July 28, 2013

slide-108
SLIDE 108

255 -way partitioning 4B keys 16B records

Design Space Exploration (in the paper)

41

255 511 127 63 31 15

Columbia University

41 Sunday, July 28, 2013

slide-109
SLIDE 109

255 -way partitioning 4B keys 16B records

Design Space Exploration (in the paper)

41

255 511 127 63 31 15 16B 8B 4B

Columbia University

41 Sunday, July 28, 2013

slide-110
SLIDE 110

255 -way partitioning 4B keys 16B records

Design Space Exploration (in the paper)

41

255 511 127 63 31 15 16B 8B 4B 8 B 1 6 B 4 B

Columbia University

41 Sunday, July 28, 2013

slide-111
SLIDE 111

Coping with Fixed Resources: Partitioning Factor > Partitioner Size

42 Columbia University

42 Sunday, July 28, 2013

slide-112
SLIDE 112

Coping with Fixed Resources: Partitioning Factor > Partitioner Size

42

HARP

Columbia University

42 Sunday, July 28, 2013

slide-113
SLIDE 113

Coping with Fixed Resources: Partitioning Factor > Partitioner Size

42

HARP HARP

Columbia University

42 Sunday, July 28, 2013

slide-114
SLIDE 114

Coping with Fixed Resources: Record Width > Record Size

43 Columbia University

43 Sunday, July 28, 2013

slide-115
SLIDE 115

Coping with Fixed Resources: Record Width > Record Size

43 Columbia University

43 Sunday, July 28, 2013

slide-116
SLIDE 116

Coping with Fixed Resources: Record Width > Record Size

43

HARP HARP

Columbia University

43 Sunday, July 28, 2013

slide-117
SLIDE 117

Conclusion

  • Data partitioning, which does not appear

compute-heavy, can still benefit from acceleration

  • Microarchitecture to pair streaming accelerator(s)

to work closely with CPU

  • Demonstrate how accelerators can rebalance

system and improve memory bandwidth utilization, a scarce resource in big data analytics

44 Columbia University

44 Sunday, July 28, 2013

slide-118
SLIDE 118

FIN

45

45 Sunday, July 28, 2013