NAVIGATING BIG DATA
with High-Throughput, Energy- Efficient Data Partitioning
Lisa Wu, R.J. Barker, Martha Kim, and Ken Ross Columbia University
1 Sunday, July 28, 2013
NAVIGATING BIG DATA with High-Throughput, Energy- Efficient Data - - PowerPoint PPT Presentation
NAVIGATING BIG DATA with High-Throughput, Energy- Efficient Data Partitioning Lisa Wu, R.J. Barker, Martha Kim, and Ken Ross Columbia University Sunday, July 28, 2013 1 BIG DATA is here Sources: IDC Worldwide Big Data Technology and
Lisa Wu, R.J. Barker, Martha Kim, and Ken Ross Columbia University
1 Sunday, July 28, 2013
Sources: IDC Worldwide Big Data Technology and Services 2012-2015 Forecast, #233485, March 2012 The Next Web, DAZEINFO NetApp
2 Columbia University
2 Sunday, July 28, 2013
3 Columbia University
3 Sunday, July 28, 2013
4
Columbia University
4 Sunday, July 28, 2013
SALES WEATHER
4
Columbia University
4 Sunday, July 28, 2013
SALES WEATHER
JOINDATE
4
Columbia University
4 Sunday, July 28, 2013
SALES WEATHER JOINDATE(SALES, WEATHER)
JOINDATE
4
Columbia University
4 Sunday, July 28, 2013
0% 25% 50% 75% 100% 17 9 11 5 7 8 22 1 2 12 21 18 19 10 15 3 20 4 16 14 13 6 Avg.
Runtime in Join TPC-H Query
5 Columbia University
5 Sunday, July 28, 2013
6
SALES WEATHER
Columbia University
6 Sunday, July 28, 2013
Scan
6
SALES WEATHER
Columbia University
6 Sunday, July 28, 2013
Scan Random Lookup
6
SALES WEATHER
Columbia University
6 Sunday, July 28, 2013
Scan Random Lookup Small table doesn’ t fit into cache ➛ lookups thrash
6
SALES WEATHER
Columbia University
6 Sunday, July 28, 2013
SALES WEATHER 7 Columbia University
7 Sunday, July 28, 2013
SALES WEATHER SALES_0 SALES_1 SALES_2 SALES_3 7 Columbia University
7 Sunday, July 28, 2013
SALES WEATHER SALES_0 SALES_1 SALES_2 WEATHER_1 WEATHER_2 WEATHER_3 SALES_3 7 WEATHER_0 Columbia University
7 Sunday, July 28, 2013
SALES WEATHER SALES_0 SALES_1 SALES_2 WEATHER_1 WEATHER_2 WEATHER_3 SALES_3 7 WEATHER_0
DATE SALES_0 WEATHER_0
Columbia University
7 Sunday, July 28, 2013
SALES WEATHER SALES_0 SALES_1 SALES_2 WEATHER_1 WEATHER_2 WEATHER_3 SALES_3
DATE SALES_1 WEATHER_1 DATE SALES_2 WEATHER_2 DATE SALES_3 WEATHER_3
7 WEATHER_0
DATE SALES_0 WEATHER_0
Columbia University
7 Sunday, July 28, 2013
time Naive Join Partitioned Join
8 Columbia University
8 Sunday, July 28, 2013
time Naive Join Partitioned Join
8 Columbia University
8 Sunday, July 28, 2013
time Naive Join Partitioned Join Partition SALES Partition WEATHER
8 Columbia University
8 Sunday, July 28, 2013
time Naive Join Partitioned Join Partition SALES Partition WEATHER 1 2 3
Join
8 Columbia University
8 Sunday, July 28, 2013
time Naive Join Partitioned Join Partition SALES Partition WEATHER 1 2 3
Join
8
Partition ≈ 50% of State-of-the-Art Joins
Columbia University
8 Sunday, July 28, 2013
time Naive Join Partitioned Join Partition SALES Partition WEATHER 1 2 3
Join
8
Partition ≈ 50% of State-of-the-Art Joins
Aggregate
Columbia University
8 Sunday, July 28, 2013
time Naive Join Partitioned Join Partition SALES Partition WEATHER 1 2 3
Join
8
Partition ≈ 50% of State-of-the-Art Joins
Aggregate Sort
Columbia University
8 Sunday, July 28, 2013
9
Columbia University
9 Sunday, July 28, 2013
increase parallel processing, and reduce shipping costs
9
Columbia University
9 Sunday, July 28, 2013
increase parallel processing, and reduce shipping costs
9
Columbia University
9 Sunday, July 28, 2013
increase parallel processing, and reduce shipping costs
9
Columbia University
9 Sunday, July 28, 2013
7.5 15 22.5 30 150 300 450 600
Partitioning Throughput (GB/s) Number of Partitions
1 thread 16 threads Potential System Memory Throughput
10
25.6 GB/s
Columbia University
10 Sunday, July 28, 2013
7.5 15 22.5 30 150 300 450 600
Partitioning Throughput (GB/s) Number of Partitions
1 thread 16 threads Potential System Memory Throughput
10
25.6 GB/s 3 GB/s
Columbia University
10 Sunday, July 28, 2013
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
11
New Modules
Columbia University
11 Sunday, July 28, 2013
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
11
New Modules
Hardware accelerated range partitioner (HARP): 7.8X more performance @ 7.5X less energy
Columbia University
11 Sunday, July 28, 2013
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
11
New Modules
Streaming framework: Can keep up with the throughput of HARP Hardware accelerated range partitioner (HARP): 7.8X more performance @ 7.5X less energy
Columbia University
11 Sunday, July 28, 2013
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
12
New Modules
UArch
Framework Evaluation
Columbia University
12 Sunday, July 28, 2013
13
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
Columbia University
13 Sunday, July 28, 2013
13
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
Columbia University
13 Sunday, July 28, 2013
13
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
Columbia University
13 Sunday, July 28, 2013
Table Partition Lock Acq. Lock Rel.
Original SW
14 Columbia University
14 Sunday, July 28, 2013
Table Partition Lock Acq. Lock Rel.
Original SW
14
RDs WRs
Columbia University
14 Sunday, July 28, 2013
Table Partition Lock Acq. Lock Rel.
Original SW
14
Original SW in Assembly
RDs WRs
Columbia University
14 Sunday, July 28, 2013
Table Partition Lock Acq. Lock Rel.
Original SW
14
Original SW in Assembly
CMP , BR, etc. RDs WRs
Columbia University
14 Sunday, July 28, 2013
LDs STs Table Partition Lock Acq. Lock Rel.
Original SW
14
Original SW in Assembly
CMP , BR, etc. RDs WRs
Columbia University
14 Sunday, July 28, 2013
LDs STs Table Partition Lock Acq. Lock Rel.
Original SW
14
Original SW in Assembly
CMP , BR, etc. RDs WRs
Executed on unmodified hardware
Columbia University
14 Sunday, July 28, 2013
LDs STs Table Partition Lock Acq. Lock Rel.
Original SW
14
Original SW in Assembly Modified SW in Assembly
CMP , BR, etc. RDs WRs
Executed on unmodified hardware
Columbia University
14 Sunday, July 28, 2013
LDs STs Table Partition Lock Acq. Lock Rel.
Original SW
14
Original SW in Assembly Modified SW in Assembly
CMP , BR, etc.
Hardware Accelerated
RDs WRs
Executed on unmodified hardware Executed
Columbia University
14 Sunday, July 28, 2013
SBLDs SBSTs LDs STs Table Partition Lock Acq. Lock Rel.
Original SW
14
Original SW in Assembly Modified SW in Assembly SB Insts SB Insts
CMP , BR, etc.
Hardware Accelerated
RDs WRs
Executed on unmodified hardware Executed
Columbia University
14 Sunday, July 28, 2013
SBLDs SBSTs LDs STs Table Partition Lock Acq. Lock Rel.
Original SW
14
Original SW in Assembly Modified SW in Assembly SB Insts SB Insts
ASM ASM ASM ASM CMP , BR, etc.
Hardware Accelerated
RDs WRs
Executed on unmodified hardware Executed
Columbia University
14 Sunday, July 28, 2013
X Y Z 15 8 27 20 52 29 16 31
15 Columbia University
15 Sunday, July 28, 2013
X Y Z 15 8 27 20 52 29 16 31
15 Columbia University
15 Sunday, July 28, 2013
X Y Z 15 8 27 20 52 29 16 31
10 20 30
15 Columbia University
15 Sunday, July 28, 2013
X Y Z 15 8 27 20 52 29 16 31
10 20 30
8
<= >
15 Columbia University
15 Sunday, July 28, 2013
X Y Z 15 8 27 20 52 29 16 31
10 20 30
8
<= >
15 20 16 27 29 52 31
<= <= > >
15 Columbia University
15 Sunday, July 28, 2013
From SBin
To SBout
16
HARP ISA set_splitter partition_start partition_stop
Columbia University
16 Sunday, July 28, 2013
From SBin
To SBout
16
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
HARP ISA set_splitter partition_start partition_stop
Columbia University
16 Sunday, July 28, 2013
17
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
HARP ISA partition_start partition_stop set_splitter
Columbia University
17 Sunday, July 28, 2013
17
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
10 20 30 HARP ISA set_splitter partition_start partition_stop
Columbia University
17 Sunday, July 28, 2013
18
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
10 20 30 HARP ISA partition_stop set_splitter partition_start
Columbia University
18 Sunday, July 28, 2013
18
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
10 20 30 HARP ISA partition_start partition_stop set_splitter
Columbia University
18 Sunday, July 28, 2013
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
19
10 20 30 HARP ISA set_splitter partition_start partition_stop
Columbia University
19 Sunday, July 28, 2013
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
19
10 20 30 HARP ISA set_splitter partition_start partition_stop
Columbia University
19 Sunday, July 28, 2013
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
20
10 20 30 HARP ISA set_splitter partition_start partition_stop
Columbia University
20 Sunday, July 28, 2013
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
15 15 15, part2
20
10 20 30 HARP ISA set_splitter partition_start partition_stop
Columbia University
20 Sunday, July 28, 2013
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
21
10 20 30 HARP ISA set_splitter partition_start partition_stop
Columbia University
21 Sunday, July 28, 2013
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
21
10 20 30 HARP ISA set_splitter partition_start partition_stop
Columbia University
21 Sunday, July 28, 2013
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
22
10 20 30 HARP ISA set_splitter partition_start partition_stop
Columbia University
22 Sunday, July 28, 2013
From SBin
To SBout
< = < = < =
Serializer
1
Conveyor
2
Merge
3 WE WE WE WE
22
10 20 30 HARP ISA set_splitter partition_start partition_stop
Columbia University
22 Sunday, July 28, 2013
23
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
Inspired by Jouppi’s work
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990.
Columbia University
23 Sunday, July 28, 2013
23
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
Inspired by Jouppi’s work
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA, 1990.
Columbia University
23 Sunday, July 28, 2013
24
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Columbia University
24 Sunday, July 28, 2013
24
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbstore sbsave sbrestore sbload
Columbia University
24 Sunday, July 28, 2013
25
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Columbia University
25 Sunday, July 28, 2013
25
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Columbia University
25 Sunday, July 28, 2013
25
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
C: Cache S: SB
Columbia University
25 Sunday, July 28, 2013
26
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
C: Cache S: SB
Columbia University
26 Sunday, July 28, 2013
26
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
C: Cache S: SB
Columbia University
26 Sunday, July 28, 2013
26
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
C: Cache S: SB
Columbia University
26 Sunday, July 28, 2013
27
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
C: Cache S: SB
Columbia University
27 Sunday, July 28, 2013
27
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
C: Cache S: SB
Columbia University
27 Sunday, July 28, 2013
28
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Columbia University
28 Sunday, July 28, 2013
28
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbsave sbrestore sbstore
Columbia University
28 Sunday, July 28, 2013
29
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Columbia University
29 Sunday, July 28, 2013
29
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Columbia University
29 Sunday, July 28, 2013
30
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Columbia University
30 Sunday, July 28, 2013
30
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Columbia University
30 Sunday, July 28, 2013
31
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Columbia University
31 Sunday, July 28, 2013
31
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbsave sbrestore
Architectural
Columbia University
31 Sunday, July 28, 2013
31
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore sbrestore
Architectural
sbsave
Columbia University
31 Sunday, July 28, 2013
31
Memory Core
L1
L2 SBin LLC
HARP
SBout
Req Buffer Store Buffer
SB ISA sbload sbstore
Architectural
sbsave sbrestore
Columbia University
31 Sunday, July 28, 2013
32 Columbia University
32 Sunday, July 28, 2013
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
33
UArch
Framework Evaluation
Columbia University
33 Sunday, July 28, 2013
34 Columbia University
34 Sunday, July 28, 2013
34 Columbia University
34 Sunday, July 28, 2013
(scalar), ASM(vector)
34 Columbia University
34 Sunday, July 28, 2013
0% 4% 8% 11% 15% 15 31 63 127 255 511
Area (% Xeon core) HARP Stream Buffers Number of Partitions
35 Columbia University
35 Sunday, July 28, 2013
0% 4% 8% 11% 15% 15 31 63 127 255 511
Area (% Xeon core) HARP Stream Buffers
0% 2% 4% 6% 8% 10% 15 31 63 127 255 511
Power (% Xeon core) Number of Partitions
35 Columbia University
35 Sunday, July 28, 2013
36 2 4 6 8 150 300 450 600
Partitioning Throughput (GB/s) Number of Partitions 1 thread 16 threads
Columbia University
36 Sunday, July 28, 2013
2 4 6 8 150 300 450 600
Partitioning Throughput (GB/s) Number of Partitions
1 thread 16 threads 1 thread + HARP
37 Columbia University
37 Sunday, July 28, 2013
2 4 6 8 150 300 450 600
Partitioning Throughput (GB/s) Number of Partitions
1 thread 16 threads 1 thread + HARP
37
Columbia University
37 Sunday, July 28, 2013
2 4 6 8 150 300 450 600
Partitioning Throughput (GB/s) Number of Partitions
1 thread 16 threads 1 thread + HARP
37
Columbia University
37 Sunday, July 28, 2013
38 1.75 3.5 5.25 7 150 300 450 600
Partitioning Throughput (GB/s) Number of Partitions 1 thread + HARP
Columbia University
38 Sunday, July 28, 2013
38 1.75 3.5 5.25 7 150 300 450 600
Partitioning Throughput (GB/s) Number of Partitions
Our measure- ments scalar ASM vector ASM memcpy
1 thread + HARP
Columbia University
38 Sunday, July 28, 2013
38 1.75 3.5 5.25 7 150 300 450 600
Partitioning Throughput (GB/s) Number of Partitions
vector ASM memcpy From the literature Our measure- ments scalar ASM vector ASM memcpy
1 thread + HARP
Columbia University
38 Sunday, July 28, 2013
5 10 15 20 150 300 450 600
Partitioning Energy (J/GB) Number of Partitions 1 thread 16 threads 1 thread + HARP
39 Columbia University
39 Sunday, July 28, 2013
5 10 15 20 150 300 450 600
Partitioning Energy (J/GB) Number of Partitions 1 thread 16 threads 1 thread + HARP
39 Columbia University
39 Sunday, July 28, 2013
5 10 15 20 150 300 450 600
Partitioning Energy (J/GB) Number of Partitions 1 thread 16 threads 1 thread + HARP
39 Columbia University
39 Sunday, July 28, 2013
SBin SBout
HARP Core L1
Memory Controller Memory
SBin SBout
HARP Core L1 L2 L2
40
UArch
Framework Evaluation
Columbia University
40 Sunday, July 28, 2013
255 -way partitioning 4B keys 16B records
41 Columbia University
41 Sunday, July 28, 2013
255 -way partitioning 4B keys 16B records
41
255 511 127 63 31 15
Columbia University
41 Sunday, July 28, 2013
255 -way partitioning 4B keys 16B records
41
255 511 127 63 31 15 16B 8B 4B
Columbia University
41 Sunday, July 28, 2013
255 -way partitioning 4B keys 16B records
41
255 511 127 63 31 15 16B 8B 4B 8 B 1 6 B 4 B
Columbia University
41 Sunday, July 28, 2013
42 Columbia University
42 Sunday, July 28, 2013
42
HARP
Columbia University
42 Sunday, July 28, 2013
42
HARP HARP
Columbia University
42 Sunday, July 28, 2013
43 Columbia University
43 Sunday, July 28, 2013
43 Columbia University
43 Sunday, July 28, 2013
43
HARP HARP
Columbia University
43 Sunday, July 28, 2013
compute-heavy, can still benefit from acceleration
to work closely with CPU
system and improve memory bandwidth utilization, a scarce resource in big data analytics
44 Columbia University
44 Sunday, July 28, 2013
45
45 Sunday, July 28, 2013