Camdoop
Exploiting In-network Aggregation for Big Data Applications
Paolo Costa costa@imperial.ac.uk
joint work with Austin Donnelly, Antony Rowstron, and Greg O’Shea (MSR Cambridge)
Camdoop Exploiting In-network Aggregation for Big Data Applications - - PowerPoint PPT Presentation
Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg OShea (MSR Cambridge) MapReduce Overview Input file Intermediate results Final
Paolo Costa costa@imperial.ac.uk
joint work with Austin Donnelly, Antony Rowstron, and Greg O’Shea (MSR Cambridge)
− Processes input data and generates (key, value) pairs
− Distributes the intermediate pairs to the reduce tasks
− Aggregates all values associated to each key
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Chunk 0 Chunk 1 Chunk 2 Input file Map Task Map Task Map Task
Reduce Task Reduce Task Reduce Task
Intermediate results Final results
2/52
networks
− All-to-all traffic pattern with O(N2) flows − Led to proposals for full-bisection bandwidth
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task
Reduce Task Reduce Task Reduce Task
Intermediate results Final results
3/52
the intermediate results
the intermediate size
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task
Reduce Task Reduce Task Reduce Task
Intermediate results Final results
4/52
the intermediate results
intermediate size
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task
Reduce Task Reduce Task Reduce Task
Intermediate results Final results
How can we exploit this to reduce the traffic and improve the performance of the shuffle phase?
5/52
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task
Reduce Task Reduce Task Reduce Task
Intermediate results Final results
users can specify a combiner function
− Aggregates the local intermediate pairs
6/52
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task
Reduce Task Reduce Task Reduce Task
Intermediate results Final results Combiner Combiner Combiner
users can specify a combiner function
− Aggregates the local intermediate pairs
7/52
MapReduce to perform multiple steps of combiners
− e.g., rack-level aggregation [Yu et al., SOSP’09]
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 8/52
What happens when we map the tree to a typical data center topology?
The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Logical topology Physical topology ToR Switch
9/52
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Logical topology Physical topology ToR Switch
Only 500 Mbps per child
What happens when we map the tree to a typical data center topology?
The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Logical topology Physical topology ToR Switch
Only 500 Mbps per child
What happens when we map the tree to a typical data center topology?
The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link
Camdoop Goal
Perform the combiner functions within the network as
Reduce shuffle time by aggregating packets on path
How Can We Perform In-network Processing?
− Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic
and process packets
− This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures y z
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
x
12/52
How Can We Perform In-network Processing?
− Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic
and process packets
− This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures y z
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
x
13/52
How Can We Perform In-network Processing?
− Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic
and process packets
− This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures y z (1,2,2)
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
x (1,2,1)
14/52
How Can We Perform In-network Processing?
− Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic
and process packets
− This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures y z (1,2,2)
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
x (1,2,1)
Key property
No distinction between network and computation devices Servers can perform arbitrary packet processing on-path
… on a switched topology … on CamCube
becomes the bottleneck
and physical topology
1 Gbps
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
1/in-degree Gbps
16/52
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 17/52
Programming Model
MapReduce model
− Each server runs map tasks on local chunks
− Combiners aggregate map tasks and children results (if any) and stream the results to the parents − The root runs the reduce task and generates the final output
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 18/52
Network locality
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
How to map the tree nodes to servers?
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 19/52
Network locality
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Map task outputs are always read from the local disk
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 20/52
Network locality
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
The parent-children are mapped on physical neighbors
(1,2,1) (1,2,2) (1,1,1)
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 21/52
Network locality
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
How to map the tree nodes to servers? Map task outputs are always read from the local disk The parent-children are mapped on physical neighbors
(1,2,1) (1,2,2) (1,1,1)
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
This ensures maximum locality and
Logical View Physical View (3D Torus) One physical link is used by one and only one logical link
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 23/52
Load Distribution
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 24/52
Load Distribution Poor server load distribution
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Only 1 Gbps (instead of 6) Different in-degree
25/52
Load Distribution
Only 1 Gbps (instead of 6)
Poor bandwidth utilization
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 26/52
Load Distribution Poor server load distribution Poor bandwidth utilization
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Solution: stripe the data across disjoint trees
Different links are used Improves load distribution
First issue: Poor load distribution Second issue: Poor bandwidth utilization
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Solution: stripe the data across 6 disjoint trees
All links are used => (Up to) 6 Gbps / server Good load distribution
Fault-tolerance
− CamCube remaps coordinates in case of failures
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 29/52
Testbed
− 27-server CamCube (3 x 3 x 3) − Quad-core Intel Xeon 5520 2.27 Ghz − 12GB RAM − 6 Intel PRO/1000 PT 1 Gbps ports − Runtime & services implemented in user-space
Simulator
− Packet-level simulator (CPU overhead not modelled) − 512-server (8x8x8) CamCube
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 30/52
Design and implementation recap
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Camdoop Shuffle & reduce parallelized
− Since all streams are ordered, as soon as the root receive at least
− No need to store to disk intermediate results on reduce servers Reduce
Shuffle Map Shuffle Map
Reduce
Design and implementation recap
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Camdoop Shuffle & reduce parallelized
CamCube
Six disjoint trees
In
aggregation
32/52
Design and implementation recap
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Camdoop TCP Camdoop (switch) Shuffle & reduce parallelized
CamCube
Six disjoint trees
In
aggregation
Reduce
Shuffle Map Shuffle Map
− 27 CamCube servers attached to a ToR switch − TCP is used to transfer data in the shuffle phase
33/52
Design and implementation recap
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Camdoop TCP Camdoop (switch) Camdoop (no agg ) Shuffle & reduce parallelized
CamCube
Six disjoint trees
In
aggregation
Reduce
Shuffle Map Shuffle Map
− 27 CamCube servers attached to a ToR switch − TCP is used to transfer data in the shuffle phase
− Like Camdoop but without in-network aggregation − Shows the impact of just running on CamCube
34/52
competitive against Hadoop and Dryad
− Shuffle and reduce parellized − Fine-tuned implementation
1 10 100 1000 Sort WordCount Time logscale (s)
Hadoop Dryad/DryadLINQ TCP Camdoop (switch) Camdoop (no agg)
Worse Better
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 35/52
competitive against Hadoop and Dryad
− Shuffle and reduce parellized − Fine-tuned implementation
1 10 100 1000 Sort WordCount Time logscale (s)
Hadoop Dryad/DryadLINQ TCP Camdoop (switch) Camdoop (no agg)
We consider these as our baselines
Worse Better
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 36/52
− S=1 (no aggregation)
− S=1/N ≈ 0 (full aggregation)
− We use synthetic workloads to explore different value of S
− R= 1 (all-to-one)
− R=N (all-to-all)
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 37/52
1 10 100 1000 0.2 0.4 0.6 0.8 1
Time (s) logscale Output size/ intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better Full aggregation No aggregation
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 38/52
1 10 100 1000 0.2 0.4 0.6 0.8 1
Time (s) logscale Output size/ intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better Full aggregation No aggregation
Performance independent of S Impact of in-network aggregation
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Impact of running
39/52
1 10 100 1000 0.2 0.4 0.6 0.8 1
Time (s) logscale Output size/ intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better Full aggregation No aggregation
Performance independent of S Impact of in-network aggregation Facebook reported aggregation ratio
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Impact of running
40/52
1 10 100 0.2 0.4 0.6 0.8 1
Time (s) logscale Output size / intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Switch 1 Gbps (bound) Worse Better Full aggregation No aggregation
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 41/52
1 10 100 0.2 0.4 0.6 0.8 1
Time (s) logscale Output size / intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Switch 1 Gbps (bound)
Impact of running
Worse Better Full aggregation No aggregation
Impact of in-network aggregation
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Maximum theoretical performance over the switch
42/52
1 10 100 1000 1 6 11 16 21 26
Time (s) logscale # reduce tasks (R) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better
4.31 x 10.19 x 1.13 x 1.91 x
All-to-one All-to-all
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
27
43/52
1 10 100 1000 1 6 11 16 21 26
Time (s) logscale # reduce tasks (R) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better
4.31 x 10.19 x 1.13 x 1.91 x R does not (significantly) impact performance
All-to-one All-to-all
Performance depends on R
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Implementation bottleneck
27
44/52
1 10 100 1000 1 6 11 16 21 26
Time (s) logscale # reduce tasks (R) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better
4.31 x 10.19 x 1.13 x 1.91 x R does not (significantly) impact performance
All-to-one All-to-all
Performance depends on R
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
27
All resources are used even when R = 1 Camdoop decouples the job execution time from the number of output files generated
0.001 0.01 0.1 1 10 100 1 4 16 64 256
Time (s) logscale # reduce tasks (R) logscale Switch 1 Gbps (bound) Camdoop
512
N=512, S= 0
Worse Better All-to-one All-to-all
512x
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 46/52
0.001 0.01 0.1 1 10 100 1 4 16 64 256
Time (s) logscale # reduce tasks (R) logscale Switch 1 Gbps (bound) Camdoop
512
N=512, S= 0
Worse Better All-to-one All-to-all
512x
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
This assumes full-bisection bandwidth
47/52
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 48/52
also common in interactive services
− e.g., Bing Search, Google Dremel
− 10s to 100s of KB returned per server
− Single result must be returned to the user
− E.g., N servers generate their best k responses each and the final result contains the best k responses
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
Web server
requests
Cache
Leaf servers Parent servers … …
49/52
Worse Better 0
10 20 30 40 50 60 20 KB 200 KB
Time (ms) Input data size / server TCP Camdoop (switch) Camdoop (no agg) Camdoop
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 50/52
Worse Better 0
10 20 30 40 50 60 20 KB 200 KB
Time (ms) Input data size / server TCP Camdoop (switch) Camdoop (no agg) Camdoop
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications
In-network aggregation can be beneficial also for (small-scale data) interactive services
− Explores the benefits of in-network processing by running combiners within the network − No change in the programming model − Achieves lower shuffle and reduce time − Decouples performance from the # of output files
− AMD SeaMicro – a 512-core cluster for data centers using a 3D torus − Fast interconnect: 5 Gbps / link
Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 52/52