Camdoop Exploiting In-network Aggregation for Big Data Applications - - PowerPoint PPT Presentation

camdoop
SMART_READER_LITE
LIVE PREVIEW

Camdoop Exploiting In-network Aggregation for Big Data Applications - - PowerPoint PPT Presentation

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg OShea (MSR Cambridge) MapReduce Overview Input file Intermediate results Final


slide-1
SLIDE 1

Camdoop

Exploiting In-network Aggregation for Big Data Applications

Paolo Costa costa@imperial.ac.uk

joint work with Austin Donnelly, Antony Rowstron, and Greg O’Shea (MSR Cambridge)

slide-2
SLIDE 2

MapReduce Overview

  • Map

− Processes input data and generates (key, value) pairs

  • Shuffle

− Distributes the intermediate pairs to the reduce tasks

  • Reduce

− Aggregates all values associated to each key

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Chunk 0 Chunk 1 Chunk 2 Input file Map Task Map Task Map Task

Reduce Task Reduce Task Reduce Task

Intermediate results Final results

2/52

slide-3
SLIDE 3

Problem

  • Shuffle phase is challenging for data center

networks

− All-to-all traffic pattern with O(N2) flows − Led to proposals for full-bisection bandwidth

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task

Reduce Task Reduce Task Reduce Task

Intermediate results Final results

3/52

slide-4
SLIDE 4

Data Reduction

  • The final results are typically much smaller than

the intermediate results

  • In most Facebook jobs the final size is 5.4 % of

the intermediate size

  • In most Yahoo jobs the ratio is 8.2 %

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task

Reduce Task Reduce Task Reduce Task

Intermediate results Final results

4/52

slide-5
SLIDE 5

Data Reduction

  • The final results are typically much smaller than

the intermediate results

  • In most Facebook jobs final size is 5.4 % of the

intermediate size

  • In most Yahoo jobs the ratio is 8.2 %

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task

Reduce Task Reduce Task Reduce Task

Intermediate results Final results

How can we exploit this to reduce the traffic and improve the performance of the shuffle phase?

5/52

slide-6
SLIDE 6

Background: Combiners

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task

Reduce Task Reduce Task Reduce Task

Intermediate results Final results

  • To reduce the data transferred in the shuffle,

users can specify a combiner function

− Aggregates the local intermediate pairs

  • Server-side only => limited aggregation

6/52

slide-7
SLIDE 7

Background: Combiners

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Split 0 Split 1 Split 2 Input file Map Task Map Task Map Task

Reduce Task Reduce Task Reduce Task

Intermediate results Final results Combiner Combiner Combiner

  • To reduce the data transferred in the shuffle,

users can specify a combiner function

− Aggregates the local intermediate pairs

  • Server-side only => limited aggregation

7/52

slide-8
SLIDE 8

Distributed Combiners

  • It has been proposed to use aggregation trees in

MapReduce to perform multiple steps of combiners

− e.g., rack-level aggregation [Yu et al., SOSP’09]

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 8/52

slide-9
SLIDE 9

What happens when we map the tree to a typical data center topology?

The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link

Logical and Physical Topology

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Logical topology Physical topology ToR Switch

9/52

slide-10
SLIDE 10

Logical and Physical Topology

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Logical topology Physical topology ToR Switch

Only 500 Mbps per child

What happens when we map the tree to a typical data center topology?

The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link

slide-11
SLIDE 11

Logical and Physical Topology

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Logical topology Physical topology ToR Switch

Only 500 Mbps per child

What happens when we map the tree to a typical data center topology?

The server link is the bottleneck Full-bisection bandwidth does not help here Mismatch between physical and logical topology Two logical links are mapped onto the same physical link

Camdoop Goal

Perform the combiner functions within the network as

  • pposed to application-level solutions

Reduce shuffle time by aggregating packets on path

slide-12
SLIDE 12

How Can We Perform In-network Processing?

  • We exploit CamCube

− Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic

  • Servers intercept, forward

and process packets

  • Nodes have (x,y,z)coordinates

− This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures y z

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

x

12/52

slide-13
SLIDE 13

How Can We Perform In-network Processing?

  • We exploit CamCube

− Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic

  • Servers intercept, forward

and process packets

  • Nodes have (x,y,z)coordinates

− This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures y z

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

x

13/52

slide-14
SLIDE 14

How Can We Perform In-network Processing?

  • We exploit CamCube

− Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic

  • Servers intercept, forward

and process packets

  • Nodes have (x,y,z)coordinates

− This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures y z (1,2,2)

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

x (1,2,1)

14/52

slide-15
SLIDE 15

How Can We Perform In-network Processing?

  • We exploit CamCube

− Direct-connect topology − 3D torus − Uses no switches / routers for internal traffic

  • Servers intercept, forward

and process packets

  • Nodes have (x,y,z)coordinates

− This defines a key-space (=> key-based routing) − Coordinates are locally re-mapped in case of failures y z (1,2,2)

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

x (1,2,1)

Key property

No distinction between network and computation devices Servers can perform arbitrary packet processing on-path

slide-16
SLIDE 16

Mapping a tree…

… on a switched topology … on CamCube

  • The 1 Gbps link

becomes the bottleneck

  • Packets are aggregated
  • n path (=> less traffic)
  • 1:1 mapping btw. logical

and physical topology

1 Gbps

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

1/in-degree Gbps

16/52

slide-17
SLIDE 17

Camdoop Design

Goals

  • 1. No change in the programming model
  • 2. Exploit network locality
  • 3. Good server and link load distribution
  • 4. Fault-tolerance

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 17/52

slide-18
SLIDE 18

Design Goal #1

Programming Model

  • Camdoop adopts the same

MapReduce model

  • GFS-like distributed file-system

− Each server runs map tasks on local chunks

  • We use a spanning tree

− Combiners aggregate map tasks and children results (if any) and stream the results to the parents − The root runs the reduce task and generates the final output

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 18/52

slide-19
SLIDE 19

Design Goal #2

Network locality

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

How to map the tree nodes to servers?

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 19/52

slide-20
SLIDE 20

Design Goal #2

Network locality

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Map task outputs are always read from the local disk

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 20/52

slide-21
SLIDE 21

Design Goal #2

Network locality

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

The parent-children are mapped on physical neighbors

(1,2,1) (1,2,2) (1,1,1)

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 21/52

slide-22
SLIDE 22

Design Goal #2

Network locality

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

How to map the tree nodes to servers? Map task outputs are always read from the local disk The parent-children are mapped on physical neighbors

(1,2,1) (1,2,2) (1,1,1)

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

This ensures maximum locality and

  • ptimizes network transfer
slide-23
SLIDE 23

Network Locality

Logical View Physical View (3D Torus) One physical link is used by one and only one logical link

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 23/52

slide-24
SLIDE 24

Design Goal #3

Load Distribution

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 24/52

slide-25
SLIDE 25

Design Goal #3

Load Distribution Poor server load distribution

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Only 1 Gbps (instead of 6) Different in-degree

25/52

slide-26
SLIDE 26

Design Goal #3

Load Distribution

Only 1 Gbps (instead of 6)

Poor bandwidth utilization

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 26/52

slide-27
SLIDE 27

Design Goal #3

Load Distribution Poor server load distribution Poor bandwidth utilization

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Solution: stripe the data across disjoint trees

Different links are used Improves load distribution

slide-28
SLIDE 28

Design Goal #3

First issue: Poor load distribution Second issue: Poor bandwidth utilization

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Solution: stripe the data across 6 disjoint trees

All links are used => (Up to) 6 Gbps / server Good load distribution

slide-29
SLIDE 29

Design Goal #4

Fault-tolerance

  • The tree is built in the coordinate space

− CamCube remaps coordinates in case of failures

  • Details in the paper

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 29/52

slide-30
SLIDE 30

Evaluation

Testbed

− 27-server CamCube (3 x 3 x 3) − Quad-core Intel Xeon 5520 2.27 Ghz − 12GB RAM − 6 Intel PRO/1000 PT 1 Gbps ports − Runtime & services implemented in user-space

Simulator

− Packet-level simulator (CPU overhead not modelled) − 512-server (8x8x8) CamCube

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 30/52

slide-31
SLIDE 31

Evaluation

Design and implementation recap

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Camdoop Shuffle & reduce parallelized

  • Reduce phase is parallelized with the shuffle phase

− Since all streams are ordered, as soon as the root receive at least

  • ne packet from all children, it can start the reduce function

− No need to store to disk intermediate results on reduce servers Reduce

Shuffle Map Shuffle Map

Reduce

slide-32
SLIDE 32

Evaluation

Design and implementation recap

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Camdoop Shuffle & reduce parallelized

CamCube

Six disjoint trees

In

  • network

aggregation

32/52

slide-33
SLIDE 33

Evaluation

Design and implementation recap

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Camdoop TCP Camdoop (switch) Shuffle & reduce parallelized

 

CamCube

 

Six disjoint trees

 

In

  • network

aggregation

 

Reduce

Shuffle Map Shuffle Map

  • TCP Camdoop (switch)

− 27 CamCube servers attached to a ToR switch − TCP is used to transfer data in the shuffle phase

33/52

slide-34
SLIDE 34

Evaluation

Design and implementation recap

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Camdoop TCP Camdoop (switch) Camdoop (no agg ) Shuffle & reduce parallelized

  

CamCube

  

Six disjoint trees

  

In

  • network

aggregation

  

Reduce

Shuffle Map Shuffle Map

  • TCP Camdoop (switch)

− 27 CamCube servers attached to a ToR switch − TCP is used to transfer data in the shuffle phase

  • Camdoop (no agg)

− Like Camdoop but without in-network aggregation − Shows the impact of just running on CamCube

34/52

slide-35
SLIDE 35

Validation against Hadoop & Dryad

  • Sort and WordCount
  • Camdoop baselines are

competitive against Hadoop and Dryad

  • Several reasons:

− Shuffle and reduce parellized − Fine-tuned implementation

1 10 100 1000 Sort WordCount Time logscale (s)

Hadoop Dryad/DryadLINQ TCP Camdoop (switch) Camdoop (no agg)

Worse Better

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 35/52

slide-36
SLIDE 36

Validation against Hadoop & Dryad

  • Sort and WordCount
  • Camdoop baselines are

competitive against Hadoop and Dryad

  • Several reasons:

− Shuffle and reduce parellized − Fine-tuned implementation

1 10 100 1000 Sort WordCount Time logscale (s)

Hadoop Dryad/DryadLINQ TCP Camdoop (switch) Camdoop (no agg)

We consider these as our baselines

Worse Better

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 36/52

slide-37
SLIDE 37

Parameter Sweep

  • Output size / intermediate size (S)

− S=1 (no aggregation)

  • Every key is unique

− S=1/N ≈ 0 (full aggregation)

  • Every key appears in all map task outputs

− We use synthetic workloads to explore different value of S

  • Intermediate data size is 22.2 GB (843 MB/server)
  • Reduce tasks (R)

− R= 1 (all-to-one)

  • E.g., Interactive queries, top-K jobs

− R=N (all-to-all)

  • Common setup in MapReduce jobs
  • N output files are generated

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 37/52

slide-38
SLIDE 38

All-to-one (R=1)

1 10 100 1000 0.2 0.4 0.6 0.8 1

Time (s) logscale Output size/ intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better Full aggregation No aggregation

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 38/52

slide-39
SLIDE 39

All-to-one (R=1)

1 10 100 1000 0.2 0.4 0.6 0.8 1

Time (s) logscale Output size/ intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better Full aggregation No aggregation

Performance independent of S Impact of in-network aggregation

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Impact of running

  • n CamCube

39/52

slide-40
SLIDE 40

All-to-one (R=1)

1 10 100 1000 0.2 0.4 0.6 0.8 1

Time (s) logscale Output size/ intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better Full aggregation No aggregation

Performance independent of S Impact of in-network aggregation Facebook reported aggregation ratio

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Impact of running

  • n CamCube

40/52

slide-41
SLIDE 41

All-to-all (R=27)

1 10 100 0.2 0.4 0.6 0.8 1

Time (s) logscale Output size / intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Switch 1 Gbps (bound) Worse Better Full aggregation No aggregation

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 41/52

slide-42
SLIDE 42

All-to-all (R=27)

1 10 100 0.2 0.4 0.6 0.8 1

Time (s) logscale Output size / intermediate size (S) TCP Camdoop (switch) Camdoop (no agg) Camdoop Switch 1 Gbps (bound)

Impact of running

  • n CamCube

Worse Better Full aggregation No aggregation

Impact of in-network aggregation

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Maximum theoretical performance over the switch

42/52

slide-43
SLIDE 43

Number of reduce tasks (S=0)

1 10 100 1000 1 6 11 16 21 26

Time (s) logscale # reduce tasks (R) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better

4.31 x 10.19 x 1.13 x 1.91 x

All-to-one All-to-all

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

27

43/52

slide-44
SLIDE 44

Number of reduce tasks (S=0)

1 10 100 1000 1 6 11 16 21 26

Time (s) logscale # reduce tasks (R) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better

4.31 x 10.19 x 1.13 x 1.91 x R does not (significantly) impact performance

All-to-one All-to-all

Performance depends on R

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Implementation bottleneck

27

44/52

slide-45
SLIDE 45

Number of reduce tasks (S=0)

1 10 100 1000 1 6 11 16 21 26

Time (s) logscale # reduce tasks (R) TCP Camdoop (switch) Camdoop (no agg) Camdoop Worse Better

4.31 x 10.19 x 1.13 x 1.91 x R does not (significantly) impact performance

All-to-one All-to-all

Performance depends on R

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

27

All resources are used even when R = 1 Camdoop decouples the job execution time from the number of output files generated

slide-46
SLIDE 46

Behavior at scale (simulated)

0.001 0.01 0.1 1 10 100 1 4 16 64 256

Time (s) logscale # reduce tasks (R) logscale Switch 1 Gbps (bound) Camdoop

512

N=512, S= 0

Worse Better All-to-one All-to-all

512x

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 46/52

slide-47
SLIDE 47

Behavior at scale (simulated)

0.001 0.01 0.1 1 10 100 1 4 16 64 256

Time (s) logscale # reduce tasks (R) logscale Switch 1 Gbps (bound) Camdoop

512

N=512, S= 0

Worse Better All-to-one All-to-all

512x

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

This assumes full-bisection bandwidth

47/52

slide-48
SLIDE 48
  • More experiments (failures, multiple jobs,…) in the paper

Beyond MapReduce

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 48/52

slide-49
SLIDE 49

Beyond MapReduce

  • The partition-aggregate model

also common in interactive services

− e.g., Bing Search, Google Dremel

  • Small-scale data

− 10s to 100s of KB returned per server

  • Typically, these services use
  • ne reduce task (R=1)

− Single result must be returned to the user

  • Full aggregation is common (S ≈ 0)

− E.g., N servers generate their best k responses each and the final result contains the best k responses

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

Web server

requests

Cache

Leaf servers Parent servers … …

49/52

slide-50
SLIDE 50

Small-scale data (R=1, S=0)

Worse Better 0

10 20 30 40 50 60 20 KB 200 KB

Time (ms) Input data size / server TCP Camdoop (switch) Camdoop (no agg) Camdoop

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 50/52

slide-51
SLIDE 51

Small-scale data (R=1, S=0)

Worse Better 0

10 20 30 40 50 60 20 KB 200 KB

Time (ms) Input data size / server TCP Camdoop (switch) Camdoop (no agg) Camdoop

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications

In-network aggregation can be beneficial also for (small-scale data) interactive services

slide-52
SLIDE 52
  • Camdoop

− Explores the benefits of in-network processing by running combiners within the network − No change in the programming model − Achieves lower shuffle and reduce time − Decouples performance from the # of output files

  • A final thought: how would Camdoop run on this?

− AMD SeaMicro – a 512-core cluster for data centers using a 3D torus − Fast interconnect: 5 Gbps / link

Conclusions

Paolo Costa Camdoop: Exploiting In-network Aggregation for Big Data Applications 52/52