Agenda 1 cs848 Models and Applications of Distributed Data - - PowerPoint PPT Presentation

agenda
SMART_READER_LITE
LIVE PREVIEW

Agenda 1 cs848 Models and Applications of Distributed Data - - PowerPoint PPT Presentation

n Background n Model n Hamming Distance 1 n Triangle Finding n Matrix Multiplication Agenda 1 cs848 Models and Applications of Distributed Data Processing Systems The Problem Tradeoff between parallelism and communication cost in a


slide-1
SLIDE 1

Agenda

n Background n Model n Hamming Distance 1 n Triangle Finding n Matrix Multiplication

cs848 Models and Applications of Distributed Data Processing Systems

1

slide-2
SLIDE 2

Background

The Problem

  • Tradeoff between parallelism and communication cost in a

map-reduce computation.

  • The finer we partition the work of the reducers so that more

parallelism can be extracted, the greater will be the total communication between mappers and reducers.

  • Limited bandwidth
  • Limited resources(memory, processing units…)

cs848 Models and Applications of Distributed Data Processing Systems

2

slide-3
SLIDE 3

Background

Why important

  • Explore the bounds on the cost of map-reduce

computation.

  • Optimize the algorithms for problem.

cs848 Models and Applications of Distributed Data Processing Systems

3

slide-4
SLIDE 4

Background

Previous Work

  • First work that addresses the tradeoff between reducer size and

communication cost in one round Map-Reduce computations.

  • Theta-join implementation by Map-Reduce: only one special

case.

  • Limit the input size of any reducer: limits consideration to

algorithms that we might think of as truly parallel.

cs848 Models and Applications of Distributed Data Processing Systems

4

slide-5
SLIDE 5

Model

  • A model of problems that can be solved in a single round
  • f map-reduce computation.

Two Parameters

  • Replication rate r: average number of key-value pairs to

which each input is mapped by the mappers.

  • Reducer size p: the maximum number of inputs that one

reducer can receive.

cs848 Models and Applications of Distributed Data Processing Systems

5

slide-6
SLIDE 6

Model

cs848 Models and Applications of Distributed Data Processing Systems

6

r = 2 p = 4

slide-7
SLIDE 7

Model

Tradeoff

  • Determine the best algorithm for a problem where:
  • Cost of solving the problem:
  • Replication rate:

cs848 Models and Applications of Distributed Data Processing Systems

7

r = f (q)

af (q)+ bq(+cq2)

r = qi I

i=1 p

slide-8
SLIDE 8

Model

Mapping Schemas

  • No reducer is assigned more than q inputs.
  • For every output, there is (at least) one reducer that is

assigned all of the inputs for that output. We say such a reducer covers the output. This reducer need not be unique, and it is permitted that these same inputs are assigned also to other reducers.

cs848 Models and Applications of Distributed Data Processing Systems

8

slide-9
SLIDE 9

Model

cs848 Models and Applications of Distributed Data Processing Systems

9

slide-10
SLIDE 10

Model

cs848 Models and Applications of Distributed Data Processing Systems

10 Steps:

Q1: Is this assumption reasonable? Q2: Can be applied to most problems or only several specific problem?

slide-11
SLIDE 11

Model

cs848 Models and Applications of Distributed Data Processing Systems

11

slide-12
SLIDE 12

Model

cs848 Models and Applications of Distributed Data Processing Systems

12

slide-13
SLIDE 13

Hamming Distance 1

cs848 Models and Applications of Distributed Data Processing Systems

13

proof in technical report: F. N. Afrati, A. D. Sarma, S. Salihoglu, and J. D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. CoRR, abs/1206.4377, 2012.

slide-14
SLIDE 14

Hamming Distance 1

cs848 Models and Applications of Distributed Data Processing Systems

14

slide-15
SLIDE 15

Hamming Distance 1

cs848 Models and Applications of Distributed Data Processing Systems

15 Upper Bound: Splitting Algorithm

slide-16
SLIDE 16

Hamming Distance 1

cs848 Models and Applications of Distributed Data Processing Systems

16 Upper Bound for large q: Replicas on neighboring reducer

slide-17
SLIDE 17

Hamming Distance 1

  • Analysis for Hamming Distance 1 does not generalize

easily to higher distance.

  • Much higher bound for number of outputs covered by a

reducer.

cs848 Models and Applications of Distributed Data Processing Systems

17

slide-18
SLIDE 18

Triangle Finding

  • We are given a graph as input and want to find all triples of

nodes such that in the graph there are edges between each pair

  • f these three nodes.
  • Alon Class of Sample Graphs: have the property that we can

partition the nodes into disjoint sets, such that the subgraph induced by each partition is either:

  • A single edge between two nodes, or
  • Contains an odd-length Hamiltonian cycle.

cs848 Models and Applications of Distributed Data Processing Systems

18

slide-19
SLIDE 19

Matrix Multiplication

cs848 Models and Applications of Distributed Data Processing Systems

19

slide-20
SLIDE 20

Matrix Multiplication

cs848 Models and Applications of Distributed Data Processing Systems

20 Matrix Multiplication Using Two Phases

slide-21
SLIDE 21

Reference

  • http://www.slideshare.net/tzulitai/upper-and-lower-

bound-on-the-cost-of-a-map-reduce-computation

  • http://shonan.nii.ac.jp/shonan/seminar011/files/2012/01/

ullman.pdf

cs848 Models and Applications of Distributed Data Processing Systems

21

slide-22
SLIDE 22

Thank you

Q&A

cs848 Models and Applications of Distributed Data Processing Systems

22