Mapreduce With Parallelizable Reduce S. Muthu Muthukrishnan - - PowerPoint PPT Presentation

mapreduce with parallelizable reduce
SMART_READER_LITE
LIVE PREVIEW

Mapreduce With Parallelizable Reduce S. Muthu Muthukrishnan - - PowerPoint PPT Presentation

Mapreduce With Parallelizable Reduce S. Muthu Muthukrishnan Some Premises At a deliberately high level, we know the MapReduce system. Some Premises At a deliberately high level, we know the


slide-1
SLIDE 1

Mapreduce With Parallelizable Reduce

  • S. Muthu Muthukrishnan
slide-2
SLIDE 2

Some Premises

■ At a deliberately high level, we know the MapReduce

system.

slide-3
SLIDE 3

Some Premises

■ At a deliberately high level, we know the MapReduce

system.

■ Parallel. Map and Reduce functions. Used when data is

  • large. Changing system.

slide-4
SLIDE 4

Some Premises

■ At a deliberately high level, we know the MapReduce

system.

■ Parallel. Map and Reduce functions. Used when data is

  • large. Changing system.

■ There is nice PRAM theory of parallel algorithms.

slide-5
SLIDE 5

Some Premises

■ At a deliberately high level, we know the MapReduce

system.

■ Parallel. Map and Reduce functions. Used when data is

  • large. Changing system.

■ There is nice PRAM theory of parallel algorithms.

■ NC, prefix sums, list ranking, and more.

slide-6
SLIDE 6

Some Premises

■ At a deliberately high level, we know the MapReduce

system.

■ Parallel. Map and Reduce functions. Used when data is

  • large. Changing system.

■ There is nice PRAM theory of parallel algorithms.

■ NC, prefix sums, list ranking, and more.

■ Goal: Develop a useful theory of MapReduce algorithms.

slide-7
SLIDE 7

Some Premises

■ At a deliberately high level, we know the MapReduce

system.

■ Parallel. Map and Reduce functions. Used when data is

  • large. Changing system.

■ There is nice PRAM theory of parallel algorithms.

■ NC, prefix sums, list ranking, and more.

■ Goal: Develop a useful theory of MapReduce algorithms.

■ An algorithmus role. Interesting problems, algorithms.

Bridge from the other side.

slide-8
SLIDE 8

Thoughts Circa 2006

■ Prefix sum in O✭1✮ rounds.

❬ ❀ ✿ ✿ ✿ ❀ ❪ ✮ ❬ ❀ ✁ ✁ ✁ ❀ ❪ ❬ ❪ ❂ P

❬ ❪

■ ■

❬ ♣ ✰ ❀ ✁ ✁ ✁ ❀ ✭ ✰ ✮♣ ❪

❬ ❀ ♣ ❪ ❬ ❪ ❂ P✭ ✰ ✮♣

♣ ✰

❬ ❪

❬ ❪

✭ ✮

✭ ✮

slide-9
SLIDE 9

Thoughts Circa 2006

■ Prefix sum in O✭1✮ rounds.

■ Problem: A❬1❀ ✿ ✿ ✿ ❀ n❪ ✮ PA❬1❀ ✁ ✁ ✁ ❀ n❪ where

PA❬i❪ ❂ P

j ✔i A❬j ❪.

■ ■

❬ ♣ ✰ ❀ ✁ ✁ ✁ ❀ ✭ ✰ ✮♣ ❪

❬ ❀ ♣ ❪ ❬ ❪ ❂ P✭ ✰ ✮♣

♣ ✰

❬ ❪

❬ ❪

✭ ✮

✭ ✮

slide-10
SLIDE 10

Thoughts Circa 2006

■ Prefix sum in O✭1✮ rounds.

■ Problem: A❬1❀ ✿ ✿ ✿ ❀ n❪ ✮ PA❬1❀ ✁ ✁ ✁ ❀ n❪ where

PA❬i❪ ❂ P

j ✔i A❬j ❪.

■ Solution: ■ Assign A❬i♣n ✰ 1❀ ✁ ✁ ✁ ❀ ✭i ✰ 1✮♣n❪ to key i. ■

❬ ❀ ♣ ❪ ❬ ❪ ❂ P✭ ✰ ✮♣

♣ ✰

❬ ❪

❬ ❪

✭ ✮

✭ ✮

slide-11
SLIDE 11

Thoughts Circa 2006

■ Prefix sum in O✭1✮ rounds.

■ Problem: A❬1❀ ✿ ✿ ✿ ❀ n❪ ✮ PA❬1❀ ✁ ✁ ✁ ❀ n❪ where

PA❬i❪ ❂ P

j ✔i A❬j ❪.

■ Solution: ■ Assign A❬i♣n ✰ 1❀ ✁ ✁ ✁ ❀ ✭i ✰ 1✮♣n❪ to key i. ■ Solve problem on B❬1❀ ♣n❪ with one proc,

B❬i❪ ❂ P✭i✰1✮♣n

i♣n✰1

A❬j ❪. Doable?

❬ ❪

✭ ✮

✭ ✮

slide-12
SLIDE 12

Thoughts Circa 2006

■ Prefix sum in O✭1✮ rounds.

■ Problem: A❬1❀ ✿ ✿ ✿ ❀ n❪ ✮ PA❬1❀ ✁ ✁ ✁ ❀ n❪ where

PA❬i❪ ❂ P

j ✔i A❬j ❪.

■ Solution: ■ Assign A❬i♣n ✰ 1❀ ✁ ✁ ✁ ❀ ✭i ✰ 1✮♣n❪ to key i. ■ Solve problem on B❬1❀ ♣n❪ with one proc,

B❬i❪ ❂ P✭i✰1✮♣n

i♣n✰1

A❬j ❪. Doable?

■ Solve problem for key i with PB❬i 1❪. Doable?

✭ ✮

✭ ✮

slide-13
SLIDE 13

Thoughts Circa 2006

■ Prefix sum in O✭1✮ rounds.

■ Problem: A❬1❀ ✿ ✿ ✿ ❀ n❪ ✮ PA❬1❀ ✁ ✁ ✁ ❀ n❪ where

PA❬i❪ ❂ P

j ✔i A❬j ❪.

■ Solution: ■ Assign A❬i♣n ✰ 1❀ ✁ ✁ ✁ ❀ ✭i ✰ 1✮♣n❪ to key i. ■ Solve problem on B❬1❀ ♣n❪ with one proc,

B❬i❪ ❂ P✭i✰1✮♣n

i♣n✰1

A❬j ❪. Doable?

■ Solve problem for key i with PB❬i 1❪. Doable?

■ List ranking in O✭1✮ rounds?

■ Some graph algorithms in O✭1✮ rounds recently.

slide-14
SLIDE 14

SIROCCO Challenge

■ Problem: Given graph G ❂ ✭V ❀ E✮, count the number of

triangles.1

✭ ❀ ✮ ✭ ❀ ❀ ✮

❀ ✭ ❀ ❀ ✮

P

■ ■ ■

✕ ❂ ✕ ✕

■ 1For ex, see. Fast Counting of Triangles in Large Real Networks without

counting: Algorithms and Laws, ICDM 08, by C. Tsourakakis.

slide-15
SLIDE 15

SIROCCO Challenge

■ Problem: Given graph G ❂ ✭V ❀ E✮, count the number of

triangles.1

■ Solution:

■ For each edge ✭u❀ v✮, generate a tuple ✭u❀ v❀ 0✮. ■ For each vertex v and for each pair of neighbors x❀ z of v,

generate a tuple ✭x❀ z❀ 1✮.

■ Presence of both 0 and 1 tuple for an edge is a triangle.

P

■ ■ ■

✕ ❂ ✕ ✕

■ 1For ex, see. Fast Counting of Triangles in Large Real Networks without

counting: Algorithms and Laws, ICDM 08, by C. Tsourakakis.

slide-16
SLIDE 16

SIROCCO Challenge

■ Problem: Given graph G ❂ ✭V ❀ E✮, count the number of

triangles.1

■ Solution:

■ For each edge ✭u❀ v✮, generate a tuple ✭u❀ v❀ 0✮. ■ For each vertex v and for each pair of neighbors x❀ z of v,

generate a tuple ✭x❀ z❀ 1✮.

■ Presence of both 0 and 1 tuple for an edge is a triangle.

■ Solution: The number of triangles is

P

i ✕3 i

6

where ✕i are eigenvalues of adjacency matrix A of G in sorted order.

■ A3

ii is the number of triangles involving i.

■ The trace is 6 times the number of triangles. ■ If ✕ is eigenvalue of A, ie., Ax ❂ ✕x, then ✕3 is eigenvalue

  • f A3.

■ In practice, computing top few eigenvalues suffices. 1For ex, see. Fast Counting of Triangles in Large Real Networks without

counting: Algorithms and Laws, ICDM 08, by C. Tsourakakis.

slide-17
SLIDE 17

Eigenvalue Estimation

A is a n ✂ n real valued matrix.

■ Lanczos method. ■

✂ ❁❁ ✭ ✮

slide-18
SLIDE 18

Eigenvalue Estimation

A is a n ✂ n real valued matrix.

■ Lanczos method. ■ Sketches. Ar for pseudo random n ✂ d vector r, d ❁❁ n.

Will O✭nd✮ sketch fit into one machine?

slide-19
SLIDE 19

Special Case

Motivation: Logs processing. x = inputrecord; x-squared = x * x; aggregator: table sum; emit aggregator <- x-squared; MUD Algorithm m ❂ ✭✟❀ ✟❀ ✑✮.

■ Local function ✟ ✿ ✝ ✦ Q maps input item to a message. ■ Aggregator ✟ ✿ Q ✂ Q ✦ Q maps two messages to a single

message.

■ Post-processing operator ✑ ✿ Q ✦ ✝ produces the final

  • utput, applying m❚ ✭x✮.

■ Computes a function f if ✑✭m❚ ✭✁✮✮ ❂ f for all trees ❚ .

slide-20
SLIDE 20

MUD Examples

✟✭x✮ ❂ ❤x❀ x✐ ✟✭❤a1❀ b1✐❀ ❤a2❀ b2✐✮ ❂ ❤min✭a1❀ a2✮❀ max✭b1❀ b2✮✐ ✑✭❤a❀ b✐✮ ❂ b a

Figure: mud algorithm for computing the total span (left)

slide-21
SLIDE 21

MUD Examples

✟✭x✮ ❂ ❤x❀ h✭x✮❀ 1✐ ✟✭❤a1❀ h✭a1✮❀ c1✐❀ ❤a2❀ h✭a2✮❀ c2✐✮ =

❤ai❀ h✭ai✮❀ ci✐ if h✭ai✮ ❁ h✭aj ✮ ❤a1❀ h✭a1✮❀ c1 ✰ c2✐

  • therwise

✑✭❤a❀ b❀ c✐✮ ❂ a if c ❂ 1

Figure: Mud algorithms for computing a uniform random sample of the unique items in a set (right). Here h is an approximate minwise hash function.

slide-22
SLIDE 22

Streaming

■ streaming algorithm s ❂ ✭✛❀ ✑✮. ■ operator ✛ ✿ Q ✂ ✝ ✦ Q ■ ✑ ✿ Q ✦ ✝ converts the final state to the output. ■ On input x ✷ ✝n, the streaming algorithm computes

f ❂ ✑✭s0✭x✮✮, where 0 is the starting state, and sq✭x✮ ❂ ✛✭✛✭✿ ✿ ✿ ✛✭✛✭q❀ x1✮❀ x2✮❀ ✿ ✿ ✿ ❀ xk1✮❀ xk✮.

■ Communication complexity is log ❥Q❥

slide-23
SLIDE 23

MUD vs Streaming

■ For a mud algorithm m ❂ ✭✟❀ ✟❀ ✑✮, there is a streaming

algorithm s ❂ ✭✛❀ ✑✮ of the same complexity with same

  • utput, by setting ✛✭q❀ x✮ ❂ ✟✭q❀ ✟✭x✮✮.

■ ■

❂ ✭ ❀ ❀ ❀ ✮❀ ✭ ❀ ❀ ❀ ✮❀ ✿ ✿ ✿ ❀ ✭ ❀ ❀ ❀ ✮❀ ✭ ❀ ❀ ❀ ✮❀ ✭ ❀ ❀ ❀ ✮❀ ✿ ✿ ✿ ❀ ✭ ❀ ❀ ❀ ✮✿ ❂ ✭ ✮ ❂

slide-24
SLIDE 24

MUD vs Streaming

■ For a mud algorithm m ❂ ✭✟❀ ✟❀ ✑✮, there is a streaming

algorithm s ❂ ✭✛❀ ✑✮ of the same complexity with same

  • utput, by setting ✛✭q❀ x✮ ❂ ✟✭q❀ ✟✭x✮✮.

■ Central question: Can MUD simulate streaming?

■ ■

❂ ✭ ❀ ❀ ❀ ✮❀ ✭ ❀ ❀ ❀ ✮❀ ✿ ✿ ✿ ❀ ✭ ❀ ❀ ❀ ✮❀ ✭ ❀ ❀ ❀ ✮❀ ✭ ❀ ❀ ❀ ✮❀ ✿ ✿ ✿ ❀ ✭ ❀ ❀ ❀ ✮✿ ❂ ✭ ✮ ❂

slide-25
SLIDE 25

MUD vs Streaming

■ For a mud algorithm m ❂ ✭✟❀ ✟❀ ✑✮, there is a streaming

algorithm s ❂ ✭✛❀ ✑✮ of the same complexity with same

  • utput, by setting ✛✭q❀ x✮ ❂ ✟✭q❀ ✟✭x✮✮.

■ Central question: Can MUD simulate streaming?

■ Count the occurrences of the first odd number on the

stream.

❂ ✭ ❀ ❀ ❀ ✮❀ ✭ ❀ ❀ ❀ ✮❀ ✿ ✿ ✿ ❀ ✭ ❀ ❀ ❀ ✮❀ ✭ ❀ ❀ ❀ ✮❀ ✭ ❀ ❀ ❀ ✮❀ ✿ ✿ ✿ ❀ ✭ ❀ ❀ ❀ ✮✿ ❂ ✭ ✮ ❂

slide-26
SLIDE 26

MUD vs Streaming

■ For a mud algorithm m ❂ ✭✟❀ ✟❀ ✑✮, there is a streaming

algorithm s ❂ ✭✛❀ ✑✮ of the same complexity with same

  • utput, by setting ✛✭q❀ x✮ ❂ ✟✭q❀ ✟✭x✮✮.

■ Central question: Can MUD simulate streaming?

■ Count the occurrences of the first odd number on the

stream.

■ Symmetric problems? Symmetric index problem.

S ❂ ✭a❀ 1❀ x1❀ p✮❀ ✭a❀ 2❀ x2❀ p✮❀ ✿ ✿ ✿ ❀ ✭a❀ 2❀ xn❀ p✮❀ ✭b❀ 1❀ y1❀ q✮❀ ✭b❀ 2❀ y2❀ q✮❀ ✿ ✿ ✿ ❀ ✭b❀ 2❀ yn❀ q✮✿ Additionally, we have xq ❂ yp. Compute function f ✭S✮ ❂ xq.

slide-27
SLIDE 27

MUD vs Streaming

For any symmetric function f ✿ ✝n ✦ ✝ computed by a g✭n✮-space, c✭n✮-communication streaming algorithm ✭✛❀ ✑✮, with g✭n✮ ❂ ✡✭log n✮ and c✭n✮ ❂ ✡✭log n✮, ✭ ✭ ✮✮ ✭ ✭ ✮✮ ✭✟❀ ✟❀ ✑✮

slide-28
SLIDE 28

MUD vs Streaming

For any symmetric function f ✿ ✝n ✦ ✝ computed by a g✭n✮-space, c✭n✮-communication streaming algorithm ✭✛❀ ✑✮, with g✭n✮ ❂ ✡✭log n✮ and c✭n✮ ❂ ✡✭log n✮, there exists a O✭c✭n✮✮-communication, O✭g2✭n✮✮-space mud algorithm ✭✟❀ ✟❀ ✑✮ that also computes f .

slide-29
SLIDE 29

MUD vs Streaming: 2 parties

■ xA and xB are partitions of the input sequence x sent to

Alice and Bob.

❂ ✭ ✮ ❂ ✭ ✮

❂ ✭ ❥❥ ✮

slide-30
SLIDE 30

MUD vs Streaming: 2 parties

■ xA and xB are partitions of the input sequence x sent to

Alice and Bob.

■ Alice runs the streaming algorithm on her input sequence

to produce the state qA ❂ s0✭xA✮, and sends this to Carol. Similarly, Bob sends qB ❂ s0✭xB✮ to Carol.

❂ ✭ ❥❥ ✮

slide-31
SLIDE 31

MUD vs Streaming: 2 parties

■ xA and xB are partitions of the input sequence x sent to

Alice and Bob.

■ Alice runs the streaming algorithm on her input sequence

to produce the state qA ❂ s0✭xA✮, and sends this to Carol. Similarly, Bob sends qB ❂ s0✭xB✮ to Carol.

■ Carol receives the states qA and qB, which contain the sizes

nA and nB of the input sequences xA and xB, and needs to calculate f ❂ s0✭xA❥❥xB✮.

slide-32
SLIDE 32

2 Parties Communication

■ Carol finds sequences x✵ A and x✵ B of length nA and nB such

that qA ❂ s0✭x✵

A✮ and qB ❂ s0✭x✵ B✮. ■

✑✭ ✭

✵ ✁ ✵ ✮✮

✑✭ ✭

✵ ✁ ✵ ✮✮

❂ ✑✭ ✭ ✁

✵ ✮✮

❂ ✑✭ ✭

✵ ✁

✮✮ ❂ ✑✭ ✭ ✁ ✮✮ ❂ ✑✭ ✭ ✁ ✮✮ ❂ ✭ ✁ ✮ ❂ ✭ ✮✿

slide-33
SLIDE 33

2 Parties Communication

■ Carol finds sequences x✵ A and x✵ B of length nA and nB such

that qA ❂ s0✭x✵

A✮ and qB ❂ s0✭x✵ B✮. ■ Carol then outputs ✑✭s0✭x✵ A ✁ x✵ B✮✮.

✑✭s0✭x✵

A ✁ x✵ B✮✮

❂ ✑✭s0✭xA ✁ x✵

B✮✮

❂ ✑✭s0✭x✵

B ✁ xA✮✮

❂ ✑✭s0✭xB ✁ xA✮✮ ❂ ✑✭s0✭xA ✁ xB✮✮ ❂ f ✭xA ✁ xB✮ ❂ f ✭x✮✿

slide-34
SLIDE 34

Space Efficient 2 Party Communication

■ Non-deterministic simulation:

✵ ✮

✵ ✮ ✻❂ ✵

✵ ✮

✵ ✮

✵ ✮ ✻❂

❂ ✭

✵ ✮

✭ ✭ ✮✮

✭ ✮

slide-35
SLIDE 35

Space Efficient 2 Party Communication

■ Non-deterministic simulation:

■ First, guess the symbols of x✵

A one at a time, simulating the

streaming algorithm s0✭x✵

A✮ on the guess.

✵ ✮ ✻❂ ✵

✵ ✮

✵ ✮

✵ ✮ ✻❂

❂ ✭

✵ ✮

✭ ✭ ✮✮

✭ ✮

slide-36
SLIDE 36

Space Efficient 2 Party Communication

■ Non-deterministic simulation:

■ First, guess the symbols of x✵

A one at a time, simulating the

streaming algorithm s0✭x✵

A✮ on the guess. If after nA

guessed symbols we have s0✭x✵

A✮ ✻❂ qA, reject this branch. ✵

✵ ✮

✵ ✮

✵ ✮ ✻❂

❂ ✭

✵ ✮

✭ ✭ ✮✮

✭ ✮

slide-37
SLIDE 37

Space Efficient 2 Party Communication

■ Non-deterministic simulation:

■ First, guess the symbols of x✵

A one at a time, simulating the

streaming algorithm s0✭x✵

A✮ on the guess. If after nA

guessed symbols we have s0✭x✵

A✮ ✻❂ qA, reject this branch.

Then, guess the symbols of x✵

B, simulating (in parallel)

s0✭x✵

B✮ and sqA✭x✵ B✮.

✵ ✮ ✻❂

❂ ✭

✵ ✮

✭ ✭ ✮✮

✭ ✮

slide-38
SLIDE 38

Space Efficient 2 Party Communication

■ Non-deterministic simulation:

■ First, guess the symbols of x✵

A one at a time, simulating the

streaming algorithm s0✭x✵

A✮ on the guess. If after nA

guessed symbols we have s0✭x✵

A✮ ✻❂ qA, reject this branch.

Then, guess the symbols of x✵

B, simulating (in parallel)

s0✭x✵

B✮ and sqA✭x✵ B✮. If after nB steps we have s0✭x✵ B✮ ✻❂ qB,

reject this branch; otherwise, output qC ❂ sqA✭x✵

B✮.

■ This procedure is a non-deterministic, O✭g✭n✮✮-space

algorithm for computing a valid qC .

■ By Savitch’s theorem, it follows that there is a

deterministic, g2✭n✮-space algorithm.

■ Simulation time is superpolynomial.

slide-39
SLIDE 39

Proof

■ Finish the proof for arbitrary computation tree inductively. ■ Extends to streaming algorithms for approximating f that

work by computing some other function g exactly over the stream, for example, sketch-based algorithms that maintain ci ❂ ❤x❀ vi✐ where x is the input vector and some vi. Public randomness.

■ Doesn’t extend to randomized algorithms with private

randomness, partial functions, etc.

slide-40
SLIDE 40

Multiple Keys

■ Any N-processor, M-memory, T-time EREW-PRAM

algorithm which has a log✭N ✰ M✮-bit word in every memory location, can be simulated by a O✭T✮-round, ✭N ✰ M✮-key mud algorithm with communication complexity O✭log✭N ✰ M✮✮ bits per key.

■ In particular, any problem in class NC has a

polylog✭n✮-round, poly✭n✮-key mud algorithm with communication complexity O✭log✭n✮✮ bits per key.

slide-41
SLIDE 41

Concluding Remarks

■ Jon Feldman, S. Muthukrishnan, Anastasios Sidiropoulos,

Clifford Stein, Zoya Svitkina: On distributing symmetric streaming computations. SODA 2008: 710-719