MapReduce Doug Woos Logistics notes Deadlines, etc. up on website - - PowerPoint PPT Presentation

mapreduce
SMART_READER_LITE
LIVE PREVIEW

MapReduce Doug Woos Logistics notes Deadlines, etc. up on website - - PowerPoint PPT Presentation

MapReduce Doug Woos Logistics notes Deadlines, etc. up on website Slip day policy Piazza!!! https://piazza.com/washington/spring2017/cse452 Outline - Why MapReduce? - Programming model - Implementation - Technical details (performance,


slide-1
SLIDE 1

MapReduce

Doug Woos

slide-2
SLIDE 2

Logistics notes

Deadlines, etc. up on website Slip day policy Piazza!!!

https://piazza.com/washington/spring2017/cse452

slide-3
SLIDE 3

Outline

  • Why MapReduce?
  • Programming model
  • Implementation
  • Technical details (performance, failure, limitations)
  • Lab 1
  • Piazza discussion
slide-4
SLIDE 4

Why MapReduce?

Distributed systems are hard

  • Failure
  • Consistency
  • Performance
  • Testing
  • etc. etc

Shouldn’t have to write one for every task

  • separation of concerns
slide-5
SLIDE 5

Separation of concerns

User program MapReduce Distributed filesystem

slide-6
SLIDE 6

Separation of concerns

User program MapReduce Distributed filesystem

slide-7
SLIDE 7

Programming model

Input: list of key/value pairs

[(“1”, “in the town where I was born”),
 (“2”, “lived a man who sailed to sea”),
 (“3”, “and he told us of his life”),
 (“4”, “in the land of submarines”),
 …]

Output: list of key/value pairs

[(“13”, “yellow”),
 (“9”, “submarine”),
 (“7”, “in”),
 (“7”, “we”),
 …]

slide-8
SLIDE 8

Programming model

Map: (k1, v1) -> [(k2, v2)]

for word in value:
 emit (word, “1”)

Reduce: (k2, [v2]) -> [v3]

emit len(values)

slide-9
SLIDE 9

Programming model

Map runs on every key/value pair, produces new pairs

[(“In”, “1”), (“the”, “1”), (“town”, “1”), (“where”, “1”), …]

Resulting pairs sorted by key

[[(“a”, “1”), (“a”, “1”), (“a”, “1”), …],
 [(“and”, “1”), (“and”, “1”), (“and”, “1”), …],
 …]

Reduce runs on every key and all associated values

[(“13”, “yellow”),
 (“9”, “submarine”),
 (“7”, “in”),
 (“7”, “we”),
 …]

slide-10
SLIDE 10

Other example programs

Surprising anagram finder

  • emit (sorted(value), value)

  • emit highest scoring anagram in values

PageRank

  • for outbound link in page.links:


emit (url, page.rank) 


  • page.rank = sum(page.rank for page in links) / len(page.links)

Others?

slide-11
SLIDE 11

Separation of concerns

User program MapReduce Distributed filesystem

slide-12
SLIDE 12

MapReduce Implementation

Goals:

  • Run on large amount of data
  • Run in parallel
  • Tolerate failures/slowness at worker nodes

Assume:

  • Distributed filesystem
  • No master failures
slide-13
SLIDE 13

MapReduce Architecture

Master Worker Worker Worker Worker Worker

Distributed filesystem

slide-14
SLIDE 14

MapReduce steps

Master Worker Worker

Distributed filesystem

slide-15
SLIDE 15

MapReduce steps

Master Worker Worker

Distributed filesystem

Register()

slide-16
SLIDE 16

MapReduce steps

Master Worker Worker

Distributed filesystem

M map tasks, R reduce tasks

slide-17
SLIDE 17

MapReduce steps

Master Worker Worker

Distributed filesystem

Split input into M ~ fixed-size splits

slide-18
SLIDE 18

MapReduce steps

Master Worker Worker

Distributed filesystem

Write splits mrtmp.<name>-<m>

slide-19
SLIDE 19

MapReduce steps

Master Worker Worker

Distributed filesystem

DoMap(m) (× M)

slide-20
SLIDE 20

MapReduce steps

Master Worker Worker

Distributed filesystem

Get k-v pairs mrtmp.<name>-<m>

slide-21
SLIDE 21

MapReduce steps

Master Worker Worker

Distributed filesystem

Call Map() on k-v pairs Partition results into R “regions”

slide-22
SLIDE 22

MapReduce steps

Master Worker Worker

Distributed filesystem

Write regions mrtmp.<name>-<m>-<r>

slide-23
SLIDE 23

MapReduce steps

Master Worker Worker

Distributed filesystem

return

slide-24
SLIDE 24

MapReduce steps

Master Worker Worker

Distributed filesystem

Wait for M Map tasks to finish

slide-25
SLIDE 25

MapReduce steps

Master Worker Worker

Distributed filesystem

DoReduce(r) (× R)

slide-26
SLIDE 26

MapReduce steps

Master Worker Worker

Distributed filesystem

Get k-v pairs for r mrtmp.<name>-<m>-<r>

slide-27
SLIDE 27

MapReduce steps

Master Worker Worker

Distributed filesystem

Sort pairs Run Reduce() per key

slide-28
SLIDE 28

MapReduce steps

Master Worker Worker

Distributed filesystem

Write results

slide-29
SLIDE 29

MapReduce steps

Master Worker Worker

Distributed filesystem

return

slide-30
SLIDE 30

Separation of concerns

User program MapReduce Distributed filesystem

slide-31
SLIDE 31

Distributed filesystem

Will cover later in the quarter! In the lab, just use the local FS For now, it’s sort of a black box But: why the 64MB default split size? What if we didn’t have a distributed filesystem?

slide-32
SLIDE 32

Technical details

  • Failures
  • Performance
  • Optimizations
  • Limitations
slide-33
SLIDE 33

Handling failure

Basically: just re-run the job

  • Handle stragglers, failures in the same way
  • If the master fails, have to start over
  • How would we handle a master failure?

Why is this easy in MapReduce? Why wouldn’t this be easy in other systems?

  • Can I re-run “charge user’s credit card?”
slide-34
SLIDE 34

Fault-tolerance model

Master never fails Workers are fail-stop

  • Don’t send garbled packets
  • Don’t otherwise misbehave
  • Can reboot

Packets can be dropped

slide-35
SLIDE 35

Performance

How much speedup do we want on N servers? How much speedup do we expect on N servers? What are the bottlenecks?

slide-36
SLIDE 36

Optimizations

Data locality is key

  • Run Map jobs near data
  • Can we run Reduce jobs near data?

Run Reduce function on each Map node’s results

  • “Combiner” function in the paper
  • When can we do this?
slide-37
SLIDE 37

Limitations

What problems doesn’t MR solve?

slide-38
SLIDE 38

DeWitt/Stonebraker critique

  • 1. A giant step backward in the programming paradigm

for large-scale data intensive applications

  • 2. A sub-optimal implementation, in that it uses brute

force instead of indexing

  • 3. Not novel at all: represents a specific implementation
  • f well known techniques developed nearly 25 years ago
  • 4. Missing most of the features that are routinely

included in current DBMS

  • 5. Incompatible with all of the tools DBMS users have

come to depend on

slide-39
SLIDE 39

Lab 1

Linked from the course website now! Due next Friday (April 7), 9:00pm Turn-in procedure:

  • Dropbox on course site
  • One partner turns in code
  • Both partners turn in brief writeup
  • Writeup: ~ how long it took, ~ which parts you did
slide-40
SLIDE 40

Lab 1

Three parts:

  • Implement word count
  • Implement naive MapReduce master
  • Handle worker failures

Some simplifications w.r.t the paper:

  • Map takes strings, not k/v pairs
  • Runs locally, so no separation btw local/global FS
  • No partial failures (no file-write issues)
slide-41
SLIDE 41

Lab 1

Partly a warm-up exercise: learn Go, etc. Go tutorial section tomorrow Some general hints next lecture Have fun!

slide-42
SLIDE 42

Discussion

What’s the deal with master failure? Why is atomic rename important? Why not store intermediate results in RAM?

  • Apache Spark

Aren’t some Reduce jobs much larger? What about infinite loops? Why does novelty matter?