MapReduce February 13, 2020 Data Science CSCI 1951A Brown - - PowerPoint PPT Presentation

mapreduce
SMART_READER_LITE
LIVE PREVIEW

MapReduce February 13, 2020 Data Science CSCI 1951A Brown - - PowerPoint PPT Presentation

MapReduce February 13, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 Announcements Project Pitch Presentations SQL Grades, late handins Questions? Concerns?


slide-1
SLIDE 1

MapReduce

February 13, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter

1

slide-2
SLIDE 2

Announcements

  • Project Pitch Presentations
  • SQL Grades, late handins
  • Questions? Concerns? Anything?

2

slide-3
SLIDE 3

Today

3

slide-4
SLIDE 4
  • Functional-programming paradigm (inspired by

LISP and friends)

  • Two functions:
  • Map: (in_key, in_value) -> list_of(out_key,

intermediate_value)

  • Reduce: (out_key, list_of(intermediate_value)) ->

list_of(out_value)

MapReduce

https://research.google.com/archive/mapreduce-osdi04-slides

4

slide-5
SLIDE 5
  • Functional-programming paradigm (inspired by LISP

and friends)

  • Two functions:
  • Map: (in_key, in_value) -> list_of(out_key,

intermediate_value)

  • Reduce: (out_key, list_of(intermediate_value)) ->

list_of(out_value)

MapReduce

https://research.google.com/archive/mapreduce-osdi04-slides

5

slide-6
SLIDE 6
  • Functional-programming paradigm (inspired by LISP

and friends)

  • Two functions:
  • Map: (in_key, in_value) -> list_of(out_key,

intermediate_value)

  • Reduce: (out_key, list_of(intermediate_value)) ->

list_of(out_value)

MapReduce

https://research.google.com/archive/mapreduce-osdi04-slides

6

slide-7
SLIDE 7
  • Functional-programming paradigm (inspired by LISP

and friends)

  • Two functions:
  • Map: (in_key, in_value) -> list_of(intermediate_key,

intermediate_value)

  • Reduce: (out_key, list_of(intermediate_value)) ->

list_of(out_value)

MapReduce

https://research.google.com/archive/mapreduce-osdi04-slides

7

slide-8
SLIDE 8
  • Functional-programming paradigm (inspired by LISP

and friends)

  • Two functions:
  • Map: (in_key, in_value) -> list_of(intermediate_key,

intermediate_value)

  • Reduce: (intermediate_key,

list_of(intermediate_value)) -> (out_key, out_value)

MapReduce

https://research.google.com/archive/mapreduce-osdi04-slides

8

slide-9
SLIDE 9
  • Functional-programming paradigm (inspired by LISP

and friends)

  • Two functions:
  • Map: (in_key, in_value) -> list_of(intermediate_key,

intermediate_value)

  • Reduce: (intermediate_key,

list_of(intermediate_value)) -> (out_key, out_value)

MapReduce

https://research.google.com/archive/mapreduce-osdi04-slides

“group by”

9

slide-10
SLIDE 10
  • Functional-programming paradigm (inspired by LISP

and friends)

  • Two functions:
  • Map: (in_key, in_value) -> list_of(intermediate_key,

intermediate_value)

  • Reduce: (intermediate_key,

list_of(intermediate_value)) -> (out_key, out_value)

MapReduce

https://research.google.com/archive/mapreduce-osdi04-slides

Extremely Vague General

10

slide-11
SLIDE 11
  • Functional-programming paradigm (inspired by

LISP and friends)

  • Two functions:
  • Map: (in_key, in_value) -> list_of(out_key,

intermediate_value)

  • Reduce: (out_key, list_of(intermediate_value)) ->

list_of(out_value)

MapReduce

https://research.google.com/archive/mapreduce-osdi04-slides

distributed grep distributed sort web link-graph reversal web access log stats inverted index construction document clustering machine learning statistical machine translation …

11

slide-12
SLIDE 12

Map Reduce

  • One “master” scheduler which assigns tasks

(mapping or reducing) to machines

  • No shared state between machines—massively

parallelizable

  • Assume very high failure rates on workers

12

slide-13
SLIDE 13

Map Reduce

  • One “master” scheduler which assigns tasks

(mapping or reducing) to machines

  • No shared state between machines—massively

parallelizable

  • Assume very high failure rates on workers

13

slide-14
SLIDE 14

Map Reduce

  • One “master” scheduler which assigns tasks

(mapping or reducing) to machines

  • No shared state between machines—massively

parallelizable

  • Assume very high failure rates on workers

14

slide-15
SLIDE 15

Map Reduce

  • One “master” scheduler which assigns tasks

(mapping or reducing) to machines

  • No shared state between machines—massively

parallelizable

  • Assume very high failure rates on workers

You will use Spark in your

  • homework. Same

algorithmic ideas apply, different data/memory management under the hood

15

slide-16
SLIDE 16

Counting Words

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ?

Documents

16

slide-17
SLIDE 17

Counting Words

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ?

Documents

hello 2 world 4

  • h 1

hi 1 there 2 why 1 ! 1 how 1 …

Counts for each word

17

slide-18
SLIDE 18

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ?

18

slide-19
SLIDE 19

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

19

slide-20
SLIDE 20

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

20

slide-21
SLIDE 21

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1) (hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

21

slide-22
SLIDE 22

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5

(hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

22

slide-23
SLIDE 23

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3

(oh, 1)

Reducer 4

(hi, 1)

Reducer 5

(there, 2) (hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

23

slide-24
SLIDE 24

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3

(oh, 1)

Reducer 4

(hi, 1)

Reducer 5

(there, 2) (hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

Input

24

slide-25
SLIDE 25

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3

(oh, 1)

Reducer 4

(hi, 1)

Reducer 5

(there, 2) (hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

Input Map Phase

25

slide-26
SLIDE 26

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3

(oh, 1)

Reducer 4

(hi, 1)

Reducer 5

(there, 2) (hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

Input Map Phase Shuffle Phase (“Group By”)

26

slide-27
SLIDE 27

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3

(oh, 1)

Reducer 4

(hi, 1)

Reducer 5

(there, 2) (hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

Input Map Phase Shuffle Phase (“Group By”)

NOT Sort! (No guarantee about order of values…)

27

slide-28
SLIDE 28

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3

(oh, 1)

Reducer 4

(hi, 1)

Reducer 5

(there, 2) (hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

Input Map Phase Shuffle Phase (“Group By”) Reduce Phase

28

slide-29
SLIDE 29

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3

(oh, 1)

Reducer 4

(hi, 1)

Reducer 5

(there, 2) (hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

Input Map Phase Shuffle Phase (“Group By”) Reduce Phase

Guarantees same key processed together Use for e.g. uniquing, sorting, etc.

29

slide-30
SLIDE 30

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3

(oh, 1)

Reducer 4

(hi, 1)

Reducer 5

(there, 2) (hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

30

slide-31
SLIDE 31

hello world

  • h hi there

world why hello there , world world ! how the hell are ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4

(hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (world, 1) (why, 1) (hello, 1) (there, 1) (,, 1) (world, 1) (world, 1) (!, 1) (how, 1) (the, 1) (hell, 1) (are, 1) (ya, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3 Reducer 4

(hi, 1) (oh, 1) (there, 2)

Reducer 5

(hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

31

slide-32
SLIDE 32

hello world

  • h hi

why hello world ! how Mapper 1 Mapper 3 Mapper 4 Mapper 5

(hello, 1) (world, 1) (there, 1) (world, 1) (there, 1) (,, 1) (world, 1) (the, 1) (hell, 1) (are, 1)

Mapper 2 Mapper 6 there world

(oh, 1) (hi, 1)

there , world Mapper 7 Mapper 7 the hell are ya ?

(why, 1) (hello, 1) (world, 1) (!, 1) (how, 1) (ya, 1) (?, 1)

Reducer 1

(hello, 2)

Reducer 2

(world, 4)

Reducer 3 Reducer 4

(hi, 1) (oh, 1) (there, 2)

Reducer 5

(hello, 1) (hello, 1) (world, 1) (world, 1) (world, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (there, 1)

32

slide-33
SLIDE 33

Map Reduce

//define your mapper function(s) def MapFn: (String, String) -> (String, Int) { TODO; } //define your reduce function(s) def ReduceFn:(String, List(Int)) -> (String, Int){ TODO; } //define your pipeline Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output)

33

slide-34
SLIDE 34

Map Reduce

//define your mapper function(s) def MapFn: (String, String) -> (String, Int) { TODO; } //define your reduce function(s) def ReduceFn:(String, List(Int)) -> (String, Int){ TODO; } //define your pipeline Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output)

Warning: Code Snippets/ Pseudocode (Don’t assume this will look exactly like this in the hw)

34

slide-35
SLIDE 35

//define your mapper function(s) def MapFn: (String, String) -> (String, Int) { TODO; } //define your reduce function(s) def ReduceFn:(String, List(Int)) -> (String, Int){ TODO; } //define your pipeline Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output)

Map Reduce

table

DocID Text 1 hello world 2

  • h hi there world

3 why hello there , world 4 world ! how the hell are ya ?

35

slide-36
SLIDE 36

//define your mapper function(s) def MapFn: (String, String) -> (String, Int) { TODO; } //define your reduce function(s) def ReduceFn:(String, List(Int)) -> (String, Int){ TODO; } //define your pipeline Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output)

Map Reduce

  • utput

Word Count hello 2 world 4

  • h

1 hi 1 there 2

table

DocID Text 1 hello world 2

  • h hi there world

3 why hello there , world 4 world ! how the hell are ya ?

36

slide-37
SLIDE 37

//define your mapper function(s) def MapFn: (String, String) -> (String, Int) { TODO; } //define your reduce function(s) def ReduceFn:(String, List(Int)) -> (String, Int){ TODO; } //define your pipeline Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output)

Map Reduce

Lots of data types: String, Int, Float, Tuples thereof

37

slide-38
SLIDE 38

Map Reduce

// enumerate occurrences of each word, with // count of 1 def MapFn: (String, String) -> (String, Int) { for w in input.value().split(){ emit(w, 1); } }

38

slide-39
SLIDE 39

Map Reduce

// enumerate occurrences of each word, with // count of 1 def MapFn: (String, String) -> (String, Int) { for w in input.value().split(){ emit(w, 1); } }

String

39

slide-40
SLIDE 40

Map Reduce

// sum the total counts of each word def ReduceFn:(String, List(Int)) -> (String, Int){ sum = 0; for c in input.value(){ sum += c; } emit(input.key(), sum); }

40

slide-41
SLIDE 41

Map Reduce

// sum the total counts of each word def ReduceFn:(String, List(Int)) -> (String, Int){ sum = 0; for c in input.value(){ sum += c; } emit(input.key(), sum); }

list of ints (counts)

41

slide-42
SLIDE 42

// sum the total counts of each word def ReduceFn:(String, List(Int)) -> (String, Int){ sum = 0; for c in input.value(){ sum += c; } emit(input.key(), sum); }

Map Reduce

the word list of ints (counts)

42

slide-43
SLIDE 43

// enumerate occurrences of each word // with count of 1 def MapFn: (String, String) -> (String, Int) { for w in input.split(){ emit(w, 1); } } // sum the total counts of each word def ReduceFn:(String, List(Int)_ -> (String, Int){ emit(input.key(), sum([c for c in input.value()])); } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output) }

Find the number of occurrences of each word?

Input: String Map: output (word, 1) for every word. Reduce: Sum counts for each word

43

slide-44
SLIDE 44

Find the number of unique documents that each word

  • ccurs in?

(non)Clicker Question!

44

slide-45
SLIDE 45

// enumerate occurrences of each word // with count of 1 def MapFn1: String -> (String, Int) { ??? } def ReduceFn1: (String, List(Int)) -> (String, Int) { ??? } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { ??? } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().ReduceFn1().ReduceFn2(); write(output) }

Find the number of unique documents that each word

  • ccurs in?

(non)Clicker Question!

45

slide-46
SLIDE 46

// enumerate occurrences of each word // with count of 1 def MapFn1: String -> (String, Int) { ??? } def ReduceFn1: (String, List(Int)) -> (String, Int) { ??? } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { ??? } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().ReduceFn1().ReduceFn2(); write(output) }

Find the number of unique documents that each word

  • ccurs in?

(non)Clicker Question!

No using sets! (use reducers instead)

46

slide-47
SLIDE 47

// enumerate occurrences of each word // with count of 1 def MapFn1: String -> (String, Int) { ??? } def ReduceFn1: (String, List(Int)) -> (String, Int) { ??? } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { ??? } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().ReduceFn1().ReduceFn2(); write(output) }

Find the number of unique documents that each word

  • ccurs in?

(non)Clicker Question!

No using sets! (use reducers instead)

47

slide-48
SLIDE 48

hello world, just saying hello

  • h hi, hi

there world why hello there , world world ! how the hell are ya ? ? ?

D1 D2 D3 D4

48

slide-49
SLIDE 49

hello world, just saying hello

  • h hi, hi

there world why hello there , world world ! how the hell are ya ? ? ?

D1 D2 D3 D4

Mapper Mapper Mapper Mapper

((D1, hello), 1) ((D1, world), 1) … ((D1, hello), 1) …. …. ((D4, world), 1) … ((D4, ?), 1) ((D4, ?), 1)

49

slide-50
SLIDE 50

hello world, just saying hello

  • h hi, hi

there world why hello there , world world ! how the hell are ya ? ? ?

D1 D2 D3 D4

Mapper Mapper Mapper Mapper

((D1, hello), 1) ((D1, world), 1) … ((D1, hello), 1) …. …. ((D4, world), 1) … ((D4, ?), 1) ((D4, ?), 1)

Reducer 1 Reducer 2 Reducer 3 Reducer 4

50

slide-51
SLIDE 51

hello world, just saying hello

  • h hi, hi

there world why hello there , world world ! how the hell are ya ? ? ? Mapper Mapper Mapper Mapper

D1 D2 D3 D4

Reducer 1 Reducer 2 Reducer 3 Reducer 4

((D1, hello), 1) ((D1, world), 1) … ((D1, hello), 1) …. …. ((D4, world), 1) … ((D4, ?), 1) ((D4, ?), 1) (hello, 1) (world, 1) (world, 1) (?, 1)

51

slide-52
SLIDE 52

hello world, just saying hello

  • h hi, hi

there world why hello there , world world ! how the hell are ya ? ? ? Mapper Mapper Mapper Mapper Reducer 1 Reducer 2 Reducer 3

D1 D2 D3 D4

Reducer 1 Reducer 2 Reducer 3 Reducer 4

((D1, hello), 1) ((D1, world), 1) … ((D1, hello), 1) …. …. ((D4, world), 1) … ((D4, ?), 1) ((D4, ?), 1) (hello, 1) (world, 1) (world, 1) (?, 1)

52

slide-53
SLIDE 53

hello world, just saying hello

  • h hi, hi

there world why hello there , world world ! how the hell are ya ? ? ? Mapper Mapper Mapper Mapper Reducer 1 Reducer 2 Reducer 3

D1 D2 D3 D4

((D1, hello), 1) ((D1, world), 1) … ((D1, hello), 1) …. …. ((D4, world), 1) … ((D4, ?), 1) ((D4, ?), 1) (hello, 1) (world, 1) (world, 1) (?, 1) (hello, 2) (world, 4) (?, 1)

Reducer 1 Reducer 2 Reducer 3 Reducer 4

53

slide-54
SLIDE 54

hello world, just saying hello

  • h hi, hi

there world why hello there , world world ! how the hell are ya ? ? ? Mapper Mapper Mapper Mapper Reducer 1 Reducer 2 Reducer 3

D1 D2 D3 D4

((D1, hello), 1) ((D1, world), 1) … ((D1, hello), 1) …. …. ((D4, world), 1) … ((D4, ?), 1) ((D4, ?), 1) (hello, 1) (world, 1) (world, 1) (?, 1) (hello, 2) (world, 4) (?, 1)

Reducer 1 Reducer 2 Reducer 3 Reducer 4

Why can’ t we use mappers for this step?

54

slide-55
SLIDE 55

hello world, just saying hello

  • h hi, hi

there world why hello there , world world ! how the hell are ya ? ? ? Mapper Mapper Mapper Mapper Reducer 1 Reducer 2 Reducer 3

D1 D2 D3 D4

((D1, hello), 1) ((D1, world), 1) … ((D1, hello), 1) …. …. ((D4, world), 1) … ((D4, ?), 1) ((D4, ?), 1) (hello, 1) (world, 1) (world, 1) (?, 1) (hello, 2) (world, 4) (?, 1)

Reducer 1 Reducer 2 Reducer 3 Reducer 4

Why can’ t we use mappers for this step? Same keys won’ t necessarily get processed together…

55

slide-56
SLIDE 56

// enumerate occurrences of each word // with count of 1 def MapFn1: String -> (String, Int) { ??? } def ReduceFn1: (String, List(Int)) -> (String, Int) { ??? } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { ??? } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().ReduceFn1().ReduceFn2(); write(output) }

Find the number of unique documents that each word

  • ccurs in?

(non)Clicker Question!

56

slide-57
SLIDE 57

Find the number of unique documents that each word

  • ccurs in?

// enumerate occurrences of each word // with count of 1 def MapFn1: (String, String) -> ((String, String), Int) { for w in input.value().split(){ emit((input.key(), w), 1) } } def ReduceFn1: (String, List(Int)) -> (String, Int) { emit(input.key()[1], 1) } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().MapFn2().ReduceFn(); write(output) }

57

slide-58
SLIDE 58

Find the number of unique documents that each word

  • ccurs in?

// enumerate occurrences of each word // with count of 1 def MapFn1: (String, String) -> ((String, String), Int) { for w in input.value().split(){ emit((input.key(), w), 1) } } def ReduceFn1: (String, List(Int)) -> (String, Int) { emit(input.key()[1], 1) } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().MapFn2().ReduceFn(); write(output) }

ignore the value list! (“unique”)

58

slide-59
SLIDE 59

Clicker Question!

59

slide-60
SLIDE 60

Find the number of unique documents that each word

  • ccurs in?

// enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit((input.key(), w), 1) } } def ReduceFn1: { emit(input.key()[1], 1) } // sum the total counts // of each word def ReduceFn2:{ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); }

60

slide-61
SLIDE 61

Find the number of unique documents that each word

  • ccurs in?

// enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit(input.key(), w) } } def ReduceFn1: { for w in input.value(){emit(w, 1)} } // sum the total counts // of each word def ReduceFn2:(S, I) -> (S, I){ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } // enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit((input.key(), w), 1) } } def ReduceFn1: { emit(input.key()[1], 1) } // sum the total counts // of each word def ReduceFn2:{ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); }

61

slide-62
SLIDE 62

Find the number of unique documents that each word

  • ccurs in?

// enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit(input.key(), w) } } def ReduceFn1: { for w in input.value(){emit(w, 1)} } // sum the total counts // of each word def ReduceFn2:(S, I) -> (S, I){ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } // enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit((input.key(), w), 1) } } def ReduceFn1: { emit(input.key()[1], 1) } // sum the total counts // of each word def ReduceFn2:{ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); }

Clicker Question! Do these produce the same output? (a)Yes (b) No

62

slide-63
SLIDE 63

Find the number of unique documents that each word

  • ccurs in?

// enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit(input.key(), w) } } def ReduceFn1: { for w in input.value(){emit(w, 1)} } // sum the total counts // of each word def ReduceFn2:(S, I) -> (S, I){ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } // enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit((input.key(), w), 1) } } def ReduceFn1: { emit(input.key()[1], 1) } // sum the total counts // of each word def ReduceFn2:{ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); }

Do these produce the same output? (a)Yes (b) No Clicker Question!

63

slide-64
SLIDE 64

Find the number of unique documents that each word

  • ccurs in?

// enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit(input.key(), w) } } def ReduceFn1: { for w in input.value(){emit(w, 1)} } // sum the total counts // of each word def ReduceFn2:(S, I) -> (S, I){ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } // enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit((input.key(), w), 1) } } def ReduceFn1: { emit(input.key()[1], 1) } // sum the total counts // of each word def ReduceFn2:{ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); }

Do these produce the same output? (a)Yes (b) No

unique documents a word occurs in

Clicker Question!

64

slide-65
SLIDE 65

Find the number of unique documents that each word

  • ccurs in?

// enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit(input.key(), w) } } def ReduceFn1: { for w in input.value(){emit(w, 1)} } // sum the total counts // of each word def ReduceFn2:(S, I) -> (S, I){ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } // enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit((input.key(), w), 1) } } def ReduceFn1: { emit(input.key()[1], 1) } // sum the total counts // of each word def ReduceFn2:{ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); }

Do these produce the same output? (a)Yes (b) No

unique documents a word occurs in ???

Clicker Question!

65

slide-66
SLIDE 66

Clicker Question! What will this produce?

def ReduceFn1: (S, S) -> (S, I) { for w in input.value(){ emit(w, 1) } } def ReduceFn2:(S, I) -> (S, I){ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); }

(a) here:2, are:2, some: 1, words:3 (b) here:2, are:2, some: 1, words:5 (c) here: 1, are: 1, some: 1, words: 1

Input K: V Doc1 : here are some words Doc2: words words words Doc3: here are words def MapFn1: (S, S) -> (S, S) { for w in input.value().split(){ emit(input.key(), w) } }

66

slide-67
SLIDE 67

Clicker Question! What will this produce?

def ReduceFn1: (S, S) -> (S, I) { for w in input.value(){ emit(w, 1) } } def ReduceFn2:(S, I) -> (S, I){ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); }

(a) here:2, are:2, some: 1, words:3 (b) here:2, are:2, some: 1, words:5 (c) here: 1, are: 1, some: 1, words: 1

Input K: V Doc1 : here are some words Doc2: words words words Doc3: here are words def MapFn1: (S, S) -> (S, S) { for w in input.value().split(){ emit(input.key(), w) } }

67

slide-68
SLIDE 68

Clicker Question!

def MapFn1: (S, S) -> (S, S) { for w in input.value().split(){ emit(input.key(), w) } }

What will this produce?

def ReduceFn1: (S, S) -> (S, I) { for w in input.value(){ emit(w, 1) } } def ReduceFn2:(S, I) -> (S, I){ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); }

(a) here:2, are:2, some: 1, words:3 (b) here:2, are:2, some: 1, words:5 (c) here: 1, are: 1, some: 1, words: 1

Input K: V Doc1 : here are some words Doc2: words words words Doc3: here are words

Reducer is by DocId only, so just counts total occurrences

68

slide-69
SLIDE 69

Other MapReduce Functions

  • Sort
  • Unique
  • Sample
  • First
  • Filter
  • Join

69

slide-70
SLIDE 70

Other MapReduce Functions

  • Sort
  • Unique
  • Sample
  • First
  • Filter
  • Join

70

slide-71
SLIDE 71

Other MapReduce Functions

  • Sort
  • Unique
  • Sample
  • First
  • Filter
  • Join
  • Joins are usually computed

“under the hood” by most MR implementations (like in SQL)

  • But you can imagine having to

do them yourself…

71

slide-72
SLIDE 72

Real Life Application

72

slide-73
SLIDE 73

Is Charles Mingus a composer?

Real Life Application

73

slide-74
SLIDE 74

Is Charles Mingus a composer?

“Mingus is a composer”

Real Life Application

74

slide-75
SLIDE 75

Is Charles Mingus a composer?

“Mingus is a composer”

Real Life Application

75

slide-76
SLIDE 76

Is Charles Mingus a 1950s American jazz composer?

“Mingus is a 1950s American jazz composer”

Real Life Application

76

slide-77
SLIDE 77

Is Charles Mingus a 1950s American jazz composer?

Real Life Application

77

slide-78
SLIDE 78

Is Charles Mingus a 1950s American jazz composer?

“Mingus is a 1950s American jazz composer”

… if Mingus is a composer worthy of our attention, it must be because… A virtuoso bassist and composer, Mingus irrevocably changed the face of jazz… Mingus dominated the scene back in the 1950s and 1960s. Mingus was truly a product of America in all its historic complexities…

Real Life Application

78

slide-79
SLIDE 79

ComposerX is a 1950s composer. ComposerX dominated the scene back in the 1950s and 1960s.

Real Life Application

79

slide-80
SLIDE 80

Real Life Application

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

80

slide-81
SLIDE 81

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz

Joins

Desired output:

Subject Predicate Object Categories Barack Obama won the electoral vote Person, US_Presidents, Huffington_Post_Columnists Kamala Lopez wrote an op-ed for HuffPo Person, Huffington_Post_Columnists, Actor

Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

81

slide-82
SLIDE 82

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz

Desired output:

Subject Predicate Object Categories Barack Obama won the electoral vote Person, US_Presidents, Huffington_Post_Columnists Kamala Lopez wrote an op-ed for HuffPo Person, Huffington_Post_Columnists, Actor

Joins

Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

82

slide-83
SLIDE 83

Facts

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz

Categories

Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

Select * from Facts, Categories Where Subject == Entity GroupBy Subject

Joins

83

slide-84
SLIDE 84

Facts

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz

Categories

Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

Select * from Facts, Categories Where Subject == Entity GroupBy Subject

Joins

84

slide-85
SLIDE 85

Facts

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz

Categories

Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

Select * from Facts, Categories Where Subject == Entity GroupBy Subject Key: String Value: (list_of((String, String, String), list_of((String, String))

Joins

85

slide-86
SLIDE 86

Facts

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz

Categories

Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

Select * from Facts, Categories Where Subject == Entity GroupBy Subject Key: String Value: (list_of((String, String, String), list_of((String, String))

Joins

Entity

86

slide-87
SLIDE 87

Facts

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz

Categories

Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

Select * from Facts, Categories Where Subject == Entity GroupBy Subject Key: String Value: (list_of((String, String, String), list_of((String, String))

Joins

All the facts for that entity

87

slide-88
SLIDE 88

Facts

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz

Categories

Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

Select * from Facts, Categories Where Subject == Entity GroupBy Subject Key: String Value: (list_of((String, String, String), list_of((String, String))

Joins

All the categories for that entity

88

slide-89
SLIDE 89

Facts

Subject Predicate Object Barack Obama won the electoral vote Kamala Lopez wrote an op-ed for HuffPo Charles Mingus wrote jazz Barack Obama opposed the appropriations bill Barack Obama listens to jazz

Categories

Category Entity Person Barack Obama Person Kamala Lopez Person Charles Mingus Huffington Post Columnists Barack Obama Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus

Select * from Facts, Categories Where Subject == Entity GroupBy Subject Key: String Value: (list_of((String, String, String), list_of((String, String))

Joins

All the categories for that entity

// rekey table by entity def MapFn1: (String, Obj) -> (String, Obj) { emit(input.value().entity(), input.value()) } // rekey table by subject def MapFn2: (String, Obj) -> (String, Obj) { emit(input.value().subject(), input.value()) } // define your pipeline def main() { Table<String, Obj> cats = read(table1_path).MapFn1() Table<String, Obj> facts = read(table2_path).MapFn2()

  • utput = cats.join(facts).MapFn3(. . .

89

slide-90
SLIDE 90

Bottlenecks!

90

slide-91
SLIDE 91

Doc1 Doc2 DocN …

Mappers: (DocID, Sent) -> (Word, Count) Mappers: (DocID, Doc) -> (DocID, Sent)

Sent1 Sent2 SentM Word1 Word2 WordK

Reducers: (Word, Count) -> Word, sum(Count)

… …

91

slide-92
SLIDE 92

Doc1 Doc2 DocN …

Mappers: (DocID, Sent) -> (Word, Count) Mappers: (DocID, Doc) -> (DocID, Sent)

Sent1 Sent2 SentM Word1 Word2 WordK

Reducers: (Word, Count) -> Word, sum(Count)

… …

Clicker Question!

92

slide-93
SLIDE 93

Doc1 Doc2 DocN …

Mappers: (DocID, Sent) -> (Word, Count) Mappers: (DocID, Doc) -> (DocID, Sent)

Sent1 Sent2 SentM Word1 Word2 WordK

Reducers: (Word, Count) -> Word, sum(Count)

… …

Clicker Question! In the best-case scenario, how much parallelization could we get here (maximum number of mappers)? (a) N (b) log(N) (c) 5

93

slide-94
SLIDE 94

Doc1 Doc2 DocN …

Mappers: (DocID, Sent) -> (Word, Count) Mappers: (DocID, Doc) -> (DocID, Sent)

Sent1 Sent2 SentM Word1 Word2 WordK

Reducers: (Word, Count) -> Word, sum(Count)

… …

Clicker Question! In the best-case scenario, how much parallelization could we get here (maximum number of mappers)? (a) N (b) log(N) (c) 5

94

slide-95
SLIDE 95

Doc1 Doc2 DocN …

Mappers: (DocID, Sent) -> (Word, Count) Mappers: (DocID, Doc) -> (DocID, Sent)

Sent1 Sent2 SentM Word1 Word2 WordK

Reducers: (Word, Count) -> Word, sum(Count)

… …

Clicker Question! How about here? (a) N (b) M (c) N*M

95

slide-96
SLIDE 96

Doc1 Doc2 DocN …

Mappers: (DocID, Sent) -> (Word, Count) Mappers: (DocID, Doc) -> (DocID, Sent)

Sent1 Sent2 SentM Word1 Word2 WordK

Reducers: (Word, Count) -> Word, sum(Count)

… …

Clicker Question! How about here? (a) N (b) M (c) N*M

96

slide-97
SLIDE 97

Doc1 Doc2 DocN …

Mappers: (DocID, Sent) -> (Word, Count) Mappers: (DocID, Doc) -> (DocID, Sent)

Sent1 Sent2 SentM Word1 Word2 WordK

Reducers: (Word, Count) -> Word, sum(Count)

… …

Clicker Question! How about here? (a) N (b) M (c) N*M

Mapping doesn’ t require the same keys to route to the same machine.

97

slide-98
SLIDE 98

Mapper2: (DocID, Sent) -> (Word, Count) Mapper1: (DocID, Doc) -> (DocID, Sent) Reducer: (Word, Count) -> Word, sum(Count)

Clicker Question! Which is (likely to be) faster?

Mapper: (DocID, Doc) -> (Word, Count) Reducer: (Word, Count) -> Word, sum(Count)

(a) (b) (c) They are the same

98

slide-99
SLIDE 99

Mapper2: (DocID, Sent) -> (Word, Count) Mapper1: (DocID, Doc) -> (DocID, Sent) Reducer: (Word, Count) -> Word, sum(Count)

Clicker Question! Which is (likely to be) faster?

Mapper: (DocID, Doc) -> (Word, Count) Reducer: (Word, Count) -> Word, sum(Count)

(a) (b) (c) They are the same

D

  • c

= l i s t _

  • f

( S e n t e n c e ) S e n t e n c e = l i s t _

  • f

( W

  • r

d )

99

slide-100
SLIDE 100

Mapper2: (DocID, Sent) -> (Word, Count) Mapper1: (DocID, Doc) -> (DocID, Sent) Reducer: (Word, Count) -> Word, sum(Count)

Clicker Question! Which is (likely to be) faster?

Mapper: (DocID, Doc) -> (Word, Count) Reducer: (Word, Count) -> Word, sum(Count)

(a) (b) (c) They are the same

100

slide-101
SLIDE 101

Mapper2: (DocID, Sent) -> (Word, Count) Mapper1: (DocID, Doc) -> (DocID, Sent) Reducer: (Word, Count) -> Word, sum(Count)

Clicker Question! Which is (likely to be) faster?

Mapper: (DocID, Doc) -> (Word, Count) Reducer: (Word, Count) -> Word, sum(Count)

(a) (b) (c) They are the same

Smaller jobs = more dynamic load balancing and faster recovery from failure

101

slide-102
SLIDE 102

Mapper2: (DocID, Sent) -> (Word, Count) Mapper1: (DocID, Doc) -> (DocID, Sent) Reducer: (Word, Count) -> Word, sum(Count)

Clicker Question! Which is (likely to be) faster?

Mapper: (DocID, Doc) -> (Word, Count) for sentence in doc: for word in sentence: blah blah Reducer: (Word, Count) -> Word, sum(Count)

(a) (b) (c) They are the same

In general, nested loops should be refactored into multiple mappers

102

slide-103
SLIDE 103

Doc1 Doc2 DocN …

Mappers: (Sent, 1) -> (Word, Count) Mappers: (DocID, Doc) -> (Sent, 1)

Sent1 Sent2 SentM Word1 Word2 WordK

Reducers: (Word, Count) -> Word, sum(Count)

… …

(non)Clicker Question! What might be bad here?

103

slide-104
SLIDE 104

Doc1 Doc2 DocN …

Mappers: (Sent, 1) -> (Word, Count) Mappers: (DocID, Doc) -> (Sent, 1)

Sent1 Sent2 SentM Word1 Word2 WordK

Reducers: (Word, Count) -> Word, sum(Count)

… …

(non)Clicker Question! What might be bad here?

Skewed Key Distributions! (Need all values with the same key to be together, so can’t automatically load balance)

104

slide-105
SLIDE 105

Zipf’s Law

Word Frequency Word Rank

https://en.wikipedia.org/wiki/Zipf%27s_law

105

slide-106
SLIDE 106

Zipf’s Law

Word Frequency Word Rank “The frequency of any word is inversely proportional to its rank in the frequency table” (Wikipedia)

https://en.wikipedia.org/wiki/Zipf%27s_law

106

slide-107
SLIDE 107

Zipf’s Law

Word Frequency Word Rank the = 7%

https://en.wikipedia.org/wiki/Zipf%27s_law

“The frequency of any word is inversely proportional to its rank in the frequency table” (Wikipedia)

107

slide-108
SLIDE 108

Zipf’s Law

Word Frequency Word Rank the = 7%

  • f = 3.5%

https://en.wikipedia.org/wiki/Zipf%27s_law

“The frequency of any word is inversely proportional to its rank in the frequency table” (Wikipedia)

108

slide-109
SLIDE 109

Zipf’s Law

Word Frequency Word Rank The most frequent 0.2% of words make up 50% of occurrences.

109

slide-110
SLIDE 110

Subject Predicate Object Categories Barack Obama won the electoral vote Person, US_Presidents, Huffington_Post_Columnists Kamala Lopez wrote an op-ed for HuffPo Person, Huffington_Post_Columnists, Actor Predicate Object Category Score won the electoral vote US_Presidents 0.92 won the electoral vote Person 0.89 won the electoral vote Huffington Post Columnists 0.23 wrote an op-ed for HuffPo Huffington Post Columnists 0.99 wrote an op-ed for HuffPo Person 0.91

Real Life Application

110

slide-111
SLIDE 111

Subject Predicate Object Categories Barack Obama won the electoral vote Person, US_Presidents, Huffington_Post_Columnists Kamala Lopez wrote an op-ed for HuffPo Person, Huffington_Post_Columnists, Actor Predicate Object Category Score won the electoral vote US_Presidents 702,345 won the electoral vote Person 812,485 won the electoral vote Huffington Post Columnists 24,571 wrote an op-ed for HuffPo Huffington Post Columnists 134,213 wrote an op-ed for HuffPo Person 136,091

Real Life Application

111

slide-112
SLIDE 112

Mapper1: (subject, predicate, object), list_of(categories) -> category, (predicate, object) Reducer1: category, list_of(predicate, object) -> (category, predicate, object), 1 Reducer2: (category, predicate, object), list_of(count) -> (category, predicate, object), total

… … …

First Attempt

112

slide-113
SLIDE 113

Mapper1: (subject, predicate, object), list_of(categories) -> category, (predicate, object) Reducer1: category, list_of(predicate, object) -> (category, predicate, object), 1 Reducer2: (category, predicate, object), list_of(count) -> (category, predicate, object), total

… … …

First Attempt

113

slide-114
SLIDE 114

Mapper1: (subject, predicate, object), list_of(categories) -> category, (predicate, object) Reducer1: category, list_of(predicate, object) -> (category, predicate, object), 1 Reducer2: (category, predicate, object), list_of(count) -> (category, predicate, object), total

… … …

First Attempt

Every tuple involving a single category (e.g. “Person”) has to go through the same reducer…

114

slide-115
SLIDE 115

Mapper1: (subject, predicate, object), list_of(categories) -> category, (predicate, object) Reducer1: category, list_of(predicate, object) -> (category, predicate, object), 1 Reducer2: (category, predicate, object), list_of(count) -> (category, predicate, object), total

… … …

First Attempt

115

slide-116
SLIDE 116

Mapper1: (subject, predicate, object), list_of(categories) -> category, (predicate, object) Reducer1: category, list_of(predicate, object) -> (category, predicate, object), 1 Reducer2: (category, predicate, object), list_of(count) -> (category, predicate, object), total

… … …

First Attempt

116

slide-117
SLIDE 117

Mapper1: (subject, predicate, object), list_of(categories) -> (category, predicate, object), 1 Reducer2: (category, predicate, object), list_of(count) -> (category, predicate, object), total

So much better!

117

slide-118
SLIDE 118
  • k ok ok go go go.

enjoy the long weekend!

118