Data-Intensive Distributed Computing
Part 3: Analyzing Text (2/2)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter - - PowerPoint PPT Presentation
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 3: Analyzing Text (2/2) January 30, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Source: http://www.flickr.com/photos/guvnah/7861418602/
Query Representation Document Representation
Doc 1
red fish, blue fish
Doc 2
cat in the hat
Doc 3
1 1 1 1 1 1
1 2 3
1 1 1
4
blue cat egg fish green ham hat
green eggs and ham
Doc 4
1 red 1 two
Doc 1
red fish, blue fish
Doc 2
cat in the hat
Doc 3
1 1 1 1 1 1
1 2 3
1 1 1
4
blue cat egg fish green ham hat
green eggs and ham
Doc 4
1 red 1 two
Doc 1
red fish, blue fish
Doc 2
cat in the hat
Doc 3
1 1 1 1 1 1
1 2 3
1 1 1
4
blue cat egg fish green ham hat
3 4 1 4 4 3 2 1 blue cat egg fish green ham hat
2
green eggs and ham
Doc 4
1 red 1 two 2 red 1 two
2 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1
1 2 3
1 1 1
4
1 1 1 1 1 1 2 1
blue cat egg fish green ham hat
1 1 1 1 1 1 2 1 blue cat egg fish green ham hat
1 1 red 1 1 two 1 red 1 two 3 4 1 4 4 3 2 1 2 2 1
Doc 1
red fish, blue fish
Doc 2
cat in the hat
Doc 3
green eggs and ham
Doc 4
[2,4] [3] [2,4] [2] [1] [1] [3] [2] [1] [1] [3]
2 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1
1 2 3
1 1 1
4
1 1 1 1 1 1 2 1
blue cat egg fish green ham hat
1 1 1 1 1 1 2 1 blue cat egg fish green ham hat
1 1 red 1 1 two 1 red 1 two 3 4 1 4 4 3 2 1 2 2 1
Doc 1
red fish, blue fish
Doc 2
cat in the hat
Doc 3
green eggs and ham
Doc 4
1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 2 1
1 two 1 fish
Doc 1
2 red 2 blue 2 fish
Doc 2
3 cat 3 hat
Doc 3
1 fish 2 1
1 two 2 red 3 cat 2 blue 3 hat Shuffle and Sort: aggregate values by keys
class Mapper { def map(docid: Long, doc: String) = { val counts = new Map() for (term <- tokenize(doc)) { counts(term) += 1 } for ((term, tf) <- counts) { emit(term, (docid, tf)) } } } class Reducer { def reduce(term: String, postings: Iterable[(docid, tf)]) = { val p = new List() for ((docid, tf) <- postings) { p.append((docid, tf)) } p.sort() emit(term, p) } }
[2,4] [1] [3] [1] [2] [1] [1] [3] [2] [3] [2,4] [1] [2,4] [2,4] [1] [3]
1 1 2 1 1 2 1 1 2 2 1 1 1 1 1 1 1
1 two 1 fish 2 red 2 blue 2 fish 3 cat 3 hat 1 fish 2 1
1 two 2 red 3 cat 2 blue 3 hat Shuffle and Sort: aggregate values by keys
Doc 1
Doc 2
Doc 3
class Mapper { def map(docid: Long, doc: String) = { val counts = new Map() for (term <- tokenize(doc)) { counts(term) += 1 } for ((term, tf) <- counts) { emit(term, (docid, tf)) } } } class Reducer { def reduce(term: String, postings: Iterable[(docid, tf)]) = { val p = new List() for ((docid, tf) <- postings) { p.append((docid, tf)) } p.sort() emit(term, p) } }
2 1 3 1 2 3 1 fish 9 21 (values) (key) 34 35 80 1 fish 9 21 (values) (keys) 34 35 80 fish fish fish fish fish
Let the framework do the sorting!
2 1 3 1 2 3
class Mapper { def map(docid: Long, doc: String) = { val counts = new Map() for (term <- tokenize(doc)) { counts(term) += 1 } for ((term, tf) <- counts) { emit((term, docid), tf) } } } class Reducer { var prev = null val postings = new PostingsList() def reduce(key: Pair, tf: Iterable[Int]) = { if key.term != prev and prev != null { emit(prev, postings) postings.reset() } postings.append(key.docid, tf.first) prev = key.term } def cleanup() = { emit(prev, postings) } }
2 1 3 1 2 3 2 1 3 1 2 3 1 fish 9 21 34 35 80 … 1 fish 8 12 13 1 45 …
1 1 1
7 bits 14 bits 21 bits
class Mapper { def map(docid: Long, doc: String) = { val counts = new Map() for (term <- tokenize(doc)) { counts(term) += 1 } for ((term, tf) <- counts) { emit((term, docid), tf) } } } class Reducer { var prev = null val postings = new PostingsList() def reduce(key: Pair, tf: Iterable[Int]) = { if key.term != prev and prev != null { emit(prev, postings) postings.reset() } postings.append(key.docid, tf.first) prev = key.term } def cleanup() = { emit(prev, postings) } }
1 fish 9 21 (value) (key) 34 35 80 fish fish fish fish fish
…
2 1 3 1 2 3
Doc 1 1 fish (value) (key) 1
1 two « fish «
« two
2 1 1 1 1 1
1 fish 9 21 (value) (key) 34 35 80 fish fish fish fish fish
« fish … …
2 1 3 1 2 3 1 1 1
2 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1
1 2 3
1 1 1
4
1 1 1 1 1 1 2 1
blue cat egg fish green ham hat
1 1 1 1 1 1 2 1 blue cat egg fish green ham hat
1 1 red 1 1 two 1 red 1 two 3 4 1 4 4 3 2 1 2 2 1
1 fish 9 21 (value) (key) 34 35 80 fish fish fish fish fish
« fish … …
2 1 3 1 2 3 1 1 1
class Mapper { def map(docid: Long, doc: String) = { val counts = new Map() for (term <- tokenize(doc)) { counts(term) += 1 } for ((term, tf) <- counts) { emit((term, docid), tf) } } } class Reducer { var prev = null val postings = new PostingsList() def reduce(key: Pair, tf: Iterable[Int]) = { if key.term != prev and prev != null { emit(key.term, postings) postings.reset() } postings.append(key.docid, tf.first) prev = key.term } def cleanup() = { emit(prev, postings) } }
class Mapper { val m = new Map() def map(docid: Long, doc: String) = { val counts = new Map() for (term <- tokenize(doc)) { counts(term) += 1 } for ((term, tf) <- counts) { m(term).append((docid, tf)) } if memoryFull() flush() } def cleanup() = { flush() } def flush() = { for (term <- m.keys) { emit(term, new PostingsList(m(term))) } m.clear() } }
class Reducer { def reduce(term: String, lists: Iterable[PostingsList]) = { var f = new PostingsList() for (list <- lists) { f = f + list } emit(term, f) } }
10 20 30 40 50 60 70 80 20 40 60 80 100
Indexing Time (minutes) Number of Documents (millions)
R2 = 0.994 R2 = 0.996 IP algorithm LP algorithm
Alg. Time Intermediate Pairs Intermediate Size IP 38.5 min 13 × 109 306 × 109 bytes LP 29.6 min 614 × 106 85 × 109 bytes
From: Elsayed et al., Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother? 2010
Experiments on ClueWeb09 collection: segments 1 + 2 101.8m documents (472 GB compressed, 2.97 TB uncompressed)
class Mapper { val m = new Map() def map(docid: Long, doc: String) = { val counts = new Map() for (term <- tokenize(doc)) { counts(term) += 1 } for ((term, tf) <- counts) { m(term).append((docid, tf)) } if memoryFull() flush() } def cleanup() = { flush() } def flush() = { for (term <- m.keys) { emit(term, new PostingsList(m(term))) } m.clear() } } class Reducer { def reduce(term: String, lists: Iterable[PostingsList]) = { val f = new PostingsList() for (list <- lists) { f = f + list } emit(term, f) } }
RDD[(K, V)]
seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U
RDD[(K, U)]
Source: Wikipedia (Walnut)
Query Representation Document Representation
( blue AND fish ) OR ham blue fish AND ham OR
1 2 blue fish 2 1 ham 3 3 5 6 7 8 9 4 5 5 9
blue fish AND ham OR
1 2 blue fish 2 1 ham 3 3 5 6 7 8 9 4 5 5 9 2 5 9
blue fish AND blue fish AND ham OR
1 2 3 4 5 9
1 2 blue fish 2 1 ham 3 3 5 6 7 8 9 4 5 5 9
blue fish AND ham OR
1 2 blue fish 2 1 ham 3 3 5 6 7 8 9 4 5 5 9
θ φ
i=0 wj,iwk,i
i=0 w2 j,i
i=0 w2 k,i
n
i=0
i j i j i
, ,
j i
j i,
i
fish 2 1 3 1 2 3 1 9 21 34 35 80 … blue 2 1 1 9 21 35 … Accumulators
(e.g. min heap) Document score in top k? Yes: Insert document score, extract-min if heap too large No: Do nothing
fish 2 1 3 1 2 3 1 9 21 34 35 80 … blue 2 1 1 9 21 35 … Accumulators
(e.g., hash)
Score{q=x}(doc n) = s
2 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1
1 2 3
1 1 1
4
1 1 1 1 1 1 2 1
blue cat egg fish green ham hat
1 1 1 1 1 1 2 1 blue cat egg fish green ham hat
1 1 red 1 1 two 1 red 1 two 3 4 1 4 4 3 2 1 2 2 1
…
…
brokers Datacenter Tier partitions … … … … … … … … replicas cache Tier partitions … … … … … … … … replicas cache Tier partitions … … … … … … … … replicas cache brokers Datacenter Tier partitions … … … … … … … … replicas cache Tier partitions … … … … … … … … replicas cache Tier partitions … … … … … … … … replicas cache Datacen Tier partit … … … Tier partit … … … Tier partit … … …
Source: Wikipedia (Japanese rock garden)