Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - - PDF document

data intensive distributed computing
SMART_READER_LITE
LIVE PREVIEW

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 4: Analyzing Text (1/2) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons


slide-1
SLIDE 1

Data-Intensive Distributed Computing

Part 4: Analyzing Text (1/2)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Fall 2020) Ali Abedi

These slides are available at https://www.student.cs.uwaterloo.ca/~cs451

1

slide-2
SLIDE 2

Structure of the Course

“Core” framework features and algorithm design

MapReduce, Apache Hadoop, Apache Spark 2

slide-3
SLIDE 3

Structure of the Course

“Core” framework features and algorithm design

Analyzing Text Analyzing Graphs Analyzing Relational Data Data Mining

3

slide-4
SLIDE 4
  • Pairs. Stripes.

Seems pretty trivial… More than a “toy problem”? Answer: language models

4

slide-5
SLIDE 5

Why?

  • Machine translation
  • P(High winds tonight) > P(Large winds tonight)
  • Spell Correction
  • P(Waterloo is a great city) > P(Waterloo is a grate city)
  • Speech recognition
  • P (I saw a van) > P(eyes awe of an)

Language Models

Assigning a probability to a sentence

Slide: from Dan Jurafsky

Sentence with T words - assign a probability to it It has may applications in natural language processing.

slide-6
SLIDE 6

[chain rule] Is this tractable?

Language Models

P(“Waterloo is a great city”) = P(Waterloo) x P(is | Waterloo) x P(a | Waterloo is) x P(great | Waterloo is a) x P(city | Waterloo is a great)

Sentence with T words - assign a probability to it P(A,B) = P(B) P(A|B) It’s becoming to complicated, let’s simplify it.

slide-7
SLIDE 7

Basic idea: limit history to fixed number of (N – 1) words (Markov Assumption) N=1: Unigram Language Model

Approximating Probabilities: N-Grams

N-Grams are different levels of simplification of the chain rule. Unigram is the simplest model, hence the most inaccurate. 7

slide-8
SLIDE 8

N=2: Bigram Language Model

Approximating Probabilities: N-Grams

Basic idea: limit history to fixed number of (N – 1) words (Markov Assumption)

Since we also want to include the first word in the bigram model, we need a dummy beginning of sentence marker <s>. We usually also have an end of sentence marker but for the sake of brevity, I don’t show that here.

slide-9
SLIDE 9

N=3: Trigram Language Model

Approximating Probabilities: N-Grams

Basic idea: limit history to fixed number of (N – 1) words (Markov Assumption)

9

slide-10
SLIDE 10

Building N-Gram Language Models

We already know how to do this in MapReduce! Compute maximum likelihood estimates (MLE) for Individual n-gram probabilities

Unigram Bigram Generalizes to higher-order n-grams State of the art models use ~5-grams

slide-11
SLIDE 11

Estimating Probability Distribution Sparsity problem

Let’s now see how we can use these models and what problems they have. 11

slide-12
SLIDE 12

Note: We don’t ever cross sentence boundaries

I am Sam Sam I am I do not like green eggs and ham <s> <s> <s> </s> </s> </s>

Training Corpus

P( I | <s> ) = 2/3 = 0.67 P( Sam | <s> ) = 1/3 = 0.33 P( am | I ) = 2/3 = 0.67 P( do | I ) = 1/3 = 0.33 P( </s> | Sam )= 1/2 = 0.50 P( Sam | am) = 1/2 = 0.50 ...

Bigram Probability Estimates

Example: Bigram Language Model

12

slide-13
SLIDE 13

P(I like ham) = P( I | <s> ) P( like | I ) P( ham | like ) P( </s> | ham ) = 0 P( I | <s> ) = 2/3 = 0.67 P( Sam | <s> ) = 1/3 = 0.33 P( am | I ) = 2/3 = 0.67 P( do | I ) = 1/3 = 0.33 P( </s> | Sam )= 1/2 = 0.50 P( Sam | am) = 1/2 = 0.50 ...

Bigram Probability Estimates Issue: Sparsity!

Data Sparsity

Why is the 0 bad ?

slide-14
SLIDE 14

Solution: Smoothing

Zeros are bad for any statistical estimator

Need better estimators because MLEs give us a lot of zeros A distribution without zeros is “smoother”

The Robin Hood Philosophy: Take from the rich (seen n-grams) and give to the poor (unseen n-grams)

Need better estimators because MLEs give us a lot of zeros A distribution without zeros is “smoother”

Lots of techniques:

Laplace, Good-Turing, Katz backoff, Jelinek-Mercer Kneser-Ney represents best practice

14

slide-15
SLIDE 15

Laplace Smoothing

Simplest and oldest smoothing technique Just add 1 to all n-gram counts including the unseen ones So, what do the revised estimates look like?

15

slide-16
SLIDE 16

Unigrams Bigrams

Laplace Smoothing

slide-17
SLIDE 17

Other variations …

Many smoothing algorithms use this general representation For example: Kneser-Ney

17

slide-18
SLIDE 18

Interesting intuition behind Kneser-Ney smoothing

Let’s complete this sentence I cannot find my reading … I cannot find my reading francisco Problem: “francisco” appears more frequently than “glasses”. Unigram probability is misleading here. Solution: “francisco” only appears after “san”! Instead

  • f unigram probability use the number of contexts

franscisco appears in.

18

slide-19
SLIDE 19

Source: Brants et al. (EMNLP 2007)

Stupid Backoff

Let’s break all the rules: But throw lots of data at the problem!

19

slide-20
SLIDE 20

A B A B C A B D A B E … A B A B C A B C P A B C Q A B D A B D X A B D Y …

remember this value remember this value remember this value remember this value

S(C|A B) = f(A B C)/f(A B) S(D|A B) = f(A B D)/f(A B) S(E|A B) = f(A B E)/f(A B) …

Stupid Backoff Implementation

Straightforward approach: count each order separately More clever approach: count all orders together

20

slide-21
SLIDE 21

Kneser-Ney (KN) and Stupid Backoff (SB)

KN fails to train on 1.8TB dataset (in reasonable time). 21

slide-22
SLIDE 22

Kneser-Ney (KN) and Stupid Backoff (SB)

Translation accuracy vs training data size. SB outperforms KN when the training size is big enough. KN fails to train on big datasets. 22

slide-23
SLIDE 23

Source: http://www.flickr.com/photos/guvnah/7861418602/

Search!

23

slide-24
SLIDE 24

24

Do these represent the same concepts?

Author Searcher “tragic love story” “fateful star-crossed romance” Concepts Query Terms Concepts Document Terms

The Central Problem in Search

Why is IR hard? Because language is hard!

slide-25
SLIDE 25

Documents Query Hits Representation Function Representation Function

Query Representation Document Representation

Comparison Function Index

  • ffline
  • nline

Abstract IR Architecture

25

slide-26
SLIDE 26

How do we represent text?

Remember: computers don’t “understand” anything! “Bag of words”

Treat all the words in a document as index terms Assign a “weight” to each term based on “importance” (or, in simplest case, presence/absence of word) Disregard order, structure, meaning, etc. of the words Simple, yet effective!

Assumptions

Term occurrence is independent Document relevance is independent “Words” are well-defined

26

slide-27
SLIDE 27

天主教教宗若望保祿二世因感冒再度住進醫院。 這是他今年第二度因同樣的病因住院。لاقوكرامفيجير-قطانلامساب ةيجراخلاةيليئارسلئا-نإنوراشلبق ةوعدلاموقيسوةرمللىلولؤاةرايزب سنوت،يتلاتناكةرتفلةليوطرقملا يمسرلاةمظنملريرحتلاةينيطسلفلادعباهجورخنمنانبلماع1982. Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России. भारत सरकार ने आरॎथिक सर्शेक्सण मेः रॎर्शत्थीय र्शरॎि 2005-06 मेः सात फीसदी रॎर्शकास दर हारॎसल करने का आकलन रॎकया है और कर सुधार पर ज़ौर रॎदया है 日米連合で台頭中国に対処…アーミテージ前副長官提言 조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안 에 대해 `군대라도 동원해 막고싶은 심정''이라고 말했다는 일부 언론의 보도를 부인했다.

What’s a word?

27

slide-28
SLIDE 28

McDonald's slims down spuds

Fast-food chain to reduce certain types of fat in its french fries with new cooking oil.

NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. …

14 × McDonalds 12 × fat 11 × fries 8 × new 7 × french 6 × company, said, nutrition 5 × food, oil, percent, reduce, taste, Tuesday …

“Bag of Words”

Sample Document

28

slide-29
SLIDE 29

Documents Inverted Index

Bag of Words

case folding, tokenization, stopword removal, stemming syntax, semantics, word knowledge, etc.

Counting Words…

29

slide-30
SLIDE 30

Source: http://www.flickr.com/photos/guvnah/7861418602/

Count.

30

slide-31
SLIDE 31
  • ne fish, two fish

Doc 1

red fish, blue fish

Doc 2

cat in the hat

Doc 3

1 1 1 1 1 1

1 2 3

1 1 1

4

blue cat egg fish green ham hat

  • ne

green eggs and ham

Doc 4

1 red 1 two

What goes in each cell? boolean count positions

31

slide-32
SLIDE 32

Documents Query Hits Representation Function Representation Function

Query Representation Document Representation

Comparison Function Index

  • ffline
  • nline

Abstract IR Architecture

32

slide-33
SLIDE 33
  • ne fish, two fish

Doc 1

red fish, blue fish

Doc 2

cat in the hat

Doc 3

1 1 1 1 1 1

1 2 3

1 1 1

4

blue cat egg fish green ham hat

  • ne

green eggs and ham

Doc 4

1 red 1 two

Indexing: building this structure Retrieval: manipulating this structure Where have we seen this before?

33

slide-34
SLIDE 34
  • ne fish, two fish

Doc 1

red fish, blue fish

Doc 2

cat in the hat

Doc 3

1 1 1 1 1 1

1 2 3

1 1 1

4

blue cat egg fish green ham hat

  • ne

3 4 1 4 4 3 2 1 blue cat egg fish green ham hat

  • ne

2

green eggs and ham

Doc 4

1 red 1 two 2 red 1 two

34

slide-35
SLIDE 35

Indexing: Performance Analysis

Fundamentally, a large sorting problem

Terms usually fit in memory Postings usually don’t

How is it done on a single machine? How can it be done with MapReduce? First, let’s characterize the problem size:

Size of vocabulary Size of postings

35

slide-36
SLIDE 36

36

22

b

kT M =

M is vocabulary size T is collection size (number of documents) k and b are constants Typically, k is between 10 and 100, b is between 0.4 and 0.6

Vocabulary Size: Heaps’ Law

Heaps’ Law: linear in log-log space Surprise: Vocabulary size grows unbounded!

slide-37
SLIDE 37

Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)

k = 44 b = 0.49

Manning, Raghavan, Schütze, Introduction to Information Retrieval (2008)

Heaps’ Law for RCV1

37

slide-38
SLIDE 38

38

22

N number of elements k rank s characteristic exponent

Postings Size: Zipf’s Law

Zipf’s Law: (also) linear in log-log space

Specific case of Power Law distributions

In other words:

A few elements occur very frequently Many elements occur very infrequently

slide-39
SLIDE 39

Fit isn’t that good… but good enough!

Manning, Raghavan, Schütze, Introduction to Information Retrieval (2008)

Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)

Zipf’s Law for RCV1

39

slide-40
SLIDE 40

Zipf’s Law for Wikipedia

Rank versus frequency for the first 10m words in 30 Wikipedias (dumps from October 2015)

40

slide-41
SLIDE 41

Figure from: Newman, M. E. J. (2005) “Power laws, Pareto distributions and Zipf's law.” Contemporary Physics 46:323–351.

41

slide-42
SLIDE 42

MapReduce: Index Construction

Map over all documents

Emit term as key, (docid, tf) as value Emit other information as necessary (e.g., term position)

Sort/shuffle: group postings by term Reduce

Gather and sort the postings (typically by docid) Write postings to disk

MapReduce does all the heavy lifting!

42

slide-43
SLIDE 43

1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 2 1

  • ne

1 two 1 fish

  • ne fish, two fish

Doc 1

2 red 2 blue 2 fish

red fish, blue fish

Doc 2

3 cat 3 hat

cat in the hat

Doc 3

1 fish 2 1

  • ne

1 two 2 red 3 cat 2 blue 3 hat Shuffle and Sort: aggregate values by keys

Map Reduce

Inverted Indexing with MapReduce

43

slide-44
SLIDE 44

Inverted Indexing: Pseudo-Code

class Mapper { def map(docid: Long, doc: String) = { val counts = new Map() for (term <- tokenize(doc)) { counts(term) += 1 } for ((term, tf) <- counts) { emit(term, (docid, tf)) } } } class Reducer { def reduce(term: String, postings: Iterable[(docid, tf)]) = { val p = new List() for ((docid, tf) <- postings) { p.append((docid, tf)) } p.sort() emit(term, p) } }

Stay tuned…

44