Probability & Language Modeling CMSC 473/673 UMBC Some slides - - PowerPoint PPT Presentation

probability language modeling
SMART_READER_LITE
LIVE PREVIEW

Probability & Language Modeling CMSC 473/673 UMBC Some slides - - PowerPoint PPT Presentation

Probability & Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner VISION AUDIO prosody intonation orthography color morphology lexemes syntax semantics pragmatics discourse Three people have been


slide-1
SLIDE 1

Probability & Language Modeling

CMSC 473/673 UMBC

Some slides adapted from 3SLP, Jason Eisner

slide-2
SLIDE 2
  • rthography

morphology lexemes syntax semantics pragmatics discourse

VISION AUDIO

prosody intonation color

slide-3
SLIDE 3
slide-4
SLIDE 4

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

score( )

slide-5
SLIDE 5

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

pθ( )

slide-6
SLIDE 6

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

pθ( )

what’s a probability?

slide-7
SLIDE 7

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

pθ( )

what do we estimate?

Documents? Sentences? Words? Characters?

slide-8
SLIDE 8

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

pθ( )

what’s a word?

how to deal with morphology and orthography

slide-9
SLIDE 9

Tree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of an Shining Path attack today.

pθ( )

how do we estimate robustly?

slide-10
SLIDE 10

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of an ISIS attack today.

pθ( )

how do we generalize?

slide-11
SLIDE 11

Outline

Probability review Words Defining Language Models Breaking & Fixing Language Models Evaluating Language Models

slide-12
SLIDE 12

Outline

Probability review Words Defining Language Models Breaking & Fixing Language Models Evaluating Language Models

slide-13
SLIDE 13

Probability Takeaways

Basic probability axioms and definitions Probabilistic Independence Definition of joint probability Definition of conditional probability Bayes rule Probability chain rule

slide-14
SLIDE 14

Kinds of Statistics

Descriptive Confirmatory Predictive

The average grade on this assignment is 83.

slide-15
SLIDE 15

Interpretations of Probability

Past performance 58% of the past 100 flips were heads Hypothetical performance If I flipped the coin in many parallel universes… Subjective strength of belief Would pay up to 58 cents for chance to win $1 Output of some computable formula? p(heads) vs q(heads)

slide-16
SLIDE 16

Probabilities Measure Sets

all (known) outcomes involving coin being flipped coin coming up heads

slide-17
SLIDE 17

Probabilities Measure Sets

all (known) outcomes involving coin being flipped coin is ancient coin coming up heads

slide-18
SLIDE 18

Probabilities Measure Sets

all (known) outcomes involving coin being flipped defective minting process coin is ancient coin coming up heads

slide-19
SLIDE 19

Probabilities Measure Sets

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

slide-20
SLIDE 20

(Most) Probability Axioms

p(everything) = 1

everything

slide-21
SLIDE 21

(Most) Probability Axioms

p(everything) = 1 p(φ) = 0

everything

slide-22
SLIDE 22

(Most) Probability Axioms

p(everything) = 1 p(φ) = 0 p(A) ≤ p(B), when A ⊆ B

everything B A

slide-23
SLIDE 23

(Most) Probability Axioms

p(everything) = 1 p(φ) = 0 p(A) ≤ p(B), when A ⊆ B p(A ∪ B) = p(A) + p(B), when A ∩ B = φ

everything A B

slide-24
SLIDE 24

(Most) Probability Axioms

p(everything) = 1 p(φ) = 0 p(A) ≤ p(B), when A ⊆ B p(A ∪ B) = p(A) + p(B), when A ∩ B = φ

everything A B

p(A ∪ B) ≠ p(A) + p(B)

slide-25
SLIDE 25

(Most) Probability Axioms

p(everything) = 1 p(φ) = 0 p(A) ≤ p(B), when A ⊆ B p(A ∪ B) = p(A) + p(B), when A ∩ B = φ

everything A B

p(A ∪ B) = p(A) + p(B) – p(A ∩ B)

slide-26
SLIDE 26

Probabilities of Independent Events Multiply

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞(ancient coin AND defective minting process)

slide-27
SLIDE 27

Probabilities of Independent Events Multiply

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 ancient coin AND defective minting process = 𝑞 ancient coin ∗ 𝑞(defective minting process)

slide-28
SLIDE 28

Probabilities of Independent Events Multiply

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads comma represents AND

𝑞 ancient coin, defective minting process = 𝑞 ancient coin ∗ 𝑞(defective minting process)

slide-29
SLIDE 29

Joint Probabilities Are (Should Be) Symmetric

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 defective minting process, ancient coin = 𝑞 defective minting process ∗ 𝑞 ancient coin

But the arguments to joint probabilities can have an order

slide-30
SLIDE 30

Conditional Probabilities (Also) Measure Sets

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads defective minting process)

slide-31
SLIDE 31

Conditional Probabilities (Also) Measure Sets

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads defective minting process) = 𝑞(heads AND defective minting process) 𝑞(defective minting process)

slide-32
SLIDE 32

Conditional Probabilities (Also) Measure Sets

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads defective minting process) = 𝑞(heads AND defective minting process) 𝑞(defective minting process)

defective process favors tails

slide-33
SLIDE 33

Conditional Probabilities (Also) Measure Sets

all (known) outcomes involving coin being flipped cafeteria serves egg salad coin is ancient

𝑞 heads defective minting process) = 𝑞(heads AND defective minting process) 𝑞(defective minting process)

defective process favors heads defective minting process coin coming up heads

slide-34
SLIDE 34

Conditional Probabilities Are Probabilities

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads egg salad) 𝑞 heads NOT egg salad)

vs.

slide-35
SLIDE 35

Conditional Probabilities Are Probabilities

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads egg salad) 𝑞 heads NOT egg salad)

vs.

𝑞 heads egg salad) 𝑞 tails egg salad)

vs.

slide-36
SLIDE 36

Conditional Probabilities Are Probabilities

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads egg salad) 𝑞 heads NOT egg salad)

vs.

𝑞 heads egg salad) 𝑞 tails egg salad)

vs.

𝑞 heads egg salad) 𝑞 tails NOT egg salad)

vs.

slide-37
SLIDE 37

Bayes Rule

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads defective minting process) = 𝑞(heads AND defective minting process) 𝑞(defective minting process)

slide-38
SLIDE 38

Bayes Rule

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads AND defective minting process = 𝑞 heads defective minting process) ∗ 𝑞(defective minting process)

slide-39
SLIDE 39

Bayes Rule

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads AND defective minting process = 𝑞 heads defective minting process) ∗ 𝑞(defective minting process)

slide-40
SLIDE 40

Bayes Rule

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads AND defective minting process = 𝑞(defective minting process | heads) ∗ 𝑞(heads)

slide-41
SLIDE 41

Bayes Rule

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads AND defective minting process = 𝑞(defective minting process | heads) ∗ 𝑞(heads) 𝑞 heads AND defective minting process = 𝑞 heads defective minting process) ∗ 𝑞(defective minting process)

slide-42
SLIDE 42

Bayes Rule

all (known) outcomes involving coin being flipped cafeteria serves egg salad defective minting process coin is ancient coin coming up heads

𝑞 heads defective minting process) = 𝑞(defective minting process | heads) ∗ 𝑞(heads) 𝑞(defective minting process)

slide-43
SLIDE 43

Bayes Rule

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

slide-44
SLIDE 44

Bayes Rule

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

posterior probability

slide-45
SLIDE 45

Bayes Rule

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

posterior probability likelihood

slide-46
SLIDE 46

Bayes Rule

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

posterior probability likelihood prior probability

slide-47
SLIDE 47

Bayes Rule

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

posterior probability likelihood prior probability marginal likelihood (probability)

slide-48
SLIDE 48

Changing the Left

1

p(A)

slide-49
SLIDE 49

Changing the Left

1

p(A, B) p(A)

slide-50
SLIDE 50

Changing the Left

1

p(A, B, C) p(A, B) p(A)

slide-51
SLIDE 51

Changing the Left

p(A, B, C, D)

1

p(A, B, C) p(A, B) p(A)

slide-52
SLIDE 52

Changing the Left

p(A, B, C, D)

1

p(A, B, C) p(A, B) p(A) p(A, B, C, D, E)

slide-53
SLIDE 53

Changing the Right

1

p(A | B) p(A)

slide-54
SLIDE 54

Changing the Right

1

p(A | B) p(A)

slide-55
SLIDE 55

Changing the Right

1

p(A | B) p(A)

slide-56
SLIDE 56

Changing the Right

Bias vs. Variance Lower bias: More specific to what we care about Higher variance: For fixed

  • bservations, estimates

become less reliable

slide-57
SLIDE 57

Probability Chain Rule

𝑞 𝑦1, 𝑦2 = 𝑞 𝑦1 𝑞 𝑦2 𝑦1)

Bayes rule

slide-58
SLIDE 58

Probability Chain Rule

𝑞 𝑦1, 𝑦2, … , 𝑦𝑇 = 𝑞 𝑦1 𝑞 𝑦2 𝑦1)𝑞 𝑦3 𝑦1, 𝑦2) ⋯ 𝑞 𝑦𝑇 𝑦1, … , 𝑦𝑗

slide-59
SLIDE 59

Probability Chain Rule

𝑞 𝑦1, 𝑦2, … , 𝑦𝑇 = 𝑞 𝑦1 𝑞 𝑦2 𝑦1)𝑞 𝑦3 𝑦1, 𝑦2) ⋯ 𝑞 𝑦𝑇 𝑦1, … , 𝑦𝑗 = ෑ

𝑗 𝑇

𝑞 𝑦𝑗 𝑦1, … , 𝑦𝑗−1)

slide-60
SLIDE 60

Probability Takeaways

Basic probability axioms and definitions Probabilistic Independence Definition of joint probability Definition of conditional probability Bayes rule Probability chain rule

slide-61
SLIDE 61

Outline

Probability review Words Defining Language Models Breaking & Fixing Language Models Evaluating Language Models

slide-62
SLIDE 62

What Are Words?

Linguists don’t agree (Human) Language-dependent White-space separation is a sometimes okay (for written English longform) Social media? Spoken vs. written? Other languages?

slide-63
SLIDE 63

What Are Words?

bat

http://www.freepngimg.com/download/bat/9-2-bat-png-hd.png

slide-64
SLIDE 64

What Are Words?

bats

http://www.freepngimg.com/download/bat/9-2-bat-png-hd.png

slide-65
SLIDE 65

What Are Words?

Fledermaus flutter mouse

http://www.freepngimg.com/download/bat/9-2-bat-png-hd.png

slide-66
SLIDE 66

What Are Words?

pişirdiler They cooked it. pişmişlermişlerdi They had it cooked it.

slide-67
SLIDE 67

What Are Words?

my leg is hurting nasty ):

):

slide-68
SLIDE 68

Examples of Text Normalization

Segmenting or tokenizing words Normalizing word formats Segmenting sentences in running text

slide-69
SLIDE 69

What Are Words? Tokens vs. Types

The film got a great opening and the film went on to become a hit .

Tokens: an instance of

that type in running text.

  • The
  • film
  • got
  • a
  • great
  • pening
  • and
  • the
  • film
  • went
  • n
  • to
  • become
  • a
  • hit
  • .

Types: an element of the

vocabulary.

  • The
  • film
  • got
  • a
  • great
  • pening
  • and
  • the
  • went
  • n
  • to
  • become
  • hit
  • .
slide-70
SLIDE 70

Some Issues with Tokenization

mph, MPH, M.D. MD, M.D. Baltimore’s mayor I’m, won’t state-of-the-art San Francisco

slide-71
SLIDE 71

CaSE inSensitive?

Replace all letters with lower case version Can be useful for information retrieval (IR), machine translation, language modeling

cat vs Cat (there are other ways to signify beginning)

slide-72
SLIDE 72

CaSE inSensitive?

Replace all letters with lower case version Can be useful for information retrieval (IR), machine translation, language modeling But… case can be useful Sentiment analysis, machine translation, information extraction

cat vs Cat (there are other ways to signify beginning) US vs us

slide-73
SLIDE 73

cat ≟ cats

Lemma: same stem, part of speech, rough word sense cat and cats: same lemma Word form: the fully inflected surface form cat and cats: different word forms

slide-74
SLIDE 74

Lemmatization

Reduce inflections or variant forms to base form

am, are, is  be car, cars, car's, cars'  car

the boy's cars are different colors  the boy car be different color

slide-75
SLIDE 75

Morphosyntax

Morphemes: The small

meaningful units that make up words Stems: The core meaning- bearing units Affixes: Bits and pieces that adhere to stems

slide-76
SLIDE 76

Morphosyntax

Morphemes: The small

meaningful units that make up words Stems: The core meaning- bearing units Affixes: Bits and pieces that adhere to stems Inflectional: (they) look  (they) looked (they) ran  (they) run Derivational: (a) run  running (of the Bulls) code  codeable

slide-77
SLIDE 77

Morphosyntax

Morphemes: The small

meaningful units that make up words Stems: The core meaning- bearing units Affixes: Bits and pieces that adhere to stems

Syntax: Contractions can rewrite and reorder a sentence

Baltimore’s [mayor’s {campaign} ]  [ {the campaign} of the mayor] of Baltimore Inflectional: (they) look  (they) looked (they) ran  (they) run Derivational: (a) run  running (of the Bulls) code  codeable

slide-78
SLIDE 78

Words vs. Sentences

!, ? are relatively unambiguous Period “.” is quite ambiguous

Sentence boundary Abbreviations like Inc. or Dr. Numbers like .02% or 4.3

Solution: write rules, build a classifier

slide-79
SLIDE 79

Outline

Probability review Words Defining Language Models Breaking & Fixing Language Models Evaluating Language Models

slide-80
SLIDE 80

[…text..]

pθ( )

Goal of Language Modeling

Learn a probabilistic model of text Accomplished through observing text and updating model parameters to make text more likely

slide-81
SLIDE 81

[…text..]

pθ( )

Goal of Language Modeling

Learn a probabilistic model of text Accomplished through

  • bserving text and updating

model parameters to make text more likely

0 ≤ 𝑞𝜄 [… 𝑢𝑓𝑦𝑢 … ] ≤ 1 ෍

𝑢:𝑢 is valid text

𝑞𝜄 𝑢 = 1

slide-82
SLIDE 82

“The Unreasonable Effectiveness of Recurrent Neural Networks”

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

slide-83
SLIDE 83

“The Unreasonable Effectiveness of Character- level Language Models” (and why RNNs are still cool)

http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

“The Unreasonable Effectiveness of Recurrent Neural Networks”

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

slide-84
SLIDE 84

Simple Count-Based

𝑞 item

slide-85
SLIDE 85

Simple Count-Based

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

“proportional to”

slide-86
SLIDE 86

Simple Count-Based

“proportional to”

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item =

𝑑𝑝𝑣𝑜𝑢(item) σany other item 𝑧 𝑑𝑝𝑣𝑜𝑢(y)

slide-87
SLIDE 87

Simple Count-Based

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item =

𝑑𝑝𝑣𝑜𝑢(item) σany other item 𝑧 𝑑𝑝𝑣𝑜𝑢(y)

“proportional to”

constant

slide-88
SLIDE 88

Simple Count-Based

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

sequence of characters  pseudo-words sequence of words  pseudo-phrases

slide-89
SLIDE 89

Shakespearian Sequences of Characters

slide-90
SLIDE 90

Shakespearian Sequences of Words

slide-91
SLIDE 91

Novel Words, Novel Sentences

“Colorless green ideas sleep furiously” – Chomsky (1957) Let’s observe and record all sentences with our big, bad supercomputer Red ideas? Read ideas?

slide-92
SLIDE 92

Probability Chain Rule

𝑞 𝑦1, 𝑦2, … , 𝑦𝑇 = 𝑞 𝑦1 𝑞 𝑦2 𝑦1)𝑞 𝑦3 𝑦1, 𝑦2) ⋯ 𝑞 𝑦𝑇 𝑦1, … , 𝑦𝑗

slide-93
SLIDE 93

Probability Chain Rule

𝑞 𝑦1, 𝑦2, … , 𝑦𝑇 = 𝑞 𝑦1 𝑞 𝑦2 𝑦1)𝑞 𝑦3 𝑦1, 𝑦2) ⋯ 𝑞 𝑦𝑇 𝑦1, … , 𝑦𝑗 = ෑ

𝑗 𝑇

𝑞 𝑦𝑗 𝑦1, … , 𝑦𝑗−1)

slide-94
SLIDE 94

N-Grams

Maintaining an entire inventory over sentences could be too much to ask Store “smaller” pieces?

p(Colorless green ideas sleep furiously)

slide-95
SLIDE 95

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces?

p(Colorless green ideas sleep furiously) = p(Colorless) *

slide-96
SLIDE 96

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces?

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) *

slide-97
SLIDE 97

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces?

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

slide-98
SLIDE 98

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces?

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

apply the chain rule

slide-99
SLIDE 99

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces?

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

apply the chain rule

slide-100
SLIDE 100

N-Grams

p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice

  • f “furiously?”
slide-101
SLIDE 101

N-Grams

p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice

  • f “furiously?”

Remove history and contextual info

slide-102
SLIDE 102

N-Grams

p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info p(furiously | Colorless green ideas sleep) ≈ p(furiously | Colorless green ideas sleep)

slide-103
SLIDE 103

N-Grams

p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info p(furiously | Colorless green ideas sleep) ≈ p(furiously | ideas sleep)

slide-104
SLIDE 104

N-Grams

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

slide-105
SLIDE 105

N-Grams

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

slide-106
SLIDE 106

Trigrams

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

slide-107
SLIDE 107

Trigrams

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

slide-108
SLIDE 108

Trigrams

p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS>) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

Consistent notation: Pad the left with <BOS> (beginning of sentence) symbols

slide-109
SLIDE 109

Trigrams

p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS>) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) * p(<EOS> | sleep furiously)

Consistent notation: Pad the left with <BOS> (beginning of sentence) symbols Fully proper distribution: Pad the right with a single <EOS> symbol

slide-110
SLIDE 110

N-Gram Terminology

n Commonly called History Size (Markov order) Example 1 unigram p(furiously)

slide-111
SLIDE 111

N-Gram Terminology

n Commonly called History Size (Markov order) Example 1 unigram p(furiously) 2 bigram 1 p(furiously | sleep)

slide-112
SLIDE 112

N-Gram Terminology

n Commonly called History Size (Markov order) Example 1 unigram p(furiously) 2 bigram 1 p(furiously | sleep) 3 trigram (3-gram) 2 p(furiously | ideas sleep)

slide-113
SLIDE 113

N-Gram Terminology

n Commonly called History Size (Markov order) Example 1 unigram p(furiously) 2 bigram 1 p(furiously | sleep) 3 trigram (3-gram) 2 p(furiously | ideas sleep) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(wi | wi-n+1 … wi-1)

slide-114
SLIDE 114

N-Gram Probability 𝑞 𝑥1, 𝑥2, 𝑥3, ⋯ , 𝑥𝑇 = ෑ

𝑗=1 𝑇

𝑞 𝑥𝑗 𝑥𝑗−𝑂+1, ⋯ , 𝑥𝑗−1)

slide-115
SLIDE 115

Count-Based N-Grams (Unigrams)

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

slide-116
SLIDE 116

Count-Based N-Grams (Unigrams)

𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢(z)

slide-117
SLIDE 117

Count-Based N-Grams (Unigrams)

𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z = 𝑑𝑝𝑣𝑜𝑢 z σ𝑤 𝑑𝑝𝑣𝑜𝑢(v)

slide-118
SLIDE 118

Count-Based N-Grams (Unigrams)

𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z = 𝑑𝑝𝑣𝑜𝑢 z σ𝑤 𝑑𝑝𝑣𝑜𝑢(v)

word type word type word type

slide-119
SLIDE 119

Count-Based N-Grams (Unigrams)

𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z = 𝑑𝑝𝑣𝑜𝑢 z 𝑋

word type word type number of tokens observed

slide-120
SLIDE 120

Count-Based N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Normalization Probability The 1 film 2 got 1 a 2 great 1

  • pening

1 and 1 the 1 went 1

  • n

1 to 1 become 1 hit 1 . 1

slide-121
SLIDE 121

Count-Based N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Normalization Probability The 1 16 film 2 got 1 a 2 great 1

  • pening

1 and 1 the 1 went 1

  • n

1 to 1 become 1 hit 1 . 1

slide-122
SLIDE 122

Count-Based N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Normalization Probability The 1 16 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16

  • pening

1 1/16 and 1 1/16 the 1 1/16 went 1 1/16

  • n

1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16

slide-123
SLIDE 123

Count-Based N-Grams (Trigrams)

𝑞 z|x, y ∝ 𝑑𝑝𝑣𝑜𝑢(x, y, z)

  • rder matters in

conditioning

  • rder matters in

count

slide-124
SLIDE 124

Count-Based N-Grams (Trigrams)

𝑞 z|x, y ∝ 𝑑𝑝𝑣𝑜𝑢(x, y, z)

  • rder matters in

conditioning

  • rder matters in

count

count(x, y, z) ≠ count(x, z, y) ≠ count(y, x, z) ≠ …

slide-125
SLIDE 125

Count-Based N-Grams (Trigrams)

𝑞 z|x, y ∝ 𝑑𝑝𝑣𝑜𝑢 x, y, z = 𝑑𝑝𝑣𝑜𝑢 x, y, z σ𝑤 𝑑𝑝𝑣𝑜𝑢(x, y, v)

slide-126
SLIDE 126

Context Word (Type) Raw Count Normalization Probability The film The 1 0/1 The film film 0/1 The film got 1 1/1 The film went 0/1 … a great great 1 0/1 a great

  • pening

1 1/1 a great and 0/1 a great the 0/1 …

Count-Based N-Grams (Trigrams)

The film got a great opening and the film went on to become a hit .

slide-127
SLIDE 127

Context Word (Type) Raw Count Normalization Probability the film the 2 0/1 the film film 0/1 the film got 1 1/2 the film went 1 1/2 … a great great 1 0/1 a great

  • pening

1 1/1 a great and 0/1 a great the 0/1 …

Count-Based N-Grams (Lowercased Trigrams)

the film got a great opening and the film went on to become a hit .

slide-128
SLIDE 128

Outline

Probability review Words Defining Language Models Breaking & Fixing Language Models Evaluating Language Models

slide-129
SLIDE 129

Maximum Likelihood Estimates

Maximizes the likelihood of the training set Do different corpora look the same? Low(er) bias, high(er) variance For large data: can actually do reasonably well

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

slide-130
SLIDE 130

n = 1

, , land of in , a teachers The , wilds the and gave a Etienne any two beginning without probably heavily that other useless the the a different . the able mines , unload into in foreign the the be either other Britain finally avoiding , for of have the cure , the Gutenberg-tm ; of being can as country in authority deviates as d seldom and They employed about from business marshal materials than in , they

slide-131
SLIDE 131

n = 2

These varied with it to the civil wars , therefore , it did not for the company had the East India , the mechanical , the sum which were by barter , vol. i , and , conveniencies of all made to purchase a council of landlords , constitute a sum as an argument , having thus forced abroad , however , and influence in the one , or banker , will there was encouraged and more common trade to corrupt , profit , it ; but a master does not , twelfth year the consent that of volunteers and […] , the other hand , it certainly it very earnestly entreat both nations . In opulent nations in a revenue of four parts of production .

slide-132
SLIDE 132

n = 3

His employer , if silver was regulated according to the temporary and

  • ccasional event .

What goods could bear the expense of defending themselves , than in the value of different sorts of goods , and placed at a much greater , there have been the effects of self-deception , this attention , but a very important ones , and which , having become of less than they ever were in this agreement for keeping up the business of weighing . After food , clothes , and a few months longer credit than is wanted , there must be sufficient to keep by him , are of such colonies to surmount . They facilitated the acquisition of the empire , both from the rents of land and labour of those pedantic pieces of silver which he can afford to take from the duty upon every quarter which they have a more equable distribution of employment .

slide-133
SLIDE 133

n = 4

To buy in one market , in order to have it ; but the 8th of George III . The tendency of some of the great lords , gradually encouraged their villains to make upon the prices of corn , cattle , poultry , etc . Though it may , perhaps , in the mean time , that part of the governments of New England , the market , trade cannot always be transported to so great a number of seamen , not inferior to those of

  • ther European nations from any direct trade to America .

The farmer makes his profit by parting with it . But the government of that country below what it is in itself necessarily slow , uncertain , liable to be interrupted by the weather .

slide-134
SLIDE 134

Maximum Likelihood Estimates

Maximizes the likelihood of the training set Do different corpora look the same? For large data: can actually do reasonably well

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

slide-135
SLIDE 135

0s Are Not Your (Language Model’s) Friend

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item = 0 → 𝑞 item = 0

slide-136
SLIDE 136

0s Are Not Your (Language Model’s) Friend

0 probability  item is impossible 0s annihilate: x*y*z*0 = 0 Language is creative: new words keep appearing existing words could appear in known contexts How much do you trust your data?

𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item = 0 → 𝑞 item = 0

slide-137
SLIDE 137

Add-λ estimation

Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

slide-138
SLIDE 138

Add-λ estimation

Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z + 𝜇

slide-139
SLIDE 139

Add-λ estimation

Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z + 𝜇 = 𝑑𝑝𝑣𝑜𝑢 z + 𝜇 σ𝑤(𝑑𝑝𝑣𝑜𝑢 v + 𝜇)

slide-140
SLIDE 140

Add-λ estimation

Laplace smoothing, Lidstone smoothing Pretend we saw each word λ more times than we did Add λ to all the counts

𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z + 𝜇 = 𝑑𝑝𝑣𝑜𝑢 z + 𝜇 𝑋 + 𝑊𝜇

slide-141
SLIDE 141

Add-λ N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-λ Count Add-λ Norm. Add-λ Prob. The 1 16 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16

  • pening

1 1/16 and 1 1/16 the 1 1/16 went 1 1/16

  • n

1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16

slide-142
SLIDE 142

Add-1 N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 16 1/16 2 film 2 1/8 3 got 1 1/16 2 a 2 1/8 3 great 1 1/16 2

  • pening

1 1/16 2 and 1 1/16 2 the 1 1/16 2 went 1 1/16 2

  • n

1 1/16 2 to 1 1/16 2 become 1 1/16 2 hit 1 1/16 2 . 1 1/16 2

slide-143
SLIDE 143

Add-1 N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 16 1/16 2 16 + 14*1 = 30 film 2 1/8 3 got 1 1/16 2 a 2 1/8 3 great 1 1/16 2

  • pening

1 1/16 2 and 1 1/16 2 the 1 1/16 2 went 1 1/16 2

  • n

1 1/16 2 to 1 1/16 2 become 1 1/16 2 hit 1 1/16 2 . 1 1/16 2

slide-144
SLIDE 144

Add-1 N-Grams (Unigrams)

The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob. The 1 16 1/16 2 16 + 14*1 = 30 =1/15 film 2 1/8 3 =1/10 got 1 1/16 2 =1/15 a 2 1/8 3 =1/10 great 1 1/16 2 =1/15

  • pening

1 1/16 2 =1/15 and 1 1/16 2 =1/15 the 1 1/16 2 =1/15 went 1 1/16 2 =1/15

  • n

1 1/16 2 =1/15 to 1 1/16 2 =1/15 become 1 1/16 2 =1/15 hit 1 1/16 2 =1/15 . 1 1/16 2 =1/15

slide-145
SLIDE 145

Backoff and Interpolation

Sometimes it helps to use less context

condition on less context for contexts you haven’t learned much

slide-146
SLIDE 146

Backoff and Interpolation

Sometimes it helps to use less context

condition on less context for contexts you haven’t learned much about

Backoff:

use trigram if you have good evidence

  • therwise bigram, otherwise unigram
slide-147
SLIDE 147

Backoff and Interpolation

Sometimes it helps to use less context

condition on less context for contexts you haven’t learned much about

Backoff:

use trigram if you have good evidence

  • therwise bigram, otherwise unigram

Interpolation:

mix (average) unigram, bigram, trigram

slide-148
SLIDE 148

Linear Interpolation

Simple interpolation

𝑞 𝑧 𝑦) = 𝜇𝑞2 𝑧 𝑦) + 1 − 𝜇 𝑞1 𝑧 0 ≤ 𝜇 ≤ 1

slide-149
SLIDE 149

Linear Interpolation

Simple interpolation Condition on context

𝑞 𝑧 𝑦) = 𝜇𝑞2 𝑧 𝑦) + 1 − 𝜇 𝑞1 𝑧 0 ≤ 𝜇 ≤ 1

𝑞 𝑨 𝑦, 𝑧) = 𝜇3 𝑦, 𝑧 𝑞3 𝑨 𝑦, 𝑧) + 𝜇2(𝑧)𝑞2 𝑨 | 𝑧 + 𝜇1𝑞1(𝑨)

slide-150
SLIDE 150

Backoff

Trust your statistics, up to a point 𝑞 𝑨 𝑦, 𝑧) ∝ ቊ𝑞3 𝑨 𝑦, 𝑧) if count 𝑦, 𝑧, 𝑨 > 0 𝑞2 z y

  • therwise
slide-151
SLIDE 151

Discounted Backoff

Trust your statistics, up to a point 𝑞 𝑨 𝑦, 𝑧) ∝ ቊ𝑞3 𝑨 𝑦, 𝑧) − 𝑒 if count 𝑦, 𝑧, 𝑨 > 0 𝛾 𝑦, 𝑧 𝑞2 z y

  • therwise
slide-152
SLIDE 152

Discounted Backoff

Trust your statistics, up to a point 𝑞 𝑨 𝑦, 𝑧) = ቊ𝑞3 𝑨 𝑦, 𝑧) − 𝑒 if count 𝑦, 𝑧, 𝑨 > 0 𝛾 𝑦, 𝑧 𝑞2 z y

  • therwise

discount constant context-dependent normalization constant

slide-153
SLIDE 153

Setting Hyperparameters

Use a development corpus Choose λs to maximize the probability of dev data:

– Fix the N-gram probabilities (on the training data) – Then search for λs that give largest probability to held-out set:

Training Data

Dev Data Test Data

slide-154
SLIDE 154

Implementation: Unknown words

Create an unknown word token <UNK> Training:

  • 1. Create a fixed lexicon L of size V
  • 2. Change any word not in L to <UNK>
  • 3. Train LM as normal

Evaluation:

Use UNK probabilities for any word not in training

slide-155
SLIDE 155

Other Kinds of Smoothing

Interpolated (modified) Kneser-Ney

Idea: How “productive” is a context? How many different word types v appear in a context x, y

Good-Turing

Partition words into classes of occurrence Smooth class statistics Properties of classes are likely to predict properties of other classes

Witten-Bell

Idea: Every observed type was at some point novel Give MLE prediction for novel type occurring

slide-156
SLIDE 156

Outline

Probability review Words Defining Language Models Breaking & Fixing Language Models Evaluating Language Models

slide-157
SLIDE 157

Evaluating Language Models

What is “correct?” What is working “well?”

slide-158
SLIDE 158

Evaluating Language Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

slide-159
SLIDE 159

Evaluating Language Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

acquire primary statistics for learning model parameters fine-tune any secondary (hyper)parameters perform final evaluation

slide-160
SLIDE 160

Evaluating Language Models

What is “correct?” What is working “well?”

Training Data

Dev Data Test Data

acquire primary statistics for learning model parameters fine-tune any secondary (hyper)parameters perform final evaluation

DO NOT ITERATE ON THE TEST DATA

slide-161
SLIDE 161

Evaluating Language Models

What is “correct?” What is working “well?” Extrinsic: Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors

slide-162
SLIDE 162

Evaluating Language Models

What is “correct?” What is working “well?” Extrinsic: Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic: Treat LM as its own downstream task Use perplexity (from information theory)

slide-163
SLIDE 163

Perplexity

Lower is better : lower perplexity  less surprised

More outcomes  More surprised Fewer outcomes  Less surprised

slide-164
SLIDE 164

Perplexity

Lower is better : lower perplexity  less surprised perplexity = exp(

−1 𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗))

n-gram history (n-1 items)

slide-165
SLIDE 165

Perplexity

Lower is better : lower perplexity  less surprised perplexity = exp(

−1 𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗))

≥ 0, ≤ 1: higher

slide-166
SLIDE 166

Perplexity

Lower is better : lower perplexity  less surprised perplexity = exp(

−1 𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗))

≥ 0, ≤ 1: higher ≤ 0: higher

slide-167
SLIDE 167

Perplexity

Lower is better : lower perplexity  less surprised perplexity = exp(

−1 𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗))

≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher

slide-168
SLIDE 168

Perplexity

Lower is better : lower perplexity  less surprised perplexity = exp(

−1 𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗))

≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher ≥ 0, lower is better

slide-169
SLIDE 169

Perplexity

Lower is better : lower perplexity  less surprised perplexity = exp(

−1 𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗))

≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher ≥ 0, lower is better ≥ 0, lower

slide-170
SLIDE 170

Perplexity

Lower is better : lower perplexity  less surprised perplexity = exp(

−1 𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗))

≥ 0, ≤ 1: higher ≤ 0: higher ≤ 0, higher ≥ 0, lower is better ≥ 0, lower base must be the same

slide-171
SLIDE 171

Perplexity

Lower is better : lower perplexity  less surprised perplexity = exp(

−1 𝑁 σ𝑗=1 𝑁 log 𝑞 𝑥𝑗

ℎ𝑗)) =

𝑁 ς𝑗=1

1 𝑞 𝑥𝑗 ℎ𝑗)

weighted geometric average

slide-172
SLIDE 172

Perplexity

Lower is better : lower perplexity  less surprised perplexity =

𝑁 ς𝑗=1

1 𝑞 𝑥𝑗 ℎ𝑗)

471/671: Branching factor

slide-173
SLIDE 173

Outline

Probability review Words Defining Language Models Breaking & Fixing Language Models Evaluating Language Models