Machinetranslation p( strong winds) > - - PowerPoint PPT Presentation

machinetranslation p strong winds p large winds
SMART_READER_LITE
LIVE PREVIEW

Machinetranslation p( strong winds) > - - PowerPoint PPT Presentation

Machinetranslation p( strong winds) > p( large winds) SpellCorrection The office is about fifteen minuets from my house p(about fifteen minutes from) > p(about fifteen


slide-1
SLIDE 1
slide-2
SLIDE 2

slide-3
SLIDE 3

▪ ▪ ▪

▪ ▪

slide-4
SLIDE 4

▪ ▪

slide-5
SLIDE 5

▪ Machinetranslation

▪ p(strong winds) > p(large winds)

▪ SpellCorrection

▪ The office is about fifteen minuets from my house ▪ p(about fifteen minutes from) > p(about fifteen minuets from)

▪ Speech Recognition

▪ p(I saw a van) >> p(eyes awe of an)

▪ Summarization, question-answering, handwriting recognition, OCR, etc.

slide-6
SLIDE 6

source

W A

noisy channel decoder

  • bserved

w a best

▪ We want to predict a sentence given acoustics:

slide-7
SLIDE 7

▪ The noisy-channel approach:

source

W A

noisy channel decoder

  • bserved

w a best

Prior Acoustic model (HMMs) Likelihood Language model: Distributions over sequences

  • f words (sentences)
slide-8
SLIDE 8

the station signs are in deep in english the stations signs are in deep in english the station signs are in deep into english the station 's signs are in deep in english the station signs are in deep in the english the station signs are indeed in english the station 's signs are indeed in english the station signs are indians in english the station signs are indian in english the stations signs are indians in english the stations signs are indians and english

source P(w)

w a

decoder

  • bserved

w a best

channel P(a|w)

Language Model Acoustic Model

the station 's signs are in deep in english

argmax P(w|a) = argmax P(a|w)P(w)

slide-9
SLIDE 9

source P(e)

e f

decoder

  • bserved

argmax P(e|f) = argmax P(f|e)P(e) e e e f best

channel P(f|e)

Language Model Translation Model

sent transmission: English recovered transmission: French recovered message: English’

slide-10
SLIDE 10

the station signs are in deep in english

  • 14732

the stations signs are in deep in english

  • 14735

the station signs are in deep into english

  • 14739

the station 's signs are in deep in english

  • 14740

the station signs are in deep in the english

  • 14741

the station signs are indeed in english

  • 14757

the station 's signs are indeed in english

  • 14760

the station signs are indians in english

  • 14790

the station signs are indian in english

  • 14799

the stations signs are indians in english

  • 14807

the stations signs are indians and english

  • 14815
slide-11
SLIDE 11

▪ A language model is a distribution over sequences of words (sentences)

▪ What’s w? (closed vs open vocabulary) ▪ What’s n? (must sum to one over all lengths) ▪ Can have rich structure or be linguistically naive

▪ Why language models?

▪ Usually the point is to assign high weights to plausible sentences (cf acoustic confusions) ▪ This is not the same as modeling grammaticality

slide-12
SLIDE 12

▪ Language models are distributions over sentences ▪ N-gram models are built from local conditional probabilities ▪ The methods we’ve seen are backed by corpus n-gram counts

slide-13
SLIDE 13

▪ ▪

slide-14
SLIDE 14

▪ ▪

slide-15
SLIDE 15

▪ ▪

slide-16
SLIDE 16

▪ ▪ ▪

slide-17
SLIDE 17

▪ ▪ ▪ ▪

slide-18
SLIDE 18

▪ ▪ ▪

▪ ▪

slide-19
SLIDE 19

▪ ▪ ⇒ ▪ ⇒

slide-20
SLIDE 20

▪ unk unk

unk ▪

slide-21
SLIDE 21

slide-22
SLIDE 22
slide-23
SLIDE 23

▪ ▪

▪ ▪

▪ ▪

slide-24
SLIDE 24

slide-25
SLIDE 25

slide-26
SLIDE 26

Training Data Held-Out Data Test Data

Counts / parameters from here Hyperparameters from here Evaluate here

slide-27
SLIDE 27

▪ We often want to make estimates from sparse statistics: ▪ Smoothing flattens spiky distributions so they generalize better: ▪ Very important all over NLP, but easy to do badly

P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total

allegations

charges motion benefits

allegations reports claims charges requ est motion benefits

allegations reports

clai ms req ues t

P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total

slide-28
SLIDE 28

▪ ▪ ▪

▪ ▪ ▪

slide-29
SLIDE 29

slide-30
SLIDE 30

slide-31
SLIDE 31

▪ ▪ ▪

slide-32
SLIDE 32

▪ ▪

▪ ▪ ▪

slide-33
SLIDE 33

▪ ▪

▪ ▪

slide-34
SLIDE 34

▪ ▪ ▪ ▪ ▪

slide-35
SLIDE 35

▪ ▪ ▪ ▪ ▪ ▪

slide-36
SLIDE 36

The LAMBADA dataset

Context: “Why?” “I would have thought you’d find him rather dry,” she said. “I don’t know about that,” said Gabriel. “He was a great craftsman,” said Heather. “That he was,” said Flannery. Target sentence: “And Polish, to boot,” said _______. Target word: Gabriel

[Paperno et al. 2016]

slide-37
SLIDE 37

Other Techniques?

▪ Lots of other techniques

▪ Maximum entropy LMs ▪ Neural network LMs (soon) ▪ Syntactic / grammar-structured LMs (later)

slide-38
SLIDE 38

How to Build an LM

slide-39
SLIDE 39

▪ Good LMs need lots of n-grams!

[Brants et al, 2007]

slide-40
SLIDE 40

▪ Key function: map from n-grams to counts

… searching for the best 192593 searching for the right 45805 searching for the cheapest 44965 searching for the perfect 43959 searching for the truth 23165 searching for the “ 19086 searching for the most 15512 searching for the latest 12670 searching for the next 10120 searching for the lowest 10080 searching for the name 8402 searching for the finest 8171 …

slide-41
SLIDE 41

https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html

slide-42
SLIDE 42
  • 24GB compressed
  • 6 DVDs
slide-43
SLIDE 43

the station signs are in deep in english the stations signs are in deep in english the station signs are in deep into english the station 's signs are in deep in english the station signs are in deep in the english the station signs are indeed in english the station 's signs are indeed in english the station signs are indians in english the station signs are indian in english the stations signs are indians in english the stations signs are indians and english

source P(w)

w a

decoder

  • bserved

w a best

channel P(a|w)

the station 's signs are in deep in english

argmax P(w|a) = argmax P(a|w)P(w)

slide-44
SLIDE 44

hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = 7 c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11 c(have) = ? hash(have) = 2

1 2 3 4 5 6 7 cat 12 the 87 and 76 dog 11

value key

slide-45
SLIDE 45

HashMap<String, Long> ngram_counts; String ngram1 = “I have a car”; String ngram2 = “I have a cat”; ngram_counts.put(ngram1, 123); ngram_counts.put(ngram2, 333);

slide-46
SLIDE 46

HashMap<String[], Long> ngram_counts; String[] ngram1 = {“I”, “have”, “a”, “car”}; String[] ngram2 = {“I”, “have”, “a”, “cat”}; ngram_counts.put(ngram1, 123); ngram_counts.put(ngram2, 333);

slide-47
SLIDE 47

c at

Per 3-gram: 1 Pointer = 8 bytes Obvious alternatives:

  • Sorted arrays
  • Open addressing

HashMap<String[], Long> ngram_counts; 1 Long = 8 bytes (obj) + 8 bytes (long) 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes

4 billion ngrams * 88 bytes = 352 GB

slide-48
SLIDE 48

hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = 7 c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11

1 2 3 4 5 6 7

value key

slide-49
SLIDE 49

hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = 7 c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11 c(have) = ? hash(have) = 2

cat the and dog 1 2 3 4 5 6 7 12 87 5 7

value key

slide-50
SLIDE 50

hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = 7 c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11

1 2 3 4 5 6 7

value key

14 15

… … …

slide-51
SLIDE 51

▪ Closed address hashing

▪ Resolve collisions with chains ▪ Easier to understand but bigger

▪ Open address hashing

▪ Resolve collisions with probe sequences ▪ Smaller but easy to mess up

▪ Direct-address hashing

▪ No collision resolution ▪ Just eject previous entries ▪ Not suitable for core LM storage

slide-52
SLIDE 52

Per 3-gram: 1 Pointer = 8 bytes Obvious alternatives:

  • Sorted arrays
  • Open addressing

HashMap<String[], Long> ngram_counts; 1 Long = 8 bytes (obj) + 8 bytes (long) 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes

slide-53
SLIDE 53

the cat laughed 233

n-gram count

7 1 15

word ids

slide-54
SLIDE 54

Fits in a primitive 64-bit long

20 bits 20 bits 20 bits

Got 3 numbers under 220 to store?

7 1 15 0…00111 0...00001 0...01111

slide-55
SLIDE 55

the cat laughed 233

n-gram count

15176595 =

n-gram encoding

32 bytes → 8 bytes

slide-56
SLIDE 56

Per 3-gram: 1 Pointer = 8 bytes Obvious alternatives:

  • Sorted arrays
  • Open addressing

HashMap<String[], Long> ngram_counts; 1 Long = 8 bytes (obj) + 8 bytes (long) 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes

slide-57
SLIDE 57

c(the) = 23135851162 < 235

35 bits to represent integers between 0 and 235

15176595 233

n-gram encoding count 60 bits 35 bits

slide-58
SLIDE 58
  • 24GB compressed
  • 6 DVDs
slide-59
SLIDE 59

# unique counts = 770000 < 220

20 bits to represent ranks of all counts

15176595 3

n-gram encoding rank 60 bits 20 bits

1 1 2 2 51 3 233

rank count

slide-60
SLIDE 60

trigram bigram unigram Vocabulary Counts lookup Count DB N-gram encoding scheme unigram: f(id) = id bigram: f(id1, id2) = ? trigram: f(id1, id2, id3) = ?

slide-61
SLIDE 61
slide-62
SLIDE 62

▪ ▪

slide-63
SLIDE 63
slide-64
SLIDE 64

[Many details from Pauls and Klein, 2011]

slide-65
SLIDE 65
slide-66
SLIDE 66
slide-67
SLIDE 67

Compression

slide-68
SLIDE 68
slide-69
SLIDE 69

000 1001

Encoding “9”

Length in Unary Number in Binary

[Elias, 75]

2.9 10

slide-70
SLIDE 70

Speed-Ups

slide-71
SLIDE 71
slide-72
SLIDE 72

LM can be more than 10x faster w/ direct-address caching

slide-73
SLIDE 73

▪ Simplest option: hash-and-hope

▪ Array of size K ~ N ▪ (optional) store hash of keys ▪ Store values in direct-address ▪ Collisions: store the max ▪ What kind of errors can there be?

▪ More complex options, like bloom filters (originally for membership, but see Talbot and Osborne 07), perfect hashing, etc