How to Build an LM Good LMs need lots of n-grams! [Brants et al, - - PowerPoint PPT Presentation

how to build an lm good lms need lots of n grams
SMART_READER_LITE
LIVE PREVIEW

How to Build an LM Good LMs need lots of n-grams! [Brants et al, - - PowerPoint PPT Presentation

How to Build an LM Good LMs need lots of n-grams! [Brants et al, 2007] Key function: map from n-grams to counts searching for the best 192593 searching for the right 45805 searching for the cheapest 44965 searching for the


slide-1
SLIDE 1
slide-2
SLIDE 2

How to Build an LM

slide-3
SLIDE 3

▪ Good LMs need lots of n-grams!

[Brants et al, 2007]

slide-4
SLIDE 4

▪ Key function: map from n-grams to counts

… searching for the best 192593 searching for the right 45805 searching for the cheapest 44965 searching for the perfect 43959 searching for the truth 23165 searching for the “ 19086 searching for the most 15512 searching for the latest 12670 searching for the next 10120 searching for the lowest 10080 searching for the name 8402 searching for the finest 8171 …

slide-5
SLIDE 5

https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html

slide-6
SLIDE 6
  • 24GB compressed
  • 6 DVDs
slide-7
SLIDE 7

hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = 7 c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11 c(have) = ? hash(have) = 2

1 2 3 4 5 6 7 cat 12 the 87 and 76 dog 11

value key

slide-8
SLIDE 8

HashMap<String, Long> ngram_counts; String ngram1 = “I have a car”; String ngram2 = “I have a cat”; ngram_counts.put(ngram1, 123); ngram_counts.put(ngram2, 333);

slide-9
SLIDE 9

HashMap<String[], Long> ngram_counts; String[] ngram1 = {“I”, “have”, “a”, “car”}; String[] ngram2 = {“I”, “have”, “a”, “cat”}; ngram_counts.put(ngram1, 123); ngram_counts.put(ngram2, 333);

slide-10
SLIDE 10

c at

Per 3-gram: 1 Pointer = 8 bytes Obvious alternatives:

  • Sorted arrays
  • Open addressing

HashMap<String[], Long> ngram_counts; 1 Long = 8 bytes (obj) + 8 bytes (long) 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes

4 billion ngrams * 88 bytes = 352 GB

slide-11
SLIDE 11

hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = 7 c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11

1 2 3 4 5 6 7

value key

slide-12
SLIDE 12

hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = 7 c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11 c(have) = ? hash(have) = 2

cat the and dog 1 2 3 4 5 6 7 12 87 5 7

value key

slide-13
SLIDE 13

hash(cat) = 2 hash(the) = 2 hash(and) = 5 hash(dog) = 7 c(cat) = 12 c(the) = 87 c(and) = 76 c(dog) = 11

1 2 3 4 5 6 7

value key

14 15

… … …

slide-14
SLIDE 14

▪ Closed address hashing

▪ Resolve collisions with chains ▪ Easier to understand but bigger

▪ Open address hashing

▪ Resolve collisions with probe sequences ▪ Smaller but more complicated implementation

▪ Direct-address hashing

▪ No collision resolution ▪ Just eject previous entries ▪ Not suitable for core LM storage

slide-15
SLIDE 15

Per 3-gram: 1 Pointer = 8 bytes Obvious alternatives:

  • Sorted arrays
  • Open addressing

HashMap<String[], Long> ngram_counts; 1 Long = 8 bytes (obj) + 8 bytes (long) 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes

slide-16
SLIDE 16

the cat laughed 233

n-gram count

7 1 15

word ids

slide-17
SLIDE 17

Fits in a primitive 64-bit long

20 bits 20 bits 20 bits

Got 3 numbers under 220 to store?

7 1 15 0…00111 0...00001 0...01111

slide-18
SLIDE 18

the cat laughed 233

n-gram count

15176595 =

n-gram encoding

32 bytes → 8 bytes

slide-19
SLIDE 19

Per 3-gram: 1 Pointer = 8 bytes Obvious alternatives:

  • Sorted arrays
  • Open addressing

HashMap<String[], Long> ngram_counts; 1 Long = 8 bytes (obj) + 8 bytes (long) 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes

slide-20
SLIDE 20

c(the) = 23,135,851,162 < 235

35 bits to represent integers between 0 and 235

15176595 233

n-gram encoding count 60 bits 35 bits

slide-21
SLIDE 21
  • 24GB compressed
  • 6 DVDs
slide-22
SLIDE 22

# unique counts = 770000 < 220

20 bits to represent ranks of all counts

15176595 3

n-gram encoding rank 60 bits 20 bits

1 1 2 2 51 3 233

rank count

slide-23
SLIDE 23

trigram bigram unigram Vocabulary Counts lookup Count DB N-gram encoding scheme unigram: f(id) = id bigram: f(id1, id2) = ? trigram: f(id1, id2, id3) = ?

slide-24
SLIDE 24
slide-25
SLIDE 25

▪ we’ll expand to more than 3-grams ▪ we’ll support vocabulary with 14M words

slide-26
SLIDE 26
slide-27
SLIDE 27

[Many details from Pauls and Klein, 2011]

slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30

Compression

slide-31
SLIDE 31
slide-32
SLIDE 32

000 1001

Encoding “9”

Length in Unary Number in Binary

[Elias, 75]

2.9 10

slide-33
SLIDE 33

Speed-Ups

slide-34
SLIDE 34
slide-35
SLIDE 35

LM can be more than 10x faster w/ direct-address caching

slide-36
SLIDE 36

▪ Simplest option: hash-and-hope

▪ Array of size K ~ N ▪ (optional) store hash of keys ▪ Store values in direct-address ▪ Collisions: store the max ▪ What kind of errors can there be?

▪ More complex options, like bloom filters (originally for membership, but see Talbot and Osborne 07), perfect hashing, etc