How to Build an LM Good LMs need lots of n-grams! [Brants et al, - PowerPoint PPT Presentation

How to Build an LM

▪ Good LMs need lots of n-grams! [Brants et al, 2007]

▪ Key function: map from n-grams to counts … searching for the best 192593 searching for the right 45805 searching for the cheapest 44965 searching for the perfect 43959 searching for the truth 23165 searching for the “ 19086 searching for the most 15512 searching for the latest 12670 searching for the next 10120 searching for the lowest 10080 searching for the name 8402 searching for the finest 8171 …

https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html

● 24GB compressed ● 6 DVDs

0 1 key value c(cat) = 12 hash(cat) = 2 the 87 2 cat 12 3 c(the) = 87 hash(the) = 2 4 5 and 76 c(and) = 76 hash(and) = 5 6 dog 11 7 c(dog) = 11 hash(dog) = 7 c(have) = ? hash(have) = 2

HashMap<String, Long> ngram_counts; String ngram1 = “I have a car”; String ngram2 = “I have a cat”; ngram_counts.put(ngram1, 123); ngram_counts.put(ngram2, 333);

HashMap<String[], Long> ngram_counts; String[] ngram1 = {“I”, “have”, “a”, “car”}; String[] ngram2 = {“I”, “have”, “a”, “cat”}; ngram_counts.put(ngram1, 123); ngram_counts.put(ngram2, 333);

Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) HashMap<String[], Long> ngram_counts; 1 Long = 8 bytes (obj) + 8 bytes (long) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes 4 billion ngrams * 88 bytes = 352 GB Obvious alternatives: - Sorted arrays - Open addressing at c

key value c(cat) = 12 hash(cat) = 2 0 1 c(the) = 87 hash(the) = 2 2 3 c(and) = 76 hash(and) = 5 4 5 c(dog) = 11 hash(dog) = 7 6 7

key value c(cat) = 12 hash(cat) = 2 0 1 c(the) = 87 hash(the) = 2 2 cat 12 3 the 87 c(and) = 76 hash(and) = 5 4 5 and 5 c(dog) = 11 hash(dog) = 7 6 7 dog 7 c(have) = ? hash(have) = 2

key value 0 c(cat) = 12 hash(cat) = 2 1 2 c(the) = 87 hash(the) = 2 3 4 c(and) = 76 hash(and) = 5 5 6 c(dog) = 11 hash(dog) = 7 7 … … … 14 15

▪ Closed address hashing ▪ Resolve collisions with chains ▪ Easier to understand but bigger ▪ Open address hashing ▪ Resolve collisions with probe sequences ▪ Smaller but more complicated implementation ▪ Direct-address hashing ▪ No collision resolution ▪ Just eject previous entries ▪ Not suitable for core LM storage

HashMap<String[], Long> ngram_counts; Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Long = 8 bytes (obj) + 8 bytes (long) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes Obvious alternatives: - Sorted arrays - Open addressing

word ids 7 1 15 the cat laughed 233 n-gram count

Got 3 numbers under 2 20 to store? 7 1 15 0 … 00111 0...00001 0...01111 20 bits 20 bits 20 bits Fits in a primitive 64-bit long

n-gram encoding 15176595 = the cat laughed 233 n-gram count 32 bytes → 8 bytes

HashMap<String[], Long> ngram_counts; Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Long = 8 bytes (obj) + 8 bytes (long) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes Obvious alternatives: - Sorted arrays - Open addressing

c(the) = 23,135,851,162 < 2 35 35 bits to represent integers between 0 and 2 35 60 bits 35 bits 15176595 233 n-gram encoding count

● 24GB compressed ● 6 DVDs

# unique counts = 770000 < 2 20 20 bits to represent ranks of all counts rank count 60 bits 20 bits 0 1 1 2 15176595 3 2 51 n-gram encoding rank 3 233

Vocabulary N-gram encoding scheme unigram: f(id) = id bigram: f(id 1 , id 2 ) = ? trigram: f(id 1 , id 2 , id 3 ) = ? Count DB unigram bigram trigram Counts lookup

▪ we’ll expand to more than 3-grams ▪ we’ll support vocabulary with 14M words

[Many details from Pauls and Klein, 2011]

Compression

Encoding “9” 000 1001 Length Number in in Unary Binary 2.9 10 [Elias, 75]

Speed-Ups

LM can be more than 10x faster w/ direct-address caching

▪ Simplest option: hash-and-hope ▪ Array of size K ~ N ▪ (optional) store hash of keys ▪ Store values in direct-address ▪ Collisions: store the max ▪ What kind of errors can there be? ▪ More complex options, like bloom filters (originally for membership, but see Talbot and Osborne 07), perfect hashing, etc

How to Build an LM Good LMs need lots of n-grams! [Brants et al, - PowerPoint PPT Presentation

How to Build an LM Good LMs need lots of n-grams! [Brants et al, 2007] Key function: map from n-grams to counts searching for the best 192593 searching for the right 45805 searching for the cheapest 44965 searching for the

N-grams L445 / L545 Dept. of Linguistics, Indiana University Spring 2017 1 / 22 N-grams

LMS Working Group LMS Transportation 7/30/2020 Existing Conditions & Future Transportation

Review of LMS Activities (August 2016-July 2017) Professor Ken Brown (LMS Vice-President) LMS

A NEXT -GEN LEARNING MANAGEMENT SYSTEM WHAT IS LMS? A Learning Management System (LMS) is a

NIH Best Practices Using the HHS Learning Portal (LMS) Michele Schwartzman, OHR, NIH LMS

ALTERNATIVE PROTEIN PRESENTATION NFS 200 BY BENJAMIN KRAEMER RECOMMENDATIONS OF RED MEAT

Statistical Language Modeling with N-grams in Python By Olha Diakonova What are n-grams

Local Maternity System (LMS) Christine Morris LMS Senior Responsible Officer Shrewsbury &

FEMA 1595 Non-Federal Expected Fed Share Project LMS Funding LMS (75% of Total (25 % of

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs (Bengio et al, 03)

LMS and GAMLSS Flexible Regression and Smoothing Mikis Stasinopoulos 1 and Bob Rigby 1 1

Teaching and Learning Services LMS Review Listening session LMS Review What is it? How does

Reading with your child Steps to reading Talking chatting lots and lots and lots (and

Architecture Aromatique Good Taste Good Food Good Health Based on sustainability Technical

Aim I can measure mass in grams. Success Criteria I can calculate the intervals on a

Questions for EPA CINDY Y WIRE, O OFFICE O OF PESTICIDE P PROGR GRAMS EMILY R Y RYAN, O

ECE 550D Fundamentals of Computer Systems and Engineering Fall 2016 Networking Basics Tyler

Computer Communication Networks Physical ICEN/ICSI 416 Fall 2016 Prof. Dola Saha 1 The

Lecture 12: Advanced Rendering CSE 40166 Computer Graphics Peter Bui University of Notre Dame,

Poonam Chandra National Centre for Radio Astrophysics Tata Institute of Fundamental Research

Clusters and features from combinatorial stochastic processes Tamara Broderick, Michael I.

Advanced Scientific Computing with R 3. Conditions, loops, apply and functions Michael Hahsler

Plan For Today Remarks & Questions a few remarks & questions no strike

Slides for Climate change discussion 22 nd May For Supper Club Gerald Oakham, Peter Black

How to Build an LM Good LMs need lots of n-grams! [Brants et al, - PowerPoint PPT Presentation

How to Build an LM Good LMs need lots of n-grams! [Brants et al, 2007] Key function: map from n-grams to counts searching for the best 192593 searching for the right 45805 searching for the cheapest 44965 searching for the

N-grams L445 / L545 Dept. of Linguistics, Indiana University Spring 2017 1 / 22 N-grams

LMS Working Group LMS Transportation 7/30/2020 Existing Conditions &amp; Future Transportation

Review of LMS Activities (August 2016-July 2017) Professor Ken Brown (LMS Vice-President) LMS

A NEXT -GEN LEARNING MANAGEMENT SYSTEM WHAT IS LMS? A Learning Management System (LMS) is a

NIH Best Practices Using the HHS Learning Portal (LMS) Michele Schwartzman, OHR, NIH LMS

ALTERNATIVE PROTEIN PRESENTATION NFS 200 BY BENJAMIN KRAEMER RECOMMENDATIONS OF RED MEAT

Statistical Language Modeling with N-grams in Python By Olha Diakonova What are n-grams

Local Maternity System (LMS) Christine Morris LMS Senior Responsible Officer Shrewsbury &amp;

FEMA 1595 Non-Federal Expected Fed Share Project LMS Funding LMS (75% of Total (25 % of

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs (Bengio et al, 03)

LMS and GAMLSS Flexible Regression and Smoothing Mikis Stasinopoulos 1 and Bob Rigby 1 1

Teaching and Learning Services LMS Review Listening session LMS Review What is it? How does

Reading with your child Steps to reading Talking chatting lots and lots and lots (and

Architecture Aromatique Good Taste Good Food Good Health Based on sustainability Technical

Aim I can measure mass in grams. Success Criteria I can calculate the intervals on a

Questions for EPA CINDY Y WIRE, O OFFICE O OF PESTICIDE P PROGR GRAMS EMILY R Y RYAN, O

ECE 550D Fundamentals of Computer Systems and Engineering Fall 2016 Networking Basics Tyler

Computer Communication Networks Physical ICEN/ICSI 416 Fall 2016 Prof. Dola Saha 1 The

Lecture 12: Advanced Rendering CSE 40166 Computer Graphics Peter Bui University of Notre Dame,

Poonam Chandra National Centre for Radio Astrophysics Tata Institute of Fundamental Research

Clusters and features from combinatorial stochastic processes Tamara Broderick, Michael I.

Advanced Scientific Computing with R 3. Conditions, loops, apply and functions Michael Hahsler

Plan For Today Remarks &amp; Questions a few remarks &amp; questions no strike

Slides for Climate change discussion 22 nd May For Supper Club Gerald Oakham, Peter Black

LMS Working Group LMS Transportation 7/30/2020 Existing Conditions & Future Transportation

Local Maternity System (LMS) Christine Morris LMS Senior Responsible Officer Shrewsbury &

Plan For Today Remarks & Questions a few remarks & questions no strike