Introduction to Natural Language Processing a course taught as - PowerPoint PPT Presentation

Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by members of the Institute of Formal and Applied Linguistics Today: Week 2, lecture Today’s topic: Language Modelling & The Noisy Channel Model Today’s teacher: Jan Hajiˇ c E-mail: hajic@ufal.mff.cuni.cz WWW: http://ufal.mff.cuni.cz/jan-hajic c (´ Jan Hajiˇ UFAL MFF UK) Language Modelling & The Noisy Channel Model Week 2, lecture 1 / 1

The Noisy Channel • Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,... • Model: probability of error (noise): • Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6 • The Task: known: the noisy output; want to know: the input ( decoding ) 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 2

Noisy Channel Applications • OCR – straightforward: text  print (adds noise), scan  image • Handwriting recognition – text  neurons, muscles (“noise”), scan/digitize  image • Speech recognition (dictation, commands, etc.) – text  conversion to acoustic signal (“noise”)  acoustic waves • Machine Translation – text in target language  translation (“noise”)  source language • Also: Part of Speech Tagging – sequence of tags  selection of word forms  text 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 3

Noisy Channel: The Golden Rule of ... OCR, ASR, HR, MT, ... • Recall: p(A|B) = p(B|A) p(A) / p(B) (Bayes formula) A best = argmax A p(B|A) p(A) (The Golden Rule) • p(B|A): the acoustic/image/translation/lexical model – application-specific name – will explore later • p(A): the language model 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 4

The Perfect Language Model • Sequence of word forms [forget about tagging for the moment] • Notation: A ~ W = (w 1 ,w 2 ,w 3 ,...,w d ) • The big (modeling) question: p(W) = ? • Well, we know (Bayes/chain rule  ): p(W) = p(w 1 ,w 2 ,w 3 ,...,w d ) = = p(w 1 )  p(w 2 |w 1 )  p(w 3 |w 1 ,w 2 )  p(w d |w 1 ,w 2 ,...,w d-1 ) • Not practical (even short W  too many parameters) 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 5

Markov Chain • Unlimited memory (cf. previous foil): – for w i , we know all its predecessors w 1 ,w 2 ,w 3 ,...,w i-1 • Limited memory: – we disregard “too old” predecessors – remember only k previous words: w i-k ,w i-k+1 ,...,w i-1 – called “k th order Markov approximation” • + stationary character (no change over time): p(W)   i=1..d p(w i |w i-k ,w i-k+1 ,...,w i-1 ), d = |W| 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 6

n-gram Language Models • (n-1) th order Markov approximation  n-gram LM: p(W)  df  i=1..d p(w i |w i-n+1 ,w i-n+2 ,...,w i-1 ) ! prediction history • In particular (assume vocabulary |V| = 60k): • 0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter • 1-gram LM: unigram model, p(w), 6  10 4 parameters • 2-gram LM: bigram model, p(w i |w i-1 ) 3.6  10 9 parameters • 3-gram LM: trigram model, p(w i |w i-2 ,w i-1 ) 2.16  10 14 parameters 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 7

Maximum Likelihood Estimate • MLE: Relative Frequency... – ...best predicts the data at hand (the “training data”) • Trigrams from Training Data T: – count sequences of three words in T: c 3 (w i-2 ,w i-1 ,w i ) [NB: notation: just saying that the three words follow each other] – count sequences of two words in T: c 2 (w i-1 ,w i ): • either use c 2 (y,z) =  w c 3 (y,z,w) • or count differently at the beginning (& end) of data! p(w i |w i-2 ,w i-1 ) = est. c 3 (w i-2 ,w i-1 ,w i ) / c 2 (w i-2 ,w i-1 ) ! 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 8

LM: an Example • Training data: <s> <s> He can buy the can of soda. – Unigram: p 1 (He) = p 1 (buy) = p 1 (the) = p 1 (of) = p 1 (soda) = p 1 (.) = .125 p 1 ( can ) = .25 – Bigram: p 2 ( He|<s> ) = 1 , p 2 ( can|He ) = 1 , p 2 ( buy|can ) = .5 , p 2 ( of|can ) = .5 , p 2 ( the|buy ) = 1 ,... – Trigram: p 3 ( He|<s>,<s> ) = 1 , p 3 ( can|<s>,He ) = 1 , p 3 ( buy|He,can ) = 1 , p 3 ( of|the,can ) = 1 , ..., p 3 ( .|of,soda ) = 1 . – Entropy: H(p 1 ) = 2.75, H(p 2 ) = .25, H(p 3 ) = 0  Great?! 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 9

LM: an Example (The Problem) • Cross-entropy: • S = <s> <s> It was the greatest buy of all. • Even H S (p 1 ) fails (= H S (p 2 ) = H S (p 3 ) =  ), because: – all unigrams but p 1 (the), p 1 (buy), p 1 (of) and p 1 (.) are 0. – all bigram probabilities are 0. – all trigram probabilities are 0. • We want: to make all (theoretically possible * ) probabilities non-zero. * in fact, all: remember our graph from day 1? 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 10

LM Smoothing (And the EM Algorithm)

Why do we need Nonzero Probs? • To avoid infinite Cross Entropy: – happens when an event is found in test data which has not been seen in training data H(p) =  prevents comparing data with  0 “errors” • To make the system more robust – low count estimates: • they typically happen for “detailed” but relatively rare appearances – high count estimates: reliable but less “detailed” 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 13

Eliminating the Zero Probabilities: Smoothing • Get new p’(w) (same  ): almost p(w) but no zeros • Discount w for (some) p(w) > 0: new p’(w) < p(w)  w  discounted (p(w) - p’(w)) = D • Distribute D to all w; p(w) = 0: new p’(w) > p(w) – possibly also to other w with low p(w) • For some w (possibly): p’(w) = p(w) • Make sure  w  p’(w) = 1 • There are many ways of smoothing 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 14

Smoothing by Adding 1 • Simplest but not really usable: – Predicting words w from a vocabulary V, training data T: p’(w|h) = (c(h,w) + 1) / (c(h) + |V|) • for non-conditional distributions: p’(w) = (c(w) + 1) / (|T| + |V|) – Problem if |V| > c(h) (as is often the case; even >> c(h)!) • Example: Training data: <s> what is it what is small ? |T| = 8 • V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12 • p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .25 2  .125 2  .001 p(it is flying.) = .125  .25  0 2 = 0 • p’(it) =.1, p’(what) =.15, p’(.)=.05 p’(what is it?) = .15 2  .1 2  .0002 p’(it is flying.) = .1  .15  .05 2  .00004 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 15

Adding less than 1 • Equally simple: – Predicting words w from a vocabulary V, training data T: p’(w|h) = (c(h,w) +  ) / (c(h) +  |V|),   • for non-conditional distributions: p’(w) = (c(w) +  ) / (|T| +  |V|) • Example: Training data: <s> what is it what is small ? |T| = 8 • V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12 • p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .25 2  .125 2  .001 p(it is flying.) = .125  .25  0 2 = 0 • Use  = .1:  • p’(it)  .12, p’(what)  .23, p’(.)  .01 p’(what is it?) = .23 2  .12 2  .0007 p’(it is flying.) = .12  .23  .01 2  .000003 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 16

Smoothing by Combination: Linear Interpolation • Combine what? • distributions of various level of detail vs. reliability • n-gram models: • use (n-1)gram, (n-2)gram, ..., uniform reliability detail • Simplest possible combination: – sum of probabilities, normalize: • p(0|0) = .8, p(1|0) = .2, p(0|1) = 1, p(1|1) = 0, p(0) = .4, p(1) = .6: • p’(0|0) = .6, p’(1|0) = .4, p’(0|1) = .7, p’(1|1) = .3 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 17

Typical n-gram LM Smoothing • Weight in less detailed distributions using  =(  0 ,   ,   ,   ): p’  (w i | w i-2 ,w i-1 ) =   p 3 (w i | w i-2 ,w i-1 ) +   p 2 (w i | w i-1 ) +   p 1 (w i ) +  0  /|V| • Normalize:  i > 0,  i=0..n  i = 1 is sufficient (  0 = 1 -  i=1..n  i ) (n=3) • Estimation using MLE: – fix the p 3 , p 2 , p 1 and |V| parameters as estimated from the training data – then find such {  i } which minimizes the cross entropy (maximizes probability of data): -(1/|D|)  i=1..|D| log 2 (p’  (w i |h i )) 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 18

Introduction to Natural Language Processing a course taught as - PowerPoint PPT Presentation

Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by members of the Institute of Formal and Applied Linguistics Today: Week 2, lecture Todays topic: Language Modelling & The Noisy Channel Model

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Natural Language Processing: Natural Language Processing: Introduction to Syntactic Parsing

PROP 65 & DINP: USING ACCS REVISED WORKBOOK TO ESTIMATE EXPOSURE FROM CONSUMER PRODUCTS

BNP signal peptide protects the heart from ischemia-reperfusion injury Chris Pemberton, Maithri

Crystalline Silica Exposure Control Presented by: Tony Kuehn, CSP, OHST, ALCM Director of Health

Aerosols at Mauna Loa Observatory spring, 2011, versus spring, 2001 Thomas A. Cahill, Jason

Welcome! Back to Agenda Agenda Advanced Practice Overview Professionalism and

71 Common crystals you may find in everyday life are of sodium chloride (common salt) and silica

Venture Chemicals, Inc. Venture Chemicals, Inc. Presents Presents 1 1 Ven- -Trol Trol 401

Fungicide Sensitivity of Cold Climate Grape Varieties Patricia McManus University of

Introduction to Natural Language Processing a course taught as - PowerPoint PPT Presentation

Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by members of the Institute of Formal and Applied Linguistics Today: Week 2, lecture Todays topic: Language Modelling & The Noisy Channel Model

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Natural Language Processing: Natural Language Processing: Introduction to Syntactic Parsing

PROP 65 &amp; DINP: USING ACCS REVISED WORKBOOK TO ESTIMATE EXPOSURE FROM CONSUMER PRODUCTS

BNP signal peptide protects the heart from ischemia-reperfusion injury Chris Pemberton, Maithri

Crystalline Silica Exposure Control Presented by: Tony Kuehn, CSP, OHST, ALCM Director of Health

Aerosols at Mauna Loa Observatory spring, 2011, versus spring, 2001 Thomas A. Cahill, Jason

Welcome! Back to Agenda Agenda Advanced Practice Overview Professionalism and

71 Common crystals you may find in everyday life are of sodium chloride (common salt) and silica

Venture Chemicals, Inc. Venture Chemicals, Inc. Presents Presents 1 1 Ven- -Trol Trol 401

Fungicide Sensitivity of Cold Climate Grape Varieties Patricia McManus University of

PROP 65 & DINP: USING ACCS REVISED WORKBOOK TO ESTIMATE EXPOSURE FROM CONSUMER PRODUCTS