Natural Language Processing Spring 2017 Unit 1: Sequence Models - PowerPoint PPT Presentation

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required optional Liang Huang

Probabilities • experiment (e.g., “toss a coin 3 times”) • basic outcomes Ω (e.g., Ω ={ HHH, HHT, HTH, ..., TTT } ) • event: some subset A of Ω (e.g., A = “heads twice”) • probability distribution • a function p from Ω to [0, 1] • ∑ e ∈ Ω p (e) = 1 • probability of events (marginals) • p (A) = ∑ e ∈ A p (e) CS 562 - Lec 5-6: Probs & WFSTs 2

Joint and Conditional Probs CS 562 - Lec 5-6: Probs & WFSTs 3

Multiplication Rule CS 562 - Lec 5-6: Probs & WFSTs 4

Independence • P(A, B) = P(A) P(B) or P(A) = P(A|B) • disjoint events are always dependent! P(A,B) = 0 • unless one of them is “impossible”: P(A)=0 • conditional independence: P(A, B|C) = P(A|C) P(B|C) P(A|C) = P (A|B, C) CS 562 - Lec 5-6: Probs & WFSTs 5

Marginalization • compute marginal probs from joint/conditional probs CS 562 - Lec 5-6: Probs & WFSTs 6

Bayes Rules alternative bayes rule by partition CS 562 - Lec 5-6: Probs & WFSTs 7

Most Likely Event CS 562 - Lec 5-6: Probs & WFSTs 8

Most Likely Given ... CS 562 - Lec 5-6: Probs & WFSTs 9

Estimating Probabilities • how to get probabilities for basic outcomes? • do experiments • count stuff • e.g. how often do people start a sentence with “the”? • P (A) = (# of sentences like “the ...” in the sample) / (# of all sentences in the sample) • P (A | B) = (count of A, B) / (count of B) • we will show that this is Maximum Likelihood Estimation CS 562 - Lec 5-6: Probs & WFSTs 10

Model • what is a MODEL? • a general theory of how the data is generated , • along with a set of parameter estimates • e.g., given this statistics • we can “guess” it’s generated by a 12-sided die • along with 11 free parameters p(1), p(2), ..., p(11) • alternatively, by two tosses of a single 6-sided die • along with 5 free parameters p(1), p(2), ..., p(5) • which is better given the data? which better explains the data? argmax m p(m|d) = argmax m p(m) p(d|m) CS 562 - Lec 5-6: Probs & WFSTs 11

Maximum Likelihood Estimation • always maximize posterior: what’s the best m given d? • when do we use maximum likelihood estimation? • with uniform prior, same as likelihood (explains data) • argmax m p(m|d) = argmax m p(m) p(d|m) bayes, and p(d)=1 • = argmax m p(d|m) when p(m) uniform CS 562 - Lec 5-6: Probs & WFSTs 12

How do we rigorously derive this? • assuming any p m (H) = θ is possible, what’s the best θ ? • e.g.: data is still H, H, T, H. • argmax θ p(d|m; θ ) = argmax θ θ 3 (1- θ ) • take derivatives, make it zero: θ = 3/4. • works in the general case: θ = n / (n+m) (n heads, m tails) • this is why MLE is just count & divide in the discrete case CS 562 - Lec 5-6: Probs & WFSTs 13

What if we have some prior? • what if we have arbitrary prior • like p( θ ) = θ (1- θ ) • maximum a posteriori estimation (MAP) • MAP approaches MLE with infinite • MAP = MLE + smoothing • this prior is just “extra two tosses, unbiased” • you can inject other priors, like “extra 4 tosses, 3 Hs ” CS 562 - Lec 5-6: Probs & WFSTs 14

Probabilistic Finite-State Machines • adding probabilities into finite-state acceptors (FSAs) • FSA: a set of strings; WFSA: a distribution of strings CS 562 - Lec 5-6: Probs & WFSTs 15

WFSA • normalization: transitions leaving each state sum up to 1 • defines a distribution over strings? • or a distribution over paths? • => also induces a distribution over strings CS 562 - Lec 5-6: Probs & WFSTs 16

WFSTs • FST: a relation over strings (a set of string pairs) • WFST: a probabilistic relation over strings (a set of <s, t, p>: strings pair <s, t> with probability p) • what is p representing? CS 562 - Lec 5-6: Probs & WFSTs 17

Edit Distance as WFST • this is simplified edit distance • real edit distance as an example of WFST, but not PFST WFST: real edit distance a:a/0 ... a:b/1 costs: 0 replacement : 1 a:*e*/2 b:a/1 insertion : 2 deletion : 2 b:b/0 *e*:a/2 CS 562 - Lec 5-6: Probs & WFSTs 18

Normalization • if transitions leaving each state and each input symbol sum up to 1, then... • WFST defines conditional prob p(y|x) for x => y • what if we want to define a joint prob p(x, y) for x=>y? • what if we want p(x | y)? CS 562 - Lec 5-6: Probs & WFSTs 19

Questions of WFSTs • given x, y, what is p(y|x) ? • for a given x, what’s the y that maximizes p(y|x) ? • for a given y, what’s the x that maximizes p(y|x) ? • for a given x, supply all output y w/ respective p(y|x) • for a given y, supply all input x w/ respective p(y|x) CS 562 - Lec 5-6: Probs & WFSTs 20

Answer: Composition • p (z | x) = p (y | x) p (z | y) ??? • = sum y p (y | x) p (z | y) have to sum up y • given y, z & x are independent in this cascade - Why? • how to build a composed WFST C out of WFSTs A, B? • again, like intersection • sum up the products • (+, x) semiring CS 562 - Lec 5-6: Probs & WFSTs 21

Example CS 562 - Lec 5-6: Probs & WFSTs 22

Example from M. Mohri and J. Eisner they use (min, +), we use (+, *) CS 562 - Lec 5-6: Probs & WFSTs 23

Example they use (min, +), we use (+, *) CS 562 - Lec 5-6: Probs & WFSTs 24

Example they use (min, +), we use (+, *) CS 562 - Lec 5-6: Probs & WFSTs 25

Given x, supply all output y no longer normalized! CS 562 - Lec 5-6: Probs & WFSTs 26

Given x, y, what’s p(y|x) CS 562 - Lec 5-6: Probs & WFSTs 27

Given x, what’s max p(y|x) CS 562 - Lec 5-6: Probs & WFSTs 28

Part-of-Speech Tagging Again CS 562 - Lec 5-6: Probs & WFSTs 29

Part-of-Speech Tagging Again CS 562 - Lec 5-6: Probs & WFSTs 30

Adding a Tag Bigram Model (again) FST C: POS bigram LM p(w...w) p(t...t|w...w) p(t...t) p(???) wait, is that right (mathematically)? CS 562 - Lec 5-6: Probs & WFSTs 31

Noisy-Channel Model CS 562 - Lec 5-6: Probs & WFSTs 32

Noisy-Channel Model p(t...t) CS 562 - Lec 5-6: Probs & WFSTs 33

Applications of Noisy-Channel CS 562 - Lec 5-6: Probs & WFSTs 34

Example: Edit Distance from J. Eisner O(k) deletion arcs b: ε" ε" a: ε" ε" a:b b:a ε :a O(k) insertion arcs a:a ε :b b:b O(k) identity arcs CS 562 - Lec 5-6: Probs & WFSTs 35

Example: Edit Distance Best path (by Dijkstra’s algorithm) clara c: ε" l: ε" a: ε" r: ε" a: ε" ε" ε" ε" ε" ε" .o. a:c l:c c:c a:c r:c ε :c ε :c ε :c ε :c " " ε :c ε :c ε ε : b ε ε " " : c: ε" l: ε" a: ε" r: ε" a: ε" a ε" ε" ε" ε" ε" b a : c:a l:a a:a r:a a:a = ε :a ε :a ε :a ε :a ε :a ε :a : a b : ε ε a c: ε" l: ε" a: ε" r: ε" a: ε" ε" ε" ε" ε" ε" l:c c:c a:c r:c a:c a a : ε :c ε :c ε :c ε :c ε :c ε :c : ε ε b c: ε" l: ε" a: ε" r: ε" a: ε" ε" ε" ε" ε" ε" : b b l:a c:a a:a r:a a:a .o. ε :a ε :a ε :a ε :a ε :a ε :a c: ε" l: ε" a: ε" r: ε" a: ε" caca ε" ε" ε" ε" ε" CS 562 - Lec 5-6: Probs & WFSTs 36

Max / Sum Probs • in a WFSA, which string x has the greatest p ( x )? • graph search (shortest path) problem • Dijkstra; Edsger Dijkstra • or Viterbi if the FSA is acyclic (1930-2002) “ GOTO considered harmful” • does it work for NFA? • best path much easier than best string • you can determinize it (with exponential cost!) • popular work-around: n -best list crunching (b. 1932) Viterbi Alg. (1967) CMDA, Qualcomm CS 562 - Lec 5-6: Probs & WFSTs 37

Dijkstra 1959 vs. Viterbi 1967 Edsger Dijkstra (1930-2002) “ GOTO considered harmful” that’s min. spanning tree! Jarnik (1930) - Prim (1957) - Dijkstra (1959) CS 562 - Lec 5-6: Probs & WFSTs 38

Dijkstra 1959 vs. Viterbi 1967 that’s shortest-path Moore (1957) - Dijkstra (1959) CS 562 - Lec 5-6: Probs & WFSTs 39

Dijkstra 1959 vs. Viterbi 1967 special case of dynamic programming (Bellman, 1957) CS 562 - Lec 5-6: Probs & WFSTs 40

Sum Probs • what is p ( x ) for some particular x ? • for DFA, just follow x • for NFA, • get a subgraph (by composition), then sum ?? • acyclic => Viterbi • cyclic => compute strongly connected components • SCC-DAG cluster graph (cyclic locally, acyclic globally) • do infinite sum (matrix inversion) locally, Viterbi globally • refer to extra readings on course website CS 562 - Lec 5-6: Probs & WFSTs 41

Natural Language Processing Spring 2017 Unit 1: Sequence Models - PowerPoint PPT Presentation

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required optional Liang Huang Probabilities experiment (e.g., toss a coin 3

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Concept Drift Detection the State-of-the-Art Shujian Yu, Ph.D. Candidate Department of

Overview on S-Box Design Principles Debdeep Mukhopadhyay Assistant Professor Department of

Successive Integer-Forcing and its Sum-Rate Optimality Or Ordentlich Joint work with Uri Erez

Formal Modeling in Cognitive Science Lecture 19: Application of Bayes Theorem; Discrete Random

Online Topology Inference from Streaming Stationary Graph Signals Rasoul Shafipour Dept. of

E [ X ] = X 1 ( a ) := { | X ( ) = a } . a Pr [ X = a ] . 3. Important

Chapter 2 Discrete Random Variables Peng-Hua Wang Graduate Institute of Communication

False vacuum decay in gauge theory ~Standard model and beyond~ Yutaro Shoji (KMI, Nagoya U.)