Expectation Maximization CMSC 473/673 UMBC Recap from last time - PowerPoint PPT Presentation

Introduction to Latent Sequences & Expectation Maximization CMSC 473/673 UMBC

Recap from last time (and the first unit)…

N-gram Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 ) w i predict the next word

Maxent Language Models given some context… w i-3 w i-2 w i-1 compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 ⋅ 𝑔(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 , 𝑥 𝑗 )) w i predict the next word

Neural Language Models given some context… w i-3 w i-2 w i-1 create/use “ distributed representations”… e w e i-3 e i-2 e i-1 combine these matrix-vector C = f θ wi representations… product compute beliefs about what is likely… 𝑞 𝑥 𝑗 𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 ) = softmax(𝜄 𝑥 𝑗 ⋅ 𝒈(𝑥 𝑗−3 , 𝑥 𝑗−2 , 𝑥 𝑗−1 )) w i predict the next word

(Some) Properties of Embeddings Capture “like” (similar) words Capture relationships vector( ‘king’ ) – vector( ‘man’ ) + vector( ‘woman’ ) ≈ vector(‘queen’) vector( ‘Paris’ ) - vector( ‘France’ ) + vector( ‘Italy’ ) ≈ vector(‘Rome’) Mikolov et al. (2013)

Four kinds of vector models Sparse vector representations 1. Mutual-information weighted word co- occurrence matrices Dense vector representations: 2. Singular value Learn more in: decomposition/Latent • Your project Semantic Analysis • Paper (673) 3. Neural-network-inspired • Other classes (478/678) models (skip-grams, CBOW) 4. Brown clusters

Shared Intuition Model the meaning of a word by “embedding” in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (“word number 545”) or the string itself

Intrinsic Evaluation: Cosine Similarity Divide the dot product by the length of the Are the vectors parallel? two vectors -1: vectors point in opposite directions +1: vectors point in This is the cosine of the same directions angle between them 0: vectors are orthogonal

Course Recap So Far Basics of Probability Requirements to be a distribution (“proportional to”, ∝ ) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule

Course Recap So Far Basics of Probability Basics of language modeling Requirements to be a Goal: model (be able to distribution (“proportional to”, ∝ ) predict) and give a score to Definitions of conditional language (whole sequences of probability, joint probability, and characters or words) independence Simple count-based Bayes rule, (probability) chain rule model Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff Evaluation: perplexity

Course Recap So Far Basics of Probability Tasks and Classification (use Requirements to be a distribution Bayes rule!) (“proportional to”, ∝ ) Posterior decoding vs. Definitions of conditional noisy channel model probability, joint probability, and independence Evaluations: accuracy, Bayes rule, (probability) chain rule precision, recall, and F β (F 1 ) scores Basics of language modeling Naïve Bayes (given the Goal: model (be able to predict) and give a score to language (whole sequences of label, generate/explain each characters or words) feature independently) and Simple count-based model connection to language Smoothing (and why we need it): modeling Laplace (add- λ ), interpolation, backoff Evaluation: perplexity

Course Recap So Far Basics of Probability Maximum Entropy Models Requirements to be a distribution (“proportional to”, ∝ ) Meanings of feature Definitions of conditional probability, joint probability, and independence functions and weights Bayes rule, (probability) chain rule Basics of language modeling Use for language Goal: model (be able to predict) and give a score to language (whole sequences of characters or modeling or conditional words) Simple count-based model classification (“posterior in Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff one go”) Evaluation: perplexity How to learn the Tasks and Classification (use Bayes rule!) Posterior decoding vs. noisy channel model weights: gradient descent Evaluations: accuracy, precision, recall, and F β (F 1 ) scores Naïve Bayes (given the label, generate/explain each feature independently) and connection to language modeling

Course Recap So Far Basics of Probability Distributed Representations Requirements to be a distribution (“proportional to”, ∝ ) Definitions of conditional probability, joint probability, and & Neural Language Models independence Bayes rule, (probability) chain rule What embeddings are Basics of language modeling Goal: model (be able to predict) and give a score to and what their motivation is language (whole sequences of characters or words) Simple count-based model A common way to Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff Evaluation: perplexity evaluate: cosine similarity Tasks and Classification (use Bayes rule!) Posterior decoding vs. noisy channel model Evaluations: accuracy, precision, recall, and F β (F 1 ) scores Naïve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (“posterior in one go”) How to learn the weights: gradient descent

Course Recap So Far Basics of Probability Requirements to be a distribution (“proportional to”, ∝ ) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add- λ ), interpolation, backoff Evaluation: perplexity Tasks and Classification (use Bayes rule!) Posterior decoding vs. noisy channel model Evaluations: accuracy, precision, recall, and F β (F 1 ) scores Naïve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (“posterior in one go”) How to learn the weights: gradient descent Distributed Representations & Neural Language Models What embeddings are and what their motivation is A common way to evaluate: cosine similarity

LATENT SEQUENCES AND LATENT VARIABLE MODELS

Is Language Modeling “Latent?” p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

Is Language Modeling “Latent?” Most* of What We’ve Discussed: Not Really p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) but the generation process these values are unknown (explanation) is transparent *Neural language modeling as an exception

Is Document Classification “Latent?” Three people have been A TTACK fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Is Document Classification “Latent?” As We’ve Discussed Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. argmax 𝑍 ෑ 𝑞(𝑌 𝑗 |𝑍) ∗ 𝑞(𝑍) 𝑗 exp 𝜄 ⋅ 𝑔 𝑦, 𝑧 argmax 𝑍 ∗ 𝑞(𝑍) 𝑎(𝑦) argmax 𝑍 exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 )

Is Document Classification “Latent?” As We’ve Discussed: Not Really Three people have been fatally shot, and five A TTACK people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. argmax 𝑍 ෑ 𝑞(𝑌 𝑗 |𝑍) ∗ 𝑞(𝑍) these values are unknown 𝑗 exp 𝜄 ⋅ 𝑔 𝑦, 𝑧 argmax 𝑍 ∗ 𝑞(𝑍) but the generation process 𝑎(𝑦) (explanation) is transparent argmax 𝑍 exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 )

Ambiguity → Part of Speech Tagging British Left Waffles on Falkland Islands British Left Waffles on Falkland Islands Adjective Noun Verb British Left Waffles on Falkland Islands Noun Verb Noun

observed text orthography morphology lexemes syntax semantics pragmatics discourse Adapted from Jason Eisner, Noah Smith

Latent Modeling observed text orthography explain what you see/annotate morphology lexemes with things “of syntax importance” you don’t semantics pragmatics discourse Adapted from Jason Eisner, Noah Smith

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

Expectation Maximization CMSC 473/673 UMBC Recap from last time - PowerPoint PPT Presentation

Introduction to Latent Sequences & Expectation Maximization CMSC 473/673 UMBC Recap from last time (and the first unit) N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Statistical Machine Learning Lecture 06 Extra: Expectation Maximization Kristian Kersting TU

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Maximization of Submodular Functions Seffi Naor Lecture 1 4th Cargese Workshop on Combinatorial

On the dual problem of utility maximization Yiqing LIN Joint work with L. GU and J. YANG

Adrian Cooper CEO & Chief Economist, Oxford Economics The savings challenge Key research

Slides for the Methods Lecture David Parnas David Parnas is Professor of Computer Science at

CS 744: MESOS Shivaram Venkataraman Fall 2020 ADMINISTRIVIA lie poll ! fill out -

Click to add title The he cos osts ts and nd impa mpact cts s of of int ntermitt

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Independent Transversals or on Forming Committees Penny Haxell University of Waterloo 1

Preparing for NSP Closeout Community Planning and Development Moderators Presenters- Dionne

DevOps: Why, How, Title slide and What Subtitle Add speaker name here Kelly Albrecht |

Expectation Maximization CMSC 473/673 UMBC Recap from last time - PowerPoint PPT Presentation

Introduction to Latent Sequences & Expectation Maximization CMSC 473/673 UMBC Recap from last time (and the first unit) N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Submodular Maximization Seffi Naor Lecture 2 4th Cargese Workshop on Combinatorial Optimization

Submodular Maximization Seffi Naor Lecture 3 4th Cargese Workshop on Combinatorial Optimization

Statistical Machine Learning Lecture 06 Extra: Expectation Maximization Kristian Kersting TU

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Maximization of Submodular Functions Seffi Naor Lecture 1 4th Cargese Workshop on Combinatorial

On the dual problem of utility maximization Yiqing LIN Joint work with L. GU and J. YANG

Adrian Cooper CEO &amp; Chief Economist, Oxford Economics The savings challenge Key research

Slides for the Methods Lecture David Parnas David Parnas is Professor of Computer Science at

CS 744: MESOS Shivaram Venkataraman Fall 2020 ADMINISTRIVIA lie poll ! fill out -

Click to add title The he cos osts ts and nd impa mpact cts s of of int ntermitt

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Independent Transversals or on Forming Committees Penny Haxell University of Waterloo 1

Preparing for NSP Closeout Community Planning and Development Moderators Presenters- Dionne

DevOps: Why, How, Title slide and What Subtitle Add speaker name here Kelly Albrecht |

Adrian Cooper CEO & Chief Economist, Oxford Economics The savings challenge Key research