4CSLL5 Parameter Estimation (Supervised and Unsupervised) - PowerPoint PPT Presentation

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z) D 2nd scenario: (toss Z; (then A or B) 10 ) D Martin Emms September 20, 2019 4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z) D Common-sense and relative frequency Suppose a 2-sided ’coin’ Z , one side labelled ’a’, other side labelled ’b’ P ( Z = a ): probability of giving ’a’ when tossed – currently not known Parameter Estimation P ( Z = b ): probability of giving ’b’ when tossed – currently not known Suppose you have data d recording 100 tosses of Z if there were (50 a, 50 b) in d , ’common-sense’ says P ( Z = a ) = 50 / 100 if there were (30 a, 70 b) in d , ’common-sense’ says P ( Z = a ) = 30 / 100 ie. you ’define’ or ’estimate’ the probability by the relative frequency

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z) D First scenario: (toss a ’coin’ Z) D Data likelihood p ( d ) for 50 a, 50 b assuming the tosses of Z are all independent, can work out the probability of X 1.2e−21 the observed data d if Z ’s probabilities had particular values. let θ a and θ b stand for P ( Z = a ) and P ( Z = b ) as θ a is varied, data prob p ( d ) varies 8.0e−22 let #( a ) be the number of ’a’ outcomes in the sequence d let #( b ) be the number of ’b’ outcomes in the sequence d max occurs at θ a = 0 . 5 4.0e−22 the probability of d , assuming the probability settings θ a and θ b is 50 which is 50 + 50 p ( d ) = θ #( a ) × θ #( b ) (1) a b 0.0e+00 different settings of θ a and θ b will give different values for p ( d ) 0.0 0.2 0.4 0.6 0.8 1.0 following slides investigate this empirically 4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z) D First scenario: (toss a ’coin’ Z) D p ( d ) for 30 a, 70 b p ( d ) for 70 a, 30 b 4e−19 4e−19 X X 3e−19 3e−19 as θ a is varied, data prob p ( d ; θ a , θ b ) as θ a is varied, data prob p ( d ; θ a , θ b ) varies varies 2e−19 2e−19 max occurs at θ a = 0 . 3 max occurs at θ a = 0 . 7 1e−19 30 1e−19 70 which is which is 30 + 70 70 + 30 0e+00 0e+00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z) D First scenario: (toss a ’coin’ Z) D ◮ in each case, it looks like the max of the data probability occured at the on reflection, if you have to set parameters given data, it makes a lot of sense value given by the relative frequency to set the parameters to whatever values make the data as likely as possible ◮ this suggests that in these cases, formula for p ( d ; θ a , θ b ) is (1), repeated below Max. Likelihood Estimator if you wanted to find θ a (and θ b ) that maximise the data probability, that is you p ( d ; θ a , θ b ) = θ #( a ) × θ #( b ) a b want and because θ b = 1 − θ a can really write this in terms of just parameter θ a arg max p ( d ; θ a , θ b ) θ a ,θ b p ( d ; θ a ) = θ #( a ) × (1 − θ a ) #( b ) a then the relative frequencies would give the answer, that is Looking at some pics suggested a formula for the value of θ a that maximises #( a ) #( b ) this. Can we actually derive this formula? θ a = θ b = #( a ) + #( b ) #( a ) + #( b ) Yes ⇒ take the log of this – the log-likelihood and use calculus to maximize that w.r.t. θ a – this turns out to be (relatively) easy ◮ technically expressed as: the relative frequency is a maximum likelihood estimator of the parameters 4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z) D 2nd scenario: (toss Z; (then A or B)10) D a more complex scenario suppose D repetitions of Define L ( θ a ) as log ( P ( d ; θ a )). Then you get toss disc Z , to choose one of two coins A or B then toss chosen coin 10 times L ( θ a ) = #( a ) log θ a + #( b ) log(1 − θ a ) Suppose 9 repetitions gave d Z X : tosses of chosen coin H counts need to take derivative wrt to θ a and set to 0, which is 1 A H H H H H H H H T T (8H) 2 B T T H T T T H T T T (2H) dL ( θ a ) #( a ) − #( b ) #( a ) + #( b ) = #( a ) #( a ) 3 A H T H H T H H H H T (7H) = 1 − θ a = 0 = θ a = ⇒ 4 A H T H H H T H H H H (8H) d θ a θ a 100 5 B T T T T T T H T T T (1H) 6 A H H T H H H H H H H (9H) so in this scenario of 100 tosses of Z , we have proven that the relative 7 A T H H T H H H H H T (7H) frequency is always going to the maximum likelihood estimator 8 A H H H H H H T H H H (9H) now want to consider slightly more complex scenario 9 B H H T T T T T H T T (3H) Let θ a be Z ’s probability of giving A Let θ h | a be A ’s probability of giving H Let θ h | b be B ’s probability of giving H

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10) D 2nd scenario: (toss Z; (then A or B)10) D ’common sense’ calculation of θ a , θ h | a and θ h | b to make the comparision with the hidden variable version which will come up for θ a , need ( count of Z = A cases )/( count of all Z cases ), ie. later, its worth noting that we can formulate all the restricted sums � d : Z = A (Φ( d )) with unrestricted sums if we put a so-called Kronecker-delta � d : Z = A 1 = 6 indicator function inside the sum � d ( δ ( d , A )Φ( d )) where δ ( d , A ) = 1 if datum est ( θ a ) = 9 = 0 . 66 (2) D d had Z = A , and is 0 otherwise. � d δ ( d , A ) for θ h | a , need est ( θ a ) = (5) ( count of H when A chosen )/( count of all tosses when A chosen ), ie. D � d : Z = A #( d , h ) = 48 60 = 4 est ( θ h | a ) = 5 = 0 . 8 (3) � d δ ( d , A )#( d , h ) � d : Z = A 10 est ( θ h | a ) = (6) � d δ ( d , A )10 for θ h | b , need ( count of H when B chosen )/( count of all tosses when B chosen ), ie. � d δ ( d , B )#( d , h ) est ( θ h | b ) = (7) � d : Z = B #( d , h ) = 6 30 = 1 � d δ ( d , B )10 est ( θ h | b ) = 5 = 0 . 2 (4) � d : Z = B 10 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10) D � [ log θ a + #( d , h ) log θ h | a + #( d , t ) log θ t | a ]+ d : Z = a it turns out that in this scenario also, the ’common-sense’, relative-frequency � [ log θ b + #( d , h ) log θ h | b + #( d , t ) log θ t | b ] answers are also maximum likelihood estimators ie. values which maximise the d : Z = b probability of the data, and again it is (relatively) easy to show this by taking logs and using calculus. L ( θ a , θ h | a , θ h | b ) – repeated above – can be split into 3 separate terms, L ( θ a ) + L ( θ h | a ) + L ( θ h | b ) concerning Z, A and B the formula for p ( d ; θ a , θ b , θ h | a , θ t | a , θ h | b , θ t | b ) � [ θ a θ #( d , h ) θ #( d , t ) � [ θ b θ #( d , h ) θ #( d , t ) p ( d ) = ] ] � � L ( θ a ) = [ 1] log θ a + [ 1] log (1 − θ a ) (8) h | a t | a h | b t | b d : Z = a d : Z = b d : Z = a d : Z = b � � L ( θ h | a ) = [ #( d , h )] log θ h | a + [ #( d , t )] log (1 − θ h | a ) (9) and its log comes out as d : Z = a d : Z = a � � L ( θ h | b ) = [ #( d , h )] log θ h | b + [ #( d , t )] log (1 − θ h | b ) (10) � [ log θ a + #( d , h ) log θ h | a + #( d , t ) log θ t | a ]+ d : Z = b d : Z = b d : Z = a and this means that when you take the derivatives of L ( θ a , θ h | a , θ h | b ) wrt. θ a , � [ log θ b + #( d , h ) log θ h | b + #( d , t ) log θ t | b ] θ h | a and θ h | b in each case you can just look at one of the above terms. They are d : Z = b all really of the same form being N ( log ( p )) + M ( log (1 − p )), the same form as N seen in the first simple scenario, and it has maximum value at p = call this L ( θ a , θ h | a , θ h | b ) N + M

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10) D hence � d : Z = a 1 ∂ L ( θ a ) = 0 = ⇒ θ a = ∂θ a � d : Z = a 1 + � d : Z = b 1 ∂ L ( θ h | a ) � d : Z = a #( d , h ) = 0 = ⇒ θ h | a = ∂θ h | a � d : Z = a #( d , h ) + � d : Z = a #( d , t ) ∂ L ( θ h | b ) � d : Z = b #( d , h ) = 0 = ⇒ θ h | b = ∂θ h | b � d : Z = b #( d , h ) + � d : Z = b #( d , t ) finally the denominators of these turn into D , � d : Z = a 10 and � d : Z = b 10 respectively and so are exactly the ’common sense’ formulae we started with in (2), (3), (4)

4CSLL5 Parameter Estimation (Supervised and Unsupervised) - PowerPoint PPT Presentation

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario:

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

4CSLL5 IBM Translation Models Martin Emms October 29, 2020 4CSLL5 IBM Translation Models

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

4CSLL5 IBM Translation Models Martin Emms October 23, 2020 4CSLL5 IBM Translation Models

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Current State of Unsupervised Deep Learning William Falcon, PhD Student AGENDA AGENDA

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Machine Learning Estimation Hamid R. Rabiee Spring 2015

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by conditioning each word on

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M.

Latent class analysis and finite mixture models with Stata Isabel Canette Principal

Modernmachinelearningmethods fortrustworthyscience TomCharnock

Mobility Inequality in the United States 1 download slides at: www.inequality.com/slides

Supporting Mobility in MobilityFirst F. Zhang, K. Nagaraja, T. Nguyen, D. Raychaudhuri, Y. Zhang

Intergenerational mobility in developing countries: on the axiomatic foundation of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) - PowerPoint PPT Presentation

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario:

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

4CSLL5 IBM Translation Models Martin Emms October 29, 2020 4CSLL5 IBM Translation Models

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

4CSLL5 IBM Translation Models Martin Emms October 23, 2020 4CSLL5 IBM Translation Models

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Current State of Unsupervised Deep Learning William Falcon, PhD Student AGENDA AGENDA

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Machine Learning Estimation Hamid R. Rabiee Spring 2015

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by conditioning each word on

Gaussian Mixture Models &amp; EM CE-717: Machine Learning Sharif University of Technology M.

Latent class analysis and finite mixture models with Stata Isabel Canette Principal

Modernmachinelearningmethods fortrustworthyscience TomCharnock

Mobility Inequality in the United States 1 download slides at: www.inequality.com/slides

Supporting Mobility in MobilityFirst F. Zhang, K. Nagaraja, T. Nguyen, D. Raychaudhuri, Y. Zhang

Intergenerational mobility in developing countries: on the axiomatic foundation of

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M.