4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin - - PowerPoint PPT Presentation

4csll5 parameter estimation supervised and unsupervised
SMART_READER_LITE
LIVE PREVIEW

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin - - PowerPoint PPT Presentation

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline Supervised Maximum Likelihood


slide-1
SLIDE 1

4CSLL5 Parameter Estimation (Supervised and Unsupervised)

4CSLL5 Parameter Estimation (Supervised and Unsupervised)

Martin Emms September 20, 2019

slide-2
SLIDE 2

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline

Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D 2nd scenario: (toss Z; (then A or B)10)D

slide-3
SLIDE 3

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline

Parameter Estimation

slide-4
SLIDE 4

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Outline

Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D 2nd scenario: (toss Z; (then A or B)10)D

slide-5
SLIDE 5

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Common-sense and relative frequency

Suppose a 2-sided ’coin’ Z, one side labelled ’a’, other side labelled ’b’

slide-6
SLIDE 6

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Common-sense and relative frequency

Suppose a 2-sided ’coin’ Z, one side labelled ’a’, other side labelled ’b’ P(Z = a): probability of giving ’a’ when tossed – currently not known

slide-7
SLIDE 7

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Common-sense and relative frequency

Suppose a 2-sided ’coin’ Z, one side labelled ’a’, other side labelled ’b’ P(Z = a): probability of giving ’a’ when tossed – currently not known P(Z = b): probability of giving ’b’ when tossed – currently not known

slide-8
SLIDE 8

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Common-sense and relative frequency

Suppose a 2-sided ’coin’ Z, one side labelled ’a’, other side labelled ’b’ P(Z = a): probability of giving ’a’ when tossed – currently not known P(Z = b): probability of giving ’b’ when tossed – currently not known Suppose you have data d recording 100 tosses of Z

slide-9
SLIDE 9

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Common-sense and relative frequency

Suppose a 2-sided ’coin’ Z, one side labelled ’a’, other side labelled ’b’ P(Z = a): probability of giving ’a’ when tossed – currently not known P(Z = b): probability of giving ’b’ when tossed – currently not known Suppose you have data d recording 100 tosses of Z if there were (50 a, 50 b) in d, ’common-sense’ says P(Z = a) = 50/100

slide-10
SLIDE 10

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Common-sense and relative frequency

Suppose a 2-sided ’coin’ Z, one side labelled ’a’, other side labelled ’b’ P(Z = a): probability of giving ’a’ when tossed – currently not known P(Z = b): probability of giving ’b’ when tossed – currently not known Suppose you have data d recording 100 tosses of Z if there were (50 a, 50 b) in d, ’common-sense’ says P(Z = a) = 50/100 if there were (30 a, 70 b) in d, ’common-sense’ says P(Z = a) = 30/100

slide-11
SLIDE 11

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Common-sense and relative frequency

Suppose a 2-sided ’coin’ Z, one side labelled ’a’, other side labelled ’b’ P(Z = a): probability of giving ’a’ when tossed – currently not known P(Z = b): probability of giving ’b’ when tossed – currently not known Suppose you have data d recording 100 tosses of Z if there were (50 a, 50 b) in d, ’common-sense’ says P(Z = a) = 50/100 if there were (30 a, 70 b) in d, ’common-sense’ says P(Z = a) = 30/100

  • ie. you ’define’ or ’estimate’ the probability by the relative frequency
slide-12
SLIDE 12

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Data likelihood

assuming the tosses of Z are all independent, can work out the probability of the observed data d if Z’s probabilities had particular values.

slide-13
SLIDE 13

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Data likelihood

assuming the tosses of Z are all independent, can work out the probability of the observed data d if Z’s probabilities had particular values. let θa and θb stand for P(Z = a) and P(Z = b)

slide-14
SLIDE 14

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Data likelihood

assuming the tosses of Z are all independent, can work out the probability of the observed data d if Z’s probabilities had particular values. let θa and θb stand for P(Z = a) and P(Z = b) let #(a) be the number of ’a’ outcomes in the sequence d

slide-15
SLIDE 15

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Data likelihood

assuming the tosses of Z are all independent, can work out the probability of the observed data d if Z’s probabilities had particular values. let θa and θb stand for P(Z = a) and P(Z = b) let #(a) be the number of ’a’ outcomes in the sequence d let #(b) be the number of ’b’ outcomes in the sequence d

slide-16
SLIDE 16

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Data likelihood

assuming the tosses of Z are all independent, can work out the probability of the observed data d if Z’s probabilities had particular values. let θa and θb stand for P(Z = a) and P(Z = b) let #(a) be the number of ’a’ outcomes in the sequence d let #(b) be the number of ’b’ outcomes in the sequence d the probability of d, assuming the probability settings θa and θb is

slide-17
SLIDE 17

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Data likelihood

assuming the tosses of Z are all independent, can work out the probability of the observed data d if Z’s probabilities had particular values. let θa and θb stand for P(Z = a) and P(Z = b) let #(a) be the number of ’a’ outcomes in the sequence d let #(b) be the number of ’b’ outcomes in the sequence d the probability of d, assuming the probability settings θa and θb is p(d) = θ#(a)

a

× θ#(b)

b

(1)

slide-18
SLIDE 18

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Data likelihood

assuming the tosses of Z are all independent, can work out the probability of the observed data d if Z’s probabilities had particular values. let θa and θb stand for P(Z = a) and P(Z = b) let #(a) be the number of ’a’ outcomes in the sequence d let #(b) be the number of ’b’ outcomes in the sequence d the probability of d, assuming the probability settings θa and θb is p(d) = θ#(a)

a

× θ#(b)

b

(1) different settings of θa and θb will give different values for p(d) following slides investigate this empirically

slide-19
SLIDE 19

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

p(d) for 50 a, 50 b

0.0 0.2 0.4 0.6 0.8 1.0 0.0e+00 4.0e−22 8.0e−22 1.2e−21 X

as θa is varied, data prob p(d) varies

slide-20
SLIDE 20

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

p(d) for 50 a, 50 b

0.0 0.2 0.4 0.6 0.8 1.0 0.0e+00 4.0e−22 8.0e−22 1.2e−21 X

as θa is varied, data prob p(d) varies max occurs at θa = 0.5

slide-21
SLIDE 21

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

p(d) for 50 a, 50 b

0.0 0.2 0.4 0.6 0.8 1.0 0.0e+00 4.0e−22 8.0e−22 1.2e−21 X

as θa is varied, data prob p(d) varies max occurs at θa = 0.5 which is 50 50 + 50

slide-22
SLIDE 22

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

p(d) for 30 a, 70 b

0.0 0.2 0.4 0.6 0.8 1.0 0e+00 1e−19 2e−19 3e−19 4e−19 X

as θa is varied, data prob p(d; θa, θb) varies

slide-23
SLIDE 23

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

p(d) for 30 a, 70 b

0.0 0.2 0.4 0.6 0.8 1.0 0e+00 1e−19 2e−19 3e−19 4e−19 X

as θa is varied, data prob p(d; θa, θb) varies max occurs at θa = 0.3

slide-24
SLIDE 24

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

p(d) for 30 a, 70 b

0.0 0.2 0.4 0.6 0.8 1.0 0e+00 1e−19 2e−19 3e−19 4e−19 X

as θa is varied, data prob p(d; θa, θb) varies max occurs at θa = 0.3 which is 30 30 + 70

slide-25
SLIDE 25

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

p(d) for 70 a, 30 b

0.0 0.2 0.4 0.6 0.8 1.0 0e+00 1e−19 2e−19 3e−19 4e−19 X

as θa is varied, data prob p(d; θa, θb) varies

slide-26
SLIDE 26

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

p(d) for 70 a, 30 b

0.0 0.2 0.4 0.6 0.8 1.0 0e+00 1e−19 2e−19 3e−19 4e−19 X

as θa is varied, data prob p(d; θa, θb) varies max occurs at θa = 0.7

slide-27
SLIDE 27

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

p(d) for 70 a, 30 b

0.0 0.2 0.4 0.6 0.8 1.0 0e+00 1e−19 2e−19 3e−19 4e−19 X

as θa is varied, data prob p(d; θa, θb) varies max occurs at θa = 0.7 which is 70 70 + 30

slide-28
SLIDE 28

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

slide-29
SLIDE 29

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

◮ in each case, it looks like the max of the data probability occured at the

value given by the relative frequency

slide-30
SLIDE 30

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

◮ in each case, it looks like the max of the data probability occured at the

value given by the relative frequency

◮ this suggests that in these cases,

  • Max. Likelihood Estimator

if you wanted to find θa (and θb) that maximise the data probability, that is you want arg max

θa,θb

p(d; θa, θb)

slide-31
SLIDE 31

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

◮ in each case, it looks like the max of the data probability occured at the

value given by the relative frequency

◮ this suggests that in these cases,

  • Max. Likelihood Estimator

if you wanted to find θa (and θb) that maximise the data probability, that is you want arg max

θa,θb

p(d; θa, θb) then the relative frequencies would give the answer, that is θa = #(a) #(a) + #(b) θb = #(b) #(a) + #(b)

slide-32
SLIDE 32

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

◮ in each case, it looks like the max of the data probability occured at the

value given by the relative frequency

◮ this suggests that in these cases,

  • Max. Likelihood Estimator

if you wanted to find θa (and θb) that maximise the data probability, that is you want arg max

θa,θb

p(d; θa, θb) then the relative frequencies would give the answer, that is θa = #(a) #(a) + #(b) θb = #(b) #(a) + #(b)

◮ technically expressed as: the relative frequency is a maximum likelihood

estimator of the parameters

slide-33
SLIDE 33

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

  • n reflection, if you have to set parameters given data, it makes a lot of sense

to set the parameters to whatever values make the data as likely as possible

slide-34
SLIDE 34

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

  • n reflection, if you have to set parameters given data, it makes a lot of sense

to set the parameters to whatever values make the data as likely as possible formula for p(d; θa, θb) is (1), repeated below p(d; θa, θb) = θ#(a)

a

× θ#(b)

b

and because θb = 1 − θa can really write this in terms of just parameter θa p(d; θa) = θ#(a)

a

× (1 − θa)#(b)

slide-35
SLIDE 35

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

  • n reflection, if you have to set parameters given data, it makes a lot of sense

to set the parameters to whatever values make the data as likely as possible formula for p(d; θa, θb) is (1), repeated below p(d; θa, θb) = θ#(a)

a

× θ#(b)

b

and because θb = 1 − θa can really write this in terms of just parameter θa p(d; θa) = θ#(a)

a

× (1 − θa)#(b) Looking at some pics suggested a formula for the value of θa that maximises

  • this. Can we actually derive this formula?
slide-36
SLIDE 36

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

  • n reflection, if you have to set parameters given data, it makes a lot of sense

to set the parameters to whatever values make the data as likely as possible formula for p(d; θa, θb) is (1), repeated below p(d; θa, θb) = θ#(a)

a

× θ#(b)

b

and because θb = 1 − θa can really write this in terms of just parameter θa p(d; θa) = θ#(a)

a

× (1 − θa)#(b) Looking at some pics suggested a formula for the value of θa that maximises

  • this. Can we actually derive this formula?

Yes ⇒ take the log of this – the log-likelihood and use calculus to maximize that w.r.t. θa – this turns out to be (relatively) easy

slide-37
SLIDE 37

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Define L(θa) as log(P(d; θa)).

slide-38
SLIDE 38

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Define L(θa) as log(P(d; θa)). Then you get L(θa) = #(a) log θa + #(b) log(1 − θa)

slide-39
SLIDE 39

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Define L(θa) as log(P(d; θa)). Then you get L(θa) = #(a) log θa + #(b) log(1 − θa) need to take derivative wrt to θa and set to 0, which is dL(θa) dθa = #(a) θa − #(b) 1 − θa = 0

slide-40
SLIDE 40

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Define L(θa) as log(P(d; θa)). Then you get L(θa) = #(a) log θa + #(b) log(1 − θa) need to take derivative wrt to θa and set to 0, which is dL(θa) dθa = #(a) θa − #(b) 1 − θa = 0 = ⇒ θa = #(a) #(a) + #(b) = #(a) 100

slide-41
SLIDE 41

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Define L(θa) as log(P(d; θa)). Then you get L(θa) = #(a) log θa + #(b) log(1 − θa) need to take derivative wrt to θa and set to 0, which is dL(θa) dθa = #(a) θa − #(b) 1 − θa = 0 = ⇒ θa = #(a) #(a) + #(b) = #(a) 100 so in this scenario of 100 tosses of Z, we have proven that the relative frequency is always going to the maximum likelihood estimator

slide-42
SLIDE 42

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D

Define L(θa) as log(P(d; θa)). Then you get L(θa) = #(a) log θa + #(b) log(1 − θa) need to take derivative wrt to θa and set to 0, which is dL(θa) dθa = #(a) θa − #(b) 1 − θa = 0 = ⇒ θa = #(a) #(a) + #(b) = #(a) 100 so in this scenario of 100 tosses of Z, we have proven that the relative frequency is always going to the maximum likelihood estimator now want to consider slightly more complex scenario

slide-43
SLIDE 43

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

Outline

Supervised Maximum Likelihood Estimation(MLE) First scenario: (toss a ’coin’ Z)D 2nd scenario: (toss Z; (then A or B)10)D

slide-44
SLIDE 44

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

a more complex scenario

suppose D repetitions of toss disc Z, to choose one of two coins A or B then toss chosen coin 10 times

slide-45
SLIDE 45

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

a more complex scenario

suppose D repetitions of toss disc Z, to choose one of two coins A or B then toss chosen coin 10 times Suppose 9 repetitions gave d Z X: tosses of chosen coin H counts 1 A H H H H H H H H T T (8H) 2 B T T H T T T H T T T (2H) 3 A H T H H T H H H H T (7H) 4 A H T H H H T H H H H (8H) 5 B T T T T T T H T T T (1H) 6 A H H T H H H H H H H (9H) 7 A T H H T H H H H H T (7H) 8 A H H H H H H T H H H (9H) 9 B H H T T T T T H T T (3H)

slide-46
SLIDE 46

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

a more complex scenario

suppose D repetitions of toss disc Z, to choose one of two coins A or B then toss chosen coin 10 times Suppose 9 repetitions gave d Z X: tosses of chosen coin H counts 1 A H H H H H H H H T T (8H) 2 B T T H T T T H T T T (2H) 3 A H T H H T H H H H T (7H) 4 A H T H H H T H H H H (8H) 5 B T T T T T T H T T T (1H) 6 A H H T H H H H H H H (9H) 7 A T H H T H H H H H T (7H) 8 A H H H H H H T H H H (9H) 9 B H H T T T T T H T T (3H) Let θa be Z’s probability of giving A

slide-47
SLIDE 47

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

a more complex scenario

suppose D repetitions of toss disc Z, to choose one of two coins A or B then toss chosen coin 10 times Suppose 9 repetitions gave d Z X: tosses of chosen coin H counts 1 A H H H H H H H H T T (8H) 2 B T T H T T T H T T T (2H) 3 A H T H H T H H H H T (7H) 4 A H T H H H T H H H H (8H) 5 B T T T T T T H T T T (1H) 6 A H H T H H H H H H H (9H) 7 A T H H T H H H H H T (7H) 8 A H H H H H H T H H H (9H) 9 B H H T T T T T H T T (3H) Let θa be Z’s probability of giving A Let θh|a be A’s probability of giving H

slide-48
SLIDE 48

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

a more complex scenario

suppose D repetitions of toss disc Z, to choose one of two coins A or B then toss chosen coin 10 times Suppose 9 repetitions gave d Z X: tosses of chosen coin H counts 1 A H H H H H H H H T T (8H) 2 B T T H T T T H T T T (2H) 3 A H T H H T H H H H T (7H) 4 A H T H H H T H H H H (8H) 5 B T T T T T T H T T T (1H) 6 A H H T H H H H H H H (9H) 7 A T H H T H H H H H T (7H) 8 A H H H H H H T H H H (9H) 9 B H H T T T T T H T T (3H) Let θa be Z’s probability of giving A Let θh|a be A’s probability of giving H Let θh|b be B’s probability of giving H

slide-49
SLIDE 49

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

slide-50
SLIDE 50

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

for θa, need (count of Z = A cases)/(count of all Z cases), ie.

slide-51
SLIDE 51

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

for θa, need (count of Z = A cases)/(count of all Z cases), ie. est(θa) =

  • d:Z=A 1

D =

slide-52
SLIDE 52

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

for θa, need (count of Z = A cases)/(count of all Z cases), ie. est(θa) =

  • d:Z=A 1

D = 6 9 = 0.66 (2)

slide-53
SLIDE 53

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

for θa, need (count of Z = A cases)/(count of all Z cases), ie. est(θa) =

  • d:Z=A 1

D = 6 9 = 0.66 (2) for θh|a, need (count of H when A chosen)/(count of all tosses when A chosen), ie.

slide-54
SLIDE 54

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

for θa, need (count of Z = A cases)/(count of all Z cases), ie. est(θa) =

  • d:Z=A 1

D = 6 9 = 0.66 (2) for θh|a, need (count of H when A chosen)/(count of all tosses when A chosen), ie. est(θh|a) =

  • d:Z=A #(d, h)
  • d:Z=A 10

=

slide-55
SLIDE 55

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

for θa, need (count of Z = A cases)/(count of all Z cases), ie. est(θa) =

  • d:Z=A 1

D = 6 9 = 0.66 (2) for θh|a, need (count of H when A chosen)/(count of all tosses when A chosen), ie. est(θh|a) =

  • d:Z=A #(d, h)
  • d:Z=A 10

= 48 60 = 4 5 = 0.8 (3)

slide-56
SLIDE 56

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

for θa, need (count of Z = A cases)/(count of all Z cases), ie. est(θa) =

  • d:Z=A 1

D = 6 9 = 0.66 (2) for θh|a, need (count of H when A chosen)/(count of all tosses when A chosen), ie. est(θh|a) =

  • d:Z=A #(d, h)
  • d:Z=A 10

= 48 60 = 4 5 = 0.8 (3) for θh|b, need (count of H when B chosen)/(count of all tosses when B chosen), ie.

slide-57
SLIDE 57

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

for θa, need (count of Z = A cases)/(count of all Z cases), ie. est(θa) =

  • d:Z=A 1

D = 6 9 = 0.66 (2) for θh|a, need (count of H when A chosen)/(count of all tosses when A chosen), ie. est(θh|a) =

  • d:Z=A #(d, h)
  • d:Z=A 10

= 48 60 = 4 5 = 0.8 (3) for θh|b, need (count of H when B chosen)/(count of all tosses when B chosen), ie. est(θh|b) =

  • d:Z=B #(d, h)
  • d:Z=B 10

=

slide-58
SLIDE 58

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

’common sense’ calculation of θa, θh|a and θh|b

for θa, need (count of Z = A cases)/(count of all Z cases), ie. est(θa) =

  • d:Z=A 1

D = 6 9 = 0.66 (2) for θh|a, need (count of H when A chosen)/(count of all tosses when A chosen), ie. est(θh|a) =

  • d:Z=A #(d, h)
  • d:Z=A 10

= 48 60 = 4 5 = 0.8 (3) for θh|b, need (count of H when B chosen)/(count of all tosses when B chosen), ie. est(θh|b) =

  • d:Z=B #(d, h)
  • d:Z=B 10

= 6 30 = 1 5 = 0.2 (4)

slide-59
SLIDE 59

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

to make the comparision with the hidden variable version which will come up later, its worth noting that we can formulate all the restricted sums

  • d:Z=A(Φ(d)) with unrestricted sums if we put a so-called Kronecker-delta

indicator function inside the sum

d(δ(d, A)Φ(d)) where δ(d, A) = 1 if datum

d had Z = A, and is 0 otherwise.

slide-60
SLIDE 60

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

to make the comparision with the hidden variable version which will come up later, its worth noting that we can formulate all the restricted sums

  • d:Z=A(Φ(d)) with unrestricted sums if we put a so-called Kronecker-delta

indicator function inside the sum

d(δ(d, A)Φ(d)) where δ(d, A) = 1 if datum

d had Z = A, and is 0 otherwise. est(θa) =

  • d δ(d, A)

D (5) est(θh|a) =

  • d δ(d, A)#(d, h)
  • d δ(d, A)10

(6) est(θh|b) =

  • d δ(d, B)#(d, h)
  • d δ(d, B)10

(7)

slide-61
SLIDE 61

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

it turns out that in this scenario also, the ’common-sense’, relative-frequency answers are also maximum likelihood estimators ie. values which maximise the probability of the data, and again it is (relatively) easy to show this by taking logs and using calculus.

slide-62
SLIDE 62

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

it turns out that in this scenario also, the ’common-sense’, relative-frequency answers are also maximum likelihood estimators ie. values which maximise the probability of the data, and again it is (relatively) easy to show this by taking logs and using calculus. the formula for p(d; θa, θb, θh|a, θt|a, θh|b, θt|b) p(d) =

  • d:Z=a

[θaθ#(d,h)

h|a

θ#(d,t)

t|a

]

  • d:Z=b

[θbθ#(d,h)

h|b

θ#(d,t)

t|b

]

slide-63
SLIDE 63

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

it turns out that in this scenario also, the ’common-sense’, relative-frequency answers are also maximum likelihood estimators ie. values which maximise the probability of the data, and again it is (relatively) easy to show this by taking logs and using calculus. the formula for p(d; θa, θb, θh|a, θt|a, θh|b, θt|b) p(d) =

  • d:Z=a

[θaθ#(d,h)

h|a

θ#(d,t)

t|a

]

  • d:Z=b

[θbθ#(d,h)

h|b

θ#(d,t)

t|b

] and its log comes out as

  • d:Z=a

[logθa + #(d, h)logθh|a + #(d, t)logθt|a]+

  • d:Z=b

[logθb + #(d, h)logθh|b + #(d, t)logθt|b]

slide-64
SLIDE 64

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

it turns out that in this scenario also, the ’common-sense’, relative-frequency answers are also maximum likelihood estimators ie. values which maximise the probability of the data, and again it is (relatively) easy to show this by taking logs and using calculus. the formula for p(d; θa, θb, θh|a, θt|a, θh|b, θt|b) p(d) =

  • d:Z=a

[θaθ#(d,h)

h|a

θ#(d,t)

t|a

]

  • d:Z=b

[θbθ#(d,h)

h|b

θ#(d,t)

t|b

] and its log comes out as

  • d:Z=a

[logθa + #(d, h)logθh|a + #(d, t)logθt|a]+

  • d:Z=b

[logθb + #(d, h)logθh|b + #(d, t)logθt|b] call this L(θa, θh|a, θh|b)

slide-65
SLIDE 65
  • d:Z=a

[logθa + #(d, h)logθh|a + #(d, t)logθt|a]+

  • d:Z=b

[logθb + #(d, h)logθh|b + #(d, t)logθt|b]

slide-66
SLIDE 66
  • d:Z=a

[logθa + #(d, h)logθh|a + #(d, t)logθt|a]+

  • d:Z=b

[logθb + #(d, h)logθh|b + #(d, t)logθt|b] L(θa, θh|a, θh|b) – repeated above – can be split into 3 separate terms, L(θa) + L(θh|a) + L(θh|b) concerning Z, A and B

slide-67
SLIDE 67
  • d:Z=a

[logθa + #(d, h)logθh|a + #(d, t)logθt|a]+

  • d:Z=b

[logθb + #(d, h)logθh|b + #(d, t)logθt|b] L(θa, θh|a, θh|b) – repeated above – can be split into 3 separate terms, L(θa) + L(θh|a) + L(θh|b) concerning Z, A and B L(θa) = [

  • d:Z=a

1]logθa + [

  • d:Z=b

1]log(1 − θa) (8)

slide-68
SLIDE 68
  • d:Z=a

[logθa + #(d, h)logθh|a + #(d, t)logθt|a]+

  • d:Z=b

[logθb + #(d, h)logθh|b + #(d, t)logθt|b] L(θa, θh|a, θh|b) – repeated above – can be split into 3 separate terms, L(θa) + L(θh|a) + L(θh|b) concerning Z, A and B L(θa) = [

  • d:Z=a

1]logθa + [

  • d:Z=b

1]log(1 − θa) (8) L(θh|a) = [

  • d:Z=a

#(d, h)]logθh|a + [

  • d:Z=a

#(d, t)]log(1 − θh|a) (9)

slide-69
SLIDE 69
  • d:Z=a

[logθa + #(d, h)logθh|a + #(d, t)logθt|a]+

  • d:Z=b

[logθb + #(d, h)logθh|b + #(d, t)logθt|b] L(θa, θh|a, θh|b) – repeated above – can be split into 3 separate terms, L(θa) + L(θh|a) + L(θh|b) concerning Z, A and B L(θa) = [

  • d:Z=a

1]logθa + [

  • d:Z=b

1]log(1 − θa) (8) L(θh|a) = [

  • d:Z=a

#(d, h)]logθh|a + [

  • d:Z=a

#(d, t)]log(1 − θh|a) (9) L(θh|b) = [

  • d:Z=b

#(d, h)]logθh|b + [

  • d:Z=b

#(d, t)]log(1 − θh|b) (10)

slide-70
SLIDE 70
  • d:Z=a

[logθa + #(d, h)logθh|a + #(d, t)logθt|a]+

  • d:Z=b

[logθb + #(d, h)logθh|b + #(d, t)logθt|b] L(θa, θh|a, θh|b) – repeated above – can be split into 3 separate terms, L(θa) + L(θh|a) + L(θh|b) concerning Z, A and B L(θa) = [

  • d:Z=a

1]logθa + [

  • d:Z=b

1]log(1 − θa) (8) L(θh|a) = [

  • d:Z=a

#(d, h)]logθh|a + [

  • d:Z=a

#(d, t)]log(1 − θh|a) (9) L(θh|b) = [

  • d:Z=b

#(d, h)]logθh|b + [

  • d:Z=b

#(d, t)]log(1 − θh|b) (10) and this means that when you take the derivatives of L(θa, θh|a, θh|b) wrt. θa, θh|a and θh|b in each case you can just look at one of the above terms.

slide-71
SLIDE 71
  • d:Z=a

[logθa + #(d, h)logθh|a + #(d, t)logθt|a]+

  • d:Z=b

[logθb + #(d, h)logθh|b + #(d, t)logθt|b] L(θa, θh|a, θh|b) – repeated above – can be split into 3 separate terms, L(θa) + L(θh|a) + L(θh|b) concerning Z, A and B L(θa) = [

  • d:Z=a

1]logθa + [

  • d:Z=b

1]log(1 − θa) (8) L(θh|a) = [

  • d:Z=a

#(d, h)]logθh|a + [

  • d:Z=a

#(d, t)]log(1 − θh|a) (9) L(θh|b) = [

  • d:Z=b

#(d, h)]logθh|b + [

  • d:Z=b

#(d, t)]log(1 − θh|b) (10) and this means that when you take the derivatives of L(θa, θh|a, θh|b) wrt. θa, θh|a and θh|b in each case you can just look at one of the above terms. They are all really of the same form being N(log(p)) + M(log(1 − p)), the same form as seen in the first simple scenario, and it has maximum value at p =

N N+M

slide-72
SLIDE 72

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

hence

slide-73
SLIDE 73

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

hence ∂L(θa) ∂θa =

slide-74
SLIDE 74

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

hence ∂L(θa) ∂θa = = ⇒ θa =

  • d:Z=a 1
  • d:Z=a 1 +

d:Z=b 1

slide-75
SLIDE 75

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

hence ∂L(θa) ∂θa = = ⇒ θa =

  • d:Z=a 1
  • d:Z=a 1 +

d:Z=b 1

∂L(θh|a) ∂θh|a =

slide-76
SLIDE 76

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

hence ∂L(θa) ∂θa = = ⇒ θa =

  • d:Z=a 1
  • d:Z=a 1 +

d:Z=b 1

∂L(θh|a) ∂θh|a = = ⇒ θh|a =

  • d:Z=a #(d, h)
  • d:Z=a #(d, h) +

d:Z=a #(d, t)

slide-77
SLIDE 77

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

hence ∂L(θa) ∂θa = = ⇒ θa =

  • d:Z=a 1
  • d:Z=a 1 +

d:Z=b 1

∂L(θh|a) ∂θh|a = = ⇒ θh|a =

  • d:Z=a #(d, h)
  • d:Z=a #(d, h) +

d:Z=a #(d, t)

∂L(θh|b) ∂θh|b =

slide-78
SLIDE 78

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

hence ∂L(θa) ∂θa = = ⇒ θa =

  • d:Z=a 1
  • d:Z=a 1 +

d:Z=b 1

∂L(θh|a) ∂θh|a = = ⇒ θh|a =

  • d:Z=a #(d, h)
  • d:Z=a #(d, h) +

d:Z=a #(d, t)

∂L(θh|b) ∂θh|b = = ⇒ θh|b =

  • d:Z=b #(d, h)
  • d:Z=b #(d, h) +

d:Z=b #(d, t)

slide-79
SLIDE 79

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood Estimation(MLE) 2nd scenario: (toss Z; (then A or B)10)D

hence ∂L(θa) ∂θa = = ⇒ θa =

  • d:Z=a 1
  • d:Z=a 1 +

d:Z=b 1

∂L(θh|a) ∂θh|a = = ⇒ θh|a =

  • d:Z=a #(d, h)
  • d:Z=a #(d, h) +

d:Z=a #(d, t)

∂L(θh|b) ∂θh|b = = ⇒ θh|b =

  • d:Z=b #(d, h)
  • d:Z=b #(d, h) +

d:Z=b #(d, t)

finally the denominators of these turn into D,

d:Z=a 10 and d:Z=b 10

respectively and so are exactly the ’common sense’ formulae we started with in (2), (3), (4)