COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: - - PowerPoint PPT Presentation

comp90051 statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: - - PowerPoint PPT Presentation

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM Statistical Inference Statistical Machine Learning (S2 2017) Deck 23 Statistical inference on PGMs Learning from data fitting probability tables to


slide-1
SLIDE 1

COMP90051 Statistical Machine Learning

  • 23. PGM Statistical Inference

Semester 2, 2017 Lecturer: Trevor Cohn

slide-2
SLIDE 2

Deck 23 Statistical Machine Learning (S2 2017)

Statistical inference on PGMs

Learning from data – fitting probability tables to

  • bservations (eg as a frequentist; a Bayesian would just

use probabilistic inference to update prior to posterior)

2

slide-3
SLIDE 3

Deck 23 Statistical Machine Learning (S2 2017)

Where are we?

  • Representation of joint distributions

* PGMs encode conditional independence

  • Independence, d-separation
  • Probabilistic inference

* Computing other distributions from joint * Elimination, sampling algorithms

  • Statistical inference

* Learn parameters from data

3

slide-4
SLIDE 4

Deck 23 Statistical Machine Learning (S2 2017)

Have PGM, Some observations, No tables…

4

ASi HGi FAi HTi FGi i=1..n

False ? True ? False ? True ? False ? True ? HT false true FG f t f t False ? ? ? ? True ? ? ? ? FA false true HG f t f t False ? ? ? ? True ? ? ? ?

slide-5
SLIDE 5

Deck 23 Statistical Machine Learning (S2 2017)

Fully-observed case is “easy”

  • Max-Likelihood Estimator (MLE) says

* If we observe all r.v.’s 𝒀 in a PGM independently 𝑜 times 𝒚𝑗 * Then maximise the full joint

arg max

*∈, ∏

∏ 𝑞 𝑌𝑘 = 𝑦34|𝑌6789:;< 4 = 𝑦36789:;< 4

  • 4

: 3>?

  • Decomposes easily, leads to counts-based estimates

* Maximise log-likelihood instead; becomes sum of logs

arg max

*∈, ∑

∑ log 𝑞 𝑌𝑘 = 𝑦𝑗𝑘|𝑌6789:;< 4 = 𝑦36789:;< 4

  • 4

: 3>?

* Big maximisation of all parameters together, decouples into small independent problems

  • Example is training a naïve Bayes classifier

5

ASi HGi FAi HTi FGi i=1..n

slide-6
SLIDE 6

Deck 23 Statistical Machine Learning (S2 2017)

Example: Fully-observed case

6

ASi HGi FAi HTi FGi

i=1..n

false ? true ? false ? true ? false ? true ? FA false true HG f t f t false ? ? ? ? true ? ? ? ? HT false true FG f t f t false ? ? ? ? true ? ? ? ?

# 𝒚𝒋|𝑮𝑯𝒋 = 𝒖𝒔𝒗𝒇 𝒐 # 𝒚𝒋|𝑮𝑯𝒋 = 𝒈𝒃𝒎𝒕𝒇 𝒐 # 𝒚𝒋|𝑰𝑯𝒋 = 𝒖𝒔𝒗𝒇, 𝑰𝑼𝒋 = 𝒈𝒃𝒎𝒕𝒇, 𝑮𝑯𝒋 = 𝒈𝒃𝒎𝒕𝒇 # 𝒚𝒋|𝑰𝑼𝒋 = 𝒈𝒃𝒎𝒕𝒇, 𝑮𝑯𝒋 = 𝒈𝒃𝒎𝒕𝒇

slide-7
SLIDE 7

Deck 23 Statistical Machine Learning (S2 2017)

Presence of unobserved variables trickier

  • But most PGMs you’ll encounter will have latent, or

unobserved, variables

  • What happens to the MLE?

* Maximise likelihood of observed data only * Marginalise full joint to get to desired “partial” joint

* arg max

*∈, ∏

∑ ∏ 𝑞 𝑌𝑘 = 𝑦34|𝑌6789:;< 4 = 𝑦36789:;< 4

  • 4
  • STUVWU 4

: 3>?

* This won’t decouple – oh-no’s!!

7

ASi HGi FAi HTi FGi i=1..n

slide-8
SLIDE 8

Deck 23 Statistical Machine Learning (S2 2017)

Can we reduce partially-observed to fully?

  • Rough idea

* If we had guesses for the missing variables * We could employ MLE on fully-observed data

  • With a bit more thought, could alternate between

* Updating missing data * Updating probability tables/parameters

  • This is the basis for training PGMs

8

ASi HGi FAi HTi FGi i=1..n

slide-9
SLIDE 9

Deck 23 Statistical Machine Learning (S2 2017)

Example: Partially-observed case

9

T,…,F ? ? ? F,…,T

i=1..n

false

?

true

?

false

?

true

?

false ? true ? FA false true HG f T f t false

? ? ? ?

true

? ? ? ?

HT false true FG f t f t false

? ? ? ?

true

? ? ? ?

slide-10
SLIDE 10

Deck 23 Statistical Machine Learning (S2 2017)

Example: Partially-observed case

10

T,…,F ? ? ? F,…,T

i=1..n

false

?

true

?

false

?

true

?

false 0.9 true 0.1 FA false true HG f T f t false

? ? ? ?

true

? ? ? ?

HT false true FG f t f t false

? ? ? ?

true

? ? ? ?

Observed marginal

slide-11
SLIDE 11

Deck 23 Statistical Machine Learning (S2 2017)

Example: Partially-observed case

11

T,…,F ? ? ? F,…,T

i=1..n

false 0.5 true 0.5 false 0.5 true 0.5 false 0.9 true 0.1 FA false true HG f t f t false

0.5 0.5 0.5 0.5

true

0.5 0.5 0.5 0.5

HT false true FG f t f t false

0.5 0.5 0.5 0.5

true

0.5 0.5 0.5 0.5

Seed

slide-12
SLIDE 12

Deck 23 Statistical Machine Learning (S2 2017)

Example: Partially-observed case

12

T,…,F F,…,F T,…,T F,…,T F,…,T

i=1..n

false 0.5 true 0.5 false 0.5 true 0.5 false 0.9 true 0.1 FA false true HG f t f t false

0.5 0.5 0.5 0.5

true

0.5 0.5 0.5 0.5

HT false true FG f t f t false

0.5 0.5 0.5 0.5

true

0.5 0.5 0.5 0.5

Missing data as expectation

slide-13
SLIDE 13

Deck 23 Statistical Machine Learning (S2 2017)

Example: Partially-observed case

13

T,…,F F,…,F T,…,T F,…,T F,…,T

i=1..n

false 0.7 true 0.3 false 0.6 true 0.4 false 0.9 true 0.1 FA false true HG f t f T false

0.7 0.3 0.4 0.8

true

0.3 0.7 0.6 0.2

HT false true FG f t f t false

0.7 0.4 0.3 0.6

true

0.3 0.6 0.7 0.4

MLE on fully-

  • bserved
slide-14
SLIDE 14

Deck 23 Statistical Machine Learning (S2 2017)

Example: Partially-observed case

14

T,…,F F,…,F T,…,T F,…,T F,…,T

i=1..n

false 0.7 true 0.3 false 0.6 true 0.4 false 0.9 true 0.1 FA false true HG f t f T false

0.7 0.3 0.4 0.8

true

0.3 0.7 0.6 0.2

HT false true FG f t f t false

0.7 0.4 0.3 0.6

true

0.3 0.6 0.7 0.4

Seed Do until “convergence” Fill missing as expectation MLE on fully-observed

slide-15
SLIDE 15

Deck 23 Statistical Machine Learning (S2 2017)

Expectation-Maximisation Algorithm

15

Seed parameters randomly E-step: Complete unobserved data as expectations (point estimates) M-step: Update parameters with MLE on the fully-observed data posterior distributions (prob inference)

slide-16
SLIDE 16

Deck 23 Statistical Machine Learning (S2 2017)

Déjà vu?

  • K-means clustering

* Randomly assign cluster centres * Repeat

  • Assign points to nearest

clusters

  • Update cluster centres
  • EM learning

* Randomly seed parameters * Repeat

  • Expectations for missing

variables

  • Update parameters via MLE

16

Hard E-step Soft E-step

  • Assign distribution of point

belonging to each cluster

(e.g., 10% C1 20% C2 70% C3)

  • Posteriors for missing

variables given observed, current parameters

slide-17
SLIDE 17

Deck 23 Statistical Machine Learning (S2 2017)

Summary

  • Statistical inference on PGMs

* What is it and why do we care? * Straight MLE for fully-observed data * EM algorithm for mixed latent/observed data

17