Statistical Learning (part II) October 28, 2008 CS 486/686 - - PowerPoint PPT Presentation

statistical learning part ii
SMART_READER_LITE
LIVE PREVIEW

Statistical Learning (part II) October 28, 2008 CS 486/686 - - PowerPoint PPT Presentation

Statistical Learning (part II) October 28, 2008 CS 486/686 University of Waterloo Outline Learning from incomplete Data EM algorithm Reading: R&N Ch 20.3 2 CS486/686 Lecture Slides (c) 2008 P. Poupart Incomplete data


slide-1
SLIDE 1

Statistical Learning (part II)

October 28, 2008 CS 486/686 University of Waterloo

slide-2
SLIDE 2

CS486/686 Lecture Slides (c) 2008 P. Poupart

2

Outline

  • Learning from incomplete Data

– EM algorithm

  • Reading: R&N Ch 20.3
slide-3
SLIDE 3

CS486/686 Lecture Slides (c) 2008 P. Poupart

3

Incomplete data

  • So far…

– Values of all attributes are known – Learning is relatively easy

  • But many real-world problems have

hidden variables (a.k.a latent variables)

– Incomplete data – Values of some attributes missing

slide-4
SLIDE 4

CS486/686 Lecture Slides (c) 2008 P. Poupart

4

Unsupervised Learning

  • Incomplete data unsupervised learning
  • Examples:

– Categorisation of stars by astronomers – Categorisation of species by anthropologists – Market segmentation for marketing – Pattern identification for fraud detection – Research in general!

slide-5
SLIDE 5

CS486/686 Lecture Slides (c) 2008 P. Poupart

5

Maximum Likelihood Learning

  • ML learning of Bayes net parameters:

– For θV=true,pa(V)=v = Pr(V=true|par(V) = v) – θV=true,pa(V)=v = – Assumes all attributes have values…

  • What if values of some attributes are

missing?

#[V=true,pa(V)=v] #[V=true,pa(V)=v] + #[V=false,pa(V)=v]

slide-6
SLIDE 6

CS486/686 Lecture Slides (c) 2008 P. Poupart

6

“Naive” solutions for incomplete data

  • Solution #1: Ignore records with

missing values

– But what if all records are missing values (i.e., when a variable is hidden, none of the records have any value for that variable)

  • Solution #2: Ignore hidden variables

– Model may become significantly more complex!

slide-7
SLIDE 7

CS486/686 Lecture Slides (c) 2008 P. Poupart

7

Heart disease example

  • a) simpler (i.e., fewer CPT parameters)
  • b) complex (i.e., lots of CPT parameters)

Smoking Diet Exercise Symptom 1 Symptom 2 Symptom 3

(a) (b)

HeartDisease Smoking Diet Exercise Symptom 1 Symptom 2 Symptom 3 2 2 2 54 6 6 6 2 2 2 54 162 486

slide-8
SLIDE 8

CS486/686 Lecture Slides (c) 2008 P. Poupart

8

“Direct” maximum likelihood

  • Solution 3: maximize likelihood directly

– Let Z be hidden and E observable – hML = argmaxh P(e|h) = argmaxh ΣZ P(e,Z|h) = argmaxh ΣZ Πi CPT(Vi) = argmaxh log ΣZ Πi CPT(Vi) – Problem: can’t push log past sum to linearize product

slide-9
SLIDE 9

CS486/686 Lecture Slides (c) 2008 P. Poupart

9

Expectation-Maximization (EM)

  • Solution #4: EM algorithm

– Intuition: if we knew the missing values, computing hML would be trival

  • Guess hML
  • Iterate

– Expectation: based on hML, compute expectation of the missing values – Maximization: based on expected missing values, compute new estimate of hML

slide-10
SLIDE 10

CS486/686 Lecture Slides (c) 2008 P. Poupart

10

Expectation-Maximization (EM)

  • More formally:

– Approximate maximum likelihood – Iteratively compute: hi+1 = argmaxh ΣZ P(Z|hi,e) log P(e,Z|h) Expectation Maximization

slide-11
SLIDE 11

CS486/686 Lecture Slides (c) 2008 P. Poupart

11

Expectation-Maximization (EM)

  • Derivation

– log P(e|h) = log [P(e,Z|h) / P(Z|e,h)] = log P(e,Z|h) – log P(Z|e,h) = ΣZ P(Z|e,h) log P(e,Z|h) – ΣZ P(Z|e,h) log P(Z|e,h) ≥ ΣZ P(Z|e,h) log P(e,Z|h)

  • EM finds a local maximum of

ΣZ P(Z|e,h) log P(e,Z|h) which is a lower bound of log P(e|h)

slide-12
SLIDE 12

CS486/686 Lecture Slides (c) 2008 P. Poupart

12

Expectation-Maximization (EM)

  • Log inside sum can linearize product

– hi+1 = argmaxh ΣZ P(Z|hi,e) log P(e,Z|h) = argmaxh ΣZ P(Z|hi,e) log Πj CPTj = argmaxh ΣZ P(Z|hi,e) Σj log CPTj

  • Monotonic improvement of likelihood

– P(e|hi+1) ≥ P(e|hi)

slide-13
SLIDE 13

CS486/686 Lecture Slides (c) 2008 P. Poupart

13

Candy Example

  • Suppose you buy two bags of candies of

unknown type (e.g. flavour ratios)

  • You plan to eat sufficiently many

candies of each bag to learn their type

  • Ignoring your plan, your roommate

mixes both bags…

  • How can you learn the type of each bag

despite being mixed?

slide-14
SLIDE 14

CS486/686 Lecture Slides (c) 2008 P. Poupart

14

Candy Example

  • “Bag” variable is hidden
slide-15
SLIDE 15

CS486/686 Lecture Slides (c) 2008 P. Poupart

15

Unsupervised Clustering

  • “Class” variable is hidden
  • Naïve Bayes model

(a) (b)

Wrapper Flavor Bag

P( 1) Bag= Bag 1 2

1

F

2

F P(F=cherry | B)

C X Holes

slide-16
SLIDE 16

CS486/686 Lecture Slides (c) 2008 P. Poupart

16

Candy Example

  • Unknown Parameters:

– θi = P(Bag=i) – θFi = P(Flavour=cherry|Bag=i) – θWi = P(Wrapper=red|Bag=i) – θHi = P(Hole=yes|Bag=i)

  • When eating a candy:

– F, W and H are observable – B is hidden

slide-17
SLIDE 17

CS486/686 Lecture Slides (c) 2008 P. Poupart

17

Candy Example

  • Let true parameters be:

– θ=0.5, θF1=θW1=θH1=0.8, θF2=θW2=θH2=0.3

  • After eating 1000 candies:

167 94 100 79 F=lime 90 104 93 273 F=cherry H=0 H=1 H=0 H=1 W=green W=red

slide-18
SLIDE 18

CS486/686 Lecture Slides (c) 2008 P. Poupart

18

Candy Example

  • EM algorithm
  • Guess h0:

– θ=0.6, θF1=θW1=θH1=0.6, θF2=θW2=θH2=0.4

  • Alternate:

– Expectation: expected # of candies in each bag – Maximization: new parameter estimates

slide-19
SLIDE 19

CS486/686 Lecture Slides (c) 2008 P. Poupart

19

Candy Example

  • Expectation: expected # of candies in

each bag

– #[Bag=i] = Σj P(B=i|fj,wj,hj) – Compute P(B=i|fj,wj,hj) by variable elimination (or any other inference alg.)

  • Example:

– #[Bag=1] = 612 – #[Bag=2] = 388

slide-20
SLIDE 20

CS486/686 Lecture Slides (c) 2008 P. Poupart

20

Candy Example

  • Maximization: relative frequency of

each bag – θ1 = 612/1000 = 0.612 – θ2 = 388/1000 = 0.388

slide-21
SLIDE 21

CS486/686 Lecture Slides (c) 2008 P. Poupart

21

Candy Example

  • Expectation: expected # of cherry

candies in each bag

– #[B=i,F=cherry] = Σj P(B=i|fj=cherry,wj,hj) – Compute P(B=i|fj=cherry,wj,hj) by variable elimination (or any other inference alg.)

  • Maximization:

– θF1 = #[B=1,F=cherry] / #[B=1] = 0.668 – θF2 = #[B=2,F=cherry] / #[B=2] = 0.389

slide-22
SLIDE 22

CS486/686 Lecture Slides (c) 2008 P. Poupart

22

Candy Example

  • 2025
  • 2020
  • 2015
  • 2010
  • 2005
  • 2000
  • 1995
  • 1990
  • 1985
  • 1980
  • 1975

20 40 60 80 100 120 Log-likelihood Iteration number

slide-23
SLIDE 23

CS486/686 Lecture Slides (c) 2008 P. Poupart

23

Bayesian networks

  • EM algorithm for general Bayes nets
  • Expectation:

– #[Vi=vij,Pa(Vi)=paik] = expected frequency

  • Maximization:

– θvij,paik = #[Vi=vij,Pa(Vi)=paik] / #[Pa(Vi)=paik]

slide-24
SLIDE 24

CS486/686 Lecture Slides (c) 2008 P. Poupart

24

Next Class

  • Next Class:
  • Ensemble Learning
  • Russell and Norvig Sect. 18.4