Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 - PDF document

Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 University of Waterloo Lecture 18: March 13, 2014 Outline • Learning from incomplete Data – EM algorithm 2 CS486/686 Lecture Slides (c) 2014 P. Poupart 1

Incomplete data • So far… – Values of all attributes are known – Learning is relatively easy • But many real-world problems have hidden variables (a.k.a latent variables) – Incomplete data – Values of some attributes missing 3 CS486/686 Lecture Slides (c) 2014 P. Poupart Unsupervised Learning • Incomplete data  unsupervised learning • Examples: – Categorisation of stars by astronomers – Categorisation of species by anthropologists – Market segmentation for marketing – Pattern identification for fraud detection – Research in general! 4 CS486/686 Lecture Slides (c) 2014 P. Poupart 2

Maximum Likelihood Learning • ML learning of Bayes net parameters: – For  V=true,pa(V)= v = Pr(V=true|par(V) = v ) –  V=true,pa(V)= v = #[V=true,pa(V)= v ] #[V=true,pa(V)= v ] + #[V=false,pa(V)= v ] – Assumes all attributes have values… • What if values of some attributes are missing? 5 CS486/686 Lecture Slides (c) 2014 P. Poupart “Naive” solutions for incomplete data • Solution #1: Ignore records with missing values – But what if all records are missing values (i.e., when a variable is hidden, none of the records have any value for that variable) • Solution #2: Ignore hidden variables – Model may become significantly more complex! 6 CS486/686 Lecture Slides (c) 2014 P. Poupart 3

Heart disease example 2 2 2 2 2 2 Smoking Diet Exercise Smoking Diet Exercise 54 HeartDisease 6 6 6 54 162 486 Symptom 1 Symptom 2 Symptom 3 Symptom 1 Symptom 2 Symptom 3 (b) (a) • a) simpler (i.e., fewer CPT parameters) • b) complex (i.e., lots of CPT parameters) 7 CS486/686 Lecture Slides (c) 2014 P. Poupart “Direct” maximum likelihood • Solution 3: maximize likelihood directly – Let Z be hidden and E observable – h ML = argmax h P( e |h) = argmax h Σ Z P( e , Z |h) = argmax h Σ Z  i CPT(V i ) = argmax h log Σ Z  i CPT(V i ) – Problem: can’t push log past sum to linearize product 8 CS486/686 Lecture Slides (c) 2014 P. Poupart 4

Expectation-Maximization (EM) • Solution #4: EM algorithm – Intuition: if we knew the missing values, computing h ML would be trival • Guess h ML • Iterate – Expectation: based on h ML , compute expectation of the missing values – Maximization: based on expected missing values, compute new estimate of h ML 9 CS486/686 Lecture Slides (c) 2014 P. Poupart Expectation-Maximization (EM) • More formally: – Approximate maximum likelihood – Iteratively compute: h i+1 = argmax h Σ Z P( Z |h i , e ) log P( e , Z |h) Expectation Maximization 10 CS486/686 Lecture Slides (c) 2014 P. Poupart 5

Expectation-Maximization (EM) • Objective: max h Σ Z P(Z|e,h) log P( e,Z |h) • Iterative approach h i+1 = argmax h Σ Z P( Z | e ,h i ) log P( e , Z |h) • Convergence guaranteed h ∞ = argmax h Σ Z P( Z | e ,h) log P( e , Z |h) • Monotonic improvement of likelihood P( e |h i+1 )  P( e |h i ) 13 CS486/686 Lecture Slides (c) 2014 P. Poupart Optimization Step • For one data point e: h i+1 = argmax h Σ Z P( Z |h i , e ) log P( e , Z |h) • For multiple data points: h i+1 = argmax h Σ e n e Σ Z P( Z |h i , e ) log P( e , Z |h) Where n e is frequency of e in dataset • Compare to ML for complete data h* = argmax h Σ d n d log P( d |h) 14 CS486/686 Lecture Slides (c) 2014 P. Poupart 7

Optimization Solution • Since d  <z,e> • Let n d = n e P( z |h i , e )  expected frequency • Similar to the complete data case, the optimal parameters are obtained by setting the derivative to 0, which yields relative expected frequencies • E.g.  V,pa(V) = P(V|pa(V)) = n V,pa(V) / n pa(V) 15 CS486/686 Lecture Slides (c) 2014 P. Poupart Candy Example • Suppose you buy two bags of candies of unknown type (e.g. flavour ratios) • You plan to eat sufficiently many candies of each bag to learn their type • Ignoring your plan, your roommate mixes both bags… • How can you learn the type of each bag despite being mixed? 16 CS486/686 Lecture Slides (c) 2014 P. Poupart 8

Candy Example • “Bag” variable is hidden 17 CS486/686 Lecture Slides (c) 2014 P. Poupart Unsupervised Clustering • “Class” variable is hidden • Naïve Bayes model P ( Bag= 1) Bag C Bag P ( F=cherry | B ) 1 F 1 2 F 2 Flavor Wrapper Holes X (a) (b) 18 CS486/686 Lecture Slides (c) 2014 P. Poupart 9

Candy Example • Unknown Parameters: –  i = P(Bag=i) –  Fi = P(Flavour=cherry|Bag=i) –  Wi = P(Wrapper=red|Bag=i) –  Hi = P(Hole=yes|Bag=i) • When eating a candy: – F, W and H are observable – B is hidden 19 CS486/686 Lecture Slides (c) 2014 P. Poupart Candy Example • Let true parameters be: –  =0.5,  F1 =  W1 =  H1 =0.8,  F2 =  W2 =  H2 =0.3 • After eating 1000 candies: W=red W=green H=1 H=0 H=1 H=0 F=cherry 273 93 104 90 F=lime 79 100 94 167 20 CS486/686 Lecture Slides (c) 2014 P. Poupart 10

Candy Example • EM algorithm • Guess h 0 : –  =0.6,  F1 =  W1 =  H1 =0.6,  F2 =  W2 =  H2 =0.4 • Alternate: – Expectation: expected # of candies in each bag – Maximization: new parameter estimates 21 CS486/686 Lecture Slides (c) 2014 P. Poupart Candy Example • Expectation: expected # of candies in each bag – #[Bag=i] = Σ j P(B=i|f j ,w j ,h j ) – Compute P(B=i|f j ,w j ,h j ) by variable elimination (or any other inference alg.) • Example: – #[Bag=1] = 612 – #[Bag=2] = 388 22 CS486/686 Lecture Slides (c) 2014 P. Poupart 11

Candy Example • Maximization: relative frequency of each bag –  1 = 612/1000 = 0.612 –  2 = 388/1000 = 0.388 23 CS486/686 Lecture Slides (c) 2014 P. Poupart Candy Example • Expectation: expected # of cherry candies in each bag – #[B=i,F=cherry] = Σ j P(B=i|f j =cherry,w j ,h j ) – Compute P(B=i|f j =cherry,w j ,h j ) by variable elimination (or any other inference alg.) • Maximization: –  F 1 = #[B=1,F=cherry] / #[B=1] = 0.668 –  F 2 = #[B=2,F=cherry] / #[B=2] = 0.389 24 CS486/686 Lecture Slides (c) 2014 P. Poupart 12

Candy Example -1975 -1980 -1985 -1990 Log-likelihood -1995 -2000 -2005 -2010 -2015 -2020 -2025 0 20 40 60 80 100 120 Iteration number 25 CS486/686 Lecture Slides (c) 2014 P. Poupart Bayesian networks • EM algorithm for general Bayes nets • Expectation: – #[V i =v ij ,Pa(V i )=pa ik ] = expected frequency • Maximization: –  vij,paik = #[V i =v ij ,Pa(V i )=pa ik ] / #[Pa(V i )=pa ik ] 26 CS486/686 Lecture Slides (c) 2014 P. Poupart 13

Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 - PDF document

Statistical Learning (II) [RN2] Sec 20.3 [RN3] Sec 20.3 CS 486/686 University of Waterloo Lecture 18: March 13, 2014 Outline Learning from incomplete Data EM algorithm 2 CS486/686 Lecture Slides (c) 2014 P. Poupart 1 Incomplete

Statistical Learning [RN2 Sec 20.1-20.2] [RN3 Sec 20.1-20.2] CS 486/686 University of Waterloo

Uncertainty [RN2 Sec. 13.1-13.6] [RN3 Sec. 13.1-13.5] CS 486/686 University of Waterloo

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686

Reasoning Over Time [RN2] Sec 15.1-15.3, 15.5 [RN3] Sec 15.1-15.3, 15.5 CS 486/686 University

Informed Search [RN2] Sec. 4.1, 4.2 [RN3] Sec. 3.5, 3.6 CS 486/686 University of Waterloo

First-order Logic [RN2] Sec 7.1-7.6 Chap 8-9 [RN3] Sec 7.1-7.6 Chap 8-9 CS 486/686 University

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

1.0 sec 0.1 sec 10 sec 1.0 sec 0.1 sec Min:500

Utility Theory [RN2] Sect 16.1-16.3 [RN3] Sect 16.1-16.3 CS 486/686 University of Waterloo

Local Search [RN2] Section 4.3 [RN3] Section 4.1 CS 486/686 University of Waterloo Lecture 6:

Bayes Nets (continued) [RN2] Section 14.4 [RN3] Section 14.4 CS 486/686 University of Waterloo

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Discovering Interesting Patterns Through Motivation Users Interactive Feedback

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

Some mathematics for k -means clustering Rachel Ward Berlin, December, 2015 Part 1: Joint work

Deep Learning Prof. Kuan-Ting Lai 2020/3/10 Applied Math for Deep Learning Linear Algebra

Experimental Design in R Kaelen Medeiros Product Data Scientist at DataCamp DataCamp

Using R for the design and analysis of computer experiments with the Nimrod toolkit Neil Diamond

Experimental Design + k-Nearest Neighbors KNN Readings: Prob. Readings: (next

How to transfer experimental results to theorists? Convener: Thomas Blake (Warwick U.)

Sambuz

Useful Links

Newsletter

Mail Us