Statistical Learning Marco Chiarandini Deptartment of Mathematics - PowerPoint PPT Presentation

Lecture 13 Statistical Learning Marco Chiarandini Deptartment of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig

Course Overview ✔ Introduction ✔ Uncertain knowledge and Reasoning ✔ Artificial Intelligence ✔ Intelligent Agents ✔ Probability and Bayesian approach ✔ Search ✔ Bayesian Networks ✔ Uninformed Search ✔ Hidden Markov Chains ✔ Heuristic Search ✔ Kalman Filters ✔ Adversarial Search ✔ Learning ✔ Minimax search ✔ Decision Trees ✔ Alpha-beta pruning Maximum Likelihood ✔ Knowledge representation and EM Algorithm Reasoning Learning Bayesian Networks ✔ Propositional logic Neural Networks ✔ First order logic ✘ Support vector machines ✔ Inference 2

Last Time Decision Trees for classification - entropy, information measure Performance evaluation - overfitting - cross validation - peeking - pruning Extensions - Ensemble learning - boosting - bagging 3

Outline ♦ Bayesian learning ♦ Maximum a posteriori and maximum likelihood learning ♦ Bayes net learning – ML parameter learning with complete data – linear regression 4

Full Bayesian learning View learning as Bayesian updating of a probability distribution over the hypothesis space H hypothesis variable, values h 1 , h 2 , . . . , prior P ( H ) d j gives the outcome of random variable D j (the j th observation) training data d = d 1 , . . . , d N Given the data so far, each hypothesis has a posterior probability: P ( h i | d ) = α P ( d | h i ) P ( h i ) where P ( d | h i ) is called the likelihood Predictions use a likelihood-weighted average over the hypotheses: � � P ( X | d ) = P ( X | d , h i ) P ( h i | d ) = P ( X | h i ) P ( h i | d ) i i No need to pick one best-guess hypothesis! 5

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be? 6

Posterior probability of hypotheses 1 P ( h 1 | d ) Posterior probability of hypothesis P ( h 2 | d ) P ( h 3 | d ) 0.8 P ( h 4 | d ) P ( h 5 | d ) 0.6 0.4 0.2 0 0 2 4 6 8 10 Number of samples in d 7

Prediction probability 1 0.9 P (next candy is lime | d ) 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples in d 8

MAP approximation Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose h MAP maximizing P ( h i | d ) I.e., maximize P ( d | h i ) P ( h i ) or log P ( d | h i ) + log P ( h i ) Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses, P ( d | h i ) is 1 if consistent, 0 otherwise = ⇒ MAP = simplest consistent hypothesis 9

ML approximation For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose h ML maximizing P ( d | h i ) I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the “standard” (non-Bayesian) statistical learning method 10

ML parameter learning in Bayes nets ( ) Bag from a new manufacturer; fraction θ of cherry candies? P F=cherry θ Any θ is possible: continuum of hypotheses h θ θ is a parameter for this simple (binomial) family of models Flavor Suppose we unwrap N candies, c cherries and ℓ = N − c limes These are i.i.d. (independent, identically distributed) observations, so N P ( d j | h θ ) = θ c · ( 1 − θ ) ℓ � P ( d | h θ ) = j = 1 Maximize this w.r.t. θ —which is easier for the log-likelihood: N � L ( d | h θ ) = log P ( d | h θ ) = log P ( d j | h θ ) = c log θ + ℓ log ( 1 − θ ) j = 1 dL ( d | h θ ) c ℓ c + ℓ = c c = θ − 1 − θ = 0 = ⇒ θ = d θ N Seems sensible, but causes problems with 0 counts! 11

Multiple parameters ( ) Red/green wrapper depends probabilistically on flavor: P F=cherry θ Likelihood for, e.g., cherry candy in green wrapper: Flavor P ( F = cherry , W = green | h θ,θ 1 ,θ 2 ) = P ( F = cherry | h θ,θ 1 ,θ 2 ) P ( W = green | F = cherry F P ( W=red | F ) θ = θ · ( 1 − θ 1 ) cherry 1 θ lime 2 Wrapper N candies, r c red-wrapped cherry candies, etc.: θ c ( 1 − θ ) ℓ · θ r c 1 ( 1 − θ 1 ) g c · θ r ℓ 2 ( 1 − θ 2 ) g ℓ P ( d | h θ,θ 1 ,θ 2 ) = L = [ c log θ + ℓ log ( 1 − θ )] + [ r c log θ 1 + g c log ( 1 − θ 1 )] + [ r ℓ log θ 2 + g ℓ log ( 1 − θ 2 )] 12

Multiple parameters contd. Derivatives of L contain only the relevant parameter: ∂ L c ℓ c = θ − 1 − θ = 0 = ⇒ θ = ∂θ c + ℓ ∂ L r c g c r c = − = 0 = ⇒ θ 1 = ∂θ 1 θ 1 1 − θ 1 r c + g c ∂ L r ℓ g ℓ r ℓ = − = 0 = ⇒ θ 2 = 1 − θ 2 r ℓ + g ℓ ∂θ 2 θ 2 With complete data, parameters can be learned separately 13

Example: linear Gaussian model 1 0.8 P ( y | x ) 4 0.6 3.5 3 y 2.5 0.4 2 1.5 1 0 0.20.40.60.81 0.5 0.2 0 0 y 0.2 0.4 0.6 0 0.8 1 x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 1 e − ( y − ( θ 1 x + θ 2 )) 2 Maximizing P ( y | x ) = √ w.r.t. θ 1 , θ 2 2 σ 2 2 πσ N � ( y j − ( θ 1 x j + θ 2 )) 2 = minimizing E = j = 1 That is, minimizing the sum of squared errors gives the ML solution for a linear fit assuming Gaussian noise of fixed variance 14

Summary Full Bayesian learning gives best possible predictions but is intractable MAP learning balances complexity with accuracy on training data Maximum likelihood assumes uniform prior, OK for large data sets 1. Choose a parameterized family of models to describe the data requires substantial insight and sometimes new models 2. Write down the likelihood of the data as a function of the parameters may require summing over hidden variables, i.e., inference 3. Write down the derivative of the log likelihood w.r.t. each parameter 4. Find the parameter values such that the derivatives are zero may be hard/impossible; modern optimization techniques help 15

Statistical Learning Marco Chiarandini Deptartment of Mathematics - PowerPoint PPT Presentation

Lecture 13 Statistical Learning Marco Chiarandini Deptartment of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Uncertain knowledge and

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Nov 2010 Statistical Literacy: Harper's Magazine Fall 2010 1 Fall 2010 2 Statistical

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

LP Decoding of Regular LDPC Codes in Memoryless Channels Nissim Halabi Guy Even ISIT 2010 1

Maximum likelihood threshold of a graph Elizabeth Gross San Jos e State University Joint work

Modern Computational Statistics Lecture 20: Applications in Computational Biology Cheng Zhang

Statistical Learning CS 786 University of Waterloo Lecture 6: May 17, 2012 Decision Tree

Practical Bioinformatics Mark Voorhies 5/26/2015 Mark Voorhies Practical Bioinformatics Habits

Clustering with k-means and Gaussian mixture distributions Machine Learning and Category

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides)

Sambuz

Useful Links

Newsletter

Mail Us

Statistical Learning Marco Chiarandini Deptartment of Mathematics - PowerPoint PPT Presentation

Lecture 13 Statistical Learning Marco Chiarandini Deptartment of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Uncertain knowledge and

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Nov 2010 Statistical Literacy: Harper's Magazine Fall 2010 1 Fall 2010 2 Statistical

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

LP Decoding of Regular LDPC Codes in Memoryless Channels Nissim Halabi Guy Even ISIT 2010 1

Maximum likelihood threshold of a graph Elizabeth Gross San Jos e State University Joint work

Modern Computational Statistics Lecture 20: Applications in Computational Biology Cheng Zhang

Statistical Learning CS 786 University of Waterloo Lecture 6: May 17, 2012 Decision Tree

Practical Bioinformatics Mark Voorhies 5/26/2015 Mark Voorhies Practical Bioinformatics Habits

Clustering with k-means and Gaussian mixture distributions Machine Learning and Category

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides)

Sambuz

Useful Links

Newsletter

Mail Us

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar