Probabilistic & Unsupervised Learning Model selection, - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Model selection, Hyperparameter optimisation, and Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2015

Learning model structure How many clusters in the data? How smooth should the function be? Is this input relevant to predicting that output? What is the order of a dynamical system? SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNG How many states in a hidden Markov model? How many auditory sources in the input?

Model selection Models (labelled by m ) have parameters θ m that specify the probability of data: P ( D| θ m , m ) . If model is known, learning θ m means finding posterior or point estimate (ML, MAP , . . . ). What if we need to learn the model too? ◮ Could combine models into a single “supermodel”, with composite parameter ( m , θ m ) . ◮ ML learning will overfit: favours most flexible (nested) model with most parameters, even if the data actually come from a simpler one. ◮ Density function on composite parameter space (union of manifolds of different dimensionalities) difficult to define ⇒ MAP learning ill-posed. ◮ Joint posterior difficult to compute — dimension of composite parameter varies [but Monte-Carlo methods may sample from much a posterior.] ⇒ Separate model selection step: P ( θ m , m |D ) = P ( θ m | m , D ) · P ( m |D ) � �� model-specific posterior model selection

Model complexity and overfitting: a simple example M = 0 M = 1 M = 2 M = 3 40 40 40 40 20 20 20 20 0 0 0 0 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 M = 4 M = 5 M = 6 M = 7 40 40 40 40 20 20 20 20 0 0 0 0 −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10

Model selection Given models labeled by m with parameters θ m , identify the “correct” model for data D . ML/MAP has no good answer: P ( D| θ ML m ) is always larger for more complex (nested) models.

Model selection Given models labeled by m with parameters θ m , identify the “correct” model for data D . ML/MAP has no good answer: P ( D| θ ML m ) is always larger for more complex (nested) models. Neyman-Pearson hypothesis testing ◮ For nested models. Starting with simplest model ( m = 1), compare (e.g. by likelihood ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. ◮ Usually only valid asympotically in data number. ◮ Conservative (N-P hypothesis tests are asymmetric).

Model selection Given models labeled by m with parameters θ m , identify the “correct” model for data D . ML/MAP has no good answer: P ( D| θ ML m ) is always larger for more complex (nested) models. Neyman-Pearson hypothesis testing ◮ For nested models. Starting with simplest model ( m = 1), compare (e.g. by likelihood ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. ◮ Usually only valid asympotically in data number. ◮ Conservative (N-P hypothesis tests are asymmetric). Likelihood validation ◮ Partition data into disjoint training and validation data sets D = D tr ∪ D vld . Choose model with greatest P ( D vld | θ ML m ) , with θ ML m = argmax P ( D tr | θ ) . [Or, better, greatest P ( D vld |D tr , m ) .] ◮ May be biased towards simpler models; often high-variance. ◮ Cross-validation uses multiple partitions and averages likelihoods.

Model selection Given models labeled by m with parameters θ m , identify the “correct” model for data D . ML/MAP has no good answer: P ( D| θ ML m ) is always larger for more complex (nested) models. Neyman-Pearson hypothesis testing ◮ For nested models. Starting with simplest model ( m = 1), compare (e.g. by likelihood ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. ◮ Usually only valid asympotically in data number. ◮ Conservative (N-P hypothesis tests are asymmetric). Likelihood validation ◮ Partition data into disjoint training and validation data sets D = D tr ∪ D vld . Choose model with greatest P ( D vld | θ ML m ) , with θ ML m = argmax P ( D tr | θ ) . [Or, better, greatest P ( D vld |D tr , m ) .] ◮ May be biased towards simpler models; often high-variance. ◮ Cross-validation uses multiple partitions and averages likelihoods. Bayesian model selection ◮ Choose most likely model: argmax P ( m |D ) . ◮ Principled from a probabilistic viewpoint—if true model is in set being considered—but sensitive to assumed priors etc. ◮ Can use posterior probabilities to weight models for combined predictions (no need to select at all).

Bayesian model selection: some terminology A model class m is a set of distributions parameterised by θ m , e.g. the set of all possible mixtures of m Gaussians. The model implies both a prior over the parameters P ( θ m | m ) , and a likelihood of data given parameters (which might require integrating out latent variables) P ( D| θ m , m ) . The posterior distribution over parameters is P ( θ m |D , m ) = P ( D| θ m , m ) P ( θ m | m ) . P ( D| m ) The marginal probability of the data under model class m is: � P ( D| m ) = P ( D| θ m , m ) P ( θ m | m ) d θ m . Θ m (also called the Bayesian evidence for model m ). The ratio of two marginal probabilities (or sometimes its log) is known as the Bayes factor: p ( m ′ ) P ( D| m ′ ) = P ( m |D ) P ( D| m ) P ( m ′ |D ) p ( m )

The Bayesian Occam’s razor Occam’s Razor is a principle of scientific philosophy: of two explanations adequate to explain the same set of observations, the simpler should always be preferred. Bayesian inference formalises and automatically implements a form of Occam’s Razor. Compare model classes m using their posterior probability given the data: � P ( m |D ) = P ( D| m ) P ( m ) , P ( D| m ) = P ( D| θ m , m ) P ( θ m | m ) d θ m P ( D ) Θ m P ( D| m ) : The probability that randomly selected parameter values from the model class would generate data set D . Model classes that are too simple are unlikely to generate the observed data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. Like Goldilocks, we favour a model that is just right. P ( D| m ) data sets: D D 0

Bayesian model comparison: Occam’s razor at work M = 0 M = 1 M = 2 M = 3 40 40 40 40 Model Evidence 1 20 20 20 20 0.8 0 0 0 0 0.6 −20 −20 −20 −20 P(Y|M) 0 5 10 0 5 10 0 5 10 0 5 10 M = 4 M = 5 M = 6 M = 7 0.4 40 40 40 40 0.2 20 20 20 20 0 0 1 2 3 4 5 6 7 0 0 0 0 M −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10

Conjugate-exponential families (recap) Can we compute P ( D| m ) ?

Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes.

Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1

Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1 If our prior on θ m is conjugate: P ( θ m | m ) = e s T p θ m − n p A ( θ m ) / Z ( s p , n p )

Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1 If our prior on θ m is conjugate: P ( θ m | m ) = e s T p θ m − n p A ( θ m ) / Z ( s p , n p ) then the joint is in the same family: � � � T i s ( x i )+ s p θ m − ( N + n p ) A ( θ m ) / Z ( s p , p ) P ( D , θ m | m ) = e

Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1 If our prior on θ m is conjugate: P ( θ m | m ) = e s T p θ m − n p A ( θ m ) / Z ( s p , n p ) then the joint is in the same family: � � � T i s ( x i )+ s p θ m − ( N + n p ) A ( θ m ) / Z ( s p , p ) P ( D , θ m | m ) = e and so: � �� P ( D| m ) = d θ m P ( D , θ m | m ) = Z i s ( x i ) + s p , N + n p Z ( s p , p )

Conjugate-exponential families (recap) Can we compute P ( D| m ) ? . . . . . . Sometimes. Suppose P ( D| θ m , m ) is a member of the exponential family: N N � � e s ( x i ) T θ m − A ( θ m ) . P ( D| θ m , m ) = P ( x i | θ m , m ) = i = 1 i = 1 If our prior on θ m is conjugate: P ( θ m | m ) = e s T p θ m − n p A ( θ m ) / Z ( s p , n p ) then the joint is in the same family: � � � T i s ( x i )+ s p θ m − ( N + n p ) A ( θ m ) / Z ( s p , p ) P ( D , θ m | m ) = e and so: � �� P ( D| m ) = d θ m P ( D , θ m | m ) = Z i s ( x i ) + s p , N + n p Z ( s p , p ) But this is a special case. In general, we need to approximate . . .

Probabilistic & Unsupervised Learning Model selection, - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Model selection, Hyperparameter optimisation, and Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Unsupervised Learning Gustavo Velasco-Hern andez Pattern Recognition, 2014 Gustavo

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Automatic Learning of a Morphological Model Theory and Unsupervised Approaches Unsupervised

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

CSE 158 Lecture 10 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday!

Decision Tree Decision Trees A decision tree is a decision support tool that uses a tree-like

One Minute Responses Nucleotide vs. amino acid sequences Parsimony Genome 559: Introduction

A brief introduction to economics Part I Tyler Moore Computer Science & Engineering

Self- -Verifying Verifying Self Self-Verifying * * Dining Philosophers Dining Philosophers

Concluding Remarks Information and Statistics in Nuclear Experiment and Theory #3 ECT*, Trento,

Decision Trees LING 572 Advanced Statistical Methods for NLP January 9, 2020 1 Sunburn Example

How Computers Discover How Computers Discover A Mini-Review of Algorithmic Meta-Discovery Filip

Probabilistic & Unsupervised Learning Model selection, - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Model selection, Hyperparameter optimisation, and Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Unsupervised Learning Gustavo Velasco-Hern andez Pattern Recognition, 2014 Gustavo

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Automatic Learning of a Morphological Model Theory and Unsupervised Approaches Unsupervised

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

CSE 158 Lecture 10 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday!

Decision Tree Decision Trees A decision tree is a decision support tool that uses a tree-like

One Minute Responses Nucleotide vs. amino acid sequences Parsimony Genome 559: Introduction

A brief introduction to economics Part I Tyler Moore Computer Science &amp; Engineering

Self- -Verifying Verifying Self Self-Verifying * * Dining Philosophers Dining Philosophers

Concluding Remarks Information and Statistics in Nuclear Experiment and Theory #3 ECT*, Trento,

Decision Trees LING 572 Advanced Statistical Methods for NLP January 9, 2020 1 Sunburn Example

How Computers Discover How Computers Discover A Mini-Review of Algorithmic Meta-Discovery Filip

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

A brief introduction to economics Part I Tyler Moore Computer Science & Engineering