coms 4721 machine learning for data science lecture 24 4
play

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University M ODEL S ELECTION M ODEL S ELECTION The model selection problem


  1. COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. M ODEL S ELECTION

  3. M ODEL S ELECTION The model selection problem We’ve seen how often model parameters need to be set in advance and discussed how this can be done using using cross-validation. Another type of model selection problem is learning model order. Model order : The complexity of a class of models ◮ Gaussian mixture model: How many Gaussians? ◮ Matrix factorization: What rank? ◮ Hidden Markov models: How many states? In each of these problems, we can’t simply look at the log-likelihood because a more complex model can always fit the data better.

  4. M ODEL S ELECTION Model Order We will discuss two methods for selecting an “appropriate” complexity of the model. This assumes a good model type was chosen to begin with.

  5. E XAMPLE : M AXIMUM LIKELIHOOD Notation We write L for the log-likelihood of a parameter under a model p ( x | θ ) : N � iid ∼ p ( x | θ ) ⇐ ⇒ L = log p ( x i | θ ) x i i = 1 The maximum likelihood solution is: θ ML = arg max θ L . Example: How many clusters? (wrong way) The parameters θ could be those of a GMM. We could find θ ML for different numbers of clusters and pick the one with the largest L . Problem : We can perfectly fit the data by putting each observation in its own cluster. Then shrink the variance of each Gaussian to zero.

  6. N UMBER OF P ARAMETERS The general problem ◮ Models with more degrees of freedom are more prone to overfitting. ◮ The degrees of freedom is roughly the number of scalar parameters, K . ◮ By increasing K (done by increasing # clusters, rank, # states, etc.) the model can add more degrees of freedom. Some common solutions ◮ Stability : Bootstrap sample the data, learn a model, calculate the likelihood on the original data set. Repeat and pick the best model. ◮ Bayesian nonparametric methods : Each possible value of K is assigned a prior probability. The posterior learns the best K . ◮ Penalization approaches : A penalty term makes adding parameters expensive. Must be overcome by a greater improvement in likelihood.

  7. P ENALIZING MODEL COMPLEXITY General form Define a penalty function on the number of model parameters. Instead of maximizing L , minimize −L and add the defined penalty. Two popular penalties are: ◮ Akaike information criterion (AIC) : − L + K ◮ Bayesian information criterion (BIC) : − L + 1 2 K ln N When 1 2 ln N > 1, BIC encourages a simpler model (happens when N ≥ 8). Example : For NMF with an M 1 × M 2 matrix and rank R factorization, BIC → 1 AIC → ( M 1 + M 2 ) R , 2 ( M 1 + M 2 ) R ln ( M 1 M 2 )

  8. E XAMPLE OF AIC OUTPUT

  9. E XAMPLE : AIC VS BIC ON HMM Notice: ◮ Likelihood is always improving ◮ Only compare location of AIC and BIC minima, not the values.

  10. D ERIVATION OF BIC

  11. AIC AND BIC Recall the two penalties: ◮ Akaike information criterion (AIC) : − L + K ◮ Bayesian information criterion (BIC) : − L + 1 2 K ln N Algorithmically, there is no extra work required: 1. Find the ML solution of the selected models and calculate L . 2. Add the AIC or BIC penalty to get a score useful for picking a model. Q: Where do these penalties come from? Currently they seem arbitrary. A: We will derive BIC next. AIC also has a theoretical motivation, but we will not discuss that derivation.

  12. D ERIVING THE BIC Imagine we have r candidate models, M 1 , . . . , M r . For example, r HMMs each having a different number of states. We also have data D = { x 1 , . . . , x N } . We want the posterior of each M i . p ( D|M i ) p ( M i ) p ( M i |D ) = � j p ( D|M j ) p ( M j ) If we assume a uniform prior distribution on models, then because the denominator is constant in M i , we can pick � M = arg max M i ln p ( D|M i ) = ln p ( D| θ, M i ) p ( θ |M i ) d θ We’re choosing the model with the largest marginal likelihood of the data by integrating out all parameters of the model. This is usually not solvable.

  13. D ERIVING THE BIC We will see how the BIC arises from the approximation, M i ln p ( D| θ ML , M i ) − 1 M = arg max M i ln p ( D|M i ) ≈ arg max 2 K ln N Step 1 : Recognize that the difficulty is with the integral � ln p ( D|M i ) = ln p ( D| θ ) p ( θ ) d θ. M i determines p ( D| θ ) , p ( θ ) —we will suppress this conditioning. Step 2 : Approximate this integral using a second-order Taylor expansion.

  14. D ERIVING THE BIC 1 . We want to calculate: � � ln p ( D|M ) = ln p ( D| θ ) p ( θ ) d θ = ln exp { ln p ( D| θ ) } p ( θ ) d θ We use a second-order Taylor expansion of ln p ( D| θ ) at the point θ ML , 2 . ln p ( D| θ ML ) + ( θ − θ ML ) T ∇ ln p ( D| θ ML ) ln p ( D| θ ) ≈ � �� � = 0 + 1 2 ( θ − θ ML ) T ∇ 2 ln p ( D| θ ML ) ( θ − θ ML ) � �� � = −J ( θ ML ) 3 . Approximate p ( θ ) as uniform and plug this approximation back in, � � � − 1 2 ( θ − θ ML ) T J ( θ ML )( θ − θ ML ) ln p ( D|M ) ≈ ln p ( D| θ ML ) + ln exp d θ

  15. D ERIVING THE BIC Observation : The integral is the normalizing constant of a Gaussian, � � � K / 2 � � 2 π − 1 2 ( θ − θ ML ) T J ( θ ML )( θ − θ ML ) exp d θ = |J ( θ ML ) | Remember the definition that N 1 � −J ( θ ML ) = ∇ 2 ln p ( D| θ ML ) ( a ) N ∇ 2 ln p ( x i | θ ML ) = N i = 1 � �� � converges as N increases (a) is by the i.i.d. model assumption made at the beginning of the lecture.

  16. D ERIVING THE BIC 4 . Plugging this in, � � K / 2 2 π ln p ( D|M ) ≈ ln p ( D| θ ML ) + ln |J ( θ ML ) | � � � � N N ∇ 2 ln p ( x i | θ ML ) 1 and |J ( θ ML ) | = N � . i = 1 Therefore we arrive at the BIC, ln p ( D|M ) ≈ ln p ( D| θ ML ) − 1 2 K ln N + something not growing with N � �� � O ( 1 ) term, so we ignore it

  17. S OME NEXT STEPS

  18. ICML S ESSIONS ( SUBSET ) The International Conference on Machine Learning (ICML) is a major ML conference. Many of the session titles should look familiar: ◮ Bayesian Optimization and Gaussian Processes ◮ PCA and Subspace Models ◮ Supervised Learning ◮ Matrix Completion and Graphs ◮ Clustering and Nonparametrics ◮ Active Learning ◮ Clustering ◮ Boosting and Ensemble Methods ◮ Matrix Factorization I & II ◮ Kernel Methods I & II ◮ Topic models ◮ Time Series and Sequences ◮ etc.

  19. ICML S ESSIONS ( SUBSET ) Other sessions might not look so familiar: ◮ Reinforcement Learning I & II ◮ Bandits I & II ◮ Optimization I, II & III ◮ Bayesian nonparametrics I & II ◮ Online learning I & II ◮ Graphical Models I & II ◮ Neural Networks and Deep Learning I & II ◮ Metric Learning and Feature Selection ◮ etc. Many of these topics are taught in advanced machine learning courses at Columbia in the CS, Statistics, IEOR and EE departments.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend