bayesian logistic regression
play

Bayesian logistic regression Already covered in lectures on - PowerPoint PPT Presentation

Deterministic Approximations Bayesian logistic regression Already covered in lectures on classification Laplace and variational approximations I will review Murphy pp256259 on the board. Similar material by MacKay, Ch. 41, pp492503. (


  1. Deterministic Approximations Bayesian logistic regression Already covered in lectures on classification Laplace and variational approximations I will review Murphy pp256–259 on the board. Similar material by MacKay, Ch. 41, pp492–503. ( § 41.4 uses non-examinable MCMC methods) http://www.inference.phy.cam.ac.uk/mackay/itila/book.html Iain Murray http://iainmurray.net/ Posterior distributions Non-Gaussian example p ( θ |D , M ) = P ( D| θ ) p ( θ ) p ( w ) ∝ N ( w ; 0 , 1) P ( D|M ) p ( w | D ) ∝ N ( w ; 0 , 1) σ (10 − 20 w ) E.g., logistic regression: p ( θ = w ) = N ( w ; 0 , σ 2 I ) � labels z ( n ) ∈ ± 1 σ ( z ( n ) w ⊤ x ( n ) ) , P ( D| θ = w ) = n Integrate large product non-linear functions. Goals: summarize posterior in simple form, −4 −2 0 2 4 estimate model evidence P ( D|M )

  2. Posterior after 500 datapoints Gaussian approximations N =500 labels generated with w =1 at x ( n ) ∼ N (0 , 10 2 ) Finite parameter vector θ p ( w ) ∝ N ( w ; 0 , 1) 500 � σ ( wx ( n ) z ( n ) ) P ( θ | lots of data ) often nearly Gaussian around the mode p ( w | D ) ∝ N ( w ; 0 , 1) n =1 Need to identify which Gaussian it is: mean, covariance −4 −2 0 2 4 −4 −2 0 2 4 Gaussian fit overlaid Laplace Approximation Laplace details MAP estimate: Matrix of second derivatives is called the Hessian: θ ∗ = arg max � � log P ( D| θ ) + log P ( θ ) . �� ∂ 2 θ � � H ij = − log P ( θ |D ) � ∂θ i ∂θ j � Define ‘energy’: θ = θ ∗ Find posterior mode (MAP estimate) θ ∗ using favourite E ( θ ) = − log P ( θ |D ) = − log P ( D| θ ) − log P ( θ ) + log P ( D ) . gradient-based optimizer. Because ∇ θ E is zero at θ ∗ (a turning point), Taylor expansion: E ( θ ∗ + δ ) ≈ E ( θ ∗ ) + 1 2 δ ⊤ H δ Log posterior doesn’t need to be normalized: constants disappear from derivatives and second-derivatives Do same thing to Gaussian around mean, identify Laplace approximation: P ( θ |D ) ≈ N ( θ ; θ ∗ , H − 1 )

  3. Laplace picture Laplace problems Weird densities won’t work well. We only locally match one mode. Mode may not have much mass, or misleading curvature Curvature and mode match. High dimensions: mode may be flat in some direction We can normalize Gaussian. Height at mode won’t match exactly! → ill-conditioned Hessian Used to approximate model likelihood (AKA ‘evidence’, ‘marginal likelihood’): ≈ P ( D| θ ∗ ) P ( θ ∗ ) P ( D ) = P ( D| θ ) P ( θ ) 1 N ( θ ∗ ; θ ∗ , H − 1 ) = P ( D| θ ∗ ) P ( θ ∗ ) | 2 πH − 1 | 2 P ( θ |D ) Other Gaussian approximations Variational methods Can match a Gaussian in other ways that derivatives Goal: fit target distribution (e.g., parameter posterior) Define: — family of possible distributions q ( θ ) — ‘variational objective’ (says ‘how well does q match?’) Accurate approximation with Gaussian may not be possible Capturing posterior width better than only fitting point estimate Optimize objective: Fit parameters of q ( θ ) — e.g., mean and cov of Gaussian

  4. Kullback–Leibler Divergence Minimizing D KL ( p || q ) � p ( θ ) log p ( θ ) Select family: q ( θ ) = N ( θ ; µ, Σ) , D KL ( p || q ) = q ( θ ) d θ Minimize D KL ( p || q ) : match mean and cov of p . D KL ( p || q ) ≥ 0 . Minimized by p ( θ ) = q ( θ ) . Information theory (non-examinable for MLPR): KL divergence: average storage wasted by compression system using model q instead of true distribution p . −4 −2 0 2 4 Minimizing D KL ( p || q ) Considering D KL ( q || p ) Optimizing D KL ( p || q ) tends to be hard. Even Gaussian q : mean and cov of p ? MCMC? Answer may not be what you want: Murphy Fig 21.1 � � D KL ( q || p ) = − q ( θ ) log p ( θ |D ) d θ + q ( θ ) log q ( θ ) d θ � �� � neg. entropy, − H ( q ) 1. “Don’t put probability mass on implausible parameters” 2. Want to be spread out, high entropy. H is the standard symbol for entropy. Nothing to do with a Hessian, also H ; sorry!

  5. Usual variational methods D KL ( q || p ) : fitting posterior Fit q to p ( θ |D ) = p ( D| θ ) p ( θ ) Most variational methods in Machine Learning p ( D ) minimize D KL ( q || p ) Substitute into KL divergence and get spray of terms: — All parameters are plausible. — We know how to do it! D KL ( q || p ) = E q [log q ( θ )] − E q [log p ( D| θ )] − E q [log p ( θ )] + log p ( D ) (There are other variational principles.) First three terms: Minimize sum of these, J ( q ) . log p ( D ) : Model evidence. Usually intractable, but: D KL ( q || p ) ≥ 0 ⇒ log p ( D ) ≥ − J ( q ) We optimize lower bound on the log marginal likelihood D KL ( q || p ) : optimization Summary Laplace approximation: Literature full of clever (non-examinable) iterative ways to — Straightforward to apply optimize D KL ( q || p ) . q not always Gaussian. — 2nd derivatives → certainty of parameter Use standard optimizers? Hardest term to evaluate is: — Incremental improvement on MAP estimate N � E q [log p ( D| θ )] = E q [log p ( x n | θ )] Variational methods: n =1 — Fit variational parameters of q (not θ !) Sum of possibly simple integrals. — Usually KL ( q || p ) , compare to KL ( p || q ) Stochastic gradient descent is an option. — Bound marginal/model likelihood (‘the evidence’)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend