Econ 2148, fall 2017 Gaussian process priors, reproducing kernel - PowerPoint PPT Presentation

Shrinkage Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37

Shrinkage Agenda ◮ 6 equivalent representations of the posterior mean in the Normal-Normal model. ◮ Gaussian process priors for regression functions. ◮ Reproducing Kernel Hilbert Spaces and splines. ◮ Applications from my own work, to 1. Optimal treatment assignment in experiments. 2. Optimal insurance and taxation. 2 / 37

Shrinkage Takeaways for this part of class ◮ In a Normal means model with Normal prior, there are a number of equivalent ways to think about regularization. ◮ Posterior mean, penalized least squares, shrinkage, etc. ◮ We can extend from estimation of means to estimation of functions using Gaussian process priors. ◮ Gaussian process priors yield the same function estimates as penalized least squares regressions. ◮ Theoretical tool: Reproducing kernel Hilbert spaces. ◮ Special case: Spline regression. 3 / 37

Shrinkage Normal posterior means – equivalent representations Normal posterior means – equivalent representations Setup ◮ θ ∈ R k ◮ X | θ ∼ N ( θ , I k ) ◮ Loss θ , θ ) = ∑ L ( � ( � θ i − θ i ) 2 i ◮ Prior θ ∼ N ( 0 , C ) 4 / 37

Shrinkage Normal posterior means – equivalent representations 6 equivalent representations of the posterior mean 1. Minimizer of weighted average risk 2. Minimizer of posterior expected loss 3. Posterior expectation 4. Posterior best linear predictor 5. Penalized least squares estimator 6. Shrinkage estimator 5 / 37

Shrinkage Normal posterior means – equivalent representations 1) Minimizer of weighted average risk ◮ Minimize weighted average risk (= Bayes risk), θ − θ ) 2 over both ◮ averaging loss L ( � θ , θ ) = ( � 1. the sampling distribution f X | θ , and 2. weighting values of θ using the decision weights (prior) π θ . ◮ Formally, � � θ ( · ) = argmin E θ [ L ( t ( X ) , θ )] d π ( θ ) . t ( · ) 6 / 37

Shrinkage Normal posterior means – equivalent representations 2) Minimizer of posterior expected loss ◮ Minimize posterior expected loss, θ − θ ) 2 over ◮ averaging loss L ( � θ , θ ) = ( � 1. just the posterior distribution π θ | X . ◮ Formally, � � θ ( x ) = argmin L ( t , θ ) d π θ | X ( θ | x ) . t 7 / 37

Shrinkage Normal posterior means – equivalent representations 3 and 4) Posterior expectation and posterior best linear predictor ◮ Note that � � � � �� C + I X C ∼ N 0 , θ . C C ◮ Posterior expectation: � θ = E [ θ | X ] . ◮ Posterior best linear predictor: θ = E ∗ [ θ | X ] = C · ( C + I ) − 1 · X . � 8 / 37

Shrinkage Normal posterior means – equivalent representations 5) Penalization ◮ Minimize 1. the sum of squared residuals, 2. plus a quadratic penalty term. ◮ Formally, n ( X i − t i ) 2 + � t � 2 , � ∑ θ = argmin t i = 1 ◮ where � t � 2 = t ′ C − 1 t . 9 / 37

Shrinkage Normal posterior means – equivalent representations 6) Shrinkage ◮ Diagonalize C : Find 1. orthonormal matrix U of eigenvectors, and 2. diagonal matrix D of eigenvalues, so that C = UDU ′ . ◮ Change of coordinates, using U : X = U ′ X ˜ ˜ θ = U ′ θ . ◮ Componentwise shrinkage in the new coordinates: d i � ˜ ˜ θ i = X i . (1) d i + 1 10 / 37

Shrinkage Normal posterior means – equivalent representations Practice problem Show that these 6 objects are all equivalent to each other. 11 / 37

Shrinkage Normal posterior means – equivalent representations Solution (sketch) 1. Minimizer of weighted average risk = minimizer of posterior expected loss: See decision slides. 2. Minimizer of posterior expected loss = posterior expectation: ◮ First order condition for quadratic loss function, ◮ pull derivative inside, ◮ and switch order of integration. 3. Posterior expectation = posterior best linear predictor: ◮ X and θ are jointly Normal, ◮ conditional expectations for multivariate Normals are linear. 4. Posterior expectation ⇒ penalized least squares: ◮ Posterior is symmetric unimodal ⇒ posterior mean is posterior mode. ◮ Posterior mode = maximizer of posterior log-likelihood = maximizer of joint log likelihood, ◮ since denominator f X does not depend on θ . 12 / 37

Shrinkage Normal posterior means – equivalent representations Solution (sketch) continued 5. Penalized least squares ⇒ posterior expectation: ◮ Any penalty of the form t ′ At for A symmetric positive definite ◮ corresponds to the log of a Normal prior � 0 , A − 1 � θ ∼ N . 6. Componentwise shrinkage = posterior best linear predictor: θ = C · ( C + I ) − 1 · X into ◮ Change of coordinates turns � � θ = D · ( D + I ) − 1 · X . ˜ ◮ Diagonality implies � � d i D · ( D + I ) − 1 = diag . d i + 1 13 / 37

Shrinkage Gaussian process regression Gaussian processes for machine learning Machine Learning ⇔ metrics dictionary machine learning metrics supervised learning regression features regressors weights coefficients bias intercept 14 / 37

Shrinkage Gaussian process regression Gaussian prior for linear regression ◮ Normal linear regression model: ◮ Suppose we observe n i.i.d. draws of ( Y i , X i ) , where Y i is real valued and X i is a k vector. ◮ Y i = X i · β + ε i ◮ ε i | X , β ∼ N ( 0 , σ 2 ) ◮ β | X ∼ N ( 0 , Ω) (prior) ◮ Note: will leave conditioning on X implicit in following slides. 15 / 37

Shrinkage Gaussian process regression Practice problem (“weight space view”) ◮ Find the posterior expectation of β ◮ Hints: 1. The posterior expectation is the maximum a posteriori. 2. The log likelihood takes a penalized least squares form. ◮ Find the posterior expectation of x · β for some (non-random) point x . 16 / 37

Shrinkage Gaussian process regression Solution ◮ Joint log likelihood of Y , β : log ( f Y β ) = log ( f Y Y | β )+ log ( f β ) 1 ( Y i − X i β ) 2 − 1 2 β ′ Ω − 1 β . 2 σ 2 ∑ = const . − i ◮ First order condition for maximum a posteriori: 0 = ∂ f Y β ∂β = 1 ( Y i − X i β ) · X i − β ′ Ω − 1 . σ 2 ∑ i � � − 1 � X ′ i X i + σ 2 Ω − 1 · ∑ X ′ ∑ ⇒ β = i Y i . i ◮ Thus � X ′ X + σ 2 Ω − 1 � − 1 · X ′ Y . E [ x · β | Y ] = x · � β = x · 17 / 37

Shrinkage Gaussian process regression ◮ Previous derivation required inverting k × k matrix. ◮ Can instead do prediction inverting an n × n matrix. ◮ n might be smaller than k if there are many “features.” ◮ This will lead to a “function space view” of prediction. Practice problem (“kernel trick”) ◮ Find the posterior expectation of f ( x ) = E [ Y | X = x ] = x · β . ◮ Wait, didn’t we just do that? ◮ Hints: 1. Start by figuring out the variance / covariance matrix of ( x · β , Y ) . 2. Then deduce the best linear predictor of x · β given Y . 18 / 37

Shrinkage Gaussian process regression Solution ◮ The joint distribution of ( x · β , Y ) is given by � � � � �� x Ω x ′ x Ω X ′ x · β ∼ N 0 , X Ω X ′ + σ 2 I n X Ω x ′ Y ◮ Denote C = X Ω X ′ and c ( x ) = x Ω X ′ . ◮ Then � � − 1 · Y . C + σ 2 I n E [ x · β | Y ] = c ( x ) · ◮ Contrast with previous representation: � X ′ X + σ 2 Ω − 1 � − 1 · X ′ Y . E [ x · β | Y ] = x · 19 / 37

Shrinkage Gaussian process regression General GP regression ◮ Suppose we observe n i.i.d. draws of ( Y i , X i ) , where Y i is real valued and X i is a k vector. ◮ Y i = f ( X i )+ ε i ◮ ε i | X , f ( · ) ∼ N ( 0 , σ 2 ) ◮ Prior: f is distributed according to a Gaussian process, f | X ∼ GP ( 0 , C ) , where C is a covariance kernel, Cov ( f ( x ) , f ( x ′ ) | X ) = C ( x , x ′ ) . ◮ We will again leave conditioning on X implicit in following slides. 20 / 37

Shrinkage Gaussian process regression Practice problem ◮ Find the posterior expectation of f ( x ) . ◮ Hints: 1. Start by figuring out the variance / covariance matrix of ( f ( x ) , Y ) . 2. Then deduce the best linear predictor of f ( x ) given Y . 21 / 37

Shrinkage Gaussian process regression Solution ◮ The joint distribution of ( f ( x ) , Y ) is given by � � � � �� f ( x ) C ( x , x ) c ( x ) ∼ N 0 , c ( x ) ′ C + σ 2 I n , Y where ◮ c ( x ) is the n vector with entries C ( x , X i ) , ◮ and C is the n × n matrix with entries C i , j = C ( X i , X j ) . ◮ Then, as before, � � − 1 · Y . C + σ 2 I n E [ f ( x ) | Y ] = c ( x ) · ◮ Read: � f ( · ) = E [ f ( · ) | Y ] ◮ is a linear combination of the functions C ( · , X i ) � � − 1 · Y . C + σ 2 I n ◮ with weights 22 / 37

Shrinkage Gaussian process regression Hyperparameters and marginal likelihood ◮ Usually, covariance kernel C ( · , · ) depends on on hyperparameters η . ◮ Example: squared exponential kernel with η = ( l , τ 2 ) (length-scale l , variance τ 2 ). � 2 l � x − x ′ � 2 � C ( x , x ′ ) = τ 2 · exp − 1 ◮ Following the empirical Bayes paradigm, we can estimate η by maximizing the marginal log likelihood: 2 Y ′ ( C η + σ 2 I ) − 1 Y � − 1 2 | det ( C η + σ 2 I ) |− 1 η = argmax η ◮ Alternatively, we could choose η using cross-validation or Stein’s unbiased risk estimate. 23 / 37

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel - PowerPoint PPT Presentation

Shrinkage Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Shrinkage Agenda 6 equivalent representations of the posterior mean

Econ 2148, fall 2019 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Applications of Gaussian process priors Maximilian Kasy Department of

Econ 2148, fall 2019 Applications of Gaussian process priors Maximilian Kasy Department of

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Econ 2148, fall 2017 Instrumental variables II, continuous treatment Maximilian Kasy Department

Econ 2148, fall 2017 Instrumental variables I, origins and binary treatment case Maximilian Kasy

Econ 2148, fall 2017 Shrinkage in the Normal means model Maximilian Kasy Department of

Econ 2148, fall 2017 Statistical decision theory Maximilian Kasy Department of Economics,

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Econ 2148, fall 2019 Data visualization Maximilian Kasy Department of Economics, Harvard

Econ 2148, fall 2019 Multi-armed bandits Maximilian Kasy Department of Economics, Harvard

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Econ 2148, fall 2019 Instrumental variables I, origins and binary treatment case Maximilian Kasy

Econ 2148, fall 2019 Instrumental variables II, continuous treatment Maximilian Kasy Department

Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A

Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens

Game Tree Search 1/6/17 Frameworks for Decision-Making 1. Goal-directed planning Agents want

Discrepancy and energy optimization on the sphere Dmitriy Bilyk University of Minnesota

Real Root Finding for Equivariant Semi-Algebraic Systems ISSAC 2018 Cordian Riener 1 Mohab Safey

Learning argumentative recommenders Olivier Cailloux LAMSADE, Universit Paris-Dauphine 22 nd

t t t st

Deep Reinforcement Learning Lecture 1 Sergey Levine How do we build intelligent machines?