High Dimensional Predictive Inference Workshop on Current Trends - PowerPoint PPT Presentation

High Dimensional Predictive Inference Workshop on Current Trends and Challenges in Model Selection and Related Areas Vienna, Austria July 2008 Ed George The Wharton School (joint work with L. Brown, F. Liang, and X. Xu)

1. Estimating a Normal Mean: A Brief History • Observe X | µ ∼ N p ( µ, I ) and estimate µ by ˆ µ under µ ( X ) − µ � 2 R Q ( µ, ˆ µ ) = E µ � ˆ • ˆ µ MLE ( X ) = X is the MLE, best invariant and minimax with constant risk • Shocking Fact: ˆ µ MLE is inadmissible when p ≥ 3. (Stein 1956) • Bayes rules are a good place to look for improvements • For a prior π ( µ ), the Bayes rule ˆ µ π ( X ) = E π ( µ | X ) minimizes E π R Q ( µ, ˆ µ ) • Remark: The (formal) Bayes rule under π U ( µ ) ≡ 1 is µ U ( X ) ≡ ˆ ˆ µ MLE ( X ) = X

• ˆ µ H ( X ), the Bayes rule under the Harmonic prior π H ( µ ) = � µ � − ( p − 2) , dominates ˆ µ U when p ≥ 3. (Stein 1974) • ˆ µ a ( X ), the Bayes rule under π a ( µ ) where s ∼ (1 + s ) a − 2 µ | s ∼ N p (0 , s I ) , dominates ˆ µ U and is proper Bayes when p = 5 and a ∈ [ . 5 , 1) or when p ≥ 6 and a ∈ [0 , 1). (Strawderman 1971) • A Unifying Phenomenon: These domination results can be at- tributed to properties of the marginal distribution of X under π H and π a .

• The Bayes rule under π ( µ ) can be expressed as µ π ( X ) = E π ( µ | X ) = X + ∇ log m π ( X ) ˆ where � e − ( X − µ ) 2 / 2 π ( µ ) dµ m π ( X ) ∝ is the marginal of X under π ( µ ). ( ∇ = ( ∂ ∂ ∂x 1 , . . . , ∂x p ) ′ ) (Brown 1971) • The risk improvement of ˆ µ π ( X ) over ˆ µ U ( X ) can be expressed as �∇ log m π ( X ) � 2 − 2 ∇ 2 m π ( X ) � � R Q ( µ, ˆ µ U ) − R Q ( µ, ˆ µ π ) = E µ m π ( X ) � � − 4 ∇ 2 � � = E µ m π ( X ) / m π ( X ) ( ∇ 2 = � ∂ 2 i ) (Stein 1974, 1981) i ∂x 2

• That ˆ µ H ( X ) dominates ˆ µ U when p ≥ 3, follows from the fact that the marginal m π ( X ) under π H is superharmonic, i.e. ∇ 2 m π ( X ) ≤ 0 • That ˆ µ a ( X ) dominates ˆ µ U when p ≥ 5 (and conditions on a ), follows from the fact that the sqrt of the marginal under π a is superharmonic, i.e. ∇ 2 � m π ( X ) ≤ 0 (Fourdrinier, Strawderman and Wells 1998)

3. Bayes Rules for the Prediction Problem • For a prior π ( µ ), the Bayes rule � p π ( y | x ) = p ( y | µ ) π ( µ | x ) dµ = E π [ p ( y | µ ) | X ] � minimizes R KL ( µ, q ) π ( µ ) dµ (Aitchison 1975) • Let p U ( y | x ) denote the Bayes rule under π U ( µ ) ≡ 1 • p U ( y | x ) dominates p ( y | ˆ µ = x ), the naive “plug-in” predictive distribution (Aitchison 1975) • p U ( y | x ) is best invariant and minimax with constant risk (Murray 1977, Ng 1980, Barron and Liang 2003) • Shocking Fact: p U ( y | x ) is inadmissible when p ≥ 3

• p H ( y | x ), the Bayes rule under the Harmonic prior π H ( µ ) = � µ � − ( p − 2) , dominates p U ( y | x ) when p ≥ 3. (Komaki 2001). • p a ( y | x ), the Bayes rule under π a ( µ ) where s ∼ (1 + s ) a − 2 , µ | s ∼ N p (0 , s v 0 I ) , dominates p U ( y | x ) and is proper Bayes when v x ≤ v 0 and when p = 5 and a ∈ [ . 5 , 1) or when p ≥ 6 and a ∈ [0 , 1). (Liang 2002) • Main Question: Are these domination results attributable to the properties of m π ?

4. A Key Representation for p π ( y | x ) • Let m π ( x ; v x ) denote the marginal of X | µ ∼ N p ( µ, v x I ) under π ( µ ). • Lemma : The Bayes rule p π ( y | x ) can be expressed as p π ( y | x ) = m π ( w ; v w ) m π ( x ; v x ) p U ( y | x ) where W = v y X + v x Y ∼ N p ( µ, v w I ) v x + v y • Using this, the risk improvement can be expressed as � � p v x ( x | µ ) p v y ( y | µ ) log p π ( y | x ) R KL ( µ, p U ) − R KL ( µ, p π ) = p U ( y | x ) dxdy = E µ,v w log m π ( W ; v w ) − E µ,v x log m π ( X ; v x )

5. An Analogue of Stein’s Unbiased Estimate of Risk • Theorem : � ∇ 2 m π ( Z ; v ) ∂ − 1 � 2 �∇ log m π ( Z ; v ) � 2 ∂v E µ,v log m π ( Z ; v ) = E µ,v m π ( Z ; v ) � � 2 ∇ 2 � � = E µ,v m π ( Z ; v ) / m π ( Z ; v ) • Proof relies on using the heat equation ∂v m π ( z ; v ) = 1 ∂ 2 ∇ 2 m π ( z ; v ) , Brown’s representation and Stein’s Lemma.

6. General Conditions for Minimax Prediction • Let m π ( z ; v ) be the marginal distribution of Z | µ ∼ N p ( µ, vI ) under π ( µ ). • Theorem : If m π ( z ; v ) is finite for all z , then p π ( y | x ) will be minimax if either of the following hold: � (i) m π ( z ; v ) is superharmonic (ii) m π ( z ; v ) is superharmonic • Corollary : If m π ( z ; v ) is finite for all z , then p π ( y | x ) will be minimax if π ( µ ) is superharmonic • p π ( y | x ) will dominate p U ( y | x ) in the above results if the super- harmonicity is strict on some interval.

7. An Explicit Connection Between the Two Problems • Comparing Stein’s unbiased quadratic risk expression with our unbiased KL risk expression reveals � ∂ � R Q ( µ, ˆ µ U ) − R Q ( µ, ˆ µ π ) = − 2 ∂v E µ,v log m π ( Z ; v ) v =1 • Combined with our previous KL risk difference expression reveals a fascinating connection � v x R KL ( µ, p U ) − R KL ( µ, p π ) = 1 1 v 2 [ R Q ( µ, ˆ µ U ) − R Q ( µ, ˆ µ π )] v dv 2 v w • Ultimately it is this connection that yields the similar conditions for minimaxity and domination in both problems. Can we go further?

8. Sufficient Conditions for Admissibility • Let B KL ( π, q ) ≡ E π [ R KL ( µ, q )] be the average KL risk of q ( y | x ) under π . • Theorem (Blyth’s Method): If there is a sequence of finite non- negative measures satisfying π n ( { µ : � µ � ≤ 1 } ) ≥ 1 such that B KL ( π n , q ) − B KL ( π n , p π n ) → 0 then q is admissible. • Theorem : For any two Bayes rules p π and p π n � v x B KL ( π n , p π ) − B KL ( π n , p π n ) = 1 1 v 2 [ B Q ( π n , ˆ µ π ) − B Q ( π n , ˆ µ π n )] v dv 2 v w where B Q ( π, ˆ µ ) is the average quadratic risk of ˆ µ under π . • Using the explicit construction of π n ( µ ) from Brown and Hwang (1984), we obtain tail behavior conditions that prove admissibility of p U ( y | x ) when p ≤ 2, and admissibility of p H ( y | x ) when p ≥ 3.

9. A Complete Class Theorem • Theorem : In the KL risk problem, all the admissible procedures are Bayes or formal Bayes procedures. • Our proof uses the weak* topology from L ∞ to L 1 to define con- vergence on the action space which is the set of all proper densities on R p . • A Sletch of the Proof: (i) All the admissible procedures are non-randomized. (ii) For any admissible procedure p ( ·| x ), there exists a sequence of priors π i ( µ ) such that p π i ( ·| x ) → p ( ·| x ) weak* for a.e. x . (iii) We can find a subsequence { π i ′′ } and a limit prior π such that p π i ′′ ( · | x ) → p π ( · | x ) weak ∗ for almost every x . There- fore, p ( · | x ) = p π ( · | x ) for a.e. x , i.e. p ( · | x ) is a Bayes or a formal Bayes rule.

10. Predictive Estimation for Linear Regression • Observe X m × 1 = A m × p β p × 1 + ε m × 1 and predict Y n × 1 = B n × p β p × 1 + τ n × 1 – ε ∼ N m (0 , I m ) is independent of τ ∼ N n (0 , I n ) – rank ( A ′ A ) = p • Given a prior π on β , the Bayes procedure p L π ( y | x ) is � p ( x | Aβ ) p ( y | Bβ ) π ( β ) dβ p L π ( y | x ) = � p ( x | Aβ ) π ( β ) dβ • The Bayes procedure p L U ( y | x ) under the uniform prior π U ≡ 1 is minimax with constant risk

11. The Key Marginal Representation • For any prior π , π ( y | x ) = m π (ˆ β x,y , ( C ′ C ) − 1 ) p L p L U ( y | x ) m π (ˆ β x , ( A ′ A ) − 1 ) where C ( m + n ) × p = ( A ′ , B ′ ) ′ and ˆ β x = ( A ′ A ) − 1 A ′ x ∼ N p ( β, ( A ′ A ) − 1 ) β x,y = ( C ′ C ) − 1 C ′ ( x ′ , y ′ ) ′ ∼ N p ( β, ( C ′ C ) − 1 ) ˆ

12. Risk Improvement over p L U ( y | x ) • Here the difference between the KL risks of p L U ( y | x ) and p L π ( y | x ) can be expressed as R KL ( β, p L U ) − R KL ( β, p L π ) = E β, ( C ′ C ) − 1 log m π (ˆ β x,y ; ( C ′ C ) − 1 ) − E β, ( A ′ A ) − 1 log m π (ˆ β x ; ( A ′ A ) − 1 ) • Minimaxity of p L π ( y | x ) is here obtained when ∂ ∂ω E µ,V ω log m π ( Z ; V ω ) < 0 where V ω ≡ ω ( A ′ A ) − 1 + (1 − ω )( C ′ C ) − 1 • This leads to weighted superharmonic conditions on m π and π for minimaxity.

13. Minimax Shrinkage Towards 0 • Our Lemma representation p H ( y | x ) = m H ( w ; v w ) m H ( x ; v x ) p U ( y | x ) shows how p H ( y | x ) “shrinks p U ( y | x ) towards 0” by an adaptive multiplicative factor • The following figure illustrates how this shrinkage occurs for var- ious values of x .

High Dimensional Predictive Inference Workshop on Current Trends - PowerPoint PPT Presentation

High Dimensional Predictive Inference Workshop on Current Trends and Challenges in Model Selection and Related Areas Vienna, Austria July 2008 Ed George The Wharton School (joint work with L. Brown, F. Liang, and X. Xu) 1. Estimating a

Contents 1 Causal Inference and Predictive Comparison 2 1.1 How Predictive Comparison Can

Session 3 Upskilling for Predictive Analytics Travis M Short, FSA Upskilling for Predictive

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Signal Rate Inference for Dimensional Faust Multi-Dimensional Faust Y. Orlarey P. Jouvelot

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

LR-GLM: High-Dimensional Bayesian Inference Using Low-Rank Data Approximations Brian Trippe ,

Conditional Predictive Inference Post Model Selection Hannes Leeb Department of Statistics Yale

Predictive Analytics for Capacity Planning HIC 2015 Andrae Gaeth What is predictive

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Using the lasso in Stata for inference in high-dimensional models David M. Drukker Executive

Using the lasso in Stata for inference in high-dimensional models David M. Drukker Executive

Random projections, reweighting and half-sampling for high-dimensional statistical inference Art

High-Fidelity Coupling of Predictive Plasma-Wall Models Goal: Develop a predictive model of the

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Minimax testing of a composite null hypothesis defined via a quadratic functional Joint work with

Minimax-Angle Learning for Optimal Treatment Decision with Heterogeneous Data Chengchun Shi

CS 188: Artificial Intelligence Spring 2007 Lecture 7: CSP-II and Adversarial Search 2/6/2007

Searching for Solutions Artificial Intelligence CSPP 56553 January 14, 2004 Agenda Search

Wigner function estimation in QHT with noisy data Joint work with Lounici, K. and Peyr e, G.

Thresholding and Learning theory Dominique Picard Laboratoire Probabilit es et Mod` eles Al

DFA Minimization, Pumping Lemma CSCI 3130 Formal Languages and Automata Theory Siu On CHAN

Initial value problems by convex minimization and matrix-valued optimal transport Yann Brenier