M-Estimation under High-Dimensional Asymptotics DLD, Andrea - - PowerPoint PPT Presentation

m estimation under high dimensional asymptotics
SMART_READER_LITE
LIVE PREVIEW

M-Estimation under High-Dimensional Asymptotics DLD, Andrea - - PowerPoint PPT Presentation

M-estimation Our Paper Isometry Between (M)-estimation & Lasso M-Estimation under High-Dimensional Asymptotics DLD, Andrea Montanari 2014-05-01 DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics M-estimation


slide-1
SLIDE 1

M-estimation Our Paper Isometry Between (M)-estimation & Lasso

M-Estimation under High-Dimensional Asymptotics

DLD, Andrea Montanari 2014-05-01

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-2
SLIDE 2

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

“An out-of-the-park grand-slam home run”

Annals of Mathematical Statistics 1964

∗ Richard Olshen

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-3
SLIDE 3

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-4
SLIDE 4

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

M-estimation Basics

Location model Yi = θ + Zi , i = 1, . . . , n Errors: Zi ∼ F, not necessarily Gaussian. “Loss” Function ρ(t) eg t2, |t|, − log(f (t)),. . . (M) min

θ n

  • i=1

ρ(Yi − θ) Asymptotic Distribution √n(ˆ θn − θ) ⇒D N(0, V ), n → ∞. Asymptotic Variance: ψ = ρ′: V (ψ, F) =

  • ψ2dF

(

  • ψ′dF)2

Information Bound V (ψ, F) ≥ 1 I(F) DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-5
SLIDE 5

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

The One-Step Viewpoint

One-Step Huber Estimates in the Linear Model

P. J. BICKEL*

Simple "one-step" versions

  • f Huber's (M) estimates for

the linear model are introduced. Some relevant Monte Carlo results

  • btained

in the Princeton project [1] are singled out and discussed. The large sample behavior

  • f

these procedures is examined under very mild regularity conditions.

  • 1. INTRODUCTION

In 1964 Huber [7] introduced a class

  • f

estimates (referred to as (M)) in the location problem, studied their asymptotic behavior and identified robust members

  • f

the group. These procedures are the solutions 8 of equa- tions

  • f

the form,

n

E +(X}-')

  • 0,

(1.1) where Xi = 0 + El, * ?Xn

= + E. and

El, ** En are unknown independent, identically distributed errors which have a distribution F which is symmetric about 0. If F has a density f which is smooth and if f is known, then maximum likelihood estimates if they exist satisfy (1.1) with F6

= -f'/f.

Under successively milder regularity conditions

  • n V

and F, Huber showed in [7] and [8] that such ' were consistent and asymptotically normal with mean and variance K(#k, F)/n where K(VI, F) =1 2 /(t)f(t)

dt f(t)do(t)I

. (1.2)

If F is unknown but close to a normal distribution with mean 0 and known variance in a suitable sense, Huber in [7] further showed that (M) estimates based

  • n

#K(t) = t

if Itl < K = Ksgnt if Itl >K (1.3) have a desirable minimax robustness property. If K is finite these estimates can

  • nly

be calculated iteratively. It has, however, been

  • bserved

by Fisher, Neyman and

  • thers

that if F is known and ' = ((- f'/f), the estimate

  • btained

by starting with a Vn consistent estimate 6 and performing

  • ne

Gauss-Newton iteration

  • f

(1.1) is asymptotically efficient even when the MLE is not and is equivalent to it when it is (cf. [13]). One purpose

  • f

this note is to show that under mild conditions this

* P.J. Bickel is professor, Department

  • f Statistics,

University

  • f California,

Berkeley,

  • Ca. 94720. This research

was performed with partial support

  • f the

O.N.R. under Contract N00014-67-A-D151-0017 with Princeton University, and N00014-67-A0114-0004 with the University

  • f

California at Berkeley, as well as that

  • f

the John Simon Guggenheim Foundation. The author would like to thank P.J. Huber,

  • C. Kraft

and C. Van Eeden and D. Relles for providing him with reprints

  • f

their work

  • n this

subject;

  • W. Rogers

III for programming the Monte Carlo computations

  • f

Section 3, which appeared in the Princeton project; and a referee who made Tables 1 and 2 reflect numerical realities.

equivalence holds in the more general context

  • f

the linear model for general 46. Typically the estimates

  • btained

from (1.1) are not scale equivariant.1 To obtain acceptable procedures a scale equivariant and location invariant estimate

  • f

scale 6 must be calculated from the data and 6 be

  • btained

as the solution

  • f

n E, Off (Xi - 0)

= O(1.4)

j-1

where (x) = (x/a) . (1.5) The resulting 6 is then both location and scale equi- variant. The estimate 6 can be

  • btained

simultaneously with ' by solving a system

  • f

equations such as those

  • f

Huber's Proposal 2 [8, p. 96] or the "likelihood equations"

n Xj - ;\

E 4 ) 0X a-1 O' (1.6)

n Xi

  • 6

where x (t) = t4* (t)

  • 1.

Or, we may choose 6 indepen- dently. For instance, in this article, the normalized inter- quartile range, l = (X(n-[n/4]+1)

  • X([,/4]))/24'-1(3/4), (1.7)

and the symmetrized interquartile range,

62 = median

{ i - m I/(D-1(3),

(1.8) are used where X(l) < ... < X(n) are the

  • rder

statistics, 4' is the standard normal cdf and m is the sample median. If 6

  • + a(F) at rate 1/v'n and F is symmetric

as hy- pothesized, then the asymptotic theory for the location model continues to be valid with K (t, F) replaced by K(#6( (F)), F). (E.g., cf. [7].) We shall show (in the con- text

  • f

the linear model) under mild conditions that the

  • ne-step

"Gauss-Newton" approximation to (1.4)-O being the

  • nly

unknown-behaves asymptotically like the root. The estimates corresponding to

OK have

a rather ap- pealing form and,

  • f

course, all of these Gauss-Newton

I In this article location (scale) invariance refers to procedures which remain unchanged when the data are shifted (rescaled). The term "equivariant" is in ac- cord with its usage in [2]. Thus, ; location and scale equivariant means that ;(aX1 + b, * ,

  • aX. + b) = aJ(Xi, *

*, aX,) + b and a scale equivariant means that 3(aXi, * *, aX.) = Ia I(Xi, * *, Xn). a Journal

  • f

the American Statistical Association June 1975, Volume 70, Number 350 Theory and Methods Section

428

JASA 1975

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-6
SLIDE 6

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

Regression M-estimation: the One-Step Viewpoint

Regression model Yi = X ′

i θ + Zi ,

Zi ∼iid F, i = 1, . . . , n Objective function of (M): R(ϑ) =

n

  • i=1

ρ(Yi − X ′

i ϑ)

(M) min

ϑ R(ϑ)

One-step estimate: ˜ θn any √n-consistent estimate of θ: ˆ θ1 = ˜ θn − [Hess R| ˜

θn ]−1∇R| ˜ θn .

Effectiveness: ˆ θ true solution of M-equation: E(ˆ θ1 − ˆ θ)(ˆ θ1 − ˆ θ)′ = o(n−1) DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-7
SLIDE 7

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

Driving Idea of Classical Asymptotics

The M-estimate is asymptotically equivalent to a single step of Newton’s method for finding a zero of ∇R starting at the true underlying parameter. Goes back to Fisher, ‘Method of Scoring’ for MLE.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-8
SLIDE 8

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

Derivation of Asymptotic Variance Formula

Approximation to One-Step: ˆ θ1 = θ + 1 B(ψ, F)(X ′X)−1X ′ (ψ(Zi)) + op(n−1/2) where B(ψ, F) =

  • ψ′dF. Observe that

Var((X ′X)−1X ′ (ψ(Zi))) ∼ (X ′X)−1A(ψ, F) where A(ψ, F) =

  • ψ2dF. Hence if Xi,j ∼ N(0, 1

n)

Var(ˆ θi − θi) → A(ψ, F) B(ψ, F)2 = V (ψ, F)

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-9
SLIDE 9

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

Asymptotics for Regression, I

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-10
SLIDE 10

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

Asymptotics for Regression, II

PJ Huber, Annals of Statistics 1973 DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-11
SLIDE 11

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-12
SLIDE 12

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-13
SLIDE 13

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

On robust regression with high-dimensional predictors

Noureddine El Karouia,1, Derek Beana, Peter J. Bickela,1, Chinghway Limb, and Bin Yua

aDepartment of Statistics, University of California, Berkeley, CA 94720; and bDepartment of Statistics and Applied Probability, Faculty of Science,

National University of Singapore, 119077 Contributed by Peter J. Bickel, April 25, 2013 (sent for review March 1, 2012)

We study regression M-estimates in the setting where p, the num- ber of covariates, and n, the number of observations, are both large, but p ≤ n. We find an exact stochastic representation for the distribution of ^ β = argminβ∈Rp∑n

i=1 ρðYi − Xi′βÞ at fixed p and n

under various assumptions on the objective function ρ and our statistical model. A scalar random variable whose deterministic limit rρðκÞ can be studied when p=n → κ > 0 plays a central role in this representation. We discover a nonlinear system of two deter- ministic equations that characterizes rρðκÞ. Interestingly, the sys- tem shows that rρðκÞ depends on ρ through proximal mappings of ρ as well as various aspects of the statistical model underlying our

  • study. Several surprising results emerge. In particular, we show

that, when p=n is large enough, least squares becomes preferable to least absolute deviations for double-exponential errors.

prox function | high-dimensional statistics | concentration of measure

  • simulations. (While the paper was under review, we have man-

aged to obtain rigorous proofs for many of our assertions. They will be presented elsewhere because they are very long and tech- nical.) We give several results for covariates that are Gaussian or derived from Gaussian but present grounds that the behavior holds much more generally—the key being concentration of certain quadratic forms involving the vectors of covariates. We also investigate the sensitivity of our results to the geometry of the design matrix. [Further results with different designs can be found in our work (5).] We find that (i) estimates of coordinates and contrasts that have coefficients independent of the observed covariates con- tinue to be unbiased and asymptotically normal; and (ii) as in the fixed p case, this happens at scale n−1=2, at least when the min- imal and maximal eigenvalues of the covariance of the predictors stay bounded away from 0 and ∞, respectively.* These findings are obtained by (i) using leave-one-out per- turbation arguments both for the data units and predictors; (ii)

STATISTICS

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-14
SLIDE 14

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

Bickel, Yu, El Karoui

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-15
SLIDE 15

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

High-Dimensional Asymptotics (HDA)

◮ n, pn → ∞. ◮ Xi,j ∼iid N(0, 1 n) ◮ pn−1θ0,n2 2 → τ 2 0 . ◮ Y = Xθ0 + Z ◮ n/pn → δ ∈ (1, ∞). ◮ Meaning of δ :

“# of observations per parameter to be estimated”

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-16
SLIDE 16

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

Emergent Phenomena under HDA

Classical setting - random design, p fixed, n → ∞. var(ˆ θi ) → V (ψ, F), n → ∞.

HDA setting - for n/pn → δ ∈ (1, ∞) ◮ Effective Score ˜ Ψ = ˜ Ψδ,ψ,F (to be described... ) ◮ Effective Error Distribution ˜ F = F ⋆ N(0, τ2

∞)

Extra Gaussian noise: τ∞ = τ∞(δ, ψ, F). ◮ Asymptotic Variance under HDA var(ˆ θi ) → V ( ˜ ψ, ˜ F), n, pn → ∞. ◮ Classical Correspondence ˜ Ψδ,ψ,F → ψ, ˜ Fδ,ψ,F → F, δ → ∞. DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-17
SLIDE 17

M-estimation Our Paper Isometry Between (M)-estimation & Lasso Classical M-estimation Big Data M-estimation

Immediate Implications

  • 1. Classical formulas for confidence statements about

M-estimates are overoptimistic under high dimensional asymptotics (even dramatically so).

  • 2. Maximum likelihood estimates are inefficient under

high-dimensional asymptotics. Score ψ = (− log fW )′ does not yield an efficient estimator.

  • 3. The usual Fisher Information bound is not attainable, as

I( F) < I(F).

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-18
SLIDE 18

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Our Paper: http://arxiv.org/pdf/1310.7320.pdf

Sample implication of DLD & Montanari (2013)

  • Corollary. For an M-estimator under HDA with errors Zi iid F:

lim

n→∞ AveiVar(ˆ

θi) ≥ 1 1 − 1/δ · 1 I(F)

◮ Effect of HDA parameter δ ∈ (1, ∞) evident

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-19
SLIDE 19

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-20
SLIDE 20

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Regularized ρ

Definition

ρ : R → R is smooth if C 1 with absolutely continuous derivative ψ = ρ′ having a bounded a.e. derivative ψ′.

◮ Excludes ℓ1:ρℓ1(z) = |z|. ◮ Allows Huber: ρH(z) = min(z2/2, λ|z| − λ2/2)

Regularized ρ: ρb(z) ≡ min

x∈R

  • bρ(x) + 1

2(x − z)2 , (1) Min-convolution of ρ with a square loss.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-21
SLIDE 21

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Regularized Score Function

Regularized Score Function Ψ(z; b) = ρ′

b(z).

Example: Huber loss ρH(z; λ), with score function ψ(z; λ) = min(max(−λ, z), λ). Ψ(z; b) = bψ

  • z

1 + b; λ

  • .

Ψ ‘like’ ψ, but central slope Ψ′( · ; b)∞ =

b 1+b < 1.

Effective score: for special choice b∗, ˜ Ψ = Ψ(z; b∗)

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-22
SLIDE 22

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Approximate Message Passing Algorithm

Initialization θ0 = 0. Adjusted residuals. Rt = Y − X θt + Ψ(Rt−1; bt−1) ; (2) Effective Score. Choose bt > 0 achieving empirical average slope p/n ∈ (0, 1). p n = 1 n

n

  • i=1

Ψ′(Rt

i ; b) .

(3)

  • Scoring. Apply effective score function Ψ(Rt; bt):
  • θt+1 =

θt + δXTΨ(Rt; bt) . (4)

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-23
SLIDE 23

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Determining Regularization Parameter bt

Example: ρ = ρH(·; 3), F = 0.95N(0, 1) + 0.05δ10.

1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Determination of b3 b Average Slope 5 10 15 20 0.27 0.28 0.29 0.3 0.31 0.32 0.33 History of bt AMP iteration t bt

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-24
SLIDE 24

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Connection of AMP to M-estimation

Lemma

Let ( θ∗, R∗, b∗) be a fixed point of the AMP iteration (2), (3), (4) having b∗ > 0. Then θ∗ is a minimizer of the problem (M). Viceversa, any minimizer θ∗ of the problem (M) corresponds to an AMP fixed point of the form ( θ∗, R∗, b∗).

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-25
SLIDE 25

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Example:

Size n = 1000, p = 200, so δ = 5. Truth θ0 random θ02 = 6√p. Errors F = CN(0.05, 10), i.e. F = 0.95Φ + 0.05H10 Hx denotes unit mass at x. Loss ρ = ρH(z; λ) with λ = 3. Iterations Run Amp for 20 iterations. Comparison Use CVX to obtain Huber estimator directly

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-26
SLIDE 26

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Convergence of AMP in Example

5 10 15 20 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 AMP iteration t R M SE( ˆ θ t, θ 0) C onve rge nc e of A M P ˆ θ t to θ 0 AMP M 5 10 15 20 1 2 3 4 5 6 C onve rge nc e of A M P ˆ θ t to ˆ θ AMP iteration t R M SE( ˆ θ t, ˆ θ)

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-27
SLIDE 27

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Contrast AMP w/ Traditional Scoring

Traditional residual: Z = Y − X ′˜ θ Traditional Scoring:

  • θ1 = ˜

θ + 1

1 n

n

i=1 ψ′(Z t)(X ′X)−1X ′(ψ(Zi)),

(5) AMP:

  • θt+1 =

θt + δn−1 · X ′Ψ(Rt; bt) . Object Scoring AMP Average Slope n

i=1 ψ′(Zi)/n

n

i=1 Ψ′(Rt i ; bt)/n ≡ δ−1

Gram (X ′X)−1 1 Residual Traditional (Zi) Adjusted (Rt

i )

Score ψ(·) Ψ(·; bt) Basepoint True θ Current Iterate Iteration 1 t ≥ 1

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-28
SLIDE 28

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Basic contrasts between HDA and Classical Asymptotics

Under the High-Dimensional Asymptotic, δ ∈ (1, ∞), The heuristic “expand around θ” is incorrect; it understates the true uncertainty. The heuristic “errors equal the residuals” is incorrect; the residuals are noisier. The heuristic “take one step” is incorrect; must take many steps.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-29
SLIDE 29

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Extra Variance at Initialization of AMP

◮ Initialize with

θ0 = 0, R−1 = 0.

◮ Initial residual R1 = Y − X

θ0 = Z + X(θ0 − θ0).

◮ Terms Z and X(θ0 −

θ0) are independent.

◮ X(θ0 −

θ0) is Gaussian with variance τ 2

0 =

θ0 − θ02

2/n = MSE(

θ0, θ0)/δ. Var(R1

i ) = Var(Z) + Var(X(θ0 −

θ0)) = Var(Z) + MSE( θ0, θ0)/δ.

◮ Extra Gaussian noise τ 2

0 = MSE(

θ0, θ0)/δ.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-30
SLIDE 30

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Evolution of Extra Variance at later AMP iterations

Pretend Rt ≈ Z + τt W with W an independent standard normal Define variance map V(τ 2, b; δ, F) = δ E

  • Ψ(Z + τ W ; b)2

, Define slope par. b = b(τ; δ, F): smallest solution b ≥ 0 to 1 δ = E

  • Ψ′(Z + τ · W ; b)
  • (6)

Definition

State Evolution dynamical system {τ 2

t }t≥0, starting at τ 2 0 ∈ R≥0 by

τ 2

t+1 = V(τ 2 t , b(τt)) = V(τt, b(τt; δ, F); δ, F).

(7)

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-31
SLIDE 31

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Variance Map, in running example

0.5 1 1.5 2 2.5 3 3.5 4 0.5 1 1.5 2 2.5 3 3.5 4 Input Variance 0

2

Output Variance 1

2

Variance map V(2), Contaminated Normal (0.472,0.472) V(tau2) y=x

F = 0.95N(0, 1) + 0.05H10, ρH with λ = 3.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-32
SLIDE 32

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

State Evolution, in running example

0.5 1 1.5 2 2.5 3 3.5 4 0.5 1 1.5 2 2.5 3 3.5 4 Input Variance 0

2

Output 1

2

Dynamics of t

2

2.056 0.819 0.545 0.486 V(tau2) y=x

F = 0.95N(0, 1) + 0.05H10, ρH with λ = 3.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-33
SLIDE 33

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Operating Characteristics from State Evolution

Definition

The state evolution formalism assigns predictions E under states S = (τ, b, δ, F) Functions ξ of θ − θ0. Predict p−1

i∈p ξ(

θi − θ0,i) by E(ξ( θ − ϑ)|S) ≡ E

  • ξ(

√ δ τ Z)

  • ,

Functions ξ2 of Residual, Error. Predict n−1 n

i=1 ξ2(Ri, Zi) by

E(ξ2(R, Z)|S) ≡ Eξ2(Z + τ W , Z) where W ∼ N(0, 1) and Z ∼ F is independent of W .

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-34
SLIDE 34

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Example of State Evolution Predictions

◮ MSE at iteration t. We let St = (τt, b(τt), δ, F) denote the

state of AMP at iteration t, and predict MSE( θt, θ0) ≈ E((ˆ ϑ − ϑ)2|St) = E

  • (

√ δ τt Z)2 = δτ 2

t . ◮ MSE at convergence. With τ∗ > 0 the limit of τt, let

S∗ = (τ∗, b(τ∗), δ, F) denote the equilibrium state of AMP MSE( θ∗, θ0) ≈ E((ˆ ϑ − ϑ)2|S∗) = E

  • (

√ δ τ∗ Z)2 = δτ 2

∗ . ◮ Ordinary residuals Y − X

θ∗ at AMP convergence. Setting η(z; b) = z − Ψ(z; b) Y − X θ∗ ⇒D η(Z + τ∗W ; b∗).

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-35
SLIDE 35

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

20 40 60 10 20 30 40 MSE State Evolution Iteration t 20 40 60 1 2 3 4 5 MAE State Evolution Iteration t 20 40 60 0.5 1 1.5 2 2.5 3 t State Evolution Iteration t 10 20 30 40 50 0.26 0.28 0.3 0.32 0.34 0.36 bt

Figure : Experimental means from 10 simulations compared with State Evolution predictions under

CN(0.05, 10), with Huber ψ, λ = 3. Upper Left: ˆ τt = θt − θ02/√n. Upper Right: ˆ

  • bt. Lower Left: MSE,

Mean Squared Error. Lower Right: MAE, Mean Absolute Error. Blue ‘+’ symbols: Empirical means of AMP

  • bservables. Green Curve: Theoretical predictions by SE.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-36
SLIDE 36

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Lower Bounds on State Evolution

  • Lemma. Suppose that F has a well-defined Fisher information I(F). Then for any t > 0

τ2

t ≥

1 δI(F) .

  • Lemma. (uses Barron-Madiman, 2007)

I(F ⋆ N(0, τ2)) ≤ I(F) 1 + τ2I(F) .

  • Corollary. For t > k

τ2

t ≥

1 + 1

δ + 1 δ2 + · · · + 1 δk

δI(F) .

  • Corollary. For every accumulation point τ∗ of State Evolution

τ2

∗ ≥

1 δ − 1 · 1 I(F) .

  • Corollary. For an M-estimator under HDA with errors Zi iid F:

lim

n→∞ Var(ˆ

θi ) ≥ 1 1 − 1/δ · 1 I(F) DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-37
SLIDE 37

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Correctness of State Evolution 1

Basic Assumptions: A1 Discrepancy function ρ is convex and smooth; A2 Matrices {X(n)}n are ∼iid N(0, 1

n )

A3 θ0, θ0 = 0 are deterministic sequences such that AMSE(θ0, θ0) = δτ2

0 .

A4 F has finite second moment. Terminology:

Let {τ2

t }t≥0 denote the state evolution sequence with initial condition τ2 0 .

Let { θt, Rt}t≥0 be the AMP trajectory with parameters bt . Definition: A function ξ : Rk → R is pseudo-Lipschitz if there exists L < ∞ such that, for all x, y ∈ Rk , |ξ(x) − ξ(y)| ≤ L(1 + x2 + y2) x − y2. DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-38
SLIDE 38

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Correctness of State Evolution 2

Theorem Assume A1-A4. Let ξ : R → R, ξ2 : R × R → R be pseudo-Lipschitz functions. Then, for any t > 0, we have, for W ∼ N(0, 1) independent of Z ∼ F lim

n→∞

1 p

p

  • i=1

ξ( θt

i − θ0,i) =a.s. E

  • ξ(

√ δ τt Z)

  • ,

(8) lim

n→∞

1 n

n

  • i=1

ξ2(Rt

i , Zi) =a.s. E

  • ξ2(Z + τt W , Z)
  • .

(9)

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-39
SLIDE 39

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Corollary of Correctness

Under HDA, can define AMSE( ˆ θt, θ0) =a.s lim

n,p→∞

θt − θ02

2/p;

and AMSE( ˆ θt, θ0) = δτ 2

t .

and lim

t→∞ AMSE( ˆ

θt, θ0) = δτ 2

∗ .

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-40
SLIDE 40

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Convergence of AMP to M-Estimator, 2

  • Theorem. Assume A1-A4 and that ρ is strongly convex and δ > 1.

Let (τ∗, b∗) be a solution of the two equations τ 2 = δ E

  • Ψ(Z + τ W ; b)2

, (10) 1 δ = E

  • Ψ′(Z + τ W ; b)
  • .

(11) Assume that AMSE( θ0, θ0) = δτ 2

∗ . Then

lim

t→∞ AMSE(

θt, θ) = 0 . (12)

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-41
SLIDE 41

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Main Result: Asymptotic Variance Formula under High-Dimensional Asymptotics.

  • Corollary. Assume setting of previous Theorem.

lim

n,p→∞ Avei∈[p]Var(

θi) =a.s V (˜ Ψ, ˜ F), (13) where the effective score ˜ Ψ is ˜ Ψ( · ) = Ψ( · ; b∗), while the effective noise distribution ˜ F is ˜ F = F ⋆ N(0, τ 2

∗ ).

Here (τ∗, b∗) are the unique solutions of the equations (10)-(11).

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-42
SLIDE 42

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Driving Idea of High-Dimensional Asymptotics, 1

◮ M-estimate asy. equivalent to limit of many steps of AMP . ◮ First step of AMP contains extra Gaussian noise. ◮ Extra Gaussian noise caused by errors in initial parameter

estimates.

◮ Extra Gaussian noise declines with AMP iterations but does

not go away completely; declines to level defined by the fixed point equations of state evolution.

◮ Extra Gaussian noise in M-estimate characterized by fixed

point equations of state evolution.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-43
SLIDE 43

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Driving Idea of High-Dimensional Asymptotics, 2

Classically M-estimate propagates true errors through score ˆ θ = θ0 + 1 B(ψ, F)(X ′X)−1X ′ (ψ(Zi)) + op(n−1/2) HDA propagates noisier effective errors through effective score ˆ θ = θ0 + 1 B(˜ Ψ, F) · n−1 · X ′ ˜ Ψ(Zi + τ∗Wi)

  • + op(n−1/2)

where Wi ⊥ ⊥ Zi and both are iid, and ˜ Ψ is the effective score.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-44
SLIDE 44

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

How is Correctness of State Evolution Proved, 1 ?

Simply apply existing paper: Bayati, Mohsen, and Andrea Montanari. ”The dynamics of message passing on dense graphs, with applications to compressed sensing.” IEEE IT 57.2 (2011): 764-785.

Bayati-Montanari 2011 designed for a seemingly different problem: Asymptotics of Lasso in p > n case. min ˜ Y − ˜ Xβ2

2/2 + λβ1

Setting was compressed sensing, where ˜ Xi,j ∼iid N(0, 1

n ), and

pn/n → δ ∈ (0, 1).

Formalism of AMP and State Evolution was introduced, and developed in DLD Arian Maleki, and Andrea Montanari. ”Message-passing algorithms for compressed sensing.” PNAS 106.45 (2009): 18914-18919. DLD, Arian Maleki, and Andrea Montanari. ”The noise-sensitivity phase transition in compressed sensing.” IEEE IT 57.10 (2011): 6920-6941.

These papers systematically understood and used the ‘Extra Gaussian Noise’ property of High Dimensional Asymptotics.

Generality of Bayati-Montanari treatment, easily accommodated M-estimation. DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-45
SLIDE 45

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Iterative Thresholding

Soft Threshold Nonlinearity η(x; λ) = (|x| − λ)+ · sgn(x). Iterative solution βt, t = 0, 1, 2, 3, . . . . β0 = zt = ˜ Y − ˜ Xβt βt+1 = η(βt + ˜ X ∗zt; λt) Heuristic to solve LASSO: minβ ˜ Y − ˜ Xβ2

2/2 + λβ1.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-46
SLIDE 46

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

Approximate Message Passing (AMP) Iterative Thresholding

First order Approximate Message Passing (AMP) algorithm zt = ˜ Y − ˜ Xβt + 1 δ zt−1η′

t−1( ˜

X ∗zt−1 + βt−1) . βt+1 = ηt(βt + ˜ X ∗zt) , η′

t( s ) = ∂ ∂s ηt( s ).

Feature: Essentially same cost as Iterative Soft Thresholding If η = soft thresholding, 1

δη′ t−1( ˜

X ∗zt−1 + βt−1) = βt0

n

.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-47
SLIDE 47

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ

How is Correctness of State Evolution Proved, 2?

Central Recursion in Bayati-Montanari 2011

ht, qt ∈ RN

zt, mt ∈ Rn

Initial Condition q0; m−1 = 0. ht+1 = A∗mt − ξtqt, mt = gt(bt, w) bt = Aqt − λtmt−1, qt = ft(ht, x0)

Reaction Coefficients: ξt = g′

t (bt, w); λt = 1 δ f ′ t (ht, x0)

State Evolution τ2

t = E{gt(σtZ, W )2};

σ2

t =

1 δ E{ft(τt−1Z, X0)} where W ∼ FW , X0 ∼ FX0 DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-48
SLIDE 48

M-estimation Our Paper Isometry Between (M)-estimation & Lasso AMP Algorithm State Evolution Correctness of State Evolution Convergence of AMP to ˆ θ ϑt+1 = δXTΨ(W + St; bt) + qtϑt ϑt+1 XT δΨ(W + St; bt) qt ϑt ht+1 = A∗mt − ξtqt ht+1 A∗ mt ξt −qt St = −Xϑt + Ψ(W + St−1; bt−1) St X −ϑt 1 Ψ(W + St−1; bt−1) bt = Aqt − λtmt−1 bt A qt −λt mt−1

Table : Correspondences between terms in the centered recursions of DLD-Montanari 2013, and recursions analyzed in Bayati-Montanari 2011.

We get exact correspondence between the two systems, provided we identify δΨ(W + St; bt) with mt = gt(bt; w) and −δht with ft(ht). One has, in particular, that λt = 1

δ f ′ t (ht) = −1, and that

ξt = g′

t (bt, w) = δΨ′(W + St; bt) = qt.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-49
SLIDE 49

M-estimation Our Paper Isometry Between (M)-estimation & Lasso

Full Circle, 1

◮ Why did work on sparse signal recovery solve problem in

robust regression? Duality of Sparsity and Robustness

◮ Explicit link between solutions of Lasso with p > n and Huber

with p < n.

◮ Identity between estimating sparsely nonzero vector and

uncovering outliers in a linear relation.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-50
SLIDE 50

M-estimation Our Paper Isometry Between (M)-estimation & Lasso (Lassoλ) min

β∈R˜ p

1 2

  • Y −

Xβ2

2 + λ ˜ p

  • i=1

|βi | , (14) (Huberλ) min

ϑ∈Rp n

  • i=1

ρH(Yi − Xi , ϑ; λ) (15) Let X be a matrix with orthonormal rows such that XX = 0, i.e. null( X) = image(X) , (16) finally, set Y = XY . Proposition. With problem instances (Y , X) and ( Y , X) related as above, the optimal values of the Lasso problem (Lassoλ) and the Huber problem (Huberλ) are identical. The solutions of the two problems are in one-one-relation. In particular, we have

  • θ = (XTX)−1XT(Y −

β) . (17) (numerous references: e.g. Art Owen/IPOD & earlier) DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-51
SLIDE 51

M-estimation Our Paper Isometry Between (M)-estimation & Lasso

Full Circle, 3

◮ Isometry between performance of

◮ Lasso in ε-sparse regression problem, p > n ◮ Huber (M)-estimation in ε-contaminated data, p < n DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-52
SLIDE 52

M-estimation Our Paper Isometry Between (M)-estimation & Lasso

Two Scalar Minimax problems

◮ Huber (1964) Minimax problem:

v(ε) = min

λ sup H

{V (ψλ, F) : F = (1 − ε)Φ + εH} Here ψλ is Huber ψ capped at λ, Φ is N(0, 1)

◮ Minimax MSE under sparse means (DLD & Johnstone, 1992)

m(ε) = inf

λ sup H

EF(ηλ(X + Z) − X)2 : X ∼ (1 − ε)ν0 + εH Here ηλ is soft-thresholding at λ, ν0 is point mass at zero.

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-53
SLIDE 53

M-estimation Our Paper Isometry Between (M)-estimation & Lasso

Full Circle, 4 (w/ Montanari, Johnstone)

◮ HDA regression with fraction ε contaminated errors: δ = n/p

V (ε, δ) ≡ min

λ max H

lim

n→∞ AveiVar(ˆ

θi) V (ε, δ) =

  • +∞

v(ε) > δ

v(ε) 1−v(ε)/δ

v(ε) < δ

◮ HDA Sparse Regression, δ = n/p < 1,

  • Y =

Xβ0 + σ ˜ Z, ˜ Z ∼iid N(0, 1), Xi,j ∼iid N(0, n−1) β0 is ε-sparse M(ε, δ) ≡ sup

σ>0

min

λ

max

β00/n≤ε lim n→∞ MSE(ˆ

βλ, β0)/σ2 M(ε, δ) =

  • +∞

m(ε) > δ

m(ε) 1−m(ε)/δ

m(ε) < δ

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics

slide-54
SLIDE 54

M-estimation Our Paper Isometry Between (M)-estimation & Lasso

Conclusions

◮ High-Dimensional Asymptotics imposes extra Gaussian noise in

estimation, not seen classically

◮ Approximate Message Passing – new algorithm to understand and analyse ◮ State Evolution – new type of analysis to obtain properties of estimates ◮ New phenomena become visible, e.g. phase transitions in (M)-estimation,

previously unknown. Alternate Approach: N. El Karoui, arXiv: 1311.2445

DLD, Andrea Montanari M-Estimation under High-Dimensional Asymptotics