Regularization Parameter Estimation for Least Squares: Using the 2 - PowerPoint PPT Presentation

Regularization Parameter Estimation for Least Squares: Using the χ 2 -curve Rosemary Renaut, Jodi Mead Supported by NSF Arizona State and Boise State Harrachov, August 2007

Outline Introduction Methods Examples Chi squared Method Background Algorithm Single Variable Newton Method Extend for General D: Generalized Tikhonov Results Conclusions References

Regularized Least Squares for A x = b ◮ Ill-posed system: A ∈ R m × n , b ∈ R m , x ∈ R n ◮ Generalized Tikhonov regularization with operator D on x . x = argmin J ( x ) = argmin {� A x − b � 2 W b + � D ( x − x 0 ) � 2 ˆ W x } . (1) Assume N ( A ) ∩ N ( D ) = ∅ ◮ Statistically W b is inverse covariance matrix for data b . ◮ Standard: W x = λ 2 I , λ unknown penalty parameter Focus: How to find λ ?

Standard Methods I: L-curve - Find the corner 1. Let r ( λ ) = ( A ( λ ) − A ) b , where Influence Matrix A ( λ ) = A ( A T W b A + λ 2 D T D ) − 1 A T Plot log ( � D x � ) , log ( � r ( λ ) � ) Find corner 2. Trade off contributions. 3. Expensive - requires range of λ . 4. GSVD makes calculations efficient . 5. No statistical information. No corner

Standard Methods II: Generalized Cross-Validation (GCV) 1. Minimizes GCV function � b − A x ( λ ) � 2 W b [ trace ( I m − A ( W x ))] 2 , W x = λ − 2 I n which estimates predictive Multiple minima risk. 2. Expensive - requires range of λ . 3. GSVD makes calculations efficient . 4. Uses statistical information. Sometimes flat

Standard Methods III: Unbiased Predictive Risk Estimation (UPRE) 1. Minimize expected value of predictive risk: Minimize UPRE function � b − A x ( λ ) � 2 W b + 2 trace ( A ( W x )) − m 2. Expensive - requires range of λ . 3. GSVD makes calculations efficient . 4. Uses statistical information. 5. Minimum needed

An Illustrative Example: phillips Fredholm integral equation (Hansen) Example Error 10 % 1. Add noise to b 2. Standard deviation σ b i = . 01 | b i | + . 1 b max 3. Covariance matrix C b = σ 2 b I m = W b − 1 4. σ 2 b average of σ 2 b i 5. − is the original b and ∗ noisy data.

An Illustrative Example: phillips Fredholm integral equation (Hansen) 1. Add noise to b Comparison with new method 2. Standard deviation σ b i = . 01 | b i | + . 1 b max 3. Covariance matrix C b = σ 2 b I m = W b − 1 4. σ 2 b average of σ 2 b i 5. − is the original b and ∗ noisy data. 6. Each method gives different solution: o is L-curve 7. + is reference

General Result: Tikhonov ( D = I ) Cost functional at min is χ 2 r.v. Theorem (Rao:73, Tarantola, Mead (2007)) J ( x ) = ( b − A x ) T C b − 1 ( b − A x ) + ( x − x 0 ) T C x − 1 ( x − x 0 ) , ◮ x and b are stochastic (need not be normal) ◮ r = b − A x 0 are iid. ◮ Matrices C b = W b − 1 and C x = W x − 1 are SPD - ◮ Then for large m, ◮ minimium value of J is a random variable ◮ it follows a χ 2 distribution with m degrees of freedom.

Implication: Find W x such that J is χ 2 r.v. ◮ Theorem implies √ √ 2 z α/ 2 < J (ˆ m − x ) < m + 2 z α/ 2 for confidence interval ( 1 − α ) , ˆ x the solution. ◮ Equivalently, when D = I , √ √ 2 z α/ 2 < r T ( AC x A T + C b ) − 1 r < m + m − 2 z α/ 2 . ◮ Having found W x posterior inverse covariance matrix is W x = A T W b A + W x ˜ Note that W x is completely general

A New Algorithm for Estimating Model Covariance Algorithm (Mead 07) Given confidence interval parameter α , initial residual r = b − A x 0 and estimate of the data covariance C b , find L x which solves the nonlinear optimization. � L x L x T � 2 Minimize √ √ F 2 z α/ 2 < r T ( AL x L x T A T + C b ) − 1 r < m + m − Subject to 2 z α/ 2 AL x L x T A T + C b well-conditioned. Expensive

Single Variable Approach: Seek efficient, practical algorithm 1. Let W x = σ − 2 x I , where regularization parameter λ = 1 /σ x . 2. Use SVD to implement U b Σ b V T b = W b 1 / 2 A , svs σ 1 ≥ σ 2 ≥ . . . σ p and define s = U b W b 1 / 2 r : 3. Find σ x such that √ √ 1 2 z α/ 2 < s T diag ( m − x + 1 ) s < m + 2 z α/ 2 . σ 2 i σ 2 4. Equivalently, find σ 2 x such that 1 F ( σ x ) = s T diag ( ) s − m = 0 . 1 + σ 2 x σ 2 i Scalar Root Finding: Newton’s Method

Extension to Generalized Tikhonov Define x GTik = argmin J D ( x ) = argmin {� A x − b � 2 W b + � D ( x − x 0 ) � 2 ˆ W x } , (2) Theorem For large m, the minimium value of J D is a random variable which follows a χ 2 distribution with m − n + p degrees of freedom. Proof. Use the Generalized Singular Value Decomposition for [ W b 1 / 2 A , W x 1 / 2 D ] Find W x such that J D is χ 2 with m − n + p d.o.f.

Newton Root Finding W x = σ − 2 x I p ◮ GSVD of [ W b 1 / 2 A , D ] � � Υ X T D = V [ M , 0 p × n − p ] X T , A = U 0 m − n × n ◮ γ i are the generalized singular values m = m − n + p − � p i δ γ i 0 − � m ◮ ˜ i = 1 s 2 i = n + 1 s 2 i , ◮ ˜ s i = s i / ( γ 2 i σ 2 x + 1 ) , i = 1 , . . . , p ◮ t i = ˜ s i γ i . Solve F = 0 , where F ( σ x ) = s T ˜ and F ′ ( σ x ) = − 2 σ x � t � 2 s − ˜ m 2 .

Observations: Example F ◮ Initialization GCV, UPRE, L-curve, χ 2 all use GSVD (or SVD). ◮ Algorithm is cheap as compared to GCV, UPRE, L-curve. ◮ F is monotonic decreasing , even ◮ Solution either exists and is unique for positive σ ◮ Or no solution exists F ( 0 ) < 0 .

Relationship to Discrepancy Principle ◮ The discrepancy principle can be implemented by a Newton method. ◮ Finds σ x such that the regularized residual satisfies b = 1 σ 2 m � b − A x ( σ ) � 2 2 . (3) ◮ Consistent with our notation p m 1 � � i σ 2 + 1 ) 2 s 2 s 2 ( i + i = m , (4) γ 2 i = 1 i = n + 1 ◮ Similar but note that the weight in the first sum is squared in this case.

Some Solutions: with no prior information x 0 Illustrated are solutions and error bars With statistical information No Statistical Information C b = diag ( σ 2 b i ) Solution is Smoothed

Some Generalized Tikhonov Solutions: First Order Derivative C b = diag ( σ 2 b i ) No Statistical Information

Some Generalized Tikhonov Solutions: Prior x 0 : Solution not smoothed C b = diag ( σ 2 b i ) No Statistical Information

Some Generalized Tikhonov Solutions: x 0 = 0: Exponential noise C b = diag ( σ 2 b i ) No Statistical Information

Newton’s Method converges in 5 − 10 Iterations l cb Iterations k mean std 6 . 64 e − 01 0 1 8 . 23 e + 00 9 . 80 e − 01 0 2 8 . 31 e + 00 0 3 8 . 06 e + 00 1 . 06 e + 00 4 . 92 e + 00 5 . 10 e − 01 1 1 1 . 00 e + 01 1 . 16 e + 00 1 2 1 3 1 . 00 e + 01 1 . 19 e + 00 2 1 5 . 01 e + 00 8 . 90 e − 01 2 2 8 . 29 e + 00 1 . 48 e + 00 2 3 8 . 38 e + 00 1 . 50 e + 00 Table: Convergence characteristics for problem phillips with n = 40 over 500 runs

Newton’s Method converges in 5 − 10 Iterations l cb Iterations k mean std 0 1 6 . 84 e + 00 1 . 28 e + 00 0 2 8 . 81 e + 00 1 . 36 e + 00 0 3 8 . 72 e + 00 1 . 46 e + 00 6 . 05 e + 00 1 . 30 e + 00 1 1 7 . 40 e + 00 7 . 68 e − 01 1 2 1 3 7 . 17 e + 00 8 . 12 e − 01 2 1 6 . 01 e + 00 1 . 40 e + 00 2 2 7 . 28 e + 00 8 . 22 e − 01 2 3 7 . 33 e + 00 8 . 66 e − 01 Table: Convergence characteristics for problem blur with n = 36 over 500 runs

Estimating The Error and Predictive Risk Error χ 2 l cb L GCV UPRE mean mean mean mean 4 . 37 e − 03 4 . 39 e − 03 4 . 21 e − 03 4 . 22 e − 03 0 2 4 . 32 e − 03 4 . 42 e − 03 4 . 21 e − 03 4 . 22 e − 03 0 3 4 . 35 e − 03 5 . 17 e − 03 4 . 30 e − 03 4 . 30 e − 03 1 2 4 . 39 e − 03 5 . 05 e − 03 4 . 38 e − 03 4 . 37 e − 03 1 3 4 . 50 e − 03 6 . 68 e − 03 4 . 39 e − 03 4 . 56 e − 03 2 2 4 . 37 e − 03 6 . 66 e − 03 4 . 43 e − 03 4 . 54 e − 03 2 3 Table: Error characteristics for problem phillips with n = 60 over 500 runs with error contaminated x 0 . Relative errors larger than . 009 removed. Results are comparable

Estimating The Error and Predictive Risk Risk χ 2 l cb L GCV UPRE mean mean mean mean 3 . 78 e − 02 5 . 22 e − 02 3 . 15 e − 02 2 . 92 e − 02 0 2 3 . 88 e − 02 5 . 10 e − 02 2 . 97 e − 02 2 . 90 e − 02 0 3 1 2 3 . 94 e − 02 5 . 71 e − 02 3 . 02 e − 02 2 . 74 e − 02 1 3 1 . 10 e − 01 5 . 90 e − 02 3 . 27 e − 02 2 . 79 e − 02 2 2 3 . 41 e − 02 6 . 00 e − 02 3 . 35 e − 02 3 . 79 e − 02 2 3 3 . 61 e − 02 5 . 98 e − 02 3 . 35 e − 02 3 . 82 e − 02 Table: Error characteristics for problem phillips with n = 60 over 500 runs χ 2 method does not give best estimate of risk

Estimating The Error and Predictive Risk Error Histogram Normal noise on rhs, first order derivative, C b = σ 2 I

Estimating The Error and Predictive Risk Error Histogram Exponential noise on rhs, first order derivative, C b = σ 2 I

Conclusions ◮ χ 2 Newton algorithm is cost effective ◮ It performs as well ( or better) than GCV and UPRE when statistical information is available. ◮ Should be method of choice when statistical information is provided ◮ Method can be adapted to find W b if W x is provided.

Regularization Parameter Estimation for Least Squares: Using the 2 - PowerPoint PPT Presentation

Regularization Parameter Estimation for Least Squares: Using the 2 -curve Rosemary Renaut, Jodi Mead Supported by NSF Arizona State and Boise State Harrachov, August 2007 Outline Introduction Methods Examples Chi squared Method

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

The Mathemagic of Magic Squares History of Magic Squares Mathematics and Magic Squares

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Regularization Regularization is a general approach to add a complexity parameter to a

ECE 516: Adaptive Digital Filters Lecture 13 (Recursive Least-Squares) Mojtaba Soltanalian 2

Statistical Geometry Processing Winter Semester 2011/2012 Least-Squares Least-Squares Fitting

9. Equality constraints and tradeoffs More least squares Example: moving average model

8. Least squares Review of linear equations Least squares Example: curve-fitting

Linear Least Squares I Steve Marschner Cornell CS 322 Cornell CS 322 Linear Least Squares I 1

Moving Least Squares Outline The Approximation Power of Moving Least- Squares D. Levin

Non linear Least Squares Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Geometry of Least Squares 2 Least squares from the

Least Squares Estimation, Filtering, and Prediction Motivation If the second-order statistics

Computation: Digital and Analog Large-Scale FPAA SoC FPAA Analog & Digital

Why identity and relationships matter? Dr Jason Paul Mika | e: j.p.mika@massey.ac.nz Guest

MSS 153, LEMUEL A. GARRISON PAPERS SLIDES AND PHOTOGRAPHS SERIES DESCRIPTION AND CONTAINER LIST

Vis isualization and and optimizatio ion Jupyter Matplotlib scipy.optimize.minimize

CS314 Software Engineering Sprint 3 Dave Matthews Sprint 3 Summary Add Level 2 and 3

Exchange HIMSS July 18, 2018 Welcome! HIMSS State Government Affairs Team Jeff Coughlin, MPP

Fatherhood Practitioners July 16, 2014 National Responsible Fatherhood Clearinghouse Overview

Introduction to Introduction to Datawarehousing Umberto Nanni Seminars of Software and Services

Regularization Parameter Estimation for Least Squares: Using the 2 - PowerPoint PPT Presentation

Regularization Parameter Estimation for Least Squares: Using the 2 -curve Rosemary Renaut, Jodi Mead Supported by NSF Arizona State and Boise State Harrachov, August 2007 Outline Introduction Methods Examples Chi squared Method

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

The Mathemagic of Magic Squares History of Magic Squares Mathematics and Magic Squares

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Regularization Regularization is a general approach to add a complexity parameter to a

ECE 516: Adaptive Digital Filters Lecture 13 (Recursive Least-Squares) Mojtaba Soltanalian 2

Statistical Geometry Processing Winter Semester 2011/2012 Least-Squares Least-Squares Fitting

9. Equality constraints and tradeoffs More least squares Example: moving average model

8. Least squares Review of linear equations Least squares Example: curve-fitting

Linear Least Squares I Steve Marschner Cornell CS 322 Cornell CS 322 Linear Least Squares I 1

Moving Least Squares Outline The Approximation Power of Moving Least- Squares D. Levin

Non linear Least Squares Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Geometry of Least Squares 2 Least squares from the

Least Squares Estimation, Filtering, and Prediction Motivation If the second-order statistics

Computation: Digital and Analog Large-Scale FPAA SoC FPAA Analog &amp; Digital

Why identity and relationships matter? Dr Jason Paul Mika | e: j.p.mika@massey.ac.nz Guest

MSS 153, LEMUEL A. GARRISON PAPERS SLIDES AND PHOTOGRAPHS SERIES DESCRIPTION AND CONTAINER LIST

Vis isualization and and optimizatio ion Jupyter Matplotlib scipy.optimize.minimize

CS314 Software Engineering Sprint 3 Dave Matthews Sprint 3 Summary Add Level 2 and 3

Exchange HIMSS July 18, 2018 Welcome! HIMSS State Government Affairs Team Jeff Coughlin, MPP

Fatherhood Practitioners July 16, 2014 National Responsible Fatherhood Clearinghouse Overview

Introduction to Introduction to Datawarehousing Umberto Nanni Seminars of Software and Services

Computation: Digital and Analog Large-Scale FPAA SoC FPAA Analog & Digital