Condition estimation of linear algebraic equations and its - PowerPoint PPT Presentation

Condition estimation of linear algebraic equations and its application to feature selection Joab Winkler Department of Computer Science, The University of Sheffield, Sheffield, United Kingdom The Institute of High Performance Computing A* Agency for Science, Technology and Research Singapore January 2019

Introduction 1 Mathematical background Regression 2 Condition numbers and regularisation 3 The effective condition number The discrete Picard condition Tikhonov regularisation Componentwise condition numbers Feature selection 4 Feature selection and condition estimation Summary 5

Introduction Several problems require the prediction of the output of a physical system for which the sample size n is much smaller than the dimension of the data p : Chemometrics Brain imaging Genomics Gene selection from microarray data Text analysis The condition n < p implies that there are many models that satisfy the given data and important issues therefore arise: Which model from this infinite set of models should be chosen? What is the criterion that should be used for this selection? Can the selection be generic, that is, not problem dependent, such that prior information is not required?

Mathematical background These problems yield an equation of the form A ∈ R m × n , b ∈ R m , x ∈ R n Ax = b + ε, where m < n , rank A = m and ε is the noise. The least squares minimisation of � ε � leads to the normal equation A T Ax = A T b whose solution is x soln = A † b = VS † U T b where the superscript † denotes the pseudo-inverse. The solution is � 0 m S − 1 � 1 U T b � � x soln = x ln + x 0 , x ln = V , x 0 = V 0 n − m r where x ln is the minimum norm solution, x 0 lies in the null space of A , r is arbitrary, and � S 1 � S = 0 m , n − m

The solution x ln is unsatisfactory for two reasons: Prediction accuracy : This solution may have low bias and high variance. Prediction accuracy can sometimes be improved by reducing or setting to zero some coefficients of x ln . Interpretation : It is usually desirable to choose the most important components of x ln that characterise the physical system being considered. Methods that are used to overcome these problems: Ridge regression : The magnitude of the components of x ln is reduced continuously: It is more stable than subset selection. It does not set any components to zero and thus it does not yield a sparse model that can be easily interpreted. Subset selection : Components of x ln are deleted in discrete steps: The models are strongly dependent on the components that are deleted because the elimination procedure is discrete. A small change in the data can cause a large change in the selected model, which reduces the prediction accuracy.

Ridge regression (Tikhonov regularisation) The sensitivity of the solution x ln to perturbations in b can be reduced by a constraint on the magnitude of the solution x reg : ( Ax − b ) T ( Ax − b ) + λ � x � 2 � � x reg = arg min , λ > 0 x and thus A T A + λ I x reg = A T b , � � λ > 0 The lasso (‘least absolute shrinkage and selection operator’) The retension of the advantages of ridge regression (stability) and subset selection (sparsity) are combined in the lasso: ( Ax − b ) T ( Ax − b ) � � x lasso = arg min � x � 1 ≤ t subject to x which can also be written as � ( Ax − b ) T ( Ax − b ) + λ � x � 1 � x lasso = arg min , λ > 0 x

The elastic net This method is an improvement on the lasso and it combines L 1 and L 2 regularisation: ( Ax − b ) T ( Ax − b ) + λ 1 � x � 1 + λ 2 � x � 2 � � x elastic = arg min x where λ 1 , λ 2 > 0 The solutions from Tikhonov regularisation, the lasso and the elastic net reduce the sensitivity of the least norm solution x ln to perturbations in b , but there are differences between these forms of regularisation.

Compare Tikhonov regularisation Tikhonov regularisation imposes a Gaussian prior on the parameters of the model. Tikhonov regularisation does not impose sparsity on x reg . The solution x reg has a closed form expression. with the lasso The lasso imposes a Laplacian prior on the parameters of the model. The lasso favours sparse solutions because some coefficients of x lasso are set to zero. The sparsity of x lasso increases as λ increases. The solution x lasso does not have a closed form expression and quadratic programming is required for its computation. and the elastic net The sparsity of x elastic is similar to the sparsity of x lasso . The solution x elastic favours a model in which strongly correlated predictors are usually either all included, or all excluded. The solution x elastic is much better than x lasso for some problems.

A regularised solution (Tikhonov, the lasso and the elastic net) is stable with respect to perturbations in b , but several points arise: Is regularisation always required when the data b are corrupted by noise? Must specific conditions on A and b be satisfied in order that regularisation is imposed only when it is required? What are consequences of applying regularisation when it is not required? If regularisation is required, then r method = x ln − x method � = 0 , method = { reg , lasso , elastic } Can bounds be imposed on � r reg � , � r lasso � and � r elastic � , such that these errors induced by regularisation are quantified? The answers to these questions are most easily obtained if Tikhonov regularisation is considered because the constraint in the 2-norm lends itself naturally to the SVD .

Regression The use of regularisation is usually justified for three reasons: It reduces or eliminates over-fitting in regression. It reduces the sensitivity of the regression curve to noise in the data. It imposes a unique solution in feature selection Ax = b where A ∈ R m × n , b ∈ R m , x ∈ R n , m < n and rank A = m . But There are well-defined problems for which regularisation must not be used because it causes a large degradation in the solution. and thus Can a quantitative test be established such that regularisation is used only when it is required? Regression provides a good example of the correct use, and the incorrect use, of regularisation.

Example 1 Consider the points ( x i , y i ) , i = 1 , . . . , 100, where the independent variables x i are not uniformly distributed in the interval I = [ 1 , . . . , 20 ] , the dependent variables y i are given by 33 � � − ( x i − d k ) 2 � y i = a k exp , i = 1 , . . . , 100 2 σ 2 d k = 1 the centres d k of the 33 basis functions are uniformly distributed in I and σ d = 1 . 35. Consider two sets of data points, y = y 1 and y = y 2 , and the perturbations δ y 1 and δ y 2 , µ = 0 , σ 2 = 25 × 10 − 8 � δ y 1 , δ y 2 ∼ N � and � δ y 1 � � δ y 2 � � y 1 � = 3 . 41 × 10 − 6 � y 2 � = 8 . 27 × 10 − 4 and

250 200 150 100 50 0 -50 10 20 30 40 50 60 70 80 90 100 i 10 4 1 0.5 0 exact data -0.5 perturbed data -1 0 5 10 15 20 25 30 35 i Figure: The exact curve (top), and the coefficients a i (bottom) for the exact data y = y 1 and the perturbed data y = y 1 + δ y 1 .

1 0.5 0 -0.5 -1 -1.5 -2 -2.5 10 20 30 40 50 60 70 80 90 100 i 10 7 1 0.5 0 -0.5 -1 5 10 15 20 25 30 i Figure: The exact curve (top), and the coefficients log 10 | a i | (bottom) for the exact data y = y 2 and the perturbed data y = y 2 + δ y 2 .

The interpolated curve is unstable for the data set y = y 1 . The interpolated curve is stable for the data set y = y 2 . The coefficient matrix A ∈ R 100 × 33 is the same for y = y 1 and y = y 2 , and its condition number is κ ( A ) = 5 . 11 × 10 8 Thus The presence of noise in the vector b , where Ax = b , does not imply that x is sensitive to changes in b . The condition κ ( A ) ≫ 1 does not imply that the equation Ax = b is ill-conditioned. Tikhonov regularisation yields a very good result (numerically stable and a small error between the theoretically exact solution and the regularised solution) for y = y 1 , but an unsatisfactory result for y = y 2 (a very large error between the theoretically exact solution and the regularised solution).

Condition numbers and regularisation The 2-norm condition number of A ∈ R m × n is κ ( A ) = s 1 , p = min ( m , n ) s p where s i , i = 1 , . . . , p , are the singular values of A and rank A = p . The condition number κ ( A ) cannot be a measure of the stability of Ax = b because it is independent of b . It is necessary to develop a measure of stability that is a function of A and b . This leads to: A refined normwise condition number - the effective condition number - which is a function of A and b . Componentwise condition numbers - one condition number for each component of x .

The effective condition number The effective condition number η ( A , b ) of A T Ax = A T b , A ∈ R m × n , m ≥ n is a refined normwise condition number. Theorem 1 Let the relative errors ∆ x and ∆ b be ∆ x = � δ x � ∆ b = � δ b � and � x � � b � The effective condition number η ( A , b ) of A T Ax = A T b is equal to the maximum value of the ratio of ∆ x to ∆ b with respect to all perturbations δ b ∈ R m ∆ x ∆ b = 1 � c � η ( A , b ) = max � S † c � s n δ b ∈ R m where A = USV T is the SVD of A and c = U T b .

Condition estimation of linear algebraic equations and its - PowerPoint PPT Presentation

Condition estimation of linear algebraic equations and its application to feature selection Joab Winkler Department of Computer Science, The University of Sheffield, Sheffield, United Kingdom The Institute of High Performance Computing A*

SOLUTION OF LINEAR ALGEBRAIC EQUATIONS I. Hajj 2017 Linear Equation Solution Methods Consider a

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Section6.1 Systems of Equations in Two Variables Introduction Definitions A system of equations

Higher Order Linear Differential Equations Familiar stuff Example Homogeneous equations Math

Linear Equations, Back Substitution and Elementary Operations Systems of Linear Equations Defn.

Linear Systems CS3220 Summer 2008 Jonathan Kaldor Systems of Linear Equations Want to find

A Symbolic Approach for Solving Algebraic Riccati Equations G. Rance, Y. Bouzidi, Al. Quadrat, Ar.

Linear Algebraic Graph Algorithms Linear Algebraic Graph Algorithms for Back End Processing for

Linear Algebraic Representation of Knowledge State of Agent Satoshi Tojo JAIST 28 August, 2018

ATI TEAS MATH UNDERSTANDING ALGEBRAIC EQUATIONS AT ATI TEAS MAT ATH AL ALGEBRAI AIC EQU

Systems of Linear Equations Marco Chiarandini Department of Mathematics & Computer Science

lgebra Linear e Aplicaes LINEAR EQUATIONS Start with a few examples One way of

A. Operations with algebraic Algebra practice part 1 expressions 3 4 A. Operations with

JUST THE MATHS SLIDES NUMBER 1.6 ALGEBRA 6 (Formulae and algebraic equations) by

Chemical Equations and Chemical Reactions Symbols Used in Chemical Equations Chemical Equations

Lecture 12 Conditioning and Condition Numbers NLA Reading Group Spring 13 by Can

Linear Inverse Problems A MATLAB Tutorial Presented by Johnny Samuels What do we want to do?

Notes on Error Propagation in Linear Systems CS3220 Summer 2008 - Jonathan Kaldor Up to this

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Tuesday,

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

Preconditioners for ill conditioned (block) Toeplitz systems: facts and ideas Paris Vassalos

CS257 Linear and Convex Optimization Lecture 9 Bo Jiang John Hopcroft Center for Computer

A structural geometrical analysis of ill-conditioned semidefinite programs Takashi Tsuchiya

Iterated Tikhonov with Generalized Singular Value Decomposition Mirjeta PASHA Kent State

Condition estimation of linear algebraic equations and its - PowerPoint PPT Presentation

Condition estimation of linear algebraic equations and its application to feature selection Joab Winkler Department of Computer Science, The University of Sheffield, Sheffield, United Kingdom The Institute of High Performance Computing A*

SOLUTION OF LINEAR ALGEBRAIC EQUATIONS I. Hajj 2017 Linear Equation Solution Methods Consider a

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Section6.1 Systems of Equations in Two Variables Introduction Definitions A system of equations

Higher Order Linear Differential Equations Familiar stuff Example Homogeneous equations Math

Linear Equations, Back Substitution and Elementary Operations Systems of Linear Equations Defn.

Linear Systems CS3220 Summer 2008 Jonathan Kaldor Systems of Linear Equations Want to find

A Symbolic Approach for Solving Algebraic Riccati Equations G. Rance, Y. Bouzidi, Al. Quadrat, Ar.

Linear Algebraic Graph Algorithms Linear Algebraic Graph Algorithms for Back End Processing for

Linear Algebraic Representation of Knowledge State of Agent Satoshi Tojo JAIST 28 August, 2018

ATI TEAS MATH UNDERSTANDING ALGEBRAIC EQUATIONS AT ATI TEAS MAT ATH AL ALGEBRAI AIC EQU

Systems of Linear Equations Marco Chiarandini Department of Mathematics &amp; Computer Science

lgebra Linear e Aplicaes LINEAR EQUATIONS Start with a few examples One way of

A. Operations with algebraic Algebra practice part 1 expressions 3 4 A. Operations with

JUST THE MATHS SLIDES NUMBER 1.6 ALGEBRA 6 (Formulae and algebraic equations) by

Chemical Equations and Chemical Reactions Symbols Used in Chemical Equations Chemical Equations

Lecture 12 Conditioning and Condition Numbers NLA Reading Group Spring 13 by Can

Linear Inverse Problems A MATLAB Tutorial Presented by Johnny Samuels What do we want to do?

Notes on Error Propagation in Linear Systems CS3220 Summer 2008 - Jonathan Kaldor Up to this

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Tuesday,

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

Preconditioners for ill conditioned (block) Toeplitz systems: facts and ideas Paris Vassalos

CS257 Linear and Convex Optimization Lecture 9 Bo Jiang John Hopcroft Center for Computer

A structural geometrical analysis of ill-conditioned semidefinite programs Takashi Tsuchiya

Iterated Tikhonov with Generalized Singular Value Decomposition Mirjeta PASHA Kent State

Systems of Linear Equations Marco Chiarandini Department of Mathematics & Computer Science