Sparse Inverse Covariance Estimation Using Quadratic Approximation - PowerPoint PPT Presentation

Sparse Inverse Covariance Estimation Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin MLSLP Symposium Portland, Oregon Sept 14, 2012 Joint work with C. Hsieh, M. Sustik and P. Ravikumar Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Inverse Covariance Estimation Given: n i.i.d. samples { y 1 , . . . , y n } , y i ∼ N ( µ, Σ) , Goal: Estimate the inverse covariance Θ = Σ − 1 . The sample mean and covariance are defined by n n µ = 1 y i and S = 1 � � µ ) T . ˆ ( y i − ˆ µ )( y i − ˆ n n i =1 i =1 Given the n samples, the likelihood is n � − 1 � (det Θ) 1 / 2 exp � µ ) T Θ( y i − ˆ µ, Θ) ∝ 2( y i − ˆ P ( y 1 , . . . , y n ; ˆ µ ) i =1 � n � − 1 = (det Θ) n / 2 exp � µ ) T Θ( y i − ˆ ( y i − ˆ µ ) . 2 i =1 Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Inverse Covariance Estimation The log likelihood can be written as µ, Θ)) = n 2 log(det Θ) − n log( P ( y 1 , . . . , y n ; ˆ 2tr(Θ S ) + constant . The maximum likelihood estimator of Θ is X ≻ 0 {− log det X + tr( SX ) } . Θ = arg min In high-dimensions ( p < n ), the sample covariance matrix S is singular. Want Θ to be sparse. Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Structure for Gaussian Markov Random Field The nonzero pattern of Θ is important: Each Gaussian distribution can be represented by a pairwise Gaussian Markov Random Field (GMRF) Conditional independence is reflected as zeros in Θ: Θ ij = 0 ⇔ y i and y j are conditional independent given other variables. In a GMRF G = ( V , E ), each node corresponds to a variable, and each edge corresponds to a non-zero entry in Θ. Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Examples An example – Chain graph: y j = ϕ y j − 1 + N (0 , 1) Real world example: graphical model which reveals the relationships between Senators: (Figure from Banerjee et al, 2008) Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Prior Work COVSEL : Block coordinate descent method with interior point solver for each block (Banerjee et al, 2007). Glasso : Block coordinate descent method with coordinate descent solver for each block (Friedman et al, 2007). VSM : Nesterov’s algorithm (Lu, 2009). PSM : Projected Subgradient Method (Duchi et al, 2008). SINCO : Greedy coordinate descent method (Scheinberg and Rish, 2009). ALM : Alternating Linearization Method (Scheinberg et al, 2010). IPM : Inexact interior point method (Li and Toh, 2010). PQN : Projected Quasi-Newton method to solve the dual problem (Schmidt et al, 2009). Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

L1-regularized covariance selection A sparse inverse covariance matrix is preferred – add ℓ 1 regularization to promote sparsity. The resulting optimization problem: � � Θ = arg min − log det X + tr( SX ) + λ � X � 1 = arg min X ≻ 0 f ( X ) , X ≻ 0 where � X � 1 = � n i , j =1 | X ij | . Regularization parameter λ > 0 controls the sparsity. Can be extended to a more general regularization term: � Λ ◦ X � 1 = � n i , j =1 λ ij | X ij | Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Second Order Method Newton method for twice differentiable function: x ← x − η ( ∇ 2 f ( x )) − 1 ∇ f ( x ) However, the sparse inverse covariance estimation objective f ( X ) = − log det X + tr( SX ) + λ � X � 1 is not differentiable. Most current solvers are first-order methods: Block Coordinate Descent (GLASSO), projected gradient descent (PSM), greedy coordinate descent (SINCO), alternating linearization method (ALM). Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Quadratic Approximation Write objective as f ( X ) = g ( X ) + h ( X ), where g ( X ) = − log det X + tr( SX ) and h ( X ) = λ � X � 1 . g ( X ) is twice differentiable while h ( X ) is convex but non-differentiable — we can only form quadratic approximation for g ( X ). The quadratic approximation of g ( X t + ∆) is g X t (∆) = tr(( S − W t )∆) + (1 / 2) tr( W t ∆ W t ∆) − log det X t + tr( SX t ) , ¯ where W t = ( X t ) − 1 . Note that tr( W t ∆ W t ∆) = vec(∆) T ( W t ⊗ W t ) vec(∆) Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Descent Direction Define the generalized Newton direction: D = arg min ∆ ¯ g X t (∆) + λ � X + ∆ � 1 , g X t (∆) ≡ g ( X t + ∆) = tr(( S − W t )∆) + 1 where ¯ 2 tr( W t ∆ W t ∆) . Can be rewritten as a Lasso type problem with p ( p + 1) / 2 variables: 1 2 vec(∆) T ( W t ⊗ W t ) vec(∆) + vec( S − W t ) T vec(∆) + λ � vec(∆) � 1 . Coordinate descent method is efficient at solving Lasso type problems. Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Coordinate Descent Updates Can use cyclic coordinate descent to solve arg min ∆ { ¯ g X t (∆) + λ � ∆ � 1 } : Generate a sequence D 1 , D 2 ..., where D i is updated from D i − 1 by only changing one variable. Variables are selected by cyclic order. Naive approach has an update cost of O ( p 2 ) because ∇ i ¯ g (∆) = (( W t ⊗ W t ) vec(∆) + vec( S − W t )) i Next we show how to reduce the cost from O ( p 2 ) to O ( p ). Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Coordinate Descent Updates Each coordinate descent update: g ( D + µ ( e i e T j + e j e T µ = arg min ¯ µ ¯ i )) + 2 λ | X ij + D ij + µ | D ij ← D ij + ¯ µ The one-variable problem can be simplified as 1 ij + W ii W jj ) µ 2 + ( S ij − W ij + w T 2( W 2 i D w j ) µ + λ | X ij + D ij + µ | Quadratic form with L1 regularization — soft thresholding gives the exact solution. Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Efficient solution of one-variable problem If we introduce a = W 2 ij + W ii W jj , b = S ij − W ij + w T i D w j , and c = X ij + D ij , then the minimum is achieved for: µ = − c + S ( c − b / a , λ/ a ) , where S ( z , r ) = sign( z ) max {| z | − r , 0 } is the soft-thresholding function. The main cost arises while computing w T i D w j : direct computation requires O ( p 2 ) flops. Instead, we maintain U = DW after each coordinate updates, and then compute w T i u j — only O ( p ) flops per updates. Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Line Search Adopt Armijo’s rule — try step-sizes α ∈ { β 0 , β 1 , β 2 , . . . } until X t + α D t : is positive definite 1 satisfies a sufficient decrease condition 2 ≤ f ( X t + α D t ) f ( X t ) + ασ ∆ t where ∆ t = tr( ∇ g ( X t ) D t ) + λ � X t + D t � 1 − λ � X t � 1 . Both conditions can be checked by performing Cholesky factorization — O ( p 3 ) flops per line search iteration. Can possibly do better by using Lanczos [K.C.Toh] Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Free and Fixed Set — Motivation Recall the time cost for finding descent direction: O ( p 2 ) variables, each update needs O ( p ) flops → total O ( p 3 ) flops per sweep. Our goal: Reduce the number of variables from O ( p 2 ) to � X t � 0 . � X t � 0 can be much smaller than O ( p 2 ) as the suitable λ should give a sparse solution. Our strategy: before solving the Newton direction, make a guess on which variables to update. Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Free and Fixed Sets ( X t ) ij belongs to fixed set if and only if |∇ ij g ( X t ) | < λ, and ( X t ) ij = 0 . The remaining variables constitute the free set. We then perform the coordinate descent updates only on free set. Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Size of free set In practice, the size of free set is small. Take Hereditary dataset as an example: p = 1869, number of variables = p 2 = 3 . 49 million. The size of free set drops to 20 , 000 at the end. Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Block-diagonal Structure Recently, (Mazumder and Hastie, 2012) and (Witten et al, 2011) proposed a block decomposition approach. Consider the thresholded covariance matrix E ij = max( | S ij | − λ, 0) . When E is block-diagonal, the solution is also block-diagonal:     E 1 0 . . . 0 Θ ∗ 0 . . . 0 1 Θ ∗ 0 E 2 . . . 0 0 . . . 0     Θ ∗ = 2 E =  ,  . . . .   . . . .  . . . . . . . .     . . . . . . . .    0 0 0 E n , 0 0 0 Θ ∗ n , Based on this approach, the original problem can be decomposed into n sub-problems. Inderjit S. Dhillon Dept of Computer Science UT Austin Sparse Inverse Covariance Estimation

Sparse Inverse Covariance Estimation Using Quadratic Approximation - PowerPoint PPT Presentation

Sparse Inverse Covariance Estimation Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin MLSLP Symposium Portland, Oregon Sept 14, 2012 Joint work with C. Hsieh, M. Sustik and P. Ravikumar Inderjit S.

The quadratic formula You may recall the quadratic formula for roots of quadratic polynomials ax 2

Lecture 14 Covariance Functions 3/08/2018 1 More on Covariance Functions 2 Nugget Covariance

High-Dimensional Covariance Decomposition into Sparse Markov and Independence Domains Majid

Beyond the graphical Lasso: Structure learning via inverse covariance estimation Po-Ling Loh UC

Covariance Matrices and Covariance Operators Theory and Applications H` a Quang Minh Functional

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

High-Dimensional Covariance Decomposition into Sparse Markov and Independence Domains Majid

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

Statistical Inverse Problems and abstract inverse problems examples Instrumental Variables

Exact and Stable Covariance Estimation from Quadratic Sampling via Convex Programming Yuxin Chen

State estimation approach to nonstationary Introduction inverse problems State estimation

Robust covariance estimation for financial applications Tim Verdonck , Mia Hubert, Peter Rousseeuw

Section3.3 Analyzing Graphs of Quadratic Functions Introduction Definitions A quadratic function

11. Quadratic forms and ellipsoids Quadratic forms Orthogonal decomposition Positive

Solving Quadratic Equations MCR3U: Functions Recall that to solve a quadratic equation means to

Quadratic Residues Definition : The numbers 0 2 , 1 2 , 2 2 , . . . , ( n 1) 2 mod n , are

Seismic Modeling, Migration and Velocity Inversion Finite Difference Approximations of the Wave

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Series Methods and Approximations Contents 12.1 Review of Calculus Topics . . . . . . . . . . .

Second order Implicit-Explicit Total Variation Diminishing schemes for the Euler system in the

Introduction to Convex Optimization Xuezhi Wang Computer Science Department Carnegie Mellon

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker

Quadrature CS3220 - Summer 2008 Jonathan Kaldor What is Quadrature? Numerical evaluation /

1 quasi-newton in one variable: the secant method In a one dimensional problem, approximating the