Bayesian estimation of sparse precision matrices Subhashis Ghoshal, - PowerPoint PPT Presentation

Bayesian estimation of sparse precision matrices Subhashis Ghoshal, North Carolina State University CANSSI–SAMSI Workshop: Geometric Topological and Graphical Model Methods in Statistics Fields Institute, Toronto, Canada, May 22-23, 2014 Based on collaborations with Sayantan Banerjee Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Estimation of Large Precision Matrix Consider multivariate Gaussian data X ∼ N p (0 , Σ). Let Ω = Σ − 1 be the precision matrix. If the p variables are represented as the vertices of a graph G , then the absence of an edge between any two vertices j and j ′ , which means conditional independence given others, is equivalent to ω jj ′ = 0. A graph can be used to control sparsity in Ω. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Graphical Lasso for Sparse Precision Matrix Developed in various papers — Meinshausen and B¨ uhlman (2006), Yuan and Lin (2007), Banerjee et al. (2008), Friedman, Hastie and Tibshirani (2008). Maximize log det Ω − tr ( S Ω) − λ � Ω � 1 subject to p.d. Ω, where S is the sample covariance matrix n − 1 � n i =1 X i X ′ i . Computation is doable in O ( p 3 ) steps by R package Glasso. Faster algorithms are possible assuming some special structure. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Convergence rate of Graphical Lasso Convergence rate studied by Rothman et al. (2008). If � λ ≍ (log p ) / n , convergence rate in Frobenius (aka � Euclidean) norm is (( p + s ) log p ) / n , where s is the number of non-zero off-diagonal entries. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Bayesian Graphical Lasso Wang (2012): Put independent exponential prior on diagonal entries, Laplace on off-diagonals, subject to positive definiteness restriction. iid ind ∼ λ e − λω ii , ω ii > 0, ω ij ∼ λ 2 e − λ | ω ij | , i � = j , Ω ∼ P : ω ii Ω ∼ P| Ω ∈ M + , the set of positive definite matrices. Posterior mode is graphical Lasso. Full posterior easily computable by MCMC. Not a real sparse prior. Posterior sits on non-sparse matrices, and hence cannot converge near the truth in high dimension. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

(Really sparse) Bayesian Graphical Lasso Real sparsity can be introduced by an extra point mass at zero for off-diagonal entries. Γ: γ i , j = 1 l (( i , j ) ∈ E ), p (Ω | Γ) ∝ � γ ij =1 { exp( − λ | ω ij | ) } � p � � �� − λ exp 2 ω ii 1 l M + (Ω). i =1 p (Γ) ∝ q # E (1 − q )( p 2 ) − # E | #Γ ≤ R , Maximum model size R has Poisson-like tail. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Posterior distribution p (Ω , Γ | X ) ∝ p ( X | Ω , Γ) p (Ω | Γ) p (Γ) � � − n { det (Ω) } n / 2 exp 2 tr (ˆ = ΣΩ) � � �� p � � − λ × { exp( − λ | ω ij | ) } exp 2 ω ii γ ij =1 i =1 × q # E (1 − q )( p 2 ) − # E . Model posterior probabilities � � Ω ∈M + exp( n p (Γ | X ) ∝ 2 h (Ω)) d ω ij , ( i , j ) ∈V Γ where p � � ΣΩ) − 2 λ | ω ij | − λ h (Ω) = log det (Ω) − tr (ˆ ω ii . n n γ ij =1 i =1 Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Issues Computation becomes a challenge. Traditional MCMC/RJMCMC are too slow. What can we say about posterior convergence rates? Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Posterior Convergence Rate Theorem iid ∼ N p (0 , Ω − 1 ) and the true precision matrix Let X 1 , . . . , X n Ω 0 ∈ U ( s , ǫ 0 ) = { Ω : # { ( i , j ) : ω ij � = 0 , i � = j } ≤ s , 0 < ǫ 0 ≤ min eig j (Ω) ≤ max eig j (Ω) ≤ ǫ − 1 < ∞} . Then for some M > 0 , 0 the posterior probability P ( � Ω − Ω 0 � 2 > M ǫ n | X ) → 0 , for ǫ n = n − 1 / 2 ( p + s ) 1 / 2 (log p ) 1 / 2 and � · � 2 stands for the Frobenius (Euclidean) norm. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Steps in Proving Posterior Convergence Rate Frobenius distance is comparable with the Hellinger distance between N p (0 , Ω) and N p (0 , Ω ′ ), the square root of their Kullback-Leibler (KL) divergence and the Euclidean norm for eigenvalues d 1 , . . . , d p of Ω − 1 / 2 ΩΩ − 1 / 2 centered by 1: 0 0 � k j =1 | d j − 1 | 2 . If either Hellinger or Frobenius is small, all d j s are uniformly close to 1, allowing Taylor’s expansion. Use general theory of posterior convergence rate [G, Ghosh and van der Vaart (2000)] by bounding Hellinger entropy of a n with at least 1 − e − bn ǫ 2 n prior probability, and “sieve” by n ǫ 2 assuring that the prior probability of the ǫ n -size KL neighborhood of the true density is at least e − n ǫ 2 n . In view of the equivalence of distances, need bounding entropy and obtain prior concentration in terms of Frobenius norm. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Steps (contd.) � p � Define sieve P n so maximum number of edges ¯ r < / 2 and 2 each entry at most L , where ¯ r and L are to be determined. � � ¯ r � ( p � 2 ) L ] need to make ≤ n ǫ 2 Metric entropy ≤ log[¯ r n . ǫ n ¯ r Choose L ∈ [ bn ǫ 2 n , bn ǫ 2 n + 1] to ensure that � p � n ). Note tail probability ≤ e − bn ǫ 2 exp( − L ) ≤ exp( − b ′ n ǫ 2 n . 2 Requirement on ¯ r becomes r log( 1 r log( n ǫ 2 n ) ≍ n ǫ 2 log ¯ r + ¯ r log p + ¯ ǫ n ) + ¯ n . n ) ≤ P (¯ r ) + exp( − b 3 n ǫ 2 P ( P c R > ¯ n ) r is like n ǫ 2 holds if ¯ n / log n under the Poisson tail condition. Bounding the KL divergences by � p j =1 | d j − 1 | 2 , suffices to lower bound P { max | d j − 1 | < ǫ n / p } = P {� Ω − Ω 0 � ∞ < ǫ n / p } ≥ ( c ′ ǫ n / p ) p + s using “independence”. ( p + s )(log p + log(1 /ǫ n )) ≍ n ǫ 2 n , giving convergence rate ǫ n = n − 1 / 2 ( p + s ) 1 / 2 (log n ) 1 / 2 . Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Approximate Posterior Model Probabilities We use Laplace approximation — in each submodel, expand log posterior density around posterior mode and evaluate normal integrals analytically. Posterior mode is graphical lasso restricted to the submodel. To find the Laplace approximation, need to calculate the Hessian. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Hessian U = Ω − Ω ∗ , where Ω ∗ is the graphical lasso solution in the submodel. � p { Γ | X } ∝ exp { n h (Ω ∗ ) / 2 } { det (Ω ∗ ) } − n / 2 U +Ω ∗ ∈M + exp { n g ( U ) / 2 } , where g ( U ) is p � � Σ U ) − 2 λ ij | ) − λ log det ( U + Ω ∗ ) − tr ( � ( | u ij + ω ∗ ij | − | ω ∗ u ii . n n γ ij =1 i =1 Hessian of g ( U ) is the # V Γ × # V Γ matrix H U +Ω ∗ , with { ( i , j ) , ( l , m ) } th entry � � ( U + Ω ∗ ) − 1 E ( i , j ) ( U + Ω ∗ ) − 1 E ( l , m ) − tr , E ( i , j ) is a binary matrix with 1 only at ( i , j )th and ( j , i )th location. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Laplace approximation C Γ exp { n h (Ω ∗ ) / 2 }{ det (Ω ∗ ) } − n / 2 exp { n g (0) / 2 } p ∗ { Γ | X } ∝ × (2 π ) # V Γ / 2 ( n / 2) − # V Γ / 2 [ det {− ∂ g ( U ) ∂ U ∂ U T | 0 } ] − 1 / 2 C Γ exp { n h (Ω ∗ ) / 2 } (2 π ) # V Γ / 2 ( n / 2) − # V Γ / 2 { det ( H Ω ∗ ) } − 1 / 2 . = Approximation is meaningful (i.e. differentiability hold) only if all the graphical lasso estimates of the off-diagonal elements corresponding to the graph generated by Γ are non-zero — coined as “regular models”. Other models are non-regular models. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Ignorability of non-regular models For a given nonregular submodel Γ, define its regular counterpart to be the model Γ by removing the edges having graphical lasso solution zero. Then as defined above, the graphical lasso solution corresponding to the two models are identical. Theorem If q < 1 / 2 , then the posterior probability of a non-regular model Γ is always less than that of its regular submodel Γ ′ . Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Accuracy of Laplace approximation Theorem The error in Laplace approximation of the posterior probability of a graphical model structure is asymptotically small if ( p + s ) 2 ǫ n → 0 , where ǫ n is the posterior convergence rate, that is, the error in the Laplace approximation tends to zero if n − 1 / 2 ( p + s ) 5 / 2 (log p ) 1 / 2 → 0 . Proof uses the bound — if sparsity is s , then with probability tending to 1, the remainder term in the expansion of h (Ω) around Ω ∗ , is bounded by ( p + s ) � Ω − Ω ∗ � 2 2 ( C 1 � Ω − Ω ∗ � 2 + C 2 � Ω − Ω ∗ � 2 2 ) / 2. Subhashis Ghoshal, North Carolina State University Bayesian estimation of sparse precision matrices

Bayesian estimation of sparse precision matrices Subhashis Ghoshal, - PowerPoint PPT Presentation

Bayesian estimation of sparse precision matrices Subhashis Ghoshal, North Carolina State University CANSSISAMSI Workshop: Geometric Topological and Graphical Model Methods in Statistics Fields Institute, Toronto, Canada, May 22-23, 2014

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Mixed Precision Training PAI Overview What is mixed-precision

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Nonparametric Bayesian Models for Sparse Matrices and Covariances Zoubin Ghahramani Department

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Tutorial: Sparse Recovery Using Sparse Matrices Piotr Indyk MIT Problem Formulation

Infotheory for Statistics and Learning Lecture 1 Entropy Relative entropy Mutual

Communication Complexity BASICS Summer School 2015 Communication

Probabilistic Data Graham Cormode Antonios Deligiannakis AT&T Labs-Research Technical

Distribution testing in the 21 1/2 th century Ryan ODonnell Carnegie Mellon University based

Obfuscation Using Distributional Features Bachelors Thesis Defense by Janek Bevendorff Date:

Profiling user belief in BI exploration for measuring subjective interestingness Alexandre

Variational regularisation for inverse problems with imperfect forward operators and general

Optimum Source Resolvability Rate with Respect to f -Divergences Using the Smooth Rnyi Entropy

Bayesian estimation of sparse precision matrices Subhashis Ghoshal, - PowerPoint PPT Presentation

Bayesian estimation of sparse precision matrices Subhashis Ghoshal, North Carolina State University CANSSISAMSI Workshop: Geometric Topological and Graphical Model Methods in Statistics Fields Institute, Toronto, Canada, May 22-23, 2014

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Mixed Precision Training PAI Overview What is mixed-precision

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices &amp; quadratic forms)

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Nonparametric Bayesian Models for Sparse Matrices and Covariances Zoubin Ghahramani Department

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Tutorial: Sparse Recovery Using Sparse Matrices Piotr Indyk MIT Problem Formulation

Infotheory for Statistics and Learning Lecture 1 Entropy Relative entropy Mutual

Communication Complexity BASICS Summer School 2015 Communication

Probabilistic Data Graham Cormode Antonios Deligiannakis AT&amp;T Labs-Research Technical

Distribution testing in the 21 1/2 th century Ryan ODonnell Carnegie Mellon University based

Obfuscation Using Distributional Features Bachelors Thesis Defense by Janek Bevendorff Date:

Profiling user belief in BI exploration for measuring subjective interestingness Alexandre

Variational regularisation for inverse problems with imperfect forward operators and general

Optimum Source Resolvability Rate with Respect to f -Divergences Using the Smooth Rnyi Entropy

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Probabilistic Data Graham Cormode Antonios Deligiannakis AT&T Labs-Research Technical