high dimensional graphical model selection practical and
play

High-dimensional graphical model selection: Practical and - PowerPoint PPT Presentation

High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John Lafferty (CMU), Pradeep Ravikumar (UC


  1. High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John Lafferty (CMU), Pradeep Ravikumar (UC Berkeley), and Prasad Santhanam (University of Hawaii) Supported by grants from National Science Foundation, and a Sloan Foundation Fellowship 1

  2. � classical asymptotic theory of statistical inference: Introduction � not suitable for many modern applications: – number of observations n → + ∞ – model dimension p stays fixed – { images, signals, systems, networks } frequently large ( p ≈ 10 3 − 10 8 )... � curse of dimensionality: frequently impossible to obtain consistent – function/surface estimation: enforces limit p → + ∞ – interesting consequences: might have p = Θ( n ) or even p ≫ n � can be saved by a lower effective dimensionality , due to some form procedures unless p/n → 0 of complexity constraint: – sparse vectors – { sparse, structured, low-rank } -matrices – structured regression functions – graphical models (Markov random fields) 2

  3. � Markov random field: random vector ( X 1 , . . . , X p ) with What are graphical models? distribution factoring according to a graph G = ( V, E ): D � Hammersley-Clifford Theorem: ( X 1 , . . . , X p ) being Markov w.r.t G A B C � studied/used in various fields: spatial statistics, language modeling, implies factorization: � � P ( x 1 , . . . , x p ) ∝ exp θ A ( x A ) + θ B ( x B ) + θ C ( x C ) + θ D ( x D ) . computational biology, computer vision, statistical physics .... 3

  4. � let G = ( V, E ) be an undirected graph on p = | V | vertices Graphical model selection � pairwise Markov random field: family of prob. distributions � given n independent and identically distributed (i.i.d.) samples of � � � 1 P ( x 1 , . . . , x p ; θ ) = Z ( θ ) exp � θ st , φ st ( x s , x t ) � . ( s,t ) ∈ E � complexity constraint: restrict to subset G d,p of graphs with X = ( X 1 , . . . , X p ), identify the underlying graph structure maximum degree d 4

  5. Illustration: Voting behavior of US senators Graphical model fit to voting records of US senators (Bannerjee, El Ghaoui, & d’Aspremont, 2008) 5

  6. Some issues in high-dimensional inference Consider some fixed loss function, and a fixed level δ of error. � for what sample sizes n do they succeed/fail to achieve error δ ? Limitations of tractable algorithms: � given a collection of methods, when does more computation reduce Given particular (polynomial-time) algorithms minimum # samples needed? � what are fundamental limitations of problem (Shannon capacity)? Information-theoretic limitations: � when are known (polynomial-time) methods optimal? Data collection as communication from nature − → statistician: � when are there gaps between poly.-time methods and optimal methods? 6

  7. � exact solution for trees Previous/on-going work on graph selection � local testing-based approaches (Chow & Liu, 1967) � methods for Gaussian MRFs (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) – ℓ 1 -regularized neighborhood regression for Gaussian MRFs � methods for discrete MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao, 2006) – ℓ 1 -regularized log-determinant (e.g., Yuan & Lin, 2006; d’Aspr´ emont et al., 2007; Friedman, 2008; Ravikumar et al., 2008) � information-theoretic approaches: – neighborhood-based search method (Bresler, Mossel & Sly, 2008) – ℓ 1 -regularized logistic regression (Ravikumar et al., 2006, 2008) – pseudolikelihood and BIC criterion (Csiszar & Talata, 2006) – information-theoretic limitations (Santhanam & Wainwright, 2008) 7

  8. � Markov properties encode neighborhood structure: Markov property and neighborhood structure d ( X r | X V \ r ) = ( X r | X N ( r ) ) � �� � � �� � Condition on full graph Condition on Markov blanket N ( r ) = { s, t, u, v, w } X s X t X w X r X u � basis of pseudolikelihood method X v (Besag, 1974) 8

  9. Practical method via neighborhood regression Observation: Recovering graph G equivalent to recovering neighborhood set N ( r ) for all r ∈ V . Method: Given n i.i.d. samples { X (1) , . . . , X ( n ) } , perform logistic regression of each node X r on X \ r := { X r , t � = r } to estimate neighborhood structure b N ( r ). 1. For each node r ∈ V , perform ℓ 1 regularized logistic regression of X r on the remaining variables X \ r : ( ) n X 1 b f ( θ ; X ( i ) θ [ r ] := arg min \ r ) + ρ n � θ � 1 n |{z} θ ∈ R p − 1 | {z } i =1 logistic likelihood regularization 2. Estimate the local neighborhood b N ( r ) as the support (non-negative entries) of the regression vector b θ [ r ]. 3. Combine the neighborhood estimates in a consistent manner (AND, or OR rule). 9

  10. � classical analysis: dimension p fixed, sample size n → + ∞ � high-dimensional analysis: allow both dimension p , sample size n , and High-dimensional analysis maximum degree d to increase at arbitrary rates � take n i.i.d. samples from MRF defined by G p,d � study probability of success as a function of three parameters: � theory is non-asymptotic: explicit probabilities for finite ( n, p, d ) Success( n, p, d ) = P [Method recovers graph G p,d from n samples] 10

  11. Empirical behavior: Unrescaled plots Star graph; Linear fraction neighbors 1 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 p = 225 0 0 100 200 300 400 500 600 Number of samples Plots of success probability versus raw sample size n . 11

  12. Empirical behavior: Appropriately rescaled Star graph; Linear fraction neighbors 1 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 p = 225 0 0 0.5 1 1.5 2 Control parameter Plots of success probability versus control parameter T LR ( n, p, d ). 12

  13. � graph sequences G p,d = ( V, E ) with p vertices, and maximum degree d . � drawn n i.i.d, samples, and analyze prob. success indexed by ( n, p, d ) Sufficient conditions for consistent model selection Theorem: For a rescaled sample size (RavWaiLaf06, RavWaiLaf08) n d 3 log p > T ∗ T LR ( n, p, d ) := crit � log p and regularization parameter ρ n ≥ c 1 τ n , then with probability � � greater than 1 − 2 exp − c 2 ( τ − 2) log p → 1: (a) For each node r ∈ V , the ℓ 1 -regularized logistic convex program has a unique solution. (Non-trivial since p ≫ n = ⇒ not strictly convex). (b) The estimated sign neighborhood � N ± ( r ) correctly excludes all edges not in the true neighborhood. � d 2 log p (c) For θ min ≥ c 3 τ , the method selects the correct signed n neighborhood. 13

  14. Some challenges in distinguishing graphs A C B D Guilt by association Hidden interactions Conditions on Fisher information matrix Q ∗ = E [ ∇ 2 f ( θ ∗ ; X )] A1. Bounded eigenspectra: λ ( Q ∗ SS ) ∈ [ C min , C max ]. A2. Mutual incoherence There exists an ν ∈ (0 , 1] such that | Q ∗ S c S ( Q ∗ SS ) − 1 | | | | | ∞ , ∞ ≤ 1 − ν. P where | | | A | | | ∞ , ∞ := max i j | A ij | . 14

  15. � construct candidate primal-dual pair ( b � proof technique—-not a practical algorithm! Proof sketch: Primal-dual certificate z ) ∈ R p − 1 × R p − 1 . θ, b (A) For a fixed node r with S = N ( r ), we solve the restricted program n � � 1 �� f ( θ ; X ( i ) � θ = arg min \ r ) + ρ n � θ � 1 , n θ ∈ R p − 1 ,θ Sc =0 i =1 thereby obtaining candidate solution � θ = ( � θ S ,� 0 S c ). z S ∈ R | S | as an element of the subdifferential ∂ � � (B) We choose � θ S � 1 . (C) Using optimality conditions from original convex program, solve z S c and check whether or not strict dual feasibility for � for all j ∈ S c holds. | � z j | < 1 Lemma: Full convex program recovers neighborhood ⇐ ⇒ primal-dual witness succeeds. 15

  16. � thus far: have exhibited a a particular polynomial-time method can Information-theoretic limits on graph selection recover structure if � but....is this a “good” result? n > Ω( d 3 log( p − d )) � are there polynomial-time methods that can do better? � information theory can answer the question: is there an exponential-time method that can do better? (Santhanam & Wainwright, 2008) 16

  17. � graphical model selection is an unorthodox channel coding problem: Graph selection as channel coding � nature sends G ∈ G d,p := { graphs on p vertices, max. degree d } P ( X | G ) X (1) , . . . , X ( n ) G � decoding problem: use observations { X (1) , . . . , X ( n ) } to correctly � channel capacity for graph decoding: balance between distinguish the “codeword” ` ´ pd log p – log number of models: log | M ( p, d ) | = Θ . d – relative distinguishability of different models 17

  18. � take Ising models P θ ( G ) from G d,p ( λ, ω ): Necessary conditions for graph recovery – graphs with p nodes and max. degree d � take n i.i.d. observations, and study probability of success in terms – parameters | θ st | ≥ λ for all edges ( s, t ) � – maximum neighborhood weight ω = max | θ st | . s ∈ V t ∈ N ( s ) of ( n, p, d ) Theorem: Necessary conditions: For sample size n � � log p exp( ω/ 2) λ d 8 log p n ≤ max 2 λ tanh( λ ) , d log( pd ) , 8 d, , 16 sinh( λ ) then the probability of error of any algorithm over G d,p ( λ, ω ) is at least 1 / 2. (Santhanam & W., 2008) 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend