High-dimensional graphical model selection: Practical and - PowerPoint PPT Presentation

High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John Lafferty (CMU), Pradeep Ravikumar (UC Berkeley), and Prasad Santhanam (University of Hawaii) Supported by grants from National Science Foundation, and a Sloan Foundation Fellowship 1

� classical asymptotic theory of statistical inference: Introduction � not suitable for many modern applications: – number of observations n → + ∞ – model dimension p stays fixed – { images, signals, systems, networks } frequently large ( p ≈ 10 3 − 10 8 )... � curse of dimensionality: frequently impossible to obtain consistent – function/surface estimation: enforces limit p → + ∞ – interesting consequences: might have p = Θ( n ) or even p ≫ n � can be saved by a lower effective dimensionality , due to some form procedures unless p/n → 0 of complexity constraint: – sparse vectors – { sparse, structured, low-rank } -matrices – structured regression functions – graphical models (Markov random fields) 2

� Markov random field: random vector ( X 1 , . . . , X p ) with What are graphical models? distribution factoring according to a graph G = ( V, E ): D � Hammersley-Clifford Theorem: ( X 1 , . . . , X p ) being Markov w.r.t G A B C � studied/used in various fields: spatial statistics, language modeling, implies factorization: � � P ( x 1 , . . . , x p ) ∝ exp θ A ( x A ) + θ B ( x B ) + θ C ( x C ) + θ D ( x D ) . computational biology, computer vision, statistical physics .... 3

� let G = ( V, E ) be an undirected graph on p = | V | vertices Graphical model selection � pairwise Markov random field: family of prob. distributions � given n independent and identically distributed (i.i.d.) samples of � � � 1 P ( x 1 , . . . , x p ; θ ) = Z ( θ ) exp � θ st , φ st ( x s , x t ) � . ( s,t ) ∈ E � complexity constraint: restrict to subset G d,p of graphs with X = ( X 1 , . . . , X p ), identify the underlying graph structure maximum degree d 4

Illustration: Voting behavior of US senators Graphical model fit to voting records of US senators (Bannerjee, El Ghaoui, & d’Aspremont, 2008) 5

Some issues in high-dimensional inference Consider some fixed loss function, and a fixed level δ of error. � for what sample sizes n do they succeed/fail to achieve error δ ? Limitations of tractable algorithms: � given a collection of methods, when does more computation reduce Given particular (polynomial-time) algorithms minimum # samples needed? � what are fundamental limitations of problem (Shannon capacity)? Information-theoretic limitations: � when are known (polynomial-time) methods optimal? Data collection as communication from nature − → statistician: � when are there gaps between poly.-time methods and optimal methods? 6

� exact solution for trees Previous/on-going work on graph selection � local testing-based approaches (Chow & Liu, 1967) � methods for Gaussian MRFs (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) – ℓ 1 -regularized neighborhood regression for Gaussian MRFs � methods for discrete MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao, 2006) – ℓ 1 -regularized log-determinant (e.g., Yuan & Lin, 2006; d’Aspr´ emont et al., 2007; Friedman, 2008; Ravikumar et al., 2008) � information-theoretic approaches: – neighborhood-based search method (Bresler, Mossel & Sly, 2008) – ℓ 1 -regularized logistic regression (Ravikumar et al., 2006, 2008) – pseudolikelihood and BIC criterion (Csiszar & Talata, 2006) – information-theoretic limitations (Santhanam & Wainwright, 2008) 7

� Markov properties encode neighborhood structure: Markov property and neighborhood structure d ( X r | X V \ r ) = ( X r | X N ( r ) ) � �� Condition on full graph Condition on Markov blanket N ( r ) = { s, t, u, v, w } X s X t X w X r X u � basis of pseudolikelihood method X v (Besag, 1974) 8

Practical method via neighborhood regression Observation: Recovering graph G equivalent to recovering neighborhood set N ( r ) for all r ∈ V . Method: Given n i.i.d. samples { X (1) , . . . , X ( n ) } , perform logistic regression of each node X r on X \ r := { X r , t � = r } to estimate neighborhood structure b N ( r ). 1. For each node r ∈ V , perform ℓ 1 regularized logistic regression of X r on the remaining variables X \ r : ( ) n X 1 b f ( θ ; X ( i ) θ [ r ] := arg min \ r ) + ρ n � θ � 1 n |{z} θ ∈ R p − 1 | {z } i =1 logistic likelihood regularization 2. Estimate the local neighborhood b N ( r ) as the support (non-negative entries) of the regression vector b θ [ r ]. 3. Combine the neighborhood estimates in a consistent manner (AND, or OR rule). 9

� classical analysis: dimension p fixed, sample size n → + ∞ � high-dimensional analysis: allow both dimension p , sample size n , and High-dimensional analysis maximum degree d to increase at arbitrary rates � take n i.i.d. samples from MRF defined by G p,d � study probability of success as a function of three parameters: � theory is non-asymptotic: explicit probabilities for finite ( n, p, d ) Success( n, p, d ) = P [Method recovers graph G p,d from n samples] 10

Empirical behavior: Unrescaled plots Star graph; Linear fraction neighbors 1 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 p = 225 0 0 100 200 300 400 500 600 Number of samples Plots of success probability versus raw sample size n . 11

Empirical behavior: Appropriately rescaled Star graph; Linear fraction neighbors 1 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 p = 225 0 0 0.5 1 1.5 2 Control parameter Plots of success probability versus control parameter T LR ( n, p, d ). 12

� graph sequences G p,d = ( V, E ) with p vertices, and maximum degree d . � drawn n i.i.d, samples, and analyze prob. success indexed by ( n, p, d ) Sufficient conditions for consistent model selection Theorem: For a rescaled sample size (RavWaiLaf06, RavWaiLaf08) n d 3 log p > T ∗ T LR ( n, p, d ) := crit � log p and regularization parameter ρ n ≥ c 1 τ n , then with probability � � greater than 1 − 2 exp − c 2 ( τ − 2) log p → 1: (a) For each node r ∈ V , the ℓ 1 -regularized logistic convex program has a unique solution. (Non-trivial since p ≫ n = ⇒ not strictly convex). (b) The estimated sign neighborhood � N ± ( r ) correctly excludes all edges not in the true neighborhood. � d 2 log p (c) For θ min ≥ c 3 τ , the method selects the correct signed n neighborhood. 13

Some challenges in distinguishing graphs A C B D Guilt by association Hidden interactions Conditions on Fisher information matrix Q ∗ = E [ ∇ 2 f ( θ ∗ ; X )] A1. Bounded eigenspectra: λ ( Q ∗ SS ) ∈ [ C min , C max ]. A2. Mutual incoherence There exists an ν ∈ (0 , 1] such that | Q ∗ S c S ( Q ∗ SS ) − 1 | | | | | ∞ , ∞ ≤ 1 − ν. P where | | | A | | | ∞ , ∞ := max i j | A ij | . 14

� construct candidate primal-dual pair ( b � proof technique—-not a practical algorithm! Proof sketch: Primal-dual certificate z ) ∈ R p − 1 × R p − 1 . θ, b (A) For a fixed node r with S = N ( r ), we solve the restricted program n � � 1 �� f ( θ ; X ( i ) � θ = arg min \ r ) + ρ n � θ � 1 , n θ ∈ R p − 1 ,θ Sc =0 i =1 thereby obtaining candidate solution � θ = ( � θ S ,� 0 S c ). z S ∈ R | S | as an element of the subdifferential ∂ � � (B) We choose � θ S � 1 . (C) Using optimality conditions from original convex program, solve z S c and check whether or not strict dual feasibility for � for all j ∈ S c holds. | � z j | < 1 Lemma: Full convex program recovers neighborhood ⇐ ⇒ primal-dual witness succeeds. 15

� thus far: have exhibited a a particular polynomial-time method can Information-theoretic limits on graph selection recover structure if � but....is this a “good” result? n > Ω( d 3 log( p − d )) � are there polynomial-time methods that can do better? � information theory can answer the question: is there an exponential-time method that can do better? (Santhanam & Wainwright, 2008) 16

� graphical model selection is an unorthodox channel coding problem: Graph selection as channel coding � nature sends G ∈ G d,p := { graphs on p vertices, max. degree d } P ( X | G ) X (1) , . . . , X ( n ) G � decoding problem: use observations { X (1) , . . . , X ( n ) } to correctly � channel capacity for graph decoding: balance between distinguish the “codeword” ` ´ pd log p – log number of models: log | M ( p, d ) | = Θ . d – relative distinguishability of different models 17

� take Ising models P θ ( G ) from G d,p ( λ, ω ): Necessary conditions for graph recovery – graphs with p nodes and max. degree d � take n i.i.d. observations, and study probability of success in terms – parameters | θ st | ≥ λ for all edges ( s, t ) � – maximum neighborhood weight ω = max | θ st | . s ∈ V t ∈ N ( s ) of ( n, p, d ) Theorem: Necessary conditions: For sample size n � � log p exp( ω/ 2) λ d 8 log p n ≤ max 2 λ tanh( λ ) , d log( pd ) , 8 d, , 16 sinh( λ ) then the probability of error of any algorithm over G d,p ( λ, ω ) is at least 1 / 2. (Santhanam & W., 2008) 18

High-dimensional graphical model selection: Practical and - PowerPoint PPT Presentation

High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John Lafferty (CMU), Pradeep Ravikumar (UC

High-Dimensional Graphical Model Selection Anima Anandkumar U.C. Irvine Joint work with Vincent

Methods for Robust High Dimensional Graphical Model Selection Bala Rajaratnam Department of

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Graphical Screen Design Grids are an essential tool for graphical design Important graphical

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

Graphical Screen Design Grids are an essential tool for graphical design Important graphical

Graphical > Tangible? What are their limitations? 93 94 Graphical > Tangible? Graphical

10/4/15 Graphical Programming (1) Maze Program TOPICS Graphical Programming Using

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Diverse Particle Selection for High-Dimensional Inference in Graphical Models Erik Sudderth UC

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Asymptotically Good Ideal LSSS with Strong Multiplication over Any Fixed Finite Field Ignacio

Existence and asymptotic behaviour of solutions to second-order evolution equations of monotone

Statistical inference via data science: A "tidy" approach Albert Y. Kim Joint Math

Higher order asymptotics from multivariate generating functions Mark C. Wilson, University of

Bisognano-Wichmann property in asymptotically complete massless QFT Wojciech Dybalski 1 joint work

On Statistical Inference of Spatio-Temporal Random Fields Yoshihiro Yajima and Yasumasa Matsuda

How can (modular) representation theorists help ring theory? Geoffrey Janssens Free University

SCATTERING THEORY IN NONRELATIVISTIC QFT Jan Derezi nski 1 SECOND QUANTIZATION 1-particle

High-dimensional graphical model selection: Practical and - PowerPoint PPT Presentation

High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John Lafferty (CMU), Pradeep Ravikumar (UC

High-Dimensional Graphical Model Selection Anima Anandkumar U.C. Irvine Joint work with Vincent

Methods for Robust High Dimensional Graphical Model Selection Bala Rajaratnam Department of

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Graphical Screen Design Grids are an essential tool for graphical design Important graphical

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

Graphical Screen Design Grids are an essential tool for graphical design Important graphical

Graphical &gt; Tangible? What are their limitations? 93 94 Graphical &gt; Tangible? Graphical

10/4/15 Graphical Programming (1) Maze Program TOPICS Graphical Programming Using

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Diverse Particle Selection for High-Dimensional Inference in Graphical Models Erik Sudderth UC

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Asymptotically Good Ideal LSSS with Strong Multiplication over Any Fixed Finite Field Ignacio

Existence and asymptotic behaviour of solutions to second-order evolution equations of monotone

Statistical inference via data science: A &quot;tidy&quot; approach Albert Y. Kim Joint Math

Higher order asymptotics from multivariate generating functions Mark C. Wilson, University of

Bisognano-Wichmann property in asymptotically complete massless QFT Wojciech Dybalski 1 joint work

On Statistical Inference of Spatio-Temporal Random Fields Yoshihiro Yajima and Yasumasa Matsuda

How can (modular) representation theorists help ring theory? Geoffrey Janssens Free University

SCATTERING THEORY IN NONRELATIVISTIC QFT Jan Derezi nski 1 SECOND QUANTIZATION 1-particle

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Graphical > Tangible? What are their limitations? 93 94 Graphical > Tangible? Graphical

Statistical inference via data science: A "tidy" approach Albert Y. Kim Joint Math