Learning the Structure of Mixed Graphical Models Jason Lee with - PowerPoint PPT Presentation

Learning the Structure of Mixed Graphical Models Jason Lee with Trevor Hastie, Michael Saunders, Yuekai Sun, and Jonathan Taylor Institute of Computational & Mathematical Engineering Stanford University June 26th, 2014

Examples of Graphical Models ◮ Pairwise MRF.   1 � p ( y ) = Z (Θ) exp φ rj ( y r , y j )   ( r,j ) ∈ E ( G ) ◮ Multivariate gaussian distribution (Gaussian MRF) � � p p p 1 − 1 � � � p ( x ) = Z (Θ) exp β st x s x t + α s x s 2 s =1 t =1 s =1

Mixed Graphical Model ◮ Want a simple joint distribution on p continuous variables and q discrete (categorical) variables. ◮ Joint distribution of p gaussian variables is multivariate gaussian. ◮ Joint distribution of q discrete variables is pairwise mrf. ◮ Conditional distributions can be estimated via (generalized) linear regression. ◮ What about the potential term between a continuous variable x s and discrete variable y j ?

Mixed Model - Joint Distribution � p p p 1 − 1 � � � p ( x, y ; Θ) = Z (Θ) exp 2 β st x s x t + α s x s s =1 t =1 s =1  p q q q � � � � + ρ sj ( y j ) x s + φ rj ( y r , y j )  s =1 j =1 j =1 r =1

Properties of the Mixed Model ◮ Pairwise model with 3 type of potentials: discrete-discrete, continuous-discrete, and continuous-continuous. Thus has O (( p + q ) 2 ) parameters. ◮ p ( x | y ) is a gaussian with Σ = B − 1 and µ = B − 1 �� j ρ sj ( y j ) . ◮ Conditional distribution of x have the same covariance regardless of the values taken by the discrete variables y . Mean depends additively on the values of discrete variables y . ◮ Special case of Lauritzen’s mixed graphical model.

Related Work ◮ Lauritzen proposed the conditional Gaussian model ◮ Fellinghauer et al. (2011) use random forests to fit the conditional distributions. This is tailored for mixed models. ◮ Cheng, Levina, and Zhu (2013) generalize to include higher order edges. ◮ Yang et al. (2014) and Shizhe Chen, Witten, and Shojaie (2014) generalize beyond Gaussian and categorical.

Outline Parameter Learning Structure Learning Experimental Results

Pseudolikelihood ◮ Log-likelihood: ℓ (Θ) = log p ( x i ; Θ). Derivative is ˆ T ( x, y ) − E p (Θ) [ T ( x, y )] where T are sufficient statistics. This is hard to compute. ◮ Log-pseudolikelihood: ℓ PL (Θ) = � s log p ( x i s | x i \ s ; Θ) ◮ Pseudolikelihood is an asymptotically consistent approximation to the likelihood by using product of the conditional distributions. ◮ Partition function cancels out in the conditional distribution, so gradients of the log-pseudolikelihood are cheap to compute.

Conditional Distribution of a Discrete Variable For a discrete variable y r with L r states, its conditional distribution is a multinomial distribution, as used in (multiclass) logistic regression. Whenever a discrete variable is a predictor, each level contributes an additive effect; continuous variables contribute linear effects. �� s ρ sr ( y r ) x s + φ rr ( y r , y r ) + � exp j � = r φ rj ( y r , y j ) p ( y r | y \ r, , x ; Θ) = �� L r s ρ sr ( l ) x s + φ rr ( l, l ) + � l =1 exp j � = r φ rj ( l, y j ) This is just multinomial logistic regression . � � α T exp k z p ( y r = k ) = � � � L r α T l =1 exp l z

Continuous variable x s given all other variables is a gaussian distribution with a linear regression model for the mean. √ β ss � � 2 � � α s + � j ρ sj ( y j ) − � t � = s β st x t − β ss p ( x s | x \ s , y ; Θ) = √ 2 π exp − x s 2 β ss This can be expressed as linear regression � E ( x s | z 1 , . . . , z p ) = α T z = α 0 + z j α j (1) j � � 1 − 1 2 σ 2 ( x s − α T z ) 2 p ( x s | z 1 , . . . , z p ) = √ 2 πσ exp with σ = 1 /β ss (2)

Two more parameter estimation methods Neighborhood selection/Separate regressions. ◮ Each node maximizes its own conditional likelihood p ( x s | x \ s ). Intuitively, this should behave similar to the pseudolikelihood since the pseudolikelihood jointly minimizes � s − log p ( x s | x \ s ). ◮ This has twice the number of parameters as the pseudolikelihood/likelihood because the regressions do not enforce symmetry. ◮ Easily distributed. Maximum Likelihood ◮ Believed to be more statistically efficient ◮ Computationally intractable.

Sparsity and Conditional Independence ◮ Lack of an edge ( u, v ) means X u ⊥ X v | X \ u,v ( X u and X v are conditionally independent.) ◮ Means that parameter block β st , ρ sj , or φ rj are 0. ◮ Each parameter block is a different size. The continuous-continuous edge are scalars, the continuous-discrete edge are vectors and the discrete-discrete edge is a table.

Structure Learning Estimated Structure 5 7 6 9 1 3 8 10 2 4

Parameters of the mixed model Figure: β st shown in red, ρ sj shown in blue, and φ rj shown in orange. The rectangles correspond to a group of parameters.

Regularizer   � � � min Θ ℓ PL (Θ)+ λ w st � β st � + w sj � ρ sj � + w rj � φ rj �  s,t s,j r,j ◮ Each edge group is of a different size and different distribution, so we need a different penalty for each group. � � ◮ By KKT conditions, a group is non-zero iff � � ∂ℓ � � > λw g . ∂θ g Thus we choose weights � � ∂ℓ � � w g ∝ E 0 � . � � ∂θ g �

Optimization Algorithm: Proximal Newton method ◮ g ( x ) + h ( x ) := �� s,t � β st � + � s,j � ρ sj � + � min Θ ℓ PL (Θ) + λ r,j � φ rj � ◮ First-order methods: proximal gradient and accelerated proximal gradient, which have similar convergence properties as their smooth counter parts (sublinear convergence rate, and linear convergence rate under strong convexity). ◮ Second-order methods: model smooth part g ( x ) with quadratic model. Proximal gradient is a linear model of the smooth function g ( x ).

Proximal Newton-like Algorithms ◮ Build a quadratic model about the iterate x k and solve this as a subproblem. x + = argmin u g ( x )+ ∇ g ( x ) T ( u − x )+ 1 2 t ( u − x ) T H ( u − x )+ h ( u ) Algorithm 1 A generic proximal Newton-type method Require: starting point x 0 ∈ dom f 1: repeat Choose an approximation to the Hessian H k . 2: Solve the subproblem for a search direction: 3: ∆ x k ← arg min d ∇ g ( x k ) T d + 1 2 d T H k d + h ( x k + d ) . Select t k with a backtracking line search. 4: Update: x k +1 ← x k + t k ∆ x k . 5: 6: until stopping conditions are satisfied.

Why are these proximal? Definition (Scaled proximal mappings) Let h be a convex function and H , a positive definite matrix. Then the scaled proximal mapping of h at x is defined to be h ( y ) + 1 prox H 2 � y − x � 2 h ( x ) = arg min H . y The proximal Newton update is � � x k +1 = prox H k x k − H − 1 k ∇ g ( x k ) h and analogous to the proximal gradient update � � x k − 1 x k +1 = prox h/L L ∇ g ( x k )

A classical idea Traces back to: ◮ Projected Newton-type methods ◮ Cost-approximation methods Popular methods tailored to specific problems: ◮ glmnet : lasso and elastic-net regularized generalized linear models ◮ LIBLINEAR: ℓ 1 -regularized logistic regression ◮ QUIC: sparse inverse covariance estimation

◮ Theoretical analysis shows that this converges quadratically with exact Hessian and super-linearly with BFGS (Lee, Sun, and Saunders 2012). ◮ Empirical results on structure learning problem confirms this. Requires very few derivatives of the log-partition. ◮ If we solve subproblems with first order methods, only require proximal operator of nonsmooth h ( u ). Method is very general. ◮ Method allows you to choose how to solve the subproblem, and comes with a stopping criterion that preserves the convergence rate. ◮ PNOPT package: www.stanford.edu/group/SOL/software/pnopt

Statistical Consistency Special case of a more general model selection consistency theorem. Theorem (Lee, Sun, and Taylor 2013) � � Θ − Θ ⋆ � | A | log | G | � ˆ � � 1. F ≤ C � n 2. ˆ Θ g = 0 for g ∈ I . | A | is the number of active edges, and I is the inactive edges. Main assumption is a generalized irrepresentable condition.

Synthetic Experiment 100 Probability of Recovery 80 60 40 ML 20 PL 0 0 500 1000 1500 2000 n (Sample Size) Figure: Blue nodes are continuous variables, red nodes are binary variables and the orange, green and dark blue lines represent the 3 types of edges. Plot of the probability of correct edge recovery at a given sample size ( p + q = 20). Results are averaged over 100 trials.

Survey Experiments ◮ The survey dataset we consider consists of 11 variables, of which 2 are continuous and 9 are discrete: age (continuous), log-wage (continuous), year(7 states), sex(2 states),marital status (5 states), race(4 states), education level (5 states), geographic region(9 states), job class (2 states), health (2 states), and health insurance (2 states). ◮ All the evaluations are done using a holdout test set of size 100 , 000 for the survey experiments. ◮ The regularization parameter λ is varied over the interval [5 × 10 − 5 , . 7] at 50 points equispaced on log-scale for all experiments.

Learning the Structure of Mixed Graphical Models Jason Lee with - PowerPoint PPT Presentation

Learning the Structure of Mixed Graphical Models Jason Lee with Trevor Hastie, Michael Saunders, Yuekai Sun, and Jonathan Taylor Institute of Computational & Mathematical Engineering Stanford University June 26th, 2014 Examples of

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Regression 2: Mixed Models Marco Baroni Practical Statistics in R Outline Mixed models with

Mixing it up with random effects Joshua Loftus Mixed models Intro to mixed models What is a

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Learning in Graphical Models Andrea Passerini passerini@disi.unitn.it Machine Learning Learning

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Precision Training PAI Overview What is mixed-precision

Probabilistic Graphical Models Probabilistic Graphical Models Learning with partial observations

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Mixed Methodological Analysis David F. Feldon Utah State University May 8, 2018 Mixed Methods

Use of Digital Media in Tobacco Control Campaigns Karen Gutierrez World Cancer Congress,

Nuclear Structure and Decay; A Shell Model View ALFREDO POVES Departamento de F sica

Random tilings and random matrices Alice Guionnet CNRS Ecole Normale Sup erieure de

XML Documents 5 May 2016 OSU CSE 1 e X tensible M arkup L anguage A textual document format

1 B-MAC Implementation B-MAC Implementation Low Power Listening (LPL) B-MAC = Link Protocol

Ultra-Low Duty Why? Cycle MAC with Scheduled Save Energy Channel Polling by turning the radio

Automatic Speech Segmentation of French: Corpus Adaptation Brigitte Bigi LPL - Aix-en-Provence

V ALID A RGUMENTS ? If God does not exist, then it is not G (P A) true that if I