Nonparametric Bayes tensor factorizations for big data David Dunson - PowerPoint PPT Presentation

Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & DARPA N66001-09-C-2082

Motivation Conditional tensor factorizations Some properties - heuristic & otherwise Computation & applications Generalizations

Motivating setting - high dimensional predictors ◮ Routine to encounter massive-dimensional prediction & variable selection problems ◮ We have y ∈ Y & x = ( x 1 , . . . , x p ) ′ ∈ X ◮ Unreasonable to assume linearity or additivity in motivating applications - e.g., epidemiology, genomics, neurosciences ◮ Goal: nonparametric approaches that accommodate large p , small n , allow interactions, scale computationally to big p

Gaussian processes with variable selection ◮ For Y = ℜ & X ⊂ ℜ p , then one approach lets ǫ i ∼ N (0 , σ 2 ) , y i = µ ( x i ) + ǫ i , where µ : X → ℜ is an unknown regression function ◮ Following Zou et al. (2010) & others, p � � c ( x , x ′ ) = φ exp � α j ( x j − x ′ j ) 2 µ ∼ GP( m , c ) , − , j =1 with mixture priors placed on α j ’s ◮ Zou et al. (2010) show good empirical results ◮ Bhattacharya, Pati & Dunson (2011) - minimax adaptive rates

Issues & alternatives ◮ Mean regression & computation challenging ◮ Difficult computationally beyond conditionally Gaussian homoscedastic case ◮ Density regression interesting as variance & shape of response distribution often changes with x ◮ Initial focus: classification from many categorical predictors ◮ Approach generalizes directly to arbitrary Y and X .

Classification & conditional probability tensors ◮ Suppose Y ∈ { 1 , . . . , d 0 } & X j ∈ { 1 , . . . , d j } , j=1,. . . ,p ◮ The classification function or conditional probability is Pr( Y = y | X 1 = x 1 , . . . , X p = x p ) = P ( y | x 1 , . . . , x p ) . ◮ This classification function can be structured as a d 0 × d 1 × · · · × d p tensor ◮ Let P d 1 ,..., d p ( d 0 ) denote to set of all possible conditional probability tensors ◮ P ∈ P d 1 ,..., d p ( d 0 ) implies P ( y | x 1 , . . . , x p ) ≥ 0 ∀ y , x 1 , . . . , x p & � d 0 y =1 P ( y | x 1 , . . . , x p ) = 1

Tensor factorizations ◮ P = big tensor & data will be very sparse ◮ If P was a matrix, we may think of SVD ◮ We can instead consider a tensor factorization ◮ Common approach is PARAFAC - sum of rank one tensors ◮ Tucker factorizations express d 1 × · · · × d p tensor A = { a c 1 ··· c p } as d j d 1 p u ( j ) � � � a c 1 ··· c p = · · · g h 1 ··· h p h j c j , h 1 =1 h p =1 j =1 where G = { g h 1 ··· h p } is a core tensor,

Our factorization ( with Yun Yang ) ◮ Our proposed nonparametric model for the conditional probability: k p k 1 p π ( j ) � � � P ( y | x 1 , . . . , x p ) = · · · λ h 1 h 2 ... h p ( y ) h j ( x j ) (1) h 1 =1 h p =1 j =1 ◮ Tucker factorization of the conditional probability P ◮ To be valid conditional probability, parameters subject to d 0 � λ h 1 h 2 ... h p ( c ) = 1 , for any ( h 1 , h 2 , . . . , h p ) , c =1 k j π ( j ) � h ( x j ) = 1 , for any possible pair of ( j , x j ) . (2) h =1

Comments on proposed factorization ◮ k j = 1 corresponds to exclusion of the j th feature ◮ By placing prior on k j , can induce variable selection & learning of dimension of factorization ◮ Representation is many-to-one and the parameters in the factorization cannot be uniquely identified. ◮ Does not present a barrier to Bayesian inference - we don’t care about the parameters in factorization ◮ We want to do variable selection, prediction & inferences on predictor effects

Theoretical support The following Theorem formalizes the flexibility: Theorem Every d 0 × d 1 × d 2 × · · · × d p conditional probability tensor P ∈ P d 1 ,..., d p ( d 0 ) can be decomposed as (1), with 1 ≤ k j ≤ d j for j = 1 , . . . , p. Furthermore, λ h 1 h 2 ... h p ( y ) and π ( j ) h j ( x j ) can be chosen to be nonnegative and satisfy the constraints (2).

Latent variable representation ◮ Simplify representation through introducing p latent class indicators z 1 , . . . , z p for X 1 , . . . , X p ◮ Conditional independence of Y and ( X 1 , . . . , X p ) given ( z 1 , . . . , z p ) ◮ The model can be written as � � Y i | z i 1 , . . . , z ip ∼ Mult { 1 , . . . , d 0 } , λ z i 1 ,..., z ip , { 1 , . . . , k j } , π ( j ) 1 ( x j ) , . . . , π ( j ) � � z ij | X ij = x j ∼ Mult k j ( x j ) , ◮ Useful computationally & provides some insight into the model

Prior specification & hierarchical model ◮ Conditional likelihood of response is ( Y i | z i 1 , . . . , z ip , Λ) ∼ � � Multinomial { 1 , . . . , d 0 } , λ z i 1 ,..., z ip ◮ Conditional likelihood of latent class variables is { 1 , . . . , k j } , π ( j ) 1 ( x j ) , . . . , π ( j ) � � ( z ij | X ij = x j , π ) ∼ Multinomial k j ( x j ) ◮ Prior on core tensor λ h 1 ,..., h p = � � λ h 1 ,..., h p (1) , . . . , λ h 1 ,..., h p ( d 0 ) ∼ Diri(1 / d 0 , . . . , 1 / d 0 ) ◮ Prior on independent rank one components, π ( j ) 1 ( x j ) , . . . , π ( j ) � � k j ( x j ) ∼ Diri(1 / k j , . . . , 1 / k j )

Prior on predictor inclusion/tensor rank ◮ For the j th dimension, we choose the simple prior P ( k j = 1) = 1 − r r p , P ( k j = k ) = ( d j − 1) p , k = 2 , . . . , d j , d j =# levels of covariate X j . ◮ r =expected # important features, ¯ r =specified maximum number of features ◮ Effective prior on k j ’s is P ( k 1 = l 1 , . . . , k p = l p ) = P ( k 1 = l 1 ) · · · P ( k p = l p ) I { ♯ { j : l j > 1 }≤ ¯ r } ( l 1 , . . . , l p ) , where I A ( · ) is the indicator function for set A .

Properties - Bias-Variance Tradeoff ◮ Extreme data sparsity - vast majority of combinations of Y , X 1 , . . . , X p not observed ◮ Critical to include sparsity assumptions - even if such assumptions do not hold, massively reduces the variance ◮ Discard predictors having small impact & parameters having small values ◮ Makes the problem tractable & may lead to good MSE

Illustrative example ◮ Binary Y & p binary covariates X j ∈ {− 1 , 1 } , j = 1 , . . . , p ◮ The true model can be expressed in the form [ β ∈ (0 , 1)] P ( Y = 1 | X 1 = x 1 , . . . , X p = x p ) = 1 2 + β β 2 2 x 1 + · · · + 2 p +1 x p . Effect of X j decreases exponentially as j increases from 1 to p . ◮ Natural strategy: estimate P ( Y = 1 | X 1 = x 1 , . . . , X p = x p ) by sample frequencies over 1st k covariates ♯ { i : y i = 1 , x 1 i = x 1 , . . . , x ki = x k } , ♯ { i : x 1 i = x 1 , . . . , x ki = x k } & ignore the remaining p − k covariates. ◮ Suppose we have n = 2 l ( k ≤ l ≪ p ) observations with one in each cell of combinations of X 1 , . . . , X l .

MSE analysis ◮ Mean square error (MSE) can be expressed as � � MSE = E P ( Y = 1 | X 1 = h 1 , . . . , X p = h p ) − h 1 ,..., h p � 2 ˆ P ( Y = 1 | X 1 = h 1 , . . . , X k = h k ) Bias 2 + Var . � ◮ The squared bias is Bias 2 � � = P ( Y = 1 | X 1 = h 1 , . . . , X p = h p ) − h 1 ,..., h p � 2 E ˆ P ( Y = 1 | X 1 = h 1 , . . . , X k = h k ) 2 p − k − 1 � 2 = β 2 � 2 i − 1 3 (2 p − 2 k − 2 − 2 − p − 2 ) . β 2 2 k +1 � = 2 p +1 i =1

MSE analysis (continued) ◮ Finally we obtain the variance as � Varˆ Var = P ( Y = 1 | X 1 = h 1 , . . . , X k = h k ) h 1 ,..., h p 2 k − 1 � 1 �� 1 � 1 2 + 2 i − 1 2 − 2 i − 1 � 2 p − k +1 = 2 k +1 β 2 k +1 β 2 l i =1 1 (3 − β 2 )2 p + k − l − 2 + β 2 2 p − k − l − 2 � � = . 3 ◮ Since there are 2 p cells, the average MSE for each cell equals 1 (3 − β 2 )2 k − l − 2 + β 2 2 − k − l − 2 + β 2 2 − 2 k − 2 − β 2 2 − 2 p − 2 � � . 3

Implications of MSE analysis ◮ #predictors p has little impact on selection of k ◮ k ≤ l & so second term small comparing to 1st & 3rd terms ◮ Average MSE obtains its minimum at k ≈ l / 3 = log 2 ( n ) / 3 ◮ True model not sparse & all the predictors impact conditional probability ◮ But optimal # predictors only depends on the log sample size

Borrowing of information ◮ Critical feature of our model is borrowing across cells j π ( j ) ◮ Letting w h 1 ,..., h p ( x 1 , . . . , x p ) = � h j ( x j ), our model is P ( Y = y | X 1 = x 1 , . . . , X p = x p ) = � w h 1 ,..., h p ( x 1 , . . . , x p ) λ h 1 ... h p ( y ) , h 1 ,..., h p with � h 1 ,..., h p w h 1 ,..., h p ( x 1 , . . . , x p ) = 1. ◮ View λ h 1 ... h p ( y ) as frequency of Y = y in cell X 1 = h 1 , . . . , X p = h p ◮ We have kernel estimate for borrowing information via weighted avg of cell freqs

Illustrative example ◮ One covariate X ∈ { 1 , . . . , m } with Y ∈ { 0 , 1 } & P j = P ( Y = 1 | X = j ) ◮ Naive estimate ˆ P j = k j / n j = ♯ { i : y i = 1 , x i = j } /♯ { i : x i = j } = sample freqs ◮ Alternatively, consider kernel estimate indexed by 0 ≤ c ≤ 1 / ( m − 1) P j = { 1 − ( m − 1) c } ˆ ˜ � ˆ P j + c P k , j = 1 , . . . , m . k � = j ◮ Use squared error loss to compare these estimators

Nonparametric Bayes tensor factorizations for big data David Dunson - PowerPoint PPT Presentation

Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & DARPA N66001-09-C-2082 Motivation Conditional tensor factorizations Some

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

BOOLEAN MATRIX FACTORIZATIONS Pauli Miettinen Leap day, 2012 MATRIX FACTORIZATIONS

Non-unique factorizations in bounded hereditary noetherian prime rings Daniel Smertnig

Factorizations of ideals in noncommutative rings similar to factorizations of ideals in

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Some Applications of Nonnegative Tensor Factorizations (NTF) to Mining Hype rspectral &

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Overview of RMD Silicon Semiconductor Detector Activities Mickel McClish Radiation Monitoring

SPECT MRI 1 02/05/16 PET Bone scin1graphy 99m Tm 2 02/05/16

Local Generic Formal Fibers of Excellent Rings Williams College SMALL REU 2013 Commutative

Weak Truth Table Degrees of Structures David Belanger 1 April 2012 at UWMadison EMAIL :

Recent Results from the PRad Experiment A. Gasparian NC A&T State University, NC USA for the

Visualizing Brand Associations from Web Community Photos Gunhee Kim Eric P. Xing Presenter:

DECIDE at a glance Submitted to EC Call: FP7-INFRASTRUCTURES-2010-2 Virtual Research

Review of future short baseline accelerator experiments M. Shaevitz - Columbia University 2

Nonparametric Bayes tensor factorizations for big data David Dunson - PowerPoint PPT Presentation

Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & DARPA N66001-09-C-2082 Motivation Conditional tensor factorizations Some

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

BOOLEAN MATRIX FACTORIZATIONS Pauli Miettinen Leap day, 2012 MATRIX FACTORIZATIONS

Non-unique factorizations in bounded hereditary noetherian prime rings Daniel Smertnig

Factorizations of ideals in noncommutative rings similar to factorizations of ideals in

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Some Applications of Nonnegative Tensor Factorizations (NTF) to Mining Hype rspectral &amp;

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

Overview of RMD Silicon Semiconductor Detector Activities Mickel McClish Radiation Monitoring

SPECT MRI 1 02/05/16 PET Bone scin1graphy 99m Tm 2 02/05/16

Local Generic Formal Fibers of Excellent Rings Williams College SMALL REU 2013 Commutative

Weak Truth Table Degrees of Structures David Belanger 1 April 2012 at UWMadison EMAIL :

Recent Results from the PRad Experiment A. Gasparian NC A&amp;T State University, NC USA for the

Visualizing Brand Associations from Web Community Photos Gunhee Kim Eric P. Xing Presenter:

DECIDE at a glance Submitted to EC Call: FP7-INFRASTRUCTURES-2010-2 Virtual Research

Review of future short baseline accelerator experiments M. Shaevitz - Columbia University 2

Some Applications of Nonnegative Tensor Factorizations (NTF) to Mining Hype rspectral &

Recent Results from the PRad Experiment A. Gasparian NC A&T State University, NC USA for the