nonparametric bayes tensor factorizations for big data
play

Nonparametric Bayes tensor factorizations for big data David Dunson - PowerPoint PPT Presentation

Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & DARPA N66001-09-C-2082 Motivation Conditional tensor factorizations Some


  1. Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & DARPA N66001-09-C-2082

  2. Motivation Conditional tensor factorizations Some properties - heuristic & otherwise Computation & applications Generalizations

  3. Motivating setting - high dimensional predictors ◮ Routine to encounter massive-dimensional prediction & variable selection problems ◮ We have y ∈ Y & x = ( x 1 , . . . , x p ) ′ ∈ X ◮ Unreasonable to assume linearity or additivity in motivating applications - e.g., epidemiology, genomics, neurosciences ◮ Goal: nonparametric approaches that accommodate large p , small n , allow interactions, scale computationally to big p

  4. Gaussian processes with variable selection ◮ For Y = ℜ & X ⊂ ℜ p , then one approach lets ǫ i ∼ N (0 , σ 2 ) , y i = µ ( x i ) + ǫ i , where µ : X → ℜ is an unknown regression function ◮ Following Zou et al. (2010) & others, p � � c ( x , x ′ ) = φ exp � α j ( x j − x ′ j ) 2 µ ∼ GP( m , c ) , − , j =1 with mixture priors placed on α j ’s ◮ Zou et al. (2010) show good empirical results ◮ Bhattacharya, Pati & Dunson (2011) - minimax adaptive rates

  5. Issues & alternatives ◮ Mean regression & computation challenging ◮ Difficult computationally beyond conditionally Gaussian homoscedastic case ◮ Density regression interesting as variance & shape of response distribution often changes with x ◮ Initial focus: classification from many categorical predictors ◮ Approach generalizes directly to arbitrary Y and X .

  6. Classification & conditional probability tensors ◮ Suppose Y ∈ { 1 , . . . , d 0 } & X j ∈ { 1 , . . . , d j } , j=1,. . . ,p ◮ The classification function or conditional probability is Pr( Y = y | X 1 = x 1 , . . . , X p = x p ) = P ( y | x 1 , . . . , x p ) . ◮ This classification function can be structured as a d 0 × d 1 × · · · × d p tensor ◮ Let P d 1 ,..., d p ( d 0 ) denote to set of all possible conditional probability tensors ◮ P ∈ P d 1 ,..., d p ( d 0 ) implies P ( y | x 1 , . . . , x p ) ≥ 0 ∀ y , x 1 , . . . , x p & � d 0 y =1 P ( y | x 1 , . . . , x p ) = 1

  7. Tensor factorizations ◮ P = big tensor & data will be very sparse ◮ If P was a matrix, we may think of SVD ◮ We can instead consider a tensor factorization ◮ Common approach is PARAFAC - sum of rank one tensors ◮ Tucker factorizations express d 1 × · · · × d p tensor A = { a c 1 ··· c p } as d j d 1 p u ( j ) � � � a c 1 ··· c p = · · · g h 1 ··· h p h j c j , h 1 =1 h p =1 j =1 where G = { g h 1 ··· h p } is a core tensor,

  8. Our factorization ( with Yun Yang ) ◮ Our proposed nonparametric model for the conditional probability: k p k 1 p π ( j ) � � � P ( y | x 1 , . . . , x p ) = · · · λ h 1 h 2 ... h p ( y ) h j ( x j ) (1) h 1 =1 h p =1 j =1 ◮ Tucker factorization of the conditional probability P ◮ To be valid conditional probability, parameters subject to d 0 � λ h 1 h 2 ... h p ( c ) = 1 , for any ( h 1 , h 2 , . . . , h p ) , c =1 k j π ( j ) � h ( x j ) = 1 , for any possible pair of ( j , x j ) . (2) h =1

  9. Comments on proposed factorization ◮ k j = 1 corresponds to exclusion of the j th feature ◮ By placing prior on k j , can induce variable selection & learning of dimension of factorization ◮ Representation is many-to-one and the parameters in the factorization cannot be uniquely identified. ◮ Does not present a barrier to Bayesian inference - we don’t care about the parameters in factorization ◮ We want to do variable selection, prediction & inferences on predictor effects

  10. Theoretical support The following Theorem formalizes the flexibility: Theorem Every d 0 × d 1 × d 2 × · · · × d p conditional probability tensor P ∈ P d 1 ,..., d p ( d 0 ) can be decomposed as (1), with 1 ≤ k j ≤ d j for j = 1 , . . . , p. Furthermore, λ h 1 h 2 ... h p ( y ) and π ( j ) h j ( x j ) can be chosen to be nonnegative and satisfy the constraints (2).

  11. Latent variable representation ◮ Simplify representation through introducing p latent class indicators z 1 , . . . , z p for X 1 , . . . , X p ◮ Conditional independence of Y and ( X 1 , . . . , X p ) given ( z 1 , . . . , z p ) ◮ The model can be written as � � Y i | z i 1 , . . . , z ip ∼ Mult { 1 , . . . , d 0 } , λ z i 1 ,..., z ip , { 1 , . . . , k j } , π ( j ) 1 ( x j ) , . . . , π ( j ) � � z ij | X ij = x j ∼ Mult k j ( x j ) , ◮ Useful computationally & provides some insight into the model

  12. Prior specification & hierarchical model ◮ Conditional likelihood of response is ( Y i | z i 1 , . . . , z ip , Λ) ∼ � � Multinomial { 1 , . . . , d 0 } , λ z i 1 ,..., z ip ◮ Conditional likelihood of latent class variables is { 1 , . . . , k j } , π ( j ) 1 ( x j ) , . . . , π ( j ) � � ( z ij | X ij = x j , π ) ∼ Multinomial k j ( x j ) ◮ Prior on core tensor λ h 1 ,..., h p = � � λ h 1 ,..., h p (1) , . . . , λ h 1 ,..., h p ( d 0 ) ∼ Diri(1 / d 0 , . . . , 1 / d 0 ) ◮ Prior on independent rank one components, π ( j ) 1 ( x j ) , . . . , π ( j ) � � k j ( x j ) ∼ Diri(1 / k j , . . . , 1 / k j )

  13. Prior on predictor inclusion/tensor rank ◮ For the j th dimension, we choose the simple prior P ( k j = 1) = 1 − r r p , P ( k j = k ) = ( d j − 1) p , k = 2 , . . . , d j , d j =# levels of covariate X j . ◮ r =expected # important features, ¯ r =specified maximum number of features ◮ Effective prior on k j ’s is P ( k 1 = l 1 , . . . , k p = l p ) = P ( k 1 = l 1 ) · · · P ( k p = l p ) I { ♯ { j : l j > 1 }≤ ¯ r } ( l 1 , . . . , l p ) , where I A ( · ) is the indicator function for set A .

  14. Properties - Bias-Variance Tradeoff ◮ Extreme data sparsity - vast majority of combinations of Y , X 1 , . . . , X p not observed ◮ Critical to include sparsity assumptions - even if such assumptions do not hold, massively reduces the variance ◮ Discard predictors having small impact & parameters having small values ◮ Makes the problem tractable & may lead to good MSE

  15. Illustrative example ◮ Binary Y & p binary covariates X j ∈ {− 1 , 1 } , j = 1 , . . . , p ◮ The true model can be expressed in the form [ β ∈ (0 , 1)] P ( Y = 1 | X 1 = x 1 , . . . , X p = x p ) = 1 2 + β β 2 2 x 1 + · · · + 2 p +1 x p . Effect of X j decreases exponentially as j increases from 1 to p . ◮ Natural strategy: estimate P ( Y = 1 | X 1 = x 1 , . . . , X p = x p ) by sample frequencies over 1st k covariates ♯ { i : y i = 1 , x 1 i = x 1 , . . . , x ki = x k } , ♯ { i : x 1 i = x 1 , . . . , x ki = x k } & ignore the remaining p − k covariates. ◮ Suppose we have n = 2 l ( k ≤ l ≪ p ) observations with one in each cell of combinations of X 1 , . . . , X l .

  16. MSE analysis ◮ Mean square error (MSE) can be expressed as � � MSE = E P ( Y = 1 | X 1 = h 1 , . . . , X p = h p ) − h 1 ,..., h p � 2 ˆ P ( Y = 1 | X 1 = h 1 , . . . , X k = h k ) Bias 2 + Var . � ◮ The squared bias is Bias 2 � � = P ( Y = 1 | X 1 = h 1 , . . . , X p = h p ) − h 1 ,..., h p � 2 E ˆ P ( Y = 1 | X 1 = h 1 , . . . , X k = h k ) 2 p − k − 1 � 2 = β 2 � 2 i − 1 3 (2 p − 2 k − 2 − 2 − p − 2 ) . β 2 2 k +1 � = 2 p +1 i =1

  17. MSE analysis (continued) ◮ Finally we obtain the variance as � Varˆ Var = P ( Y = 1 | X 1 = h 1 , . . . , X k = h k ) h 1 ,..., h p 2 k − 1 � 1 �� 1 � 1 2 + 2 i − 1 2 − 2 i − 1 � 2 p − k +1 = 2 k +1 β 2 k +1 β 2 l i =1 1 (3 − β 2 )2 p + k − l − 2 + β 2 2 p − k − l − 2 � � = . 3 ◮ Since there are 2 p cells, the average MSE for each cell equals 1 (3 − β 2 )2 k − l − 2 + β 2 2 − k − l − 2 + β 2 2 − 2 k − 2 − β 2 2 − 2 p − 2 � � . 3

  18. Implications of MSE analysis ◮ #predictors p has little impact on selection of k ◮ k ≤ l & so second term small comparing to 1st & 3rd terms ◮ Average MSE obtains its minimum at k ≈ l / 3 = log 2 ( n ) / 3 ◮ True model not sparse & all the predictors impact conditional probability ◮ But optimal # predictors only depends on the log sample size

  19. Borrowing of information ◮ Critical feature of our model is borrowing across cells j π ( j ) ◮ Letting w h 1 ,..., h p ( x 1 , . . . , x p ) = � h j ( x j ), our model is P ( Y = y | X 1 = x 1 , . . . , X p = x p ) = � w h 1 ,..., h p ( x 1 , . . . , x p ) λ h 1 ... h p ( y ) , h 1 ,..., h p with � h 1 ,..., h p w h 1 ,..., h p ( x 1 , . . . , x p ) = 1. ◮ View λ h 1 ... h p ( y ) as frequency of Y = y in cell X 1 = h 1 , . . . , X p = h p ◮ We have kernel estimate for borrowing information via weighted avg of cell freqs

  20. Illustrative example ◮ One covariate X ∈ { 1 , . . . , m } with Y ∈ { 0 , 1 } & P j = P ( Y = 1 | X = j ) ◮ Naive estimate ˆ P j = k j / n j = ♯ { i : y i = 1 , x i = j } /♯ { i : x i = j } = sample freqs ◮ Alternatively, consider kernel estimate indexed by 0 ≤ c ≤ 1 / ( m − 1) P j = { 1 − ( m − 1) c } ˆ ˜ � ˆ P j + c P k , j = 1 , . . . , m . k � = j ◮ Use squared error loss to compare these estimators

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend