self reflective multi task gaussian process
play

Self-reflective Multi-task Gaussian Process Kohei Hayashi 1 , - PowerPoint PPT Presentation

Self-reflective Multi-task Gaussian Process Kohei Hayashi 1 , Takashi Takenouchi 1 , Ryota Tomioka 2 , Hisashi Kashima 2 1 Graduate School of Information Science Nara Institute of Science and Technology 2 Department of Mathematical Informatics


  1. Self-reflective Multi-task Gaussian Process Kohei Hayashi 1 , Takashi Takenouchi 1 , Ryota Tomioka 2 , Hisashi Kashima 2 1 Graduate School of Information Science Nara Institute of Science and Technology 2 Department of Mathematical Informatics The University of Tokyo July 2nd, 2011 1 / 22

  2. Multi-task learning: problem definition • tasks and data points are correlated Goal: predict from and 2 / 22

  3. Multi-task learning: problem definition • tasks and data points are correlated Goal: predict from and 3 / 22

  4. Gaussian process for multi-task learning Idea: capture the correlations by measuring similarities between the responses . Multi-task GP [Bonilla+ AISTAT’07] [Yu+ NIPS’07] . separately measures task/data point similarity by using additional information . 4 / 22

  5. Challenges 1 Good similarity measurement . . • additional information may not be enough to capture the correlations ⇒ inaccurate prediction 2 Computational complexity . . • inverse of Gram matrix: not practical for large-scale datasets 5 / 22

  6. Our contributions Propose a new framework for multi-task learning • Self-measuring similarities • measures similarities by observed responses themselves • Efficient, exact learning algorithm • ∼ 10 min for 1000 × 1000 matrix Apply to a recommender system • Outperform existing methods 6 / 22

  7. Model 7 / 22

  8. Simple linear model Consider a linear Gaussian model x ik = w ⊤ ξ ik + ε ik , ( i, k ) ∈ I • w ∈ R K : weight parameter • ξ ik ∈ R K : latent feature vector of x ik • ε ik ∼ N (0 , σ 2 ) : observation noise • I : indices set of observed elements 8 / 22

  9. Bilinear assumption Assume that ξ ik is decomposed into ψ i and φ k : w ⊤ ξ ik = w ⊤ ( φ k ⊗ ψ i ) = ψ ⊤ i W φ k • ψ i ∈ R K 1 : i -th row-dependent feature • φ k ∈ R K 2 : k -th column-dependent feature • W ∈ R K 1 × K 2 : weight parameter ( vec W = w ) • K = K 1 K 2 9 / 22

  10. Now ψ i and φ k are given by feature functions: ψ i = ψ ( x i : ) , φ k = φ ( x : k ) • x i : ∈ R D 1 : i -th row vector of X • x : k ∈ R D 2 : k -th col. vector of X 10 / 22

  11. Kernel representation x pred w ⊤ ( φ ( x : k ) ⊗ ψ ( x i : )) = ˆ (primal) ik ˆ ∑ = β ik k ( { x i : , x : k } , { x j : , x : l } ) (dual) ( j,l ) ∈I . Self-measuring kernel (similarity) . k ( { x i : , x : k } , { x j : , x : l } ) = k ψ ( x i : , x j : ) k φ ( x : k , x : l ) = � ψ ( x i : ) , ψ ( x j : ) �� φ ( x : k ) , φ ( x : l ) � . sim( x ij , x kl ) =sim( x i : , x j : ) × sim( x : k , x : l ) 11 / 22

  12. Latent variables for missing values x pred ˆ ∑ = β ik k ( { x i : , x : k } , { x j : , x : l } ) ik ( j,l ) ∈I How to compute k ( · , · ) with missing values? • introduce latent variables Z ) ⊤ ) ⊤ ( ( 1 , z i 2 , 3 , 4 , z i 5 x i : = ⇒ 1 , , 3 , 4 , . EM-like iterative estimation . • initialize Z 0 by data mean ik = x pred ik ( Z t − 1 ) for t = 1 , 2 , . . . • estimate z t • early stopping with a validation set . 12 / 22

  13. Use of additional information We can exploit additional information S = ( s 1 , . . . , s D 1 ) and T = ( t 1 , . . . , t D 2 ) by combining them with self-measuring similarity. e.g. k ψ ( · , · ) = k ( x i : , x j : ) k ( s i , s j ) , k φ ( · , · ) = k ( x : k , x : l ) k ( t k , t l ) 13 / 22

  14. Optimization 14 / 22

  15. Strategy L 2 norm regularized least square solution: ˆ β = K − 1 x I • K = Ω ⊗ Σ + σ 2 I : Gram matrix • x I ∈ R M : observed elements of X • M = |I| : # observations . ıve approach: compute K − 1 Na¨ . • O ( M 3 ) time and O ( M 2 ) space • too expensive . 15 / 22

  16. Strategy L 2 norm regularized least square solution: ˆ β = K − 1 x I • K = Ω ⊗ Σ + σ 2 I : Gram matrix • x I ∈ R M : observed elements of X • M = |I| : # observations . ıve approach: compute K − 1 Na¨ . • O ( M 3 ) time and O ( M 2 ) space • too expensive . Solve x I = K β by conjugate gradient with vec-trick 3 • O ( M 2 ) time and O ( M ) space 16 / 22

  17. Experiment (updated) 17 / 22

  18. Dataset Dataset: Movielens 100k data • 1682 movies × 943 users • x ik ∈ { 1 , . . . , 5 } • a rating of i -th movie by k -th user • # observations: 100 , 000 • 86 , 040 for training • 4 , 530 for validating (early stopping) • 9 , 430 for testing • additional information • user-specific feature: age, gender, ... • movie-specific feature: genre, release date, ... 18 / 22

  19. Settings • RBF kernel: k ( x , x ′ ) = exp( − λ || x − x ′ || 2 ) • hyper-parameters { σ 2 , λ } : 3 -fold CV k ψ k φ Multi-task GP k ( s i , s j ) k ( t k , t l ) Self-measuring k ( x i : , x j : ) k ( x : k , x : l ) Product k ( x i : , x j : ) k ( s i , s j ) k ( x : k , x : l ) k ( t k , t l ) 19 / 22

  20. Results Method RMSE time Matrix Factorization 0 . 9345 1 m 38 s Multi-task GP 1 . 0517 7 m 01 s Self-measuring 0 . 9308 16 m 22 s Product 18 m 25 s 0 . 9256 • The best score in http://mlcomp.org/datasets/341 20 / 22

  21. Conclusion 1 Proposed a kernel-based method for multi-task . . learning problems • self-measuring similarity • efficient algorithm using CG method 2 Applied to a recommender system . . • outperformed existing methods in the Movielens 100k dataset 21 / 22

  22. Conclusion 1 Proposed a kernel-based method for multi-task . . learning problems • self-measuring similarity • efficient algorithm using CG method 2 Applied to a recommender system . . • outperformed existing methods in the Movielens 100k dataset Questions? 22 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend