Self-reflective Multi-task Gaussian Process Kohei Hayashi 1 , - - PowerPoint PPT Presentation

self reflective multi task gaussian process
SMART_READER_LITE
LIVE PREVIEW

Self-reflective Multi-task Gaussian Process Kohei Hayashi 1 , - - PowerPoint PPT Presentation

Self-reflective Multi-task Gaussian Process Kohei Hayashi 1 , Takashi Takenouchi 1 , Ryota Tomioka 2 , Hisashi Kashima 2 1 Graduate School of Information Science Nara Institute of Science and Technology 2 Department of Mathematical Informatics


slide-1
SLIDE 1

Self-reflective Multi-task Gaussian Process

Kohei Hayashi1, Takashi Takenouchi1, Ryota Tomioka2, Hisashi Kashima2

1Graduate School of Information Science

Nara Institute of Science and Technology

2Department of Mathematical Informatics

The University of Tokyo

July 2nd, 2011

1 / 22

slide-2
SLIDE 2

Multi-task learning: problem definition

  • tasks and data points are correlated

Goal: predict from and

2 / 22

slide-3
SLIDE 3

Multi-task learning: problem definition

  • tasks and data points are correlated

Goal: predict from and

3 / 22

slide-4
SLIDE 4

Gaussian process for multi-task learning

Idea: capture the correlations by measuring similarities between the responses .

Multi-task GP [Bonilla+ AISTAT’07] [Yu+ NIPS’07]

. . separately measures task/data point similarity by using additional information

4 / 22

slide-5
SLIDE 5

Challenges

. . 1 Good similarity measurement

  • additional information may not be enough to capture the

correlations ⇒ inaccurate prediction

. . 2 Computational complexity

  • inverse of Gram matrix: not practical for large-scale

datasets

5 / 22

slide-6
SLIDE 6

Our contributions

Propose a new framework for multi-task learning

  • Self-measuring similarities
  • measures similarities by observed responses themselves
  • Efficient, exact learning algorithm
  • ∼ 10 min for 1000 × 1000 matrix

Apply to a recommender system

  • Outperform existing methods

6 / 22

slide-7
SLIDE 7

Model

7 / 22

slide-8
SLIDE 8

Simple linear model

Consider a linear Gaussian model xik = w⊤ξik + εik, (i, k) ∈ I

  • w ∈ RK: weight parameter
  • ξik ∈ RK: latent feature vector of xik
  • εik ∼ N(0, σ2): observation noise
  • I: indices set of observed elements

8 / 22

slide-9
SLIDE 9

Bilinear assumption

Assume that ξik is decomposed into ψi and φk: w⊤ξik = w⊤(φk ⊗ ψi) = ψ⊤

i Wφk

  • ψi ∈ RK1: i-th row-dependent feature
  • φk ∈ RK2: k-th column-dependent feature
  • W ∈ RK1×K2: weight parameter (vec W = w)
  • K = K1K2

9 / 22

slide-10
SLIDE 10

Now ψi and φk are given by feature functions: ψi = ψ(xi:), φk = φ(x:k)

  • xi: ∈ RD1: i-th row vector of X
  • x:k ∈ RD2: k-th col. vector of X

10 / 22

slide-11
SLIDE 11

Kernel representation

xpred

ik

= ˆ w⊤(φ(x:k) ⊗ ψ(xi:)) (primal) = ∑

(j,l)∈I

ˆ βikk({xi:, x:k}, {xj:, x:l}) (dual) .

Self-measuring kernel (similarity)

. . k({xi:, x:k}, {xj:, x:l}) = kψ(xi:, xj:)kφ(x:k, x:l) = ψ(xi:), ψ(xj:)φ(x:k), φ(x:l)

sim(xij, xkl) =sim(xi:, xj:) × sim(x:k, x:l)

11 / 22

slide-12
SLIDE 12

Latent variables for missing values

xpred

ik

= ∑

(j,l)∈I

ˆ βikk({xi:, x:k}, {xj:, x:l}) How to compute k(·, ·) with missing values?

  • introduce latent variables Z

xi: = ( 1, , 3, 4, )⊤ ⇒ (1, zi2, 3, 4, zi5 )⊤ .

EM-like iterative estimation

. .

  • initialize Z0 by data mean
  • estimate zt

ik = xpred ik (Zt−1) for t = 1, 2, . . .

  • early stopping with a validation set

12 / 22

slide-13
SLIDE 13

Use of additional information

We can exploit additional information S = (s1, . . . , sD1) and T = (t1, . . . , tD2) by combining them with self-measuring similarity. e.g. kψ(·, ·) = k(xi:, xj:)k(si, sj), kφ(·, ·) = k(x:k, x:l)k(tk, tl)

13 / 22

slide-14
SLIDE 14

Optimization

14 / 22

slide-15
SLIDE 15

Strategy

L2 norm regularized least square solution: ˆ β = K−1xI

  • K = Ω ⊗ Σ + σ2I: Gram matrix
  • xI ∈ RM: observed elements of X
  • M = |I|: # observations

.

Na¨ ıve approach: compute K−1

. .

  • O(M 3) time and O(M 2) space
  • too expensive

15 / 22

slide-16
SLIDE 16

Strategy

L2 norm regularized least square solution: ˆ β = K−1xI

  • K = Ω ⊗ Σ + σ2I: Gram matrix
  • xI ∈ RM: observed elements of X
  • M = |I|: # observations

.

Na¨ ıve approach: compute K−1

. .

  • O(M 3) time and O(M 2) space
  • too expensive

Solve xI = Kβ by conjugate gradient with vec-trick

  • O(M

3 2) time and O(M) space

16 / 22

slide-17
SLIDE 17

Experiment (updated)

17 / 22

slide-18
SLIDE 18

Dataset

Dataset: Movielens 100k data

  • 1682 movies × 943 users
  • xik ∈ {1, . . . , 5}
  • a rating of i-th movie by k-th user
  • # observations: 100, 000
  • 86, 040 for training
  • 4, 530 for validating (early stopping)
  • 9, 430 for testing
  • additional information
  • user-specific feature: age, gender, ...
  • movie-specific feature: genre, release date, ...

18 / 22

slide-19
SLIDE 19

Settings

  • RBF kernel: k(x, x′) = exp(−λ ||x − x′||2)
  • hyper-parameters {σ2, λ}: 3-fold CV

kψ kφ Multi-task GP k(si, sj) k(tk, tl) Self-measuring k(xi:, xj:) k(x:k, x:l) Product k(xi:, xj:)k(si, sj) k(x:k, x:l)k(tk, tl)

19 / 22

slide-20
SLIDE 20

Results

Method RMSE time Matrix Factorization 0.9345 1m38s Multi-task GP 1.0517 7m01s Self-measuring 0.9308 16m22s Product 0.9256 18m25s

  • The best score in http://mlcomp.org/datasets/341

20 / 22

slide-21
SLIDE 21

Conclusion

. . 1 Proposed a kernel-based method for multi-task

learning problems

  • self-measuring similarity
  • efficient algorithm using CG method

. . 2 Applied to a recommender system

  • outperformed existing methods in the Movielens 100k

dataset

21 / 22

slide-22
SLIDE 22

Conclusion

. . 1 Proposed a kernel-based method for multi-task

learning problems

  • self-measuring similarity
  • efficient algorithm using CG method

. . 2 Applied to a recommender system

  • outperformed existing methods in the Movielens 100k

dataset

Questions?

22 / 22