Collaborative Deep Learning and Its Variants for Recommender Systems - - PowerPoint PPT Presentation

collaborative deep learning and its variants for
SMART_READER_LITE
LIVE PREVIEW

Collaborative Deep Learning and Its Variants for Recommender Systems - - PowerPoint PPT Presentation

1 Collaborative Deep Learning and Its Variants for Recommender Systems Hao Wang Joint work with Naiyan Wang, Xingjian Shi, and Dit-Yan Yeung 2 Recommender Systems Rating matrix: Observed preferences: Matrix completion To predict: 3


slide-1
SLIDE 1

Collaborative Deep Learning and Its Variants for Recommender Systems

1

Hao Wang

Joint work with Naiyan Wang, Xingjian Shi, and Dit-Yan Yeung

slide-2
SLIDE 2

Recommender Systems

Observed preferences: To predict: Matrix completion Rating matrix:

2

slide-3
SLIDE 3

Recommender Systems with Content

Content information: Plots, directors, actors, etc.

3

slide-4
SLIDE 4

Modeling the Content Information

Handcrafted features Automatically learn features Automatically learn features and

adapt for ratings

4

slide-5
SLIDE 5

Modeling the Content Information

  • 1. Powerful features for content information

Deep learning

  • 2. Feedback from rating information

Non-i.i.d. Collaborative deep learning

5

slide-6
SLIDE 6

Deep Learning

Stacked denoising autoencoders Convolutional neural networks Recurrent neural networks

Typically for i.i.d. data

6

slide-7
SLIDE 7

Modeling the Content Information

  • 1. Powerful features for content information

Deep learning

  • 2. Feedback from rating information

Non-i.i.d. Collaborative deep learning (CDL)

7

slide-8
SLIDE 8

Contribution

Collaborative deep learning: * deep learning for non-i.i.d. data * joint representation learning and collaborative filtering

8

slide-9
SLIDE 9

Contribution

Collaborative deep learning Complex target: * beyond targets like classification and regression * to complete a low-rank matrix

9

slide-10
SLIDE 10

Contribution

Collaborative deep learning Complex target First hierarchical Bayesian models for deep hybrid recommender system

10

slide-11
SLIDE 11

Related Work

11

  • Not hybrid methods (ratings only)

RBM (single layer, Salakhutdinov et al., 2007) I-RBM/U-RBM (Georgiev et al., 2013)

  • Not using Bayesian modeling for joint learning

DeepMusic (van den Oord et al., 2013) HLDBN (Wang et al., 2014)

slide-12
SLIDE 12

Stacked Denoising Autoencoders (SDAE)

Corrupted input Clean input [ Vincent et al. 2010 ]

12

slide-13
SLIDE 13

Probabilistic Matrix Factorization (PMF)

Graphical model: Generative process: Objective function if using MAP:

latent vector of item j latent vector of user i rating of item j from user i

Notation:

[ Salakhutdinov et al. 2008 ]

13

slide-14
SLIDE 14

Probabilistic SDAE

Generalized SDAE Graphical model: Generative process:

corrupted input clean input weights and biases

Notation:

14

slide-15
SLIDE 15

Collaborative Deep Learning (CDL)

Graphical model: Collaborative deep learning SDAE

corrupted input clean input weights and biases content representation rating of item j from user i latent vector of item j latent vector of user i

Notation:

Two-way interaction

  • More powerful representation
  • Infer missing ratings from content
  • Infer missing content from ratings

15

slide-16
SLIDE 16

A Principled Probabilistic Framework

Perception Component Task-Specific Component Perception Variables Task Variables Hinge Variables [ Wang et al. TKDE 2016 ]

16

slide-17
SLIDE 17

CDL with Two Components

Graphical model: Collaborative deep learning SDAE

corrupted input clean input weights and biases content representation rating of item j from user i latent vector of item j latent vector of user i

Notation:

Two-way interaction

  • More powerful representation
  • Infer missing ratings from content
  • Infer missing content from ratings

17

slide-18
SLIDE 18

Collaborative Deep Learning

Neural network representation for degenerated CDL

18

slide-19
SLIDE 19

Collaborative Deep Learning

Information flows from ratings to content

19

slide-20
SLIDE 20

Collaborative Deep Learning

Information flows from content to ratings

20

slide-21
SLIDE 21

Collaborative Deep Learning

Representation learning <-> recommendation

21

slide-22
SLIDE 22

Learning

maximizing the posterior probability is equivalent to maximizing the joint log-likelihood

22

slide-23
SLIDE 23

Learning

Prior (regularization) for user latent vectors, weights, and biases

23

slide-24
SLIDE 24

Learning

Generating item latent vectors from content representation with Gaussian offset

24

slide-25
SLIDE 25

Learning

‘Generating’ clean input from the output of probabilistic SDAE with Gaussian offset

25

slide-26
SLIDE 26

Learning

Generating the input of Layer l from the output of Layer l-1 with Gaussian offset

26

slide-27
SLIDE 27

Learning

measures the error of predicted ratings

27

slide-28
SLIDE 28

Learning

If goes to infinity, the likelihood simplifies to

28

slide-29
SLIDE 29

Update Rules

For U and V, use block coordinate descent: For W and b, use a modified version of backpropagation:

29

slide-30
SLIDE 30

Datasets

Content information Titles and abstracts Titles and abstracts Movie plots [ Wang et al. KDD 2011 ] [ Wang et al. IJCAI 2013 ]

30

slide-31
SLIDE 31

Evaluation Metrics

Recall: Mean Average Precision (mAP):

Higher recall and mAP indicate better recommendation performance

31

slide-32
SLIDE 32

Recall@M

citeulike-t, sparse setting citeulike-t, dense setting Netflix, sparse setting Netflix, dense setting

When the ratings are very sparse: When the ratings are dense:

32

slide-33
SLIDE 33

Mean Average Precision (mAP)

Exactly the same as Oord et al. 2013, we set the cutoff point at 500 for each user. A relative performance boost of about 50%

33

slide-34
SLIDE 34

Example User

Moonstruck True Romance

Romance Movies

Precision: 20% VS 30%

34

slide-35
SLIDE 35

Example User

Johnny English American Beauty

Action & Drama Movies

Precision: 20% VS 50%

35

slide-36
SLIDE 36

Example User

Precision: 50% VS 90%

36

slide-37
SLIDE 37

Summary: Collaborative Deep Learning

Non-i.i.d (collaborative) deep learning With a complex target First hierarchical Bayesian models for hybrid deep recommender system Significantly advance the state of the art

37

slide-38
SLIDE 38

[ Li et al., CIKM 2015 ] Transformation to latent factors Transformation to latent factors Reconstruction error CDL: Marginalized CDL:

Marginalized CDL

Reconstruction error

38

slide-39
SLIDE 39

[ Ying et al., PAKDD 2016 ]

Collaborative Deep Ranking

39

slide-40
SLIDE 40

Generative Process: Collaborative Deep Ranking

40

slide-41
SLIDE 41

CDL Variants

41

More details in http://wanghao.in/CDL.htm

slide-42
SLIDE 42

Motivation:

  • A more natural way, take in one

word at a time, model documents as sequences

  • Jointly model preferences and

sequence generation under the BDL framework

“Collaborative recurrent autoencoder: recommend while learning to fill in the blanks” [ Wang et al., NIPS 2016a ]

42

Beyond Bag-of-Words: Documents as Sequences

slide-43
SLIDE 43

Main Idea:

  • Joint learning in the BDL framework
  • Wildcard denoising for robust

representation

“Collaborative recurrent autoencoder: recommend while learning to fill in the blanks” [ Wang et al., NIPS 2016a]

43

Beyond Bag-of-Words: Documents as Sequences

slide-44
SLIDE 44

this a great idea this a great idea is encoder RNN decoder RNN wrong transition this a great idea is this a great idea <wc> encoder RNN decoder RNN Direct Denoising: Wildcard Denoising: Sentence: This is a great idea. -> This is a great idea.

44

Wildcard Denoising

slide-45
SLIDE 45

Main Idea:

  • Joint learning in the BDL framework
  • Wildcard denoising for robust

representation

  • Beta-Pooling for variable-length

sequences

“Collaborative recurrent autoencoder: recommend while learning to fill in the blanks” [ Wang et al., NIPS 2016a ]

45

Documents as Sequences

slide-46
SLIDE 46

46

length: 8 length: 6 length: 4 [ Wang et al., NIPS 2016a ]

Is Variable-Length Weight Vector Possible?

slide-47
SLIDE 47

47

0.08 0.18 0.22 0.16 0.21 0.10 0.04 0.01

X

=

8 length-3 vectors length-8 weight vector

  • ne single

vector

Use the area of the beta distribution to define the weights!

[ Wang et al., NIPS 2016a ]

Variable-Length Weight Vector with Beta Distributions

slide-48
SLIDE 48

48

0.13

X

=

6 length-3 vectors length-6 weight vector

  • ne single

vector

Use the area of the beta distribution to define the weights!

0.27 0.28 0.20 0.10 0.02 [ Wang et al., NIPS 2016a ]

Variable-Length Weight Vector with Beta Distributions

slide-49
SLIDE 49

49

  • Joint learning in the BDL framework
  • Wildcard denoising for robust representation
  • Beta-Pooling for variable-length sequences

[ Wang et al., NIPS 2016a ] Perception Component Task-Specific Component

Graphical Model: Collaborative Recurrent Autoencoder

slide-50
SLIDE 50

Incorporating Relational Information

50

[ Wang et al. AAAI 2017 ] [ Wang et al. AAAI 2015 ]

slide-51
SLIDE 51

Probabilistic SDAE

Generalized SDAE Graphical model: Generative process:

corrupted input clean input weights and biases

Notation:

51

slide-52
SLIDE 52

Relational SDAE: Graphical Model

corrupted input clean input adjacency matrix

Notation:

52

slide-53
SLIDE 53

Relational SDAE: Two Components

Perception Component Task-Specific Component

53

slide-54
SLIDE 54

Relational SDAE: Generative Process

54

slide-55
SLIDE 55

Relational SDAE: Generative Process

55

slide-56
SLIDE 56

Multi-Relational SDAE: Graphical Model

corrupted input clean input adjacency matrix

Notation:

Product of Q+1 Gaussians

Multiple networks: citation networks co-author networks

56

slide-57
SLIDE 57

Network A → Relational Matrix S Relational Matrix S → Middle-Layer Representations

Relational SDAE: Objective Function

57

slide-58
SLIDE 58

Update Rules

58

slide-59
SLIDE 59

From Representation to Tag Recommendation

59

slide-60
SLIDE 60

Algorithm

60

slide-61
SLIDE 61

Datasets

61

slide-62
SLIDE 62

Sparse Setting, citeulike-a

62

slide-63
SLIDE 63

Case Study 1: Tagging Scientific Articles

Precision: 10% VS 60%

63

slide-64
SLIDE 64

Case Study 2: Tagging Movies (SDAE)

Precision: 30% VS 60%

64

slide-65
SLIDE 65

Case Study 2: Tagging Movies (RSDAE)

Does not appear in the tag lists of movies linked to ‘E.T. the Extra-Terrestrial’ Very difficult to discover this tag

65

slide-66
SLIDE 66

Topic hierarchy Inter-document relation BDL-Based Topic Models

66

Perception component Task-Specific component

Relational SDAE as Deep Relational Topic Models

[ Wang et al. 2015 (AAAI) ] Unified into a probabilistic relational model for relational deep learning

slide-67
SLIDE 67

(Recap) Relational SDAE: Two Components

Perception Component Task-Specific Component

67

slide-68
SLIDE 68

68

Using Relational Information as Observations

[ Wang et al. 2017 (AAAI) ] Probabilistic SDAE Modeling relation among nodes

slide-69
SLIDE 69

69

Be ‘Bayesian’ in Collaborative Deep Learning

slide-70
SLIDE 70

Motivation:

  • Uncertainty estimation for

reinforcement learning, active learning, etc.

  • Robust for insufficient data and

noise

  • More accurate prediction

“Natural-Parameter Networks: A Class

  • f Probabilistic Neural Networks”

70

Be Bayesian in BDL

slide-71
SLIDE 71

What We Want:

  • Solvable via back propagation
  • Sampling-free during both training

and testing

  • Intuitive and easy to implement

“Natural-Parameter Networks: A Class

  • f Probabilistic Neural Networks”

71

Be Bayesian in BDL

slide-72
SLIDE 72

neural networks weights/neurons as points natural-parameter networks weights/neurons as distributions

72

Weights/Neurons as Distributions

slide-73
SLIDE 73

Take-home Messages

  • Probabilistic graphical models for formulating both

representation learning and inference/reasoning components

  • Learnable representation serving as a bridge
  • Tight, two-way interaction is crucial

73

slide-74
SLIDE 74

Thanks! Q&A

74