Training deep Autoencoders for collaborative filtering Oleksii - - PowerPoint PPT Presentation

training deep autoencoders for
SMART_READER_LITE
LIVE PREVIEW

Training deep Autoencoders for collaborative filtering Oleksii - - PowerPoint PPT Presentation

Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg Motivation Personalized recommendations 2 Key points (spoiler alert) 1. Deep autoencoder for collaborative filtering 1. Improves generalization


slide-1
SLIDE 1

Oleksii Kuchaiev & Boris Ginsburg

Training deep Autoencoders for collaborative filtering

slide-2
SLIDE 2

2

Motivation

Personalized recommendations

slide-3
SLIDE 3

3

Key points (spoiler alert)

1. Deep autoencoder for collaborative filtering

1. Improves generalization

2. Right activation function (SELU, ELU, LeakyRELU) enables deep architectures

1. No layer-wise pre-training, or skip connections

3. Heavy use of dropout 4. Dense re-feeding for faster and better training 5. Beats other models on time-split Netflix data (RMSE of 0.9099 vs 0.9224) 6. (PyTorch-based) https://github.com/NVIDIA/DeepRecommender

Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).

slide-4
SLIDE 4

4

Autoencoders & collaborative filtering

Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions

Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).

slide-5
SLIDE 5

5

Collaborative filtering

Rating prediction

R=Rating matrix

Users Items

X

m users n items

i j 3 3 5 1 3 4 5

r r hidden factors

Some of the most popular approaches – Alternative Least Squares (ALS)

R(i,j) = k iff user i gave item j rating k

slide-6
SLIDE 6

6

Autoencoder

Deep learning tool of choice for dimensionality reduction

x y 𝑓1 = 𝑔(We

1 ∗ 𝑦 + 𝑐1)

𝑓2 = 𝑔(𝑋

𝑓 2 ∗ 𝑓1 + 𝑐2)

𝑒1 = 𝑔(𝑋

𝑒 2 ∗ 𝑨 + 𝑐3)

𝑒2 = 𝑋

𝑒 1 ∗ 𝑒1 + 𝑐4

E n c

  • d

e r D e c

  • d

e r … … … … … z = encoder(x), encoding y = decoder(z), reconstruction of x y = decoder(encoder(x)) Autoencoder can be thought of as generalization of PCA “Constrained” if decoder weights are transpose of encoder “De-noising” if noise is a added to x. 𝑨 = 𝑓2

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507.

slide-7
SLIDE 7

7

AutoEncoders for recommendations

User (item) based

Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015.

r y

(very) sparse dense

z

Masked Mean Squared Error

slide-8
SLIDE 8

8

Dataset

Time split to predict future ratings

Netflix prize training data set

Wu, Chao-Yuan, et al. "Recurrent recommender networks." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017.

slide-9
SLIDE 9

9

Benchmark

Netflix prize training data set

RRN: Wu, Chao-Yuan, et al. "Recurrent recommender networks." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017. I-AR, U-AR: Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015. PMF: Mnih, Andriy, and Ruslan R. Salakhutdinov. "Probabilistic matrix factorization." Advances in neural information processing systems. 2008.

𝑆𝑁𝑇𝐹 = σ𝑠𝑗≠0 𝑠

𝑗 − 𝑧𝑗 2

σ𝑠𝑗≠0 1

slide-10
SLIDE 10

10

Autoencoders & collaborative filtering

Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions

Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).

slide-11
SLIDE 11

11

Activation function matters

  • We found that on this task ELU, SELU and LRELU perform much better than

SIGMOID, RELU, RELU6, TANH and SWISH

Apparently important: a) non-zero negative part b) Unbounded positive part

Training RMSE per mini-batch. All lines correspond to 4-layers autoencoder (2 layer encoder and 2 layer decoder) with hidden unit dimensions of 128. Different line colors correspond to different activation functions.

Iteration RMSE

slide-12
SLIDE 12

12

Autoencoders & collaborative filtering

Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions

Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).

slide-13
SLIDE 13

13

Overfit your data

Wide layers generalize poorly

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 10 20 30 40 50 60 70 80 90 100

d size 128 d size 256 d size 512 d size 1024

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 10 20 30 40 50 60 70 80 90 100

d size 128 d size 256 d size 512 d size 1024 x y … … … 𝑒 𝑓1 = 𝑔(We

1 ∗ 𝑦 + 𝑐1)

𝑒2 = 𝑋

𝑒 1 ∗ 𝑒1 + 𝑐4

Evaluation RMSE > 1.1 on Netflix full

Epoch RMSE RMSE

slide-14
SLIDE 14

14

Autoencoders & collaborative filtering

Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions

Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).

slide-15
SLIDE 15

15

Deeper models

Generalize better

x y … … … … … No layer-wise pre-training necessary!

slide-16
SLIDE 16

16

Autoencoders & collaborative filtering

Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions

Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).

slide-17
SLIDE 17

17

Dropout

Helps wider models generalize

0.91 0.93 0.95 0.97 0.99 1.01 1.03 1.05 1.07 1.09 20 40 60 80 100

Drop Prob 0.0 Drop Prob 0.5 Drop Prob 0.65 Drop Prob 0.8

x y … … … … … … 512 512 512 512 dropout 1024 Evaluation RMSE

RMSE Epoch

slide-18
SLIDE 18

18

Autoencoders & collaborative filtering

Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions

Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).

slide-19
SLIDE 19

19

Dense re-feeding

Intuition: idealized scenario

Imagine perfect f

f(x)= ∀𝑦𝑗 ≠ 0: 𝑔 𝑦 𝑗 = 𝑦𝑗 If user later rates new item k with rating r, then: 𝑔 𝑦 𝑙 = 𝑠 𝑔 𝑔 𝑦 = 𝑔 𝑦 By induction: Note that x is sparse but f(x) is dense For x, most of the loss is masked Thus, f(x) should be a fixed point of f for every valid x

slide-20
SLIDE 20

20

Dense re-feeding

Attempt to enforce fixed point constraint

(very) sparse x Dense f(x) Dense f(x) Dense f(f(x))

Update with real data x Update with synthetic data f(x)

slide-21
SLIDE 21

21

Dense re-feeding

Together with bigger LR improves generalization

0.905 0.915 0.925 0.935 0.945 0.955 0.965 0.975 0.985 0.995 20 40 60 80 100

Baseline Baseline LR 0.005 Baseline RF Baseline LF 0.005 RF

Epoch RMSE

slide-22
SLIDE 22

22

Results

Netflix time split data

RRN: Wu, Chao-Yuan, et al. "Recurrent recommender networks." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017. I-AR, U-AR: Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015.

DeepRec is our 6 layer model

slide-23
SLIDE 23

23

Conclusions

1. Autoencoders can replace ALS and be competitive with other methods 2. Deeper models generalize better

1. No layer-wise pre-training is necessary

3. Right activation function enables deep architectures

1. Negative parts are important 2. Unbounded positive part

4. Heavy use of dropout is needed for wider models 5. Dense re-feeding further improves generalization

slide-24
SLIDE 24

Oleksii Kuchaiev and Boris Ginsburg "Training Deep Autoencoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).

Code, docs and tutorial: https://github.com/NVIDIA/DeepRecommender