Oleksii Kuchaiev & Boris Ginsburg
Training deep Autoencoders for collaborative filtering Oleksii - - PowerPoint PPT Presentation
Training deep Autoencoders for collaborative filtering Oleksii - - PowerPoint PPT Presentation
Training deep Autoencoders for collaborative filtering Oleksii Kuchaiev & Boris Ginsburg Motivation Personalized recommendations 2 Key points (spoiler alert) 1. Deep autoencoder for collaborative filtering 1. Improves generalization
2
Motivation
Personalized recommendations
3
Key points (spoiler alert)
1. Deep autoencoder for collaborative filtering
1. Improves generalization
2. Right activation function (SELU, ELU, LeakyRELU) enables deep architectures
1. No layer-wise pre-training, or skip connections
3. Heavy use of dropout 4. Dense re-feeding for faster and better training 5. Beats other models on time-split Netflix data (RMSE of 0.9099 vs 0.9224) 6. (PyTorch-based) https://github.com/NVIDIA/DeepRecommender
Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).
4
Autoencoders & collaborative filtering
Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions
Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).
5
Collaborative filtering
Rating prediction
R=Rating matrix
≈
Users Items
X
m users n items
i j 3 3 5 1 3 4 5
r r hidden factors
Some of the most popular approaches – Alternative Least Squares (ALS)
R(i,j) = k iff user i gave item j rating k
6
Autoencoder
Deep learning tool of choice for dimensionality reduction
x y 𝑓1 = 𝑔(We
1 ∗ 𝑦 + 𝑐1)
𝑓2 = 𝑔(𝑋
𝑓 2 ∗ 𝑓1 + 𝑐2)
𝑒1 = 𝑔(𝑋
𝑒 2 ∗ 𝑨 + 𝑐3)
𝑒2 = 𝑋
𝑒 1 ∗ 𝑒1 + 𝑐4
E n c
- d
e r D e c
- d
e r … … … … … z = encoder(x), encoding y = decoder(z), reconstruction of x y = decoder(encoder(x)) Autoencoder can be thought of as generalization of PCA “Constrained” if decoder weights are transpose of encoder “De-noising” if noise is a added to x. 𝑨 = 𝑓2
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507.
7
AutoEncoders for recommendations
User (item) based
Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015.
r y
(very) sparse dense
z
Masked Mean Squared Error
8
Dataset
Time split to predict future ratings
Netflix prize training data set
Wu, Chao-Yuan, et al. "Recurrent recommender networks." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017.
9
Benchmark
Netflix prize training data set
RRN: Wu, Chao-Yuan, et al. "Recurrent recommender networks." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017. I-AR, U-AR: Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015. PMF: Mnih, Andriy, and Ruslan R. Salakhutdinov. "Probabilistic matrix factorization." Advances in neural information processing systems. 2008.
𝑆𝑁𝑇𝐹 = σ𝑠𝑗≠0 𝑠
𝑗 − 𝑧𝑗 2
σ𝑠𝑗≠0 1
10
Autoencoders & collaborative filtering
Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions
Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).
11
Activation function matters
- We found that on this task ELU, SELU and LRELU perform much better than
SIGMOID, RELU, RELU6, TANH and SWISH
Apparently important: a) non-zero negative part b) Unbounded positive part
Training RMSE per mini-batch. All lines correspond to 4-layers autoencoder (2 layer encoder and 2 layer decoder) with hidden unit dimensions of 128. Different line colors correspond to different activation functions.
Iteration RMSE
12
Autoencoders & collaborative filtering
Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions
Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).
13
Overfit your data
Wide layers generalize poorly
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 10 20 30 40 50 60 70 80 90 100
d size 128 d size 256 d size 512 d size 1024
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 10 20 30 40 50 60 70 80 90 100
d size 128 d size 256 d size 512 d size 1024 x y … … … 𝑒 𝑓1 = 𝑔(We
1 ∗ 𝑦 + 𝑐1)
𝑒2 = 𝑋
𝑒 1 ∗ 𝑒1 + 𝑐4
Evaluation RMSE > 1.1 on Netflix full
Epoch RMSE RMSE
14
Autoencoders & collaborative filtering
Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions
Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).
15
Deeper models
Generalize better
x y … … … … … No layer-wise pre-training necessary!
16
Autoencoders & collaborative filtering
Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions
Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).
17
Dropout
Helps wider models generalize
0.91 0.93 0.95 0.97 0.99 1.01 1.03 1.05 1.07 1.09 20 40 60 80 100
Drop Prob 0.0 Drop Prob 0.5 Drop Prob 0.65 Drop Prob 0.8
x y … … … … … … 512 512 512 512 dropout 1024 Evaluation RMSE
RMSE Epoch
18
Autoencoders & collaborative filtering
Effects of the activation types Overfitting the data Going deeper Dropout Dense re-feeding Conclusions
Oleksii Kuchaiev and Boris Ginsburg "Training Deep AutoEncoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).
19
Dense re-feeding
Intuition: idealized scenario
Imagine perfect f
f(x)= ∀𝑦𝑗 ≠ 0: 𝑔 𝑦 𝑗 = 𝑦𝑗 If user later rates new item k with rating r, then: 𝑔 𝑦 𝑙 = 𝑠 𝑔 𝑔 𝑦 = 𝑔 𝑦 By induction: Note that x is sparse but f(x) is dense For x, most of the loss is masked Thus, f(x) should be a fixed point of f for every valid x
20
Dense re-feeding
Attempt to enforce fixed point constraint
(very) sparse x Dense f(x) Dense f(x) Dense f(f(x))
Update with real data x Update with synthetic data f(x)
21
Dense re-feeding
Together with bigger LR improves generalization
0.905 0.915 0.925 0.935 0.945 0.955 0.965 0.975 0.985 0.995 20 40 60 80 100
Baseline Baseline LR 0.005 Baseline RF Baseline LF 0.005 RF
Epoch RMSE
22
Results
Netflix time split data
RRN: Wu, Chao-Yuan, et al. "Recurrent recommender networks." Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017. I-AR, U-AR: Sedhain, Suvash, et al. "Autorec: Autoencoders meet collaborative filtering." Proceedings of the 24th International Conference on World Wide Web. ACM, 2015.
DeepRec is our 6 layer model
23
Conclusions
1. Autoencoders can replace ALS and be competitive with other methods 2. Deeper models generalize better
1. No layer-wise pre-training is necessary
3. Right activation function enables deep architectures
1. Negative parts are important 2. Unbounded positive part
4. Heavy use of dropout is needed for wider models 5. Dense re-feeding further improves generalization
Oleksii Kuchaiev and Boris Ginsburg "Training Deep Autoencoders for Collaborative Filtering“, arXiv preprint arXiv:1708.01715 (2017).