Ensemble Methods + Recommender Systems Matt Gormley Lecture 21 - - PowerPoint PPT Presentation

ensemble methods recommender systems
SMART_READER_LITE
LIVE PREVIEW

Ensemble Methods + Recommender Systems Matt Gormley Lecture 21 - - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Midterm Review + Ensemble Methods + Recommender Systems Matt Gormley Lecture 21 Nov. 4, 2019 1 Reminders


slide-1
SLIDE 1

Midterm Review +

Ensemble Methods + Recommender Systems

1

10-601 Introduction to Machine Learning

Matt Gormley Lecture 21

  • Nov. 4, 2019

Machine Learning Department School of Computer Science Carnegie Mellon University

slide-2
SLIDE 2

Reminders

  • Homework 6: Information Theory / Generative Models

– Out: Fri, Oct. 25 – Due: Fri, Nov. 8 at 11:59pm

  • Midterm Exam 2

– Thu, Nov. 14, 6:30pm – 8:00pm – more details announced on Piazza

  • Homework 7: HMMs

– Out: Fri, Nov. 8 – Due: Sun, Nov. 24 at 11:59pm

  • Today’s In-Class Poll

– http://p21.mlcourse.org

2

slide-3
SLIDE 3

MIDTERM EXAM LOGISTICS

3

slide-4
SLIDE 4

Midterm Exam

  • Time / Location

– Time: Evening Exam Thu, Nov. 14 at 6:30pm – 8:00pm – Room: We will contact each student individually with your room

  • assignment. The rooms are not based on section.

– Seats: There will be assigned seats. Please arrive early. – Please watch Piazza carefully for announcements regarding room / seat assignments.

  • Logistics

– Covered material: Lecture 9 – Lecture 19 (95%), Lecture 1 – 8 (5%) – Format of questions:

  • Multiple choice
  • True / False (with justification)
  • Derivations
  • Short answers
  • Interpreting figures
  • Implementing algorithms on paper

– No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back)

4

slide-5
SLIDE 5

Midterm Exam

  • How to Prepare

– Attend the midterm review lecture (right now!) – Review prior year’s exam and solutions (we’ll post them) – Review this year’s homework problems – Consider whether you have achieved the “learning objectives” for each lecture / section

5

slide-6
SLIDE 6

Midterm Exam

  • Advice (for during the exam)

– Solve the easy problems first (e.g. multiple choice before derivations)

  • if a problem seems extremely complicated you’re likely

missing something

– Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer:

  • we probably haven’t told you the answer
  • but we’ve told you enough to work it out
  • imagine arguing for some answer and see if you like it

6

slide-7
SLIDE 7

Topics for Midterm 1

  • Foundations

– Probability, Linear Algebra, Geometry, Calculus – Optimization

  • Important Concepts

– Overfitting – Experimental Design

  • Classification

– Decision Tree – KNN – Perceptron

  • Regression

– Linear Regression

7

slide-8
SLIDE 8

Topics for Midterm 2

  • Classification

– Binary Logistic Regression – Multinomial Logistic Regression

  • Important Concepts

– Regularization – Feature Engineering

  • Feature Learning

– Neural Networks – Basic NN Architectures – Backpropagation

  • Reinforcement

Learning

– Value Iteration – Policy Iteration – Q-Learning – Deep Q-Learning

  • Learning Theory

– Information Theory

8

slide-9
SLIDE 9

SAMPLE QUESTIONS

9

slide-10
SLIDE 10

Sample Questions

10

3.2 Logistic regression

Given a training set {(xi, yi), i = 1, . . . , n} where xi 2 Rd is a feature vector and yi 2 {0, 1} is a binary label, we want to find the parameters ˆ w that maximize the likelihood for the training set, assuming a parametric model of the form p(y = 1|x; w) = 1 1 + exp(wTx). The conditional log likelihood of the training set is `(w) =

n

X

i=1

yi log p(yi, |xi; w) + (1 yi) log(1 p(yi, |xi; w)), and the gradient is r`(w) =

n

X

i=1

(yi p(yi|xi; w))xi. (c) [2 pts.] Extra Credit: Consider the case with binary features, i.e, x 2 {0, 1}d ⇢ Rd, where feature x1 is rare and happens to appear in the training set with only label 1. What is ˆ w1? Is the gradient ever zero for any finite w? Why is it important to include a regularization term to control the norm of ˆ w? (b) [5 pts.] What is the form of the classifier output by logistic regression?

slide-11
SLIDE 11

Samples Questions

11

2.1 Train and test errors

In this problem, we will see how you can debug a classifier by looking at its train and test errors. Consider a classifier trained till convergence on some training data Dtrain, and tested on a separate test set Dtest. You look at the test error, and find that it is very high. You then compute the training error and find that it is close to 0.

  • 1. [4 pts] Which of the following is expected to help? Select all that apply.

(a) Increase the training data size. (b) Decrease the training data size. (c) Increase model complexity (For example, if your classifier is an SVM, use a more complex kernel. Or if it is a decision tree, increase the depth). (d) Decrease model complexity. (e) Train on a combination of Dtrain and Dtest and test on Dtest (f) Conclude that Machine Learning does not work.

slide-12
SLIDE 12

Samples Questions

12

2.1 Train and test errors

In this problem, we will see how you can debug a classifier by looking at its train and test errors. Consider a classifier trained till convergence on some training data Dtrain, and tested on a separate test set Dtest. You look at the test error, and find that it is very high. You then compute the training error and find that it is close to 0.

  • 4. [1 pts] Say you plot the train and test errors as a function of the model complexity. Which
  • f the following two plots is your plot expected to look like?

(a) (b)

slide-13
SLIDE 13

Sample Questions

14

y h1 h2 x1 x2 w11 w21 w12 w22 w31 w32

(b) The neural network architecture 1 2 3 4 5 1 2 3 4 5 x1 x2

S1 S2 S3

(a) The dataset with groups S1, S2, and S3.

Can the neural network in Figure (b) correctly classify the dataset given in Figure (a)?

Neural Networks

slide-14
SLIDE 14

Sample Questions

15

y h1 h2 x1 x2 w11 w21 w12 w22 w31 w32

(b) The neural network architecture

Apply the backpropagation algorithm to obtain the partial derivative of the mean-squared error

  • f y with the true value y* with respect to the

weight w22 assuming a sigmoid nonlinear activation function for the hidden layer.

Neural Networks

slide-15
SLIDE 15

Sample Questions

16

7.1 Reinforcement Learning

  • 4. (1 point) True or False: Value iteration is better at balancing exploration and ex-

ploitation compared with policy iteration. True False

  • 3. (1 point) Please select one statement that is true for reinforcement learning

and supervised learning. Reinforcement learning is a kind of supervised learning problem because you can treat the reward and next state as the label and each state, action pair as the training data. Reinforcement learning differs from supervised learning because it has a tem- poral structure in the learning process, whereas, in supervised learning, the prediction of a data point does not affect the data you would see in the future.

slide-16
SLIDE 16

Sample Questions

17

7.1 Reinforcement Learning

2 2 4 4 8 4 8

  • 1. For the R(s,a) values shown on the arrows below, what

is the corresponding optimal policy? Assume the discount factor is 0.1

  • 2. For the R(s,a) values shown on the arrows below, which

are the corresponding V*(s) values? Assume the discount factor is 0.1

  • 3. For the R(s,a) values shown on the arrows below, which

are the corresponding Q*(s,a) values? Assume the discount factor is 0.1

slide-17
SLIDE 17

Example: Robot Localization

18

Figure from Tom Mitchell

Im St

’|

slide-18
SLIDE 18

ML Big Picture

24

Learning Paradigms: What data is available and when? What form of prediction?

  • supervised learning
  • unsupervised learning
  • semi-supervised learning
  • reinforcement learning
  • active learning
  • imitation learning
  • domain adaptation
  • nline learning
  • density estimation
  • recommender systems
  • feature learning
  • manifold learning
  • dimensionality reduction
  • ensemble learning
  • distant supervision
  • hyperparameter optimization

Problem Formulation: What is the structure of our output prediction?

boolean Binary Classification categorical Multiclass Classification

  • rdinal

Ordinal Classification real Regression

  • rdering

Ranking multiple discrete Structured Prediction multiple continuous (e.g. dynamical systems) both discrete & cont. (e.g. mixed graphical models)

Theoretical Foundations: What principles guide learning? q probabilistic q information theoretic q evolutionary search q ML as optimization

Facets of Building ML Systems: How to build systems that are robust, efficient, adaptive, effective? 1. Data prep 2. Model selection 3. Training (optimization / search) 4. Hyperparameter tuning on validation data 5. (Blind) Assessment on test data Big Ideas in ML: Which are the ideas driving development of the field?

  • inductive bias
  • generalization / overfitting
  • bias-variance decomposition
  • generative vs. discriminative
  • deep nets, graphical models
  • PAC learning
  • distant rewards

Application Areas Key challenges? NLP, Speech, Computer Vision, Robotics, Medicine, Search

slide-19
SLIDE 19

Outline for Today

We’ll talk about two distinct topics:

  • 1. Ensemble Methods: combine or learn multiple

classifiers into one (i.e. a family of algorithms)

  • 2. Recommender Systems: produce

recommendations of what a user will like (i.e. the solution to a particular type of task) We’ll use a prominent example of a recommender systems (the Netflix Prize) to motivate both topics…

25

slide-20
SLIDE 20

RECOMMENDER SYSTEMS

26

slide-21
SLIDE 21

Recommender Systems

A Common Challenge:

– Assume you’re a company selling items of some sort: movies, songs, products, etc. – Company collects millions

  • f ratings from users of

their items – To maximize profit / user happiness, you want to recommend items that users are likely to want

27

slide-22
SLIDE 22

Recommender Systems

28

slide-23
SLIDE 23

Recommender Systems

29

slide-24
SLIDE 24

Recommender Systems

30

slide-25
SLIDE 25

Recommender Systems

31

Problem Setup

  • 500,000 users
  • 20,000 movies
  • 100 million ratings
  • Goal: To obtain lower root mean squared error (RMSE)

than Netflix’s existing system on 3 million held out ratings

slide-26
SLIDE 26

ENSEMBLE METHODS

32

slide-27
SLIDE 27

Recommender Systems

33

Top performing systems were ensembles

slide-28
SLIDE 28

Weighted Majority Algorithm

  • Given: pool A of binary classifiers (that

you know nothing about)

  • Data: stream of examples (i.e. online

learning setting)

  • Goal: design a new learner that uses

the predictions of the pool to make new predictions

  • Algorithm:

– Initially weight all classifiers equally – Receive a training example and predict the (weighted) majority vote of the classifiers in the pool – Down-weight classifiers that contribute to a mistake by a factor of β

34

(Littlestone & Warmuth, 1994)

slide-29
SLIDE 29

Weighted Majority Algorithm

36

(Littlestone & Warmuth, 1994)

slide-30
SLIDE 30

Weighted Majority Algorithm

37

(Littlestone & Warmuth, 1994)

This is a “mistake bound”

  • f the variety we saw for

the Perceptron algorithm

slide-31
SLIDE 31

ADABOOST

38

slide-32
SLIDE 32

Comparison

Weighted Majority Algorithm

  • an example of an

ensemble method

  • assumes the classifiers are

learned ahead of time

  • only learns (majority vote)

weight for each classifiers AdaBoost

  • an example of a boosting

method

  • simultaneously learns:

– the classifiers themselves – (majority vote) weight for each classifiers

39

slide-33
SLIDE 33

D1

weak classifiers = vertical or horizontal half-planes

AdaBoost: Toy Example

42

Slide from Schapire NIPS Tutorial

slide-34
SLIDE 34

h1 α ε1 1 =0.30 =0.42 2 D

AdaBoost: Toy Example

43

Slide from Schapire NIPS Tutorial

slide-35
SLIDE 35

α ε2 2 =0.21 =0.65 h2 3 D

AdaBoost: Toy Example

44

Slide from Schapire NIPS Tutorial

slide-36
SLIDE 36

h3 α ε3 3=0.92 =0.14

AdaBoost: Toy Example

45

Slide from Schapire NIPS Tutorial

slide-37
SLIDE 37

H final + 0.92 + 0.65 0.42 sign = =

AdaBoost: Toy Example

46

Slide from Schapire NIPS Tutorial

slide-38
SLIDE 38

AdaBoost

47

Given: where , Initialize . For : Train weak learner using distribution . Get weak hypothesis with error Choose . Update: if if where is a normalization factor (chosen so that will be a distribution). Output the final hypothesis:

Algorithm from (Freund & Schapire, 1999)

slide-39
SLIDE 39

AdaBoost

49

Figure from (Freund & Schapire, 1999)

error

10 100 1000 5 10 15 20

cumulative distribution

  • 1
  • 0.5

0.5 1 0.5 1.0

# rounds margin Figure 2: Error curves and the margin distribution graph for boosting C4.5 on the letter dataset as reported by Schapire et al. [41]. Left: the training and test error curves (lower and upper curves, respectively) of the combined classifier as a function of the number of rounds of boosting. The horizontal lines indicate the test error rate of the base classifier as well as the test error of the final combined classifier. Right: The cumulative distribution of margins of the training examples after 5, 100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly hidden) and solid curves, respectively.

slide-40
SLIDE 40

Learning Objectives

Ensemble Methods / Boosting You should be able to…

  • 1. Implement the Weighted Majority Algorithm
  • 2. Implement AdaBoost
  • 3. Distinguish what is learned in the Weighted

Majority Algorithm vs. Adaboost

  • 4. Contrast the theoretical result for the

Weighted Majority Algorithm to that of Perceptron

  • 5. Explain a surprisingly common empirical result

regarding Adaboost train/test curves

50

slide-41
SLIDE 41

Outline

  • Recommender Systems

– Content Filtering – Collaborative Filtering (CF) – CF: Neighborhood Methods – CF: Latent Factor Methods

  • Matrix Factorization

– Background: Low-rank Factorizations – Residual matrix – Unconstrained Matrix Factorization

  • Optimization problem
  • Gradient Descent, SGD, Alternating Least Squares
  • User/item bias terms (matrix trick)

– Singular Value Decomposition (SVD) – Non-negative Matrix Factorization

51

slide-42
SLIDE 42

RECOMMENDER SYSTEMS

52

slide-43
SLIDE 43

Recommender Systems

57

Problem Setup

  • 500,000 users
  • 20,000 movies
  • 100 million ratings
  • Goal: To obtain lower root mean squared error (RMSE)

than Netflix’s existing system on 3 million held out ratings

slide-44
SLIDE 44

Recommender Systems

58

slide-45
SLIDE 45

Recommender Systems

  • Setup:

– Items: movies, songs, products, etc. (often many thousands) – Users: watchers, listeners, purchasers, etc. (often many millions) – Feedback: 5-star ratings, not-clicking ‘next’, purchases, etc.

  • Key Assumptions:

– Can represent ratings numerically as a user/item matrix – Users only rate a small number of items (the matrix is sparse)

59

Doctor Strange Star Trek: Beyond Zootopia Alice 1 5 Bob 3 4 Charlie 3 5 2

slide-46
SLIDE 46

Two Types of Recommender Systems

Content Filtering

  • Example: Pandora.com

music recommendations (Music Genome Project)

  • Con: Assumes access to

side information about items (e.g. properties of a song)

  • Pro: Got a new item to

add? No problem, just be sure to include the side information Collaborative Filtering

  • Example: Netflix movie

recommendations

  • Pro: Does not assume

access to side information about items (e.g. does not need to know about movie genres)

  • Con: Does not work on

new items that have no ratings

60

slide-47
SLIDE 47

COLLABORATIVE FILTERING

62

slide-48
SLIDE 48

Collaborative Filtering

  • Everyday Examples of Collaborative Filtering...

– Bestseller lists – Top 40 music lists – The “recent returns” shelf at the library – Unmarked but well-used paths thru the woods – The printer room at work – “Read any good books lately?” – …

  • Common insight: personal tastes are correlated

– If Alice and Bob both like X and Alice likes Y then Bob is more likely to like Y – especially (perhaps) if Bob knows Alice

63

Slide from William Cohen

slide-49
SLIDE 49

Two Types of Collaborative Filtering

  • 1. Neighborhood Methods
  • 2. Latent Factor Methods

64

Figures from Koren et al. (2009)

slide-50
SLIDE 50

Two Types of Collaborative Filtering

  • 1. Neighborhood Methods

65

In the figure, assume that a green line indicates the movie was watched Algorithm: 1. Find neighbors based

  • n similarity of movie

preferences

  • 2. Recommend movies

that those neighbors watched

Figures from Koren et al. (2009)

slide-51
SLIDE 51

Two Types of Collaborative Filtering

  • 2. Latent Factor Methods

66

Figures from Koren et al. (2009)

  • Assume that both

movies and users live in some low- dimensional space describing their properties

  • Recommend a

movie based on its proximity to the user in the latent space

  • Example Algorithm:

Matrix Factorization

slide-52
SLIDE 52

MATRIX FACTORIZATION

67

slide-53
SLIDE 53

Recommending Movies

Question: Applied to the Netflix Prize problem, which of the following methods always requires side information about the users and movies? Select all that apply

  • A. collaborative filtering
  • B. latent factor methods

C. ensemble methods

  • D. content filtering

E. neighborhood methods F. recommender systems

68

Answer:

slide-54
SLIDE 54

Matrix Factorization

  • Many different ways of factorizing a matrix
  • We’ll consider three:

1. Unconstrained Matrix Factorization 2. Singular Value Decomposition 3. Non-negative Matrix Factorization

  • MF is just another example of a common

recipe:

1. define a model 2. define an objective function 3.

  • ptimize with SGD

69

slide-55
SLIDE 55

Matrix Factorization

Whiteboard

– Background: Low-rank Factorizations – Residual matrix

71

slide-56
SLIDE 56

Example: MF for Netflix Problem

72

Figures from Aggarwal (2016)

ATTLE N O US CAESAR OPA TRA PLESS IN SEA TTY WOMAN ABLANCA NERO JULIU CLEO SLEEP PRET CASA 1 BOTH HISTORY 1 2 3 4 ROMANCE 1 1 1 5 6 1

R

7

(b) Residual matrix 1 2 3 4 5 6 7 HISTORY ROMANCE

X

HISTORY ROMANCE ROMANCE BOTH HISTORY 1 1 1 1 1 1 1 1 1

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1 - 1 - 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 NERO JULIUS CAESAR CLEOPA TRA SLEEPLESS IN SEA TTLE PRETTY WOMAN CASABLANCA

R U VT

NERO JULIUS CAESAR CLEOPA TRA SLEEPLESS IN SEA TTLE PRETTY WOMAN CASABLANCA

  • 1
  • 1
  • 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 6 7 5 4 3 2 1 E

(a) Example of rank-2 matrix factorization

slide-57
SLIDE 57

Regression vs. Collaborative Filtering

73 TRAINING ROWS TEST ROWS INDEPENDENT VARIABLES DEPENDENT VARIABLE NO DEMARCATION BETWEEN TRAINING AND TEST ROWS NO DEMARCATION BETWEEN DEPENDENT AND INDEPENDENT VARIABLES

Figures from Aggarwal (2016)

Regression Collaborative Filtering

slide-58
SLIDE 58

UNCONSTRAINED MATRIX FACTORIZATION

74

slide-59
SLIDE 59

Unconstrained Matrix Factorization

Whiteboard

– Optimization problem – SGD – SGD with Regularization – Alternating Least Squares – User/item bias terms (matrix trick)

75

slide-60
SLIDE 60

Unconstrained Matrix Factorization

SGD for UMF:

76

slide-61
SLIDE 61

Unconstrained Matrix Factorization

SGD for UMF:

77

slide-62
SLIDE 62

Unconstrained Matrix Factorization

Alternating Least Squares (ALS) for UMF:

78

slide-63
SLIDE 63

Unconstrained Matrix Factorization

In-Class Exercise Derive a block coordinate descent algorithm for the Unconstrained Matrix Factorization problem.

80

  • User vectors:
  • Item vectors:
  • Rating prediction:

u ∈ Rr i ∈ Rr vui = T

u i

  • Set of non-missing entries:
  • Objective:
  • ,
  • (u,i)∈Z

(vui − T

u i)2

slide-64
SLIDE 64

Matrix Factorization

  • User vectors:
  • Item vectors:
  • Rating prediction:

81

Figures from Koren et al. (2009)

H∗i ∈ Rr (Wu∗)T ∈ Rr

Figures from Gemulla et al. (2011)

Vui = Wu∗H∗i = [WH]ui

(with matrices)

slide-65
SLIDE 65
  • User vectors:
  • Item vectors:
  • Rating prediction:

Matrix Factorization

(with vectors)

82

Figures from Koren et al. (2009)

u ∈ Rr i ∈ Rr vui = T

u i

slide-66
SLIDE 66

Matrix Factorization

  • Set of non-missing entries:
  • Objective:

83

Figures from Koren et al. (2009)

  • ,
  • (u,i)∈Z

(vui − T

u i)2

(with vectors)

slide-67
SLIDE 67

Matrix Factorization

  • Regularized Objective:
  • SGD update for random (u,i):

84

Figures from Koren et al. (2009)

(with vectors)

  • ,
  • (u,i)∈Z

(vui − T

u i)2

+ λ(

  • i

||i||2 +

  • u

||u||2)

slide-68
SLIDE 68

Matrix Factorization

  • Regularized Objective:
  • SGD update for random (u,i):

85

Figures from Koren et al. (2009)

(with vectors)

eui ← vui − T

u i

u ← u + γ(euii − λu) i ← i + γ(euiu − λi)

  • ,
  • (u,i)∈Z

(vui − T

u i)2

+ λ(

  • i

||i||2 +

  • u

||u||2)

slide-69
SLIDE 69

Matrix Factorization

  • User vectors:
  • Item vectors:
  • Rating prediction:

86

Figures from Koren et al. (2009)

H∗i ∈ Rr (Wu∗)T ∈ Rr

Figures from Gemulla et al. (2011)

Vui = Wu∗H∗i = [WH]ui

(with matrices)

slide-70
SLIDE 70

Matrix Factorization

  • SGD

87

Figures from Koren et al. (2009) Figure from Gemulla et al. (2011)

(with matrices)

Matrix$factorization$as$SGD$V$ V$why$does$ th this$wo $work? k?

step size

Figure from Gemulla et al. (2011)

slide-71
SLIDE 71

Matrix Factorization

88

–1.5 –1.0 –0.5 0.0 0.5 1.0 –1.5 –1.0 –0.5 0.0 0.5 1.0 1.5 Factor vector 1 Factor vector 2

Freddy Got Fingered Freddy vs. Jason Half Baked Road Trip The Sound of Music Sophie’s Choice Moonstruck Maid in Manhattan The Way We Were Runaway Bride Coyote Ugly The Royal Tenenbaums Punch-Drunk Love I Heart Huckabees Armageddon Citizen Kane The Waltons: Season 1 Stepmom Julien Donkey-Boy Sister Act The Fast and the Furious The Wizard of Oz Kill Bill: Vol. 1 Scarface Natural Born Killers Annie Hall Belle de Jour Lost in Translation The Longest Yard Being John Malkovich Catwoman

Figure 3. The fjrst two vectors from a matrix decomposition of the Netfmix Prize

  • data. Selected movies are placed at the appropriate spot based on their factor

vectors in two dimensions. The plot reveals distinct genres, including clusters of movies with strong female leads, fraternity humor, and quirky independent fjlms.

Figure from Koren et al. (2009)

Example Factors

slide-72
SLIDE 72

Matrix Factorization

89

ALS = alternating least squares

Comparison

  • f

Optimization Algorithms

Figure from Gemulla et al. (2011)

slide-73
SLIDE 73

SVD FOR COLLABORATIVE FILTERING

90

slide-74
SLIDE 74

Singular Value Decomposition for Collaborative Filtering

92

Theorem: If R fully

  • bserved and no

regularization, the

  • ptimal UVT from

SVD equals the

  • ptimal UVT from

Unconstrained MF

slide-75
SLIDE 75

NON-NEGATIVE MATRIX FACTORIZATION

93

slide-76
SLIDE 76

Implicit Feedback Datasets

  • What information does a five-star rating contain?
  • Implicit Feedback Datasets:

– In many settings, users don’t have a way of expressing dislike for an item (e.g. can’t provide negative ratings) – The only mechanism for feedback is to “like” something

  • Examples:

– Facebook has a “Like” button, but no “Dislike” button – Google’s “+1” button – Pinterest pins – Purchasing an item on Amazon indicates a preference for it, but there are many reasons you might not purchase an item (besides dislike) – Search engines collect click data but don’t have a clear mechanism for observing dislike of a webpage

94

Examples from Aggarwal (2016)

slide-77
SLIDE 77

Constrained Optimization Problem:

Non-negative Matrix Factorization

95

Multiplicative Updates: simple iterative algorithm for solving just involves multiplying a few entries together

slide-78
SLIDE 78

105

slide-79
SLIDE 79

Summary

  • Recommender systems solve many real-

world (*large-scale) problems

  • Collaborative filtering by Matrix

Factorization (MF) is an efficient and effective approach

  • MF is just another example of a common

recipe:

1. define a model 2. define an objective function 3.

  • ptimize with SGD

106