Web Mining and Recommender Systems Supervised learning Regression - - PowerPoint PPT Presentation

web mining and recommender systems
SMART_READER_LITE
LIVE PREVIEW

Web Mining and Recommender Systems Supervised learning Regression - - PowerPoint PPT Presentation

Web Mining and Recommender Systems Supervised learning Regression Learning Goals Introduce the concept of Supervised Learning Understand the components (inputs and outputs) of supervised learning problems Introduce linear


slide-1
SLIDE 1

Web Mining and Recommender Systems

Supervised learning – Regression

slide-2
SLIDE 2

Learning Goals

  • Introduce the concept of Supervised

Learning

  • Understand the components (inputs

and outputs) of supervised learning problems

  • Introduce linear regression, one of

the simplest forms of supervised learning

slide-3
SLIDE 3

What is supervised learning? Supervised learning is the process of trying to infer from labeled data the underlying function that produced the labels associated with the data

slide-4
SLIDE 4

What is supervised learning? Given labeled training data of the form Infer the function

slide-5
SLIDE 5

Example Suppose we want to build a movie recommender

e.g. which of these films will I rate highest?

slide-6
SLIDE 6

Example Q: What are the labels? A: ratings that others have given to each movie, and that I have given to

  • ther movies
slide-7
SLIDE 7

Example Q: What is the data? A: features about the movie and the users who evaluated it

Movie features: genre, actors, rating, length, etc. User features: age, gender, location, etc.

slide-8
SLIDE 8

Example Movie recommendation: =

slide-9
SLIDE 9

Solution 1 Design a system based on prior knowledge, e.g.

def prediction(user, movie): if (user[‘age’] <= 14): if (movie[‘mpaa_rating’]) == “G”): return 5.0 else: return 1.0 else if (user[‘age’] <= 18): if (movie[‘mpaa_rating’]) == “PG”): return 5.0 ….. Etc.

Is this supervised learning?

slide-10
SLIDE 10

Solution 2

Identify words that I frequently mention in my social media posts, and recommend movies whose plot synopses use similar types of language

Plot synopsis Social media posts

argmax similarity(synopsis, post)

Is this supervised learning?

slide-11
SLIDE 11

Solution 3 Identify which attributes (e.g. actors, genres) are associated with positive

  • ratings. Recommend movies that

exhibit those attributes. Is this supervised learning?

slide-12
SLIDE 12

Solution 1 (design a system based on prior knowledge)

Disadvantages:

  • Depends on possibly false assumptions

about how users relate to items

  • Cannot adapt to new data/information

Advantages:

  • Requires no data!
slide-13
SLIDE 13

Solution 2 (identify similarity between wall posts and synopses)

Disadvantages:

  • Depends on possibly false assumptions

about how users relate to items

  • May not be adaptable to new settings

Advantages:

  • Requires data, but does not require labeled

data

slide-14
SLIDE 14

Solution 3 (identify attributes that are associated with positive ratings)

Disadvantages:

  • Requires a (possibly large) dataset of movies

with labeled ratings Advantages:

  • Directly optimizes a measure we care about

(predicting ratings)

  • Easy to adapt to new settings and data
slide-15
SLIDE 15

Supervised versus unsupervised learning Learning approaches attempt to model data in order to solve a problem

Unsupervised learning approaches find patterns/relationships/structure in data, but are not

  • ptimized to solve a particular predictive task

Supervised learning aims to directly model the relationship between input and output variables, so that the

  • utput variables can be predicted accurately given the input
slide-16
SLIDE 16

Regression Regression is one of the simplest supervised learning approaches to learn relationships between input variables (features) and output variables (predictions)

slide-17
SLIDE 17

Linear regression Linear regression assumes a predictor

  • f the form

(or if you prefer)

matrix of features (data) unknowns (which features are relevant) vector of outputs (labels)

slide-18
SLIDE 18

Motivation: height vs. weight

Height Weight

40kg 120kg 130cm 200cm

Q: Can we find a line that (approximately) fits the data?

slide-19
SLIDE 19

Motivation: height vs. weight

Q: Can we find a line that (approximately) fits the data?

  • If we can find such a line, we can use it to make predictions

(i.e., estimate a person's weight given their height)

  • How do we formulate the problem of finding a line?
  • If no line will fit the data exactly, how to approximate?
  • What is the "best" line?
slide-20
SLIDE 20

Recap: equation for a line

What is the formula describing the line?

Height Weight

40kg 120kg 130cm 200cm

slide-21
SLIDE 21

Recap: equation for a line (plane)

What about in more dimensions?

Height Weight

40kg 120kg 130cm 200cm

slide-22
SLIDE 22

Recap: equation for a line as an inner product

What about in more dimensions?

Height Weight

40kg 120kg 130cm 200cm

slide-23
SLIDE 23
slide-24
SLIDE 24

Linear regression Linear regression assumes a predictor

  • f the form

Q: Solve for theta A:

slide-25
SLIDE 25

Linear regression Linear regression assumes a predictor

  • f the form

Q: Solve for theta A:

slide-26
SLIDE 26

Learning Outcomes

  • Explained Supervised Learning

problems in terms of data, labels, and features

  • Explained how regression can be

setup in terms of lines (or hyperplanes) of best fit

slide-27
SLIDE 27

Web Mining and Recommender Systems

Worked Example – Regression

slide-28
SLIDE 28

Learning Goals

  • Work through an example of a

regression problem

  • Introduce some simple feature

engineering strategies

slide-29
SLIDE 29

Linear regression Linear regression assumes a predictor

  • f the form

Q: Solve for theta A:

slide-30
SLIDE 30

Linear regression Linear regression assumes a predictor

  • f the form

Q: Solve for theta A:

slide-31
SLIDE 31

Example 1 How do preferences toward certain beers vary with age?

slide-32
SLIDE 32

Example 1

Beers: Ratings/reviews: User profiles:

slide-33
SLIDE 33

Example 1

50,000 reviews are available on http://cseweb.ucsd.edu/classes/fa19/cse258-a/data/beer_50000.json (see course webpage)

slide-34
SLIDE 34

Example 1 How do preferences toward certain beers vary with age? How about ABV? Real-valued features

(code for all examples is on the course webpage)

slide-35
SLIDE 35

Example 1 What is the interpretation of: Real-valued features

(code for all examples is on the course webpage)

slide-36
SLIDE 36

Example 2 How do beer preferences vary as a function of gender? Categorical features

(code for all examples is on the course webpage)

slide-37
SLIDE 37

Example 2

E.g. How does rating vary with gender?

Gender Rating

1 stars 5 stars

slide-38
SLIDE 38

Example 2

Gender Rating

1 star 5 stars male female

is the (predicted/average) rating for males is the how much higher females rate than males (in this case a negative number) We’re really still fitting a line though!

slide-39
SLIDE 39

Exercise How would you build a feature to represent the month, and the impact it has on people’s rating behavior?

slide-40
SLIDE 40

Learning Outcomes

  • Worked through a simple regression

problem

  • Began some simple feature

engineering with binary features

slide-41
SLIDE 41

Web Mining and Recommender Systems

Regression – Feature Transforms & Worked Example

slide-42
SLIDE 42

Learning Goals

  • Work through a real example of a

regression problem

  • Discuss the topic of feature

engineering in more depth

slide-43
SLIDE 43

Regression Regression is one of the simplest supervised learning approaches to learn relationships between input variables (features) and output variables (predictions)

slide-44
SLIDE 44

Linear regression Linear regression assumes a predictor

  • f the form

(or if you prefer)

matrix of features (data) unknowns (which features are relevant) vector of outputs (labels)

slide-45
SLIDE 45

Linear regression Linear regression assumes a predictor

  • f the form

Q: Solve for theta A:

slide-46
SLIDE 46

Example

Beers: Ratings/reviews: User profiles:

slide-47
SLIDE 47

Example How do preferences toward certain beers vary with age? How about ABV? Real-valued features

(code for all examples on course webpage)

slide-48
SLIDE 48

Example: Polynomial functions

What about something like ABV^2?

  • Note that this is perfectly straightforward:

the model still takes the form

  • We just need to use the feature vector

x = [1, ABV, ABV^2, ABV^3]

slide-49
SLIDE 49

Fitting complex functions

Note that we can use the same approach to fit arbitrary functions of the features! E.g.:

  • We can perform arbitrary combinations of the

features and the model will still be linear in the parameters (theta):

slide-50
SLIDE 50

Fitting complex functions

The same approach would not work if we wanted to transform the parameters:

  • The linear models we’ve seen so far do not support

these types of transformations (i.e., they need to be linear in their parameters)

  • There are alternative models that support non-linear

transformations of parameters, e.g. neural networks

slide-51
SLIDE 51

Learning Outcomes

  • Worked through a real regression

example

  • Explained how to use more complex

feature transforms to fit (e.g.) polynomials with regression algorithms

slide-52
SLIDE 52

Web Mining and Recommender Systems

Regression – Categorical Features

slide-53
SLIDE 53

Learning Goals

  • Explain how to use categorical

features within regression algorithms

slide-54
SLIDE 54

Example How do beer preferences vary as a function of gender? Categorical features

(code for all examples is the course webpage)

slide-55
SLIDE 55

Example

E.g. How does rating vary with gender?

Gender Rating

1 stars 5 stars

slide-56
SLIDE 56

Example

Gender Rating

1 star 5 stars male female

is the (predicted/average) rating for males is the how much higher females rate than males (in this case a negative number) We’re really still fitting a line though!

slide-57
SLIDE 57

Motivating examples

What if we had more than two values?

(e.g {“male”, “female”, “other”, “not specified”}) Could we apply the same approach?

gender = 0 if “male”, 1 if “female”, 2 if “other”, 3 if “not specified”

if male if female if other if not specified

slide-58
SLIDE 58

Motivating examples

What if we had more than two values?

(e.g {“male”, “female”, “other”, “not specified”})

Gender Rating

male female

  • ther

not specified

slide-59
SLIDE 59

Motivating examples

  • This model is valid, but won’t be very effective
  • It assumes that the difference between “male” and

“female” must be equivalent to the difference between “female” and “other”

  • But there’s no reason this should be the case!

Gender Rating

male female

  • ther

not specified

slide-60
SLIDE 60

Motivating examples

E.g. it could not capture a function like:

Gender Rating

male female

  • ther

not specified

slide-61
SLIDE 61

Motivating examples

Instead we need something like: if male if female if other if not specified

slide-62
SLIDE 62

Motivating examples

This is equivalent to: where feature = [1, 0, 0] for “female” feature = [0, 1, 0] for “other” feature = [0, 0, 1] for “not specified”

slide-63
SLIDE 63

Concept: One-hot encodings

feature = [1, 0, 0] for “female” feature = [0, 1, 0] for “other” feature = [0, 0, 1] for “not specified”

  • This type of encoding is called a one-hot encoding (because

we have a feature vector with only a single “1” entry)

  • Note that to capture 4 possible categories, we only need three

dimensions (a dimension for “male” would be redundant)

  • This approach can be used to capture a variety of categorical

feature types, as well as objects that belong to multiple categories

slide-64
SLIDE 64

Linearly dependent features

slide-65
SLIDE 65

Linearly dependent features

slide-66
SLIDE 66

Learning Outcomes

  • Showed how to use categorical

features within regression algorithms

  • Introduced the concept of a "one-

hot" encoding

  • Discussed linear dependence of

features

slide-67
SLIDE 67

Web Mining and Recommender Systems

Regression – T emporal Features

slide-68
SLIDE 68

Learning Goals

  • Explain how to use temporal features

within regression algorithms

slide-69
SLIDE 69

Example How would you build a feature to represent the month, and the impact it has on people’s rating behavior?

slide-70
SLIDE 70

Motivating examples

E.g. How do ratings vary with time?

Time Rating

1 star 5 stars

slide-71
SLIDE 71

Motivating examples

E.g. How do ratings vary with time?

  • In principle this picture looks okay (compared our

previous example on categorical features) – we’re predicting a real valued quantity from real valued data (assuming we convert the date string to a number)

  • So, what would happen if (e.g. we tried to train a

predictor based on the month of the year)?

slide-72
SLIDE 72

Motivating examples

E.g. How do ratings vary with time?

  • Let’s start with a simple feature representation,

e.g. map the month name to a month number: Jan = [0] Feb = [1] Mar = [2] etc.

where

slide-73
SLIDE 73

Motivating examples

The model we’d learn might look something like:

J F M A M J J A S O N D 0 1 2 3 4 5 6 7 8 9 10 11

Rating

1 star 5 stars

slide-74
SLIDE 74

Motivating examples

J F M A M J J A S O N D J F M A M J J A S O N D 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11

Rating

1 star 5 stars

This seems fine, but what happens if we look at multiple years?

slide-75
SLIDE 75

Modeling temporal data

  • This representation implies that the

model would “wrap around” on December 31 to its January 1st value.

  • This type of “sawtooth” pattern probably

isn’t very realistic

This seems fine, but what happens if we look at multiple years?

slide-76
SLIDE 76

Modeling temporal data

J F M A M J J A S O N D J F M A M J J A S O N D 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11

Rating

1 star 5 stars

What might be a more realistic shape?

?

slide-77
SLIDE 77

Modeling temporal data

  • Also, it’s not a linear model
  • Q: What’s a class of functions that we can use to

capture a more flexible variety of shapes?

  • A: Piecewise functions!

Fitting some periodic function like a sin wave would be a valid solution, but is difficult to get right, and fairly inflexible

slide-78
SLIDE 78

Concept: Fitting piecewise functions

We’d like to fit a function like the following:

J F M A M J J A S O N D 0 1 2 3 4 5 6 7 8 9 10 11

Rating

1 star 5 stars

slide-79
SLIDE 79

Fitting piecewise functions

In fact this is very easy, even for a linear model! This function looks like:

1 if it’s Feb, 0

  • therwise
  • Note that we don’t need a feature for January
  • i.e., theta_0 captures the January value, theta_1

captures the difference between February and January, etc.

slide-80
SLIDE 80

Fitting piecewise functions

Or equivalently we’d have features as follows:

where

x = [1,1,0,0,0,0,0,0,0,0,0,0] if February [1,0,1,0,0,0,0,0,0,0,0,0] if March [1,0,0,1,0,0,0,0,0,0,0,0] if April ... [1,0,0,0,0,0,0,0,0,0,0,1] if December

slide-81
SLIDE 81

Fitting piecewise functions

Note that this is still a form of one-hot encoding, just like we saw in the “categorical features” example

  • This type of feature is very flexible, as it can

handle complex shapes, periodicity, etc.

  • We could easily increase (or decrease) the

resolution to a week, or an entire season, rather than a month, depending on how fine-grained our data was

slide-82
SLIDE 82

Concept: Combining one-hot encodings

We can also extend this by combining several one-hot encodings together:

where

x1 = [1,1,0,0,0,0,0,0,0,0,0,0] if February [1,0,1,0,0,0,0,0,0,0,0,0] if March [1,0,0,1,0,0,0,0,0,0,0,0] if April ... [1,0,0,0,0,0,0,0,0,0,0,1] if December x2 = [1,0,0,0,0,0] if Tuesday [0,1,0,0,0,0] if Wednesday [0,0,1,0,0,0] if Thursday ...

slide-83
SLIDE 83

What does the data actually look like? Season vs. rating (overall)

slide-84
SLIDE 84

Learning Outcomes

  • Explained how to use

temporal features within regression algorithms

  • Showed how to use one-hot

encodings to capture trends in periodic data

slide-85
SLIDE 85

Web Mining and Recommender Systems

Regression Diagnostics

slide-86
SLIDE 86

Learning Goals

  • Show how to evaluate regression

algorithms

slide-87
SLIDE 87

T

  • day: Regression diagnostics

Mean-squared error (MSE)

slide-88
SLIDE 88

Regression diagnostics Q: Why MSE (and not mean-absolute- error or something else)

slide-89
SLIDE 89

Regression diagnostics

slide-90
SLIDE 90

Regression diagnostics

slide-91
SLIDE 91

Regression diagnostics Coefficient of determination Q: How low does the MSE have to be before it’s “low enough”? A: It depends! The MSE is proportional to the variance of the data

slide-92
SLIDE 92

Regression diagnostics Coefficient of determination (R^2 statistic) Mean: Variance: MSE:

slide-93
SLIDE 93

Regression diagnostics Coefficient of determination (R^2 statistic) Mean: Variance: MSE:

slide-94
SLIDE 94

Regression diagnostics Coefficient of determination (R^2 statistic) FVU(f) = 1 Trivial predictor FVU(f) = 0 Perfect predictor

(FVU = fraction of variance unexplained)

slide-95
SLIDE 95

Regression diagnostics Coefficient of determination (R^2 statistic) R^2 = 0 Trivial predictor R^2 = 1 Perfect predictor

slide-96
SLIDE 96

Learning Outcomes

  • Showed how to evaluate regression

algorithms

  • Introduced the Mean Squared Error

and R^2 coefficient

  • Explained the relationship between

the MSE and the variance

slide-97
SLIDE 97

Web Mining and Recommender Systems

Overfitting

slide-98
SLIDE 98

Learning Goals

  • Introduce the concepts of overfitting

and regularization

slide-99
SLIDE 99

Overfitting Q: But can’t we get an R^2 of 1 (MSE of 0) just by throwing in enough random features? A: Yes! This is why MSE and R^2 should always be evaluated on data that wasn’t used to train the model A good model is one that generalizes to new data

slide-100
SLIDE 100

Overfitting When a model performs well on training data but doesn’t generalize, we are said to be

  • verfitting
slide-101
SLIDE 101

Overfitting When a model performs well on training data but doesn’t generalize, we are said to be

  • verfitting

Q: What can be done to avoid

  • verfitting?
slide-102
SLIDE 102

Occam’s razor

“Among competing hypotheses, the one with the fewest assumptions should be selected”

slide-103
SLIDE 103

Occam’s razor

“hypothesis”

Q: What is a “complex” versus a “simple” hypothesis?

slide-104
SLIDE 104
slide-105
SLIDE 105

Occam’s razor A1: A “simple” model is one where theta has few non-zero parameters

(only a few features are relevant)

A2: A “simple” model is one where theta is almost uniform

(few features are significantly more relevant than others)

slide-106
SLIDE 106

Occam’s razor

A1: A “simple” model is one where theta has few non-zero parameters A2: A “simple” model is one where theta is almost uniform is small is small

slide-107
SLIDE 107

“Proof”

slide-108
SLIDE 108

Regularization Regularization is the process of penalizing model complexity during training

MSE (l2) model complexity

slide-109
SLIDE 109

Regularization Regularization is the process of penalizing model complexity during training

How much should we trade-off accuracy versus complexity?

slide-110
SLIDE 110

Optimizing the (regularized) model

  • Could look for a closed form

solution as we did before

  • Or, we can try to solve using

gradient descent

slide-111
SLIDE 111

Optimizing the (regularized) model Gradient descent:

  • 1. Initialize at random
  • 2. While (not converged) do

All sorts of annoying issues:

  • How to initialize theta?
  • How to determine when the process has converged?
  • How to set the step size alpha

These aren’t really the point of this class though

slide-112
SLIDE 112

Optimizing the (regularized) model

slide-113
SLIDE 113

Optimizing the (regularized) model Gradient descent in scipy: code on course webpage

(see also “ridge regression” in the “sklearn” module)

slide-114
SLIDE 114

Learning Outcomes

  • Introduced the concepts of
  • verfitting and regularization
  • Showed how to regularize models

using the l1 and l2 norms

  • (very briefly) touched on gradient

descent

slide-115
SLIDE 115

Web Mining and Recommender Systems

Model Selection & Summary

slide-116
SLIDE 116

Learning Goals

  • Discuss model selection and

validation sets

  • Summarize our discussion on

regression

slide-117
SLIDE 117

Model selection

How much should we trade-off accuracy versus complexity?

Each value of lambda generates a different model. Q: How do we select which one is the best?

slide-118
SLIDE 118

Model selection How to select which model is best? A1: The one with the lowest training error? A2: The one with the lowest test error? We need a third sample of the data that is not used for training or testing

slide-119
SLIDE 119

Model selection A validation set is constructed to “tune” the model’s parameters

  • Training set: used to optimize the model’s

parameters

  • Test set: used to report how well we expect the

model to perform on unseen data

  • Validation set: used to tune any model

parameters that are not directly optimized

slide-120
SLIDE 120

Model selection A few “theorems” about training, validation, and test sets

  • The training error increases as lambda increases
  • The validation and test error are at least as large as

the training error (assuming infinitely large random partitions)

  • The validation/test error will usually have a “sweet

spot” between under- and over-fitting

slide-121
SLIDE 121

Model selection

slide-122
SLIDE 122

Summary: Regression

  • Linear regression and least-squares
  • (a little bit of) feature design
  • Overfitting and regularization
  • Gradient descent
  • Training, validation, and testing
  • Model selection