Overview CS 446 What is machine learning? Machine learning : study - - PowerPoint PPT Presentation

overview
SMART_READER_LITE
LIVE PREVIEW

Overview CS 446 What is machine learning? Machine learning : study - - PowerPoint PPT Presentation

Overview CS 446 What is machine learning? Machine learning : study of computational mechanisms that learn from data to make predictions and decisions. 1 / 21 Application 1: image classification Birdwatcher takes photos of birds,


slide-1
SLIDE 1

Overview

CS 446

slide-2
SLIDE 2

What is machine learning?

◮ Machine learning: study of computational mechanisms that “learn” from data to make predictions and decisions.

1 / 21

slide-3
SLIDE 3

Application 1: image classification

◮ Birdwatcher takes photos of birds, organizes by species.

2 / 21

slide-4
SLIDE 4

Application 1: image classification

◮ Birdwatcher takes photos of birds, organizes by species. ◮ Goal: automatically recognize bird species in new photos.

2 / 21

slide-5
SLIDE 5

Application 1: image classification

◮ Birdwatcher takes photos of birds, organizes by species. ◮ Goal: automatically recognize bird species in new photos.

Indigo bunting

2 / 21

slide-6
SLIDE 6

Application 1: image classification

◮ Birdwatcher takes photos of birds, organizes by species. ◮ Goal: automatically recognize bird species in new photos.

Indigo bunting

◮ Why ML: variation in lighting, occlusions, morphology.

2 / 21

slide-7
SLIDE 7

Application 2: recommender system

◮ Netflix users watch movies and provide ratings.

R R R

  • 3 / 21
slide-8
SLIDE 8

Application 2: recommender system

◮ Netflix users watch movies and provide ratings. ◮ Goal: predict user’s rating of unwatched movie.

R R R

  • 3 / 21
slide-9
SLIDE 9

Application 2: recommender system

◮ Netflix users watch movies and provide ratings. ◮ Goal: predict user’s rating of unwatched movie. ◮ (Real goal: keep users paying customers.)

R R R

  • 3 / 21
slide-10
SLIDE 10

Application 2: recommender system

◮ Netflix users watch movies and provide ratings. ◮ Goal: predict user’s rating of unwatched movie. ◮ (Real goal: keep users paying customers.) ◮ (Real effect: reinforce stereotypes found in the data?)

R R R

  • 3 / 21
slide-11
SLIDE 11

Application 2: recommender system

◮ Netflix users watch movies and provide ratings. ◮ Goal: predict user’s rating of unwatched movie. ◮ (Real goal: keep users paying customers.) ◮ (Real effect: reinforce stereotypes found in the data?)

R R R

  • Geared

toward males Serious Escapist The Princess Diaries Braveheart Lethal Weapon Independence Day Ocean’s 11 Sense and Sensibility Gus Dave Geared toward females Amadeus The Lion King Dumb and Dumber The Color Purple

(Image credit: Koren, Bell, and Volinsky, 2009.)

3 / 21

slide-12
SLIDE 12

Application 2: recommender system

◮ Netflix users watch movies and provide ratings. ◮ Goal: predict user’s rating of unwatched movie. ◮ (Real goal: keep users paying customers.) ◮ (Real effect: reinforce stereotypes found in the data?)

R R R

  • Geared

toward males Serious Escapist The Princess Diaries Braveheart Lethal Weapon Independence Day Ocean’s 11 Sense and Sensibility Gus Dave Geared toward females Amadeus The Lion King Dumb and Dumber The Color Purple

(Image credit: Koren, Bell, and Volinsky, 2009.)

◮ Why ML: can go far by representing users as normalized view count vectors, and correlating them.

3 / 21

slide-13
SLIDE 13

Application 3: machine translation

◮ Linguists provide translations of all English language books into French, sentence-by-sentence.

4 / 21

slide-14
SLIDE 14

Application 3: machine translation

◮ Linguists provide translations of all English language books into French, sentence-by-sentence. ◮ Goal: translate any English sentence into French.

4 / 21

slide-15
SLIDE 15

Application 3: machine translation

◮ Linguists provide translations of all English language books into French, sentence-by-sentence. ◮ Goal: translate any English sentence into French. Note: the text-to-speech is via ML (recurrent network).

4 / 21

slide-16
SLIDE 16

Application 3: machine translation

◮ Linguists provide translations of all English language books into French, sentence-by-sentence. ◮ Goal: translate any English sentence into French. Note: the text-to-speech is via ML (recurrent network). ◮ Why ML? Can avoid not just the tedium of endless grammar rules, but also hope to capture idiom and other nuance.

4 / 21

slide-17
SLIDE 17

Application 4: chess

◮ Chess enthusiasts construct a large corpus of chess games.

5 / 21

slide-18
SLIDE 18

Application 4: chess

◮ Chess enthusiasts construct a large corpus of chess games. ◮ Goal: for any board position, determine probability that each possible move leads to a win.

5 / 21

slide-19
SLIDE 19

Application 4: chess

◮ Chess enthusiasts construct a large corpus of chess games. ◮ Goal: for any board position, determine probability that each possible move leads to a win.

5 / 21

slide-20
SLIDE 20

Application 4: chess

◮ Chess enthusiasts construct a large corpus of chess games. ◮ Goal: for any board position, determine probability that each possible move leads to a win. ◮ Why ML? Magical “interpolation” between various positions.

5 / 21

slide-21
SLIDE 21

Application 4: chess

◮ Chess enthusiasts construct a large corpus of chess games. ◮ Goal: for any board position, determine probability that each possible move leads to a win. ◮ Why ML? Magical “interpolation” between various positions. ◮ Note: it can even learn via self-play!

5 / 21

slide-22
SLIDE 22

Interlude: technical background

Math. Software.

6 / 21

slide-23
SLIDE 23

Interlude: technical background

Math. ◮ Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ). Software.

6 / 21

slide-24
SLIDE 24

Interlude: technical background

Math. ◮ Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ). ◮ Basic probability and statistics (e.g., variance of a random variable). Software.

6 / 21

slide-25
SLIDE 25

Interlude: technical background

Math. ◮ Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ). ◮ Basic probability and statistics (e.g., variance of a random variable). ◮ Multivariable calculus (e.g., gradient of Aw − b2 wrt w). Software.

6 / 21

slide-26
SLIDE 26

Interlude: technical background

Math. ◮ Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ). ◮ Basic probability and statistics (e.g., variance of a random variable). ◮ Multivariable calculus (e.g., gradient of Aw − b2 wrt w). ◮ Basic proof writing (e.g., prove A⊤A is positive semi-definite). Software.

6 / 21

slide-27
SLIDE 27

Interlude: technical background

Math. ◮ Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ). ◮ Basic probability and statistics (e.g., variance of a random variable). ◮ Multivariable calculus (e.g., gradient of Aw − b2 wrt w). ◮ Basic proof writing (e.g., prove A⊤A is positive semi-definite). Software. ◮ python3. It’s slow, but often computation will be inside fast libraries.

6 / 21

slide-28
SLIDE 28

Interlude: technical background

Math. ◮ Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ). ◮ Basic probability and statistics (e.g., variance of a random variable). ◮ Multivariable calculus (e.g., gradient of Aw − b2 wrt w). ◮ Basic proof writing (e.g., prove A⊤A is positive semi-definite). Software. ◮ python3. It’s slow, but often computation will be inside fast libraries. ◮ numpy, an easy-to-use numeric library.

6 / 21

slide-29
SLIDE 29

Interlude: technical background

Math. ◮ Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ). ◮ Basic probability and statistics (e.g., variance of a random variable). ◮ Multivariable calculus (e.g., gradient of Aw − b2 wrt w). ◮ Basic proof writing (e.g., prove A⊤A is positive semi-definite). Software. ◮ python3. It’s slow, but often computation will be inside fast libraries. ◮ numpy, an easy-to-use numeric library. ◮ pytorch, a numeric library with gpu support, auto-differentiation, and deep learning helpers.

6 / 21

slide-30
SLIDE 30

Interlude: technical background

Math. ◮ Linear algebra (e.g., null spaces; eigendecomposition; SVD. . . ). ◮ Basic probability and statistics (e.g., variance of a random variable). ◮ Multivariable calculus (e.g., gradient of Aw − b2 wrt w). ◮ Basic proof writing (e.g., prove A⊤A is positive semi-definite). Software. ◮ python3. It’s slow, but often computation will be inside fast libraries. ◮ numpy, an easy-to-use numeric library. ◮ pytorch, a numeric library with gpu support, auto-differentiation, and deep learning helpers. My opinion. pytorch is one of the nicest libraries I’ve ever used, for

  • anything. I use it for much more than deep learning.

6 / 21

slide-31
SLIDE 31

python3, pytorch, numpy

>>> import numpy >>> import torch >>> 3 / 2

1.5

>>> 3 // 2

1

>>> A = torch.randn(5,5) >>> b = torch.randn(5,1) >>> x = torch.gels(b, A)[0] >>> (A @ x − b).norm()

tensor(3.7985e−06)

>>> x = numpy.linalg.lstsq(A, b)[0] >>> (A @ torch.tensor(x) − b).norm()

tensor(1.1999e−06)

7 / 21

slide-32
SLIDE 32

pytorch on gpu is easy

>>> import torch >>> A = torch.randn(5, 5) >>> b = torch.randn(5) >>> (A @ b).norm()

tensor(4.7746)

>>> device = torch.device("cuda:0") >>> (A.to(device) @ b.to(device)).norm()

tensor(4.7746, device=’cuda:0’)

>>> (A.to(device) @ b).norm()

Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.FloatTensor for argument #2

’ v e c ’

  • Note. Homeworks will be graded in a gpu-free container!

8 / 21

slide-33
SLIDE 33

Homework 0

◮ Homework 0 is posted on the class webpage. ◮ It is a sanity check of basic (math and coding) background. ◮ It is due Tuesday, January 22. ◮ It has two gradescope components: hw0 is multiple choice, hw0code is coding.

9 / 21

slide-34
SLIDE 34

Basic setting: supervised learning

Training data: labeled examples (x1, y1), (x2, y2), . . . , (xn, yn)

10 / 21

slide-35
SLIDE 35

Basic setting: supervised learning

Training data: labeled examples (x1, y1), (x2, y2), . . . , (xn, yn) where ◮ each input xi is a machine-readable description of an instance (e.g., image, sentence), and

10 / 21

slide-36
SLIDE 36

Basic setting: supervised learning

Training data: labeled examples (x1, y1), (x2, y2), . . . , (xn, yn) where ◮ each input xi is a machine-readable description of an instance (e.g., image, sentence), and ◮ each corresponding label yi is an annotation relevant to the task—typically not easy to automatically obtain.

10 / 21

slide-37
SLIDE 37

Basic setting: supervised learning

Training data: labeled examples (x1, y1), (x2, y2), . . . , (xn, yn) where ◮ each input xi is a machine-readable description of an instance (e.g., image, sentence), and ◮ each corresponding label yi is an annotation relevant to the task—typically not easy to automatically obtain. Goal: learn a function ˆ f from labeled examples, that accurately “predicts” the labels of new (previously unseen) inputs.

learned predictor past labeled examples learning algorithm predicted label new (unlabeled) example

10 / 21

slide-38
SLIDE 38

Learn a function?

The learned function ˆ f might look like the following:

1: if age ≥ 40 then 2:

if genre = western then

3:

return 4.3

4:

else if release date > 1998 then

5:

return 2.5

6:

else

7:

. . .

8:

end if

9: else if · · · then 10:

. . .

11: end if

11 / 21

slide-39
SLIDE 39

Learn a function?

The learned function ˆ f might look like the following:

1: if age ≥ 40 then 2:

if genre = western then

3:

return 4.3

4:

else if release date > 1998 then

5:

return 2.5

6:

else

7:

. . .

8:

end if

9: else if · · · then 10:

. . .

11: end if

Want machine to figure out these if-else clauses from the data, rather than hand-code them ourselves.

11 / 21

slide-40
SLIDE 40

Learn a function?

The learned function ˆ f might look like the following:

1: if age ≥ 40 then 2:

if genre = western then

3:

return 4.3

4:

else if release date > 1998 then

5:

return 2.5

6:

else

7:

. . .

8:

end if

9: else if · · · then 10:

. . .

11: end if

Want machine to figure out these if-else clauses from the data, rather than hand-code them ourselves. (This type of decision rule is called a decision tree.)

11 / 21

slide-41
SLIDE 41

Learn a function?

The learned function ˆ f might look like the following:

1: if age ≥ 40 then 2:

if genre = western then

3:

return 4.3

4:

else if release date > 1998 then

5:

return 2.5

6:

else

7:

. . .

8:

end if

9: else if · · · then 10:

. . .

11: end if

Want machine to figure out these if-else clauses from the data, rather than hand-code them ourselves. (This type of decision rule is called a decision tree.) (How to fit: use one of many recursive splitting heuristics.)

11 / 21

slide-42
SLIDE 42

Example 2: 1-nn

◮ Suppose we are given red and blue data. . .

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

12 / 21

slide-43
SLIDE 43

Example 2: 1-nn

◮ Suppose we are given red and blue data. . . ◮ . . . and label new points according to the closest existing point.

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

12 / 21

slide-44
SLIDE 44

Example 2: 1-nn

◮ Suppose we are given red and blue data. . . ◮ . . . and label new points according to the closest existing point.

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

◮ This is a 1-nearest-neighbor (1-nn) classifier.

12 / 21

slide-45
SLIDE 45

Example 2: 1-nn

◮ Suppose we are given red and blue data. . . ◮ . . . and label new points according to the closest existing point.

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

◮ This is a 1-nearest-neighbor (1-nn) classifier. ◮ The k-nn classifier takes majority vote of the k nearest data points.

12 / 21

slide-46
SLIDE 46

Example 3: linear classifier

◮ Classify points with x → sgn(a⊤x).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

13 / 21

slide-47
SLIDE 47

Example 3: linear classifier

◮ Classify points with x → sgn(a⊤x).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 7.500
  • 5.000
  • 2.500

0.000 2.500 5.000 7.500

◮ This gives a contour plot: depicting f −1(c) = {x ∈ R2 : f(x) = c} for a few c.

13 / 21

slide-48
SLIDE 48

Example 3: linear classifier

◮ Classify points with x → sgn(a⊤x).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 7.500
  • 5.000
  • 2.500

0.000 2.500 5.000 7.500

◮ This gives a contour plot: depicting f −1(c) = {x ∈ R2 : f(x) = c} for a few c. ◮ How to fit: convex optimization.

13 / 21

slide-49
SLIDE 49

Example 4: 2-layer ReLU network

◮ Classify points with x → sgn(A2σr(A1x + b1) + b2), where A1 ∈ R16×2, b1 ∈ R16, A2 ∈ R1×16, b2 ∈ R1, and σr(z) = max{0, z} is the Rectified Linear Unit (ReLU).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

14 / 21

slide-50
SLIDE 50

Example 4: 2-layer ReLU network

◮ Classify points with x → sgn(A2σr(A1x + b1) + b2), where A1 ∈ R16×2, b1 ∈ R16, A2 ∈ R1×16, b2 ∈ R1, and σr(z) = max{0, z} is the Rectified Linear Unit (ReLU).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.000
  • 9

.

  • 6.000
  • 3.000

0.000 . 3.000

◮ Here once again is the contour plot: f −1(c) = {x ∈ R2 : f(x) = c} for a few c.

14 / 21

slide-51
SLIDE 51

Example 4: 2-layer ReLU network

◮ Classify points with x → sgn(A2σr(A1x + b1) + b2), where A1 ∈ R16×2, b1 ∈ R16, A2 ∈ R1×16, b2 ∈ R1, and σr(z) = max{0, z} is the Rectified Linear Unit (ReLU).

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.000
  • 9

.

  • 6.000
  • 3.000

0.000 . 3.000

◮ Here once again is the contour plot: f −1(c) = {x ∈ R2 : f(x) = c} for a few c. ◮ How to fit: use convex optimization tools, without knowing why.

14 / 21

slide-52
SLIDE 52

Other examples

Other examples of supervised learning (learn f : X → Y given pairs (x, y) ∈ X × Y ). ◮ Support Vector Machines (SVM), least squares, naive Bayes, AdaBoost, . . .

15 / 21

slide-53
SLIDE 53

Other examples

Other examples of supervised learning (learn f : X → Y given pairs (x, y) ∈ X × Y ). ◮ Support Vector Machines (SVM), least squares, naive Bayes, AdaBoost, . . . Other types of machine learning. ◮ Unsupervised learning: find structure in some examples (xi)n

i=1

(there are no labels!). ◮ Time series modeling: the label of xi also depends on (xi−1, xi−2, . . . ). ◮ Reinforcement learning: the machine learning method makes decisions, not just predictions: the outputs affect feature inputs. (E.g., controlling a robot).

15 / 21

slide-54
SLIDE 54

Why is machine learning challenging?

From the examples, supervised learning (fit ˆ f : X → Y to ((xi, yi))n

i=1)

consists of: ◮ Pick a model and fitting algorithm; feed them data.

16 / 21

slide-55
SLIDE 55

Why is machine learning challenging?

From the examples, supervised learning (fit ˆ f : X → Y to ((xi, yi))n

i=1)

consists of: ◮ Pick a model and fitting algorithm; feed them data. Not easy.

16 / 21

slide-56
SLIDE 56

Why is machine learning challenging?

From the examples, supervised learning (fit ˆ f : X → Y to ((xi, yi))n

i=1)

consists of: ◮ Pick a model and fitting algorithm; feed them data. Not easy. ◮ We might overfit the training set.

16 / 21

slide-57
SLIDE 57

Why is machine learning challenging?

From the examples, supervised learning (fit ˆ f : X → Y to ((xi, yi))n

i=1)

consists of: ◮ Pick a model and fitting algorithm; feed them data. Not easy. ◮ We might overfit the training set. ◮ We might pick a bad model.

16 / 21

slide-58
SLIDE 58

Why is machine learning challenging?

From the examples, supervised learning (fit ˆ f : X → Y to ((xi, yi))n

i=1)

consists of: ◮ Pick a model and fitting algorithm; feed them data. Not easy. ◮ We might overfit the training set. ◮ We might pick a bad model. ◮ The fitting method might be fickle.

16 / 21

slide-59
SLIDE 59

Why is machine learning challenging?

From the examples, supervised learning (fit ˆ f : X → Y to ((xi, yi))n

i=1)

consists of: ◮ Pick a model and fitting algorithm; feed them data. Not easy. ◮ We might overfit the training set. ◮ We might pick a bad model. ◮ The fitting method might be fickle. ◮ Data may be hard to obtain.

16 / 21

slide-60
SLIDE 60

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R2 = 0.00690718, R2 = 0.00870077

17 / 21

slide-61
SLIDE 61

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R2 = 0.00690718, R2 = 0.00870077

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-62
SLIDE 62

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R3 = 0.0064944, R3 = 0.00998528

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-63
SLIDE 63

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R4 = 0.0062397, R4 = 0.013063

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-64
SLIDE 64

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R5 = 0.00582684, R5 = 0.00975194

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-65
SLIDE 65

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R6 = 0.00571136, R6 = 0.0142185

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-66
SLIDE 66

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R7 = 0.00565266, R7 = 0.0282631

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-67
SLIDE 67

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R8 = 0.00564127, R8 = 0.0440347

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-68
SLIDE 68

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R9 = 0.00541878, R9 = 0.42463

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-69
SLIDE 69

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R10 = 0.00540813, R10 = 0.682619

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-70
SLIDE 70

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R12 = 0.00503499, R12 = 40.6796

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-71
SLIDE 71

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R16 = 0.00312787, R16 = 30910.6

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-72
SLIDE 72

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R21 = 0.0026458, R21 = 1.24795e + 07

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-73
SLIDE 73

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R27 = 0.00262184, R27 = 233052

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-74
SLIDE 74

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R34 = 0.00259486, R34 = 3.97751e + 10

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-75
SLIDE 75

Overfitting

Let’s fit a polynomial. What’s the degree? Truth: y = 0 · x + ξ, ξ ∼ Gaussian. Red: seen data. Green: unseen data.

0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

R1 = 0.00692304, R1 = 0.00897457, R38 = 0.00242094, R38 = 1.02285e + 13

Overfitting: fitting training (seen) data, not fitting testing (unseen) data. One cure: “regularization”.

17 / 21

slide-76
SLIDE 76

Choosing an appropriate model

Should we use 1-nn or a 2-layer ReLU network?

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.000
  • 9

.

  • 6.000
  • 3

. 0.000 . 3.000

18 / 21

slide-77
SLIDE 77

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

  • 1. Throw more data, fitting time, and neural network parameters at a

problem until training error becomes near zero.

19 / 21

slide-78
SLIDE 78

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

  • 1. Throw more data, fitting time, and neural network parameters at a

problem until training error becomes near zero.

  • 2. Shrink/regularize the model to reduce test error.

19 / 21

slide-79
SLIDE 79

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

  • 1. Throw more data, fitting time, and neural network parameters at a

problem until training error becomes near zero.

  • 2. Shrink/regularize the model to reduce test error.

Does this work?

19 / 21

slide-80
SLIDE 80

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

  • 1. Throw more data, fitting time, and neural network parameters at a

problem until training error becomes near zero.

  • 2. Shrink/regularize the model to reduce test error.

Does this work?

  • 1. For some well-studied problems, yes.

19 / 21

slide-81
SLIDE 81

The “throw data at deep network” / pytorch paradigm

Here’s one approach to machine learning.

  • 1. Throw more data, fitting time, and neural network parameters at a

problem until training error becomes near zero.

  • 2. Shrink/regularize the model to reduce test error.

Does this work?

  • 1. For some well-studied problems, yes.
  • 2. For other problems, and/or if you lack resources, no.

19 / 21

slide-82
SLIDE 82

This course (CS 446 MJT)

Topics

◮ An overview of standard supervised machine learning methods. ◮ Some mathematical intuition, motivation, and background. ◮ A few topics in unsupervised learning.

Course website http://mjt.cs.illinois.edu/courses/ml-s19/

◮ Schedule, homeworks, lecture slides, academic integrity, etc.

Course staff

◮ Instructor: Matus Telgarsky. ◮ Teaching assistants, office hours, online forum: see course website. ◮ Many thanks: to Daniel Hsu @ Columbia; this class is based upon his.

20 / 21

slide-83
SLIDE 83

Key takeaways

  • 1. Examples of machine learning problems and why they are challenging.
  • 2. Course information.
  • 3. Homework 0 due next Tuesday (January 22)!

21 / 21