Supervised Learning II Cameron Allen csal@brown.edu Fall 2019 - - PowerPoint PPT Presentation

supervised learning ii
SMART_READER_LITE
LIVE PREVIEW

Supervised Learning II Cameron Allen csal@brown.edu Fall 2019 - - PowerPoint PPT Presentation

Supervised Learning II Cameron Allen csal@brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997) Supervised


slide-1
SLIDE 1

Supervised Learning II

Cameron Allen csal@brown.edu

Fall 2019

slide-2
SLIDE 2

Machine Learning

Subfield of AI concerned with learning from data. Broadly, using:

  • Experience
  • To Improve Performance
  • On Some Task

(Tom Mitchell, 1997)

slide-3
SLIDE 3

Supervised Learning

Input: X = {x1, …, xn} Y = {y1, …, yn} Learn to predict new labels. Given x: y?

inputs labels training data

slide-4
SLIDE 4

Supervised Learning

Input: X = {x1, …, xn} Y = {y1, …, yn} Learn to predict new labels. Given x: y?

inputs labels training data

“Not Hotdog”, SeeFood Technologies Inc.

slide-5
SLIDE 5

Supervised Learning

Input: X = {x1, …, xn} Y = {y1, …, yn} Learn to predict new labels. Given x: y?

inputs labels training data

slide-6
SLIDE 6

Supervised Learning

Formal definition: Given training data: X = {x1, …, xn} Y = {y1, …, yn} Produce: Decision function That minimizes error:

inputs labels

f : X → Y X

i

err(f(xi), yi)

slide-7
SLIDE 7

Neural Networks

σ(w · x + c)

logistic regression

slide-8
SLIDE 8

Deep Neural Networks

x1 x2 h11 h12 h13

  • 1
  • 2

hn1 hn2 hn3

….

slide-9
SLIDE 9

Nonparametric Methods

Most ML methods are parametric:

  • Characterized by setting a few parameters.
  • y = f(x, w)

Alternative approach:

  • Let the data speak for itself.
  • No finite-sized parameter vector.
  • Usually more interesting decision boundaries.
slide-10
SLIDE 10

K-Nearest Neighbors

Given training data: X = {x1, …, xn} Y = {y1, …, yn} Distance metric D(xi, xj) For a new data point xnew: find k nearest points in X (measured via D) set ynew to the majority label

slide-11
SLIDE 11

K-Nearest Neighbors

+ + + + + +

  • +

++ + +

slide-12
SLIDE 12

K-Nearest Neighbors

Decision boundary … what if k=1?

+ + + + + +

  • +

++ + +

slide-13
SLIDE 13

K-Nearest Neighbors

Properties:

  • No learning phase.
  • Must store all the data.
  • log(n) computation per sample - grows with data.

Decision boundary:

  • any function, given enough data.

Classic trade-off: memory and compute time for flexibility.

slide-14
SLIDE 14

Applications

  • Fraud detection
  • Internet advertising
  • Friend or link prediction
  • Sentiment analysis
  • Face recognition
  • Spam filtering
slide-15
SLIDE 15

Applications

MNIST Data Set Training set: 60k digits Test set: 10k digits

slide-16
SLIDE 16

Classification vs. Regression

If the set of labels Y is discrete:

  • Classification
  • Minimize number of errors

If Y is real-valued:

  • Regression
  • Minimize sum squared error

Let’s look at regression.

slide-17
SLIDE 17

Regression with Decision Trees

a > 3.1

true y=1 false

b < 0.6?

true y=2 false y=1 Start with decision trees with real-valued inputs.

slide-18
SLIDE 18

Regression with Decision Trees

a > 3.1

true y=0.6 false

b < 0.6?

true y=0.3 false y=1.1 … now real-valued outputs.

slide-19
SLIDE 19

Regression with Decision Trees

Training procedure - fix a depth, k. If you have k=1, fit the average. If k > 1: Consider all variables to split on Find the one that minimizes SSE Recurse (k-1) What happens if k = N?

slide-20
SLIDE 20

Regression with Decision Trees

(via scikit-learn docs)

slide-21
SLIDE 21

Linear Regression

Alternatively, explicit equation for prediction. Recall the Perceptron. f(x) = sign(w · x + c)

gradient

  • ffset

If x = [x(1), … , x(n)]:

  • Create an n-d line
  • Slope for each x(i)
  • Constant offset

+ + + + + +

slide-22
SLIDE 22

Linear Regression

Directly represent f as a linear function:

  • What can be represented this way?

f(x, w) = w · x + c

x1 x2 y

slide-23
SLIDE 23

Linear Regression

How to train? Given inputs:

  • x = [x1, …, xn] (each xi is a vector, first element = 1)
  • y = [y1, …, yn] (each yi is a real number)

Define error function: Minimize summed squared error

n

X

i=1

(w · xi − yi)2

slide-24
SLIDE 24

Linear Regression

The usual story:

  • Set derivative of error function to zero.

d dw

n

X

i=1

(w · xi − yi)2 = 0 2

n

X

i=1

(w · xi − yi)xT

i = 0

n X

i=1

xT

i xi

! w =

n

X

i=1

xT

i yi

w = A−1b A = n X

i=1

xT

i xi

! b =

n

X

i=1

xT

i yi

matrix vector

slide-25
SLIDE 25

Polynomial Regression

More powerful:

  • Polynomials in state variables.
  • 1st order:
  • 2nd order:
  • What can be represented?

[1, x, y, xy] [1, x, y, xy, x2, y2, x2y, y2x, x2y2] yi = w · Φ(xi)

slide-26
SLIDE 26

Polynomial Regression

As before …

d dw

n

X

i=1

(w · Φ(xi) − yi)2

w = A−1b

A =

n

X

i=1

ΦT (xi)Φ(xi) b =

n

X

i=1

ΦT (xi)yi

slide-27
SLIDE 27

Polynomial Regression

(wikipedia)

slide-28
SLIDE 28

Overfitting

slide-29
SLIDE 29

Overfitting

slide-30
SLIDE 30

Ridge Regression

A characteristic of overfitting:

  • Very large weights.

Modify the objective function to discourage this:

min

n

X

i=1

(w · xi − yi)2 + λ||w|| error term regularization term

w =

  • AT A + ΛT Λ

−1 AT b

slide-31
SLIDE 31

Neural Network Regression

σ(w · x + c)

classification

slide-32
SLIDE 32

Neural Network Regression

x1 x2 h1 h2 h3

  • 1
  • 2

input layer hidden layer

  • utput layer
slide-33
SLIDE 33

Neural Network Regression

x1 x2 h1 h2 h3

  • 1
  • 2

input layer value computed

h1 = σ(wh1

1 x1 + wh1 2 x2 + wh1 3 )

σ(wh2

1 x1 + wh2 2 x2 + wh2 3 )

σ(wh3

1 x1 + wh3 2 x2 + wh3 3 )

x1, x2 ∈ [0, 1]

feed forward

σ(wo1

1 h1 + wo1 2 h2 + wo1 3 h3 + wo1 4 )

value computed

σ(wo2

1 h1 + wo2 2 h2 + wo2 3 h3 + wo2 4 )

slide-34
SLIDE 34

No closed form solution to gradient = 0. Hence, stochastic gradient descent:

  • Compute
  • Descend

d dw(yi − f(xi, w))2

Neural Network Regression

A neural network is just a parametrized function: How to train it?

y = f(x, w)

(yi − f(xi, w))2 Write down an error function: Minimize it! (w.r.t. w)

slide-35
SLIDE 35

(Zhang, Isola, Efros, 2016)

Image Colorization

slide-36
SLIDE 36

Nonparametric Regression

Most ML methods are parametric:

  • Characterized by setting a few parameters.
  • y = f(x, w)

Alternative approach:

  • Let the data speak for itself.
  • No finite-sized parameter vector.
  • Usually more interesting decision boundaries.
slide-37
SLIDE 37

Nonparametric Regression

What’s the regression equivalent of k-Nearest Neighbors? Given training data: X = {x1, …, xn} Y = {y1, …, yn} Distance metric D(xi, xj) For a new data point xnew: find k nearest points in X (measured via D) set ynew to the (weighted by D) average yi labels

slide-38
SLIDE 38

Nonparametric Regression

As k increases, f gets smoother.

slide-39
SLIDE 39

Gaussian Processes

slide-40
SLIDE 40

Applications

(Gonzalez et al., 2007) model and predict variations in pH, clay, and sand content in the topsoil