Lecture 2: Linear Classification Princeton University COS 495 - - PowerPoint PPT Presentation

β–Ά
lecture 2 linear classification
SMART_READER_LITE
LIVE PREVIEW

Lecture 2: Linear Classification Princeton University COS 495 - - PowerPoint PPT Presentation

Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from


slide-1
SLIDE 1

Machine Learning Basics Lecture 2: Linear Classification

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

Review: machine learning basics

slide-3
SLIDE 3

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ that minimizes ΰ· 

𝑀 𝑔 =

1 π‘œ σ𝑗=1 π‘œ

π‘š(𝑔, 𝑦𝑗, 𝑧𝑗)

  • s.t. the expected loss is small

𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸[π‘š(𝑔, 𝑦, 𝑧)]

slide-4
SLIDE 4

Machine learning 1-2-3

  • Collect data and extract features
  • Build model: choose hypothesis class π“˜ and loss function π‘š
  • Optimization: minimize the empirical loss
slide-5
SLIDE 5

Machine learning 1-2-3

  • Collect data and extract features
  • Build model: choose hypothesis class π“˜ and loss function π‘š
  • Optimization: minimize the empirical loss

Experience

Prior knowledge

slide-6
SLIDE 6

Example: Linear regression

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 π‘œ σ𝑗=1 π‘œ

π‘₯π‘ˆπ‘¦π‘— βˆ’ 𝑧𝑗 2

π‘š2 loss

Linear model π“˜

slide-7
SLIDE 7

Why π‘š2 loss

  • Why not choose another loss
  • π‘š1 loss, hinge loss, exponential loss, …
  • Empirical: easy to optimize
  • For linear case: w = π‘Œπ‘ˆπ‘Œ βˆ’1π‘Œπ‘ˆπ‘§
  • Theoretical: a way to encode prior knowledge

Questions:

  • What kind of prior knowledge?
  • Principal way to derive loss?
slide-8
SLIDE 8

Maximum likelihood Estimation

slide-9
SLIDE 9

Maximum likelihood Estimation (MLE)

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Let {π‘„πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„
  • Would like to pick πœ„ so that π‘„πœ„(𝑦, 𝑧) fits the data well
slide-10
SLIDE 10

Maximum likelihood Estimation (MLE)

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Let {π‘„πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„
  • β€œfitness” of πœ„ to one data point 𝑦𝑗, 𝑧𝑗

likelihood πœ„; 𝑦𝑗, 𝑧𝑗 ≔ π‘„πœ„(𝑦𝑗, 𝑧𝑗)

slide-11
SLIDE 11

Maximum likelihood Estimation (MLE)

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Let {π‘„πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„
  • β€œfitness” of πœ„ to i.i.d. data points { 𝑦𝑗, 𝑧𝑗 }

likelihood πœ„; {𝑦𝑗, 𝑧𝑗} ≔ π‘„πœ„ {𝑦𝑗, 𝑧𝑗} = ς𝑗 π‘„πœ„(𝑦𝑗, 𝑧𝑗)

slide-12
SLIDE 12

Maximum likelihood Estimation (MLE)

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Let {π‘„πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„
  • MLE: maximize β€œfitness” of πœ„ to i.i.d. data points { 𝑦𝑗, 𝑧𝑗 }

πœ„π‘π‘€ = argmaxθ∈Θ ς𝑗 π‘„πœ„(𝑦𝑗, 𝑧𝑗)

slide-13
SLIDE 13

Maximum likelihood Estimation (MLE)

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Let {π‘„πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„
  • MLE: maximize β€œfitness” of πœ„ to i.i.d. data points { 𝑦𝑗, 𝑧𝑗 }

πœ„π‘π‘€ = argmaxθ∈Θ log[ς𝑗 π‘„πœ„ 𝑦𝑗, 𝑧𝑗 ] πœ„π‘π‘€ = argmaxθ∈Θ σ𝑗 log[π‘„πœ„ 𝑦𝑗, 𝑧𝑗 ]

slide-14
SLIDE 14

Maximum likelihood Estimation (MLE)

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Let {π‘„πœ„ 𝑦, 𝑧 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„
  • MLE: negative log-likelihood loss

πœ„π‘π‘€ = argmaxθ∈Θ σ𝑗 log(π‘„πœ„ 𝑦𝑗, 𝑧𝑗 ) π‘š π‘„πœ„, 𝑦𝑗, 𝑧𝑗 = βˆ’ log(π‘„πœ„ 𝑦𝑗, 𝑧𝑗 ) ΰ·  𝑀 π‘„πœ„ = βˆ’ σ𝑗 log(π‘„πœ„ 𝑦𝑗, 𝑧𝑗 )

slide-15
SLIDE 15

MLE: conditional log-likelihood

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Let {π‘„πœ„ 𝑧 𝑦 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„
  • MLE: negative conditional log-likelihood loss

πœ„π‘π‘€ = argmaxθ∈Θ σ𝑗 log(π‘„πœ„ 𝑧𝑗|𝑦𝑗 ) π‘š π‘„πœ„, 𝑦𝑗, 𝑧𝑗 = βˆ’ log(π‘„πœ„ 𝑧𝑗|𝑦𝑗 ) ΰ·  𝑀 π‘„πœ„ = βˆ’ σ𝑗 log(π‘„πœ„ 𝑧𝑗|𝑦𝑗 )

Only care about predicting y from x; do not care about p(x)

slide-16
SLIDE 16

MLE: conditional log-likelihood

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Let {π‘„πœ„ 𝑧 𝑦 : πœ„ ∈ Θ} be a family of distributions indexed by πœ„
  • MLE: negative conditional log-likelihood loss

πœ„π‘π‘€ = argmaxθ∈Θ σ𝑗 log(π‘„πœ„ 𝑧𝑗|𝑦𝑗 ) π‘š π‘„πœ„, 𝑦𝑗, 𝑧𝑗 = βˆ’ log(π‘„πœ„ 𝑧𝑗|𝑦𝑗 ) ΰ·  𝑀 π‘„πœ„ = βˆ’ σ𝑗 log(π‘„πœ„ 𝑧𝑗|𝑦𝑗 )

P(y|x): discriminative; P(x,y): generative

slide-17
SLIDE 17

Example: π‘š2 loss

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑔

πœ„ 𝑦 that minimizes ΰ· 

𝑀 𝑔

πœ„ = 1 π‘œ σ𝑗=1 π‘œ

𝑔

πœ„(𝑦𝑗) βˆ’ 𝑧𝑗 2

slide-18
SLIDE 18

Example: π‘š2 loss

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑔

πœ„ 𝑦 that minimizes ΰ· 

𝑀 𝑔

πœ„ = 1 π‘œ σ𝑗=1 π‘œ

𝑔

πœ„(𝑦𝑗) βˆ’ 𝑧𝑗 2

  • Define π‘„πœ„ 𝑧 𝑦 = Normal 𝑧; 𝑔

πœ„ 𝑦 , 𝜏2

  • log(π‘„πœ„ 𝑧𝑗|𝑦𝑗 ) =

βˆ’1 2𝜏2 (𝑔 πœ„ 𝑦𝑗

βˆ’ 𝑧𝑗)2βˆ’log(𝜏) βˆ’

1 2 log(2𝜌)

  • πœ„π‘π‘€ = argminθ∈Θ

1 π‘œ σ𝑗=1 π‘œ

𝑔

πœ„(𝑦𝑗) βˆ’ 𝑧𝑗 2

π‘š2 loss: Normal + MLE

slide-19
SLIDE 19

Linear classification

slide-20
SLIDE 20

Example 1: image classification

indoor

  • utdoor

Indoor

slide-21
SLIDE 21

Example 2: Spam detection

#”$” #”Mr.” #”sale” … Spam? Email 1 2 1 1 Yes Email 2 1 No Email 3 1 1 1 Yes … Email n No New email 1 ??

slide-22
SLIDE 22

Why classification

  • Classification: a kind of summary
  • Easy to interpret
  • Easy for making decisions
slide-23
SLIDE 23

Linear classification

π‘₯π‘ˆπ‘¦ = 0 Class 1 Class 0 π‘₯ π‘₯π‘ˆπ‘¦ > 0 π‘₯π‘ˆπ‘¦ < 0

slide-24
SLIDE 24

Linear classification: natural attempt

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Hypothesis 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦

  • 𝑧 = 1 if π‘₯π‘ˆπ‘¦ > 0
  • 𝑧 = 0 if π‘₯π‘ˆπ‘¦ < 0
  • Prediction: 𝑧 = step(𝑔

π‘₯ 𝑦 ) = step(π‘₯π‘ˆπ‘¦)

Linear model π“˜

slide-25
SLIDE 25

Linear classification: natural attempt

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ to minimize ΰ· 

𝑀 𝑔

π‘₯ = 1 π‘œ σ𝑗=1 π‘œ

𝕁[step(π‘₯π‘ˆπ‘¦π‘—) β‰  𝑧𝑗]

  • Drawback: difficult to optimize
  • NP-hard in the worst case

0-1 loss

slide-26
SLIDE 26

Linear classification: simple approach

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 π‘œ σ𝑗=1 π‘œ

π‘₯π‘ˆπ‘¦π‘— βˆ’ 𝑧𝑗 2

Reduce to linear regression; ignore the fact 𝑧 ∈ {0,1}

slide-27
SLIDE 27

Linear classification: simple approach

Figure borrowed from Pattern Recognition and Machine Learning, Bishop

Drawback: not robust to β€œoutliers”

slide-28
SLIDE 28

Compare the two

𝑧 = π‘₯π‘ˆπ‘¦

π‘₯π‘ˆπ‘¦ 𝑧

𝑧 = step(π‘₯π‘ˆπ‘¦)

slide-29
SLIDE 29

Between the two

  • Prediction bounded in [0,1]
  • Smooth
  • Sigmoid: 𝜏 𝑏 =

1 1+exp(βˆ’π‘) Figure borrowed from Pattern Recognition and Machine Learning, Bishop

slide-30
SLIDE 30

Linear classification: sigmoid prediction

  • Squash the output of the linear function

Sigmoid π‘₯π‘ˆπ‘¦ = 𝜏 π‘₯π‘ˆπ‘¦ = 1 1 + exp(βˆ’π‘₯π‘ˆπ‘¦)

  • Find π‘₯ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 π‘œ σ𝑗=1 π‘œ

𝜏(π‘₯π‘ˆπ‘¦π‘—) βˆ’ 𝑧𝑗 2

slide-31
SLIDE 31

Linear classification: logistic regression

  • Squash the output of the linear function

Sigmoid π‘₯π‘ˆπ‘¦ = 𝜏 π‘₯π‘ˆπ‘¦ = 1 1 + exp(βˆ’π‘₯π‘ˆπ‘¦)

  • A better approach: Interpret as a probability

𝑄

π‘₯(𝑧 = 1|𝑦) = 𝜏 π‘₯π‘ˆπ‘¦ =

1 1 + exp(βˆ’π‘₯π‘ˆπ‘¦) 𝑄

π‘₯ 𝑧 = 0 𝑦 = 1 βˆ’ 𝑄 π‘₯ 𝑧 = 1 𝑦 = 1 βˆ’ 𝜏 π‘₯π‘ˆπ‘¦

slide-32
SLIDE 32

Linear classification: logistic regression

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find π‘₯ that minimizes

ΰ·  𝑀 π‘₯ = βˆ’ 1 π‘œ ෍

𝑗=1 π‘œ

log 𝑄

π‘₯ 𝑧 𝑦

ΰ·  𝑀 π‘₯ = βˆ’ 1 π‘œ ෍

𝑧𝑗=1

log𝜏(π‘₯π‘ˆπ‘¦π‘—) βˆ’ 1 π‘œ ෍

𝑧𝑗=0

log[1 βˆ’ 𝜏 π‘₯π‘ˆπ‘¦π‘— ]

Logistic regression:

MLE with sigmoid

slide-33
SLIDE 33

Linear classification: logistic regression

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find π‘₯ that minimizes

ΰ·  𝑀 π‘₯ = βˆ’ 1 π‘œ ෍

𝑧𝑗=1

log𝜏(π‘₯π‘ˆπ‘¦π‘—) βˆ’ 1 π‘œ ෍

𝑧𝑗=0

log[1 βˆ’ 𝜏 π‘₯π‘ˆπ‘¦π‘— ]

No close form solution; Need to use gradient descent

slide-34
SLIDE 34

Properties of sigmoid function

  • Bounded

𝜏 𝑏 = 1 1 + exp(βˆ’π‘) ∈ (0,1)

  • Symmetric

1 βˆ’ 𝜏 𝑏 = exp βˆ’π‘ 1 + exp βˆ’π‘ = 1 exp 𝑏 + 1 = 𝜏(βˆ’π‘)

  • Gradient

πœβ€²(𝑏) = exp βˆ’π‘ 1 + exp βˆ’π‘

2 = 𝜏(𝑏)(1 βˆ’ 𝜏 𝑏 )