Lecture 1: Linear Regression Princeton University COS 495 - - PowerPoint PPT Presentation

β–Ά
lecture 1 linear regression
SMART_READER_LITE
LIVE PREVIEW

Lecture 1: Linear Regression Princeton University COS 495 - - PowerPoint PPT Presentation

Machine Learning Basics Lecture 1: Linear Regression Princeton University COS 495 Instructor: Yingyu Liang Machine learning basics What is machine learning? A computer program is said to learn from experience E with respect to some


slide-1
SLIDE 1

Machine Learning Basics Lecture 1: Linear Regression

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

Machine learning basics

slide-3
SLIDE 3

What is machine learning?

  • β€œA computer program is said to learn from experience E with respect

to some class of tasks T and performance measure P, if its performance at tasks in T as measured by P, improves with experience E.”

  • ------ Machine Learning, Tom Mitchell, 1997
slide-4
SLIDE 4

Example 1: image classification

Task: determine if the image is indoor or outdoor Performance measure: probability of misclassification

slide-5
SLIDE 5

Example 1: image classification

indoor

  • utdoor

Experience/Data: images with labels Indoor

slide-6
SLIDE 6

Example 1: image classification

  • A few terminologies
  • Training data: the images given for learning
  • Test data: the images to be classified
  • Binary classification: classify into two classes
slide-7
SLIDE 7

Example 1: image classification (multi-class)

ImageNet figure borrowed from vision.standford.edu

slide-8
SLIDE 8

Example 2: clustering images

Task: partition the images into 2 groups Performance: similarities within groups Data: a set of images

slide-9
SLIDE 9

Example 2: clustering images

  • A few terminologies
  • Unlabeled data vs labeled data
  • Supervised learning vs unsupervised learning
slide-10
SLIDE 10

Math formulation

Color Histogram

Red Green Blue

Indoor

Feature vector: 𝑦𝑗 Label: 𝑧𝑗 Extract features

slide-11
SLIDE 11

Math formulation

Color Histogram

Red Green Blue

  • utdoor

1

Feature vector: π‘¦π‘˜ Label: π‘§π‘˜ Extract features

slide-12
SLIDE 12

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ
  • Find 𝑧 = 𝑔(𝑦) using training data
  • s.t. 𝑔 correct on test data

What kind of functions?

slide-13
SLIDE 13

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ using training data
  • s.t. 𝑔 correct on test data

Hypothesis class

slide-14
SLIDE 14

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ using training data
  • s.t. 𝑔 correct on test data

Connection between training data and test data?

slide-15
SLIDE 15

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ using training data
  • s.t. 𝑔 correct on test data i.i.d. from distribution 𝐸

They have the same distribution i.i.d.: independently identically distributed

slide-16
SLIDE 16

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ using training data
  • s.t. 𝑔 correct on test data i.i.d. from distribution 𝐸

What kind of performance measure?

slide-17
SLIDE 17

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ using training data
  • s.t. the expected loss is small

𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸[π‘š(𝑔, 𝑦, 𝑧)]

Various loss functions

slide-18
SLIDE 18

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ using training data
  • s.t. the expected loss is small

𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸[π‘š(𝑔, 𝑦, 𝑧)]

  • Examples of loss functions:
  • 0-1 loss: π‘š 𝑔, 𝑦, 𝑧 = 𝕁[𝑔 𝑦 β‰  𝑧] and 𝑀 𝑔 = Pr[𝑔 𝑦 β‰  𝑧]
  • π‘š2 loss: π‘š 𝑔, 𝑦, 𝑧 = [𝑔 𝑦 βˆ’ 𝑧]2 and 𝑀 𝑔 = 𝔽[𝑔 𝑦 βˆ’ 𝑧]2
slide-19
SLIDE 19

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ using training data
  • s.t. the expected loss is small

𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸[π‘š(𝑔, 𝑦, 𝑧)]

How to use?

slide-20
SLIDE 20

Math formulation

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑧 = 𝑔(𝑦) ∈ π“˜ that minimizes ΰ· 

𝑀 𝑔 =

1 π‘œ σ𝑗=1 π‘œ

π‘š(𝑔, 𝑦𝑗, 𝑧𝑗)

  • s.t. the expected loss is small

𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸[π‘š(𝑔, 𝑦, 𝑧)]

Empirical loss

slide-21
SLIDE 21

Machine learning 1-2-3

  • Collect data and extract features
  • Build model: choose hypothesis class π“˜ and loss function π‘š
  • Optimization: minimize the empirical loss
slide-22
SLIDE 22

Wait…

  • Why handcraft the feature vectors 𝑦, 𝑧?
  • Can use prior knowledge to design suitable features
  • Can computer learn the features on the raw images?
  • Learn features directly on the raw images: Representation Learning
  • Deep Learning βŠ† Representation Learning βŠ† Machine Learning βŠ† Artificial

Intelligence

slide-23
SLIDE 23

Wait…

  • Does MachineLearning-1-2-3 include all approaches?
  • Include many but not all
  • Our current focus will be MachineLearning-1-2-3
slide-24
SLIDE 24

Example: Stock Market Prediction

2013 2014 2015 2016

Stock Market (Disclaimer: synthetic data/in another parallel universe)

Orange MacroHard Ackermann

Sliding window over time: serve as input 𝑦; non-i.i.d.

slide-25
SLIDE 25

Linear regression

slide-26
SLIDE 26

Real data: Prostate Cancer by Stamey et al. (1989)

Figure borrowed from The Elements of Statistical Learning

𝑧: prostate

specific antigen

(𝑦1, … , 𝑦8): clinical measures

slide-27
SLIDE 27

Linear regression

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 π‘œ σ𝑗=1 π‘œ

π‘₯π‘ˆπ‘¦π‘— βˆ’ 𝑧𝑗 2

π‘š2 loss; also called mean square error

Hypothesis class π“˜

slide-28
SLIDE 28

Linear regression: optimization

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 π‘œ σ𝑗=1 π‘œ

π‘₯π‘ˆπ‘¦π‘— βˆ’ 𝑧𝑗 2

  • Let π‘Œ be a matrix whose 𝑗-th row is 𝑦𝑗

π‘ˆ, 𝑧 be the vector 𝑧1, … , π‘§π‘œ π‘ˆ

ΰ·  𝑀 𝑔

π‘₯ = 1

π‘œ ෍

𝑗=1 π‘œ

π‘₯π‘ˆπ‘¦π‘— βˆ’ 𝑧𝑗 2 = 1 π‘œ βƒ¦π‘Œπ‘₯ βˆ’ 𝑧 ⃦2

2

slide-29
SLIDE 29

Linear regression: optimization

  • Set the gradient to 0 to get the minimizer

𝛼

π‘₯ ΰ· 

𝑀 𝑔

π‘₯ = 𝛼 π‘₯

1 π‘œ βƒ¦π‘Œπ‘₯ βˆ’ 𝑧 ⃦2

2 = 0

𝛼

π‘₯[ π‘Œπ‘₯ βˆ’ 𝑧 π‘ˆ(π‘Œπ‘₯ βˆ’ 𝑧)] = 0

𝛼

π‘₯[ π‘₯π‘ˆπ‘Œπ‘ˆπ‘Œπ‘₯ βˆ’ 2π‘₯π‘ˆπ‘Œπ‘ˆπ‘§ + π‘§π‘ˆπ‘§] = 0

2π‘Œπ‘ˆπ‘Œπ‘₯ βˆ’ 2π‘Œπ‘ˆπ‘§ = 0 w = π‘Œπ‘ˆπ‘Œ βˆ’1π‘Œπ‘ˆπ‘§

slide-30
SLIDE 30

Linear regression: optimization

  • Algebraic view of the minimizer
  • If π‘Œ is invertible, just solve π‘Œπ‘₯ = 𝑧 and get π‘₯ = π‘Œβˆ’1𝑧
  • But typically π‘Œ is a tall matrix

π‘Œ π‘₯ = 𝑧 π‘Œπ‘ˆπ‘Œ π‘₯ = π‘Œπ‘ˆπ‘§ Normal equation: w = π‘Œπ‘ˆπ‘Œ βˆ’1π‘Œπ‘ˆπ‘§

slide-31
SLIDE 31

Linear regression with bias

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Find 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐 to minimize the loss

  • Reduce to the case without bias:
  • Let π‘₯β€² = π‘₯; 𝑐 , 𝑦′ = 𝑦; 1
  • Then 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐 = π‘₯β€² π‘ˆ(𝑦′)

Bias term