BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning - - PowerPoint PPT Presentation

bbm406
SMART_READER_LITE
LIVE PREVIEW

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning - - PowerPoint PPT Presentation

photo:@rewardyfahmi // Unsplash BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest Neighbor Classifier Aykut Erdem // Hacettepe University // Fall 2019 When Do We Use Machine Learning? ML is


slide-1
SLIDE 1

Aykut Erdem // Hacettepe University // Fall 2019

Lecture 2:

Machine Learning by Examples, 
 Nearest Neighbor Classifier

BBM406

Fundamentals of 
 Machine Learning

photo:@rewardyfahmi // Unsplash

slide-2
SLIDE 2

When Do We Use Machine Learning?

ML is used when:

  • Human expertise does not exist (navigating on Mars)
  • Humans can’t explain their expertise (speech recognition)
  • Models must be customized (personalized medicine)
  • Models are based on huge amounts of data (genomics)

2

slide based on Ethem Alpaydin

slide-3
SLIDE 3

A classic example of a task that requires machine learning: It is very hard to say what makes a 2

3

slide by Geoffrey Hinton

credit: Geoffrey Hinton

slide-4
SLIDE 4

Machine Learning 
 (by examples)

slide-5
SLIDE 5

Pose Estimation

5

slide by Alex Smola

slide-6
SLIDE 6

Collaborative Filtering

6

Amazon books Don’t mix preferences on Netflix!

slide by Alex Smola

slide-7
SLIDE 7

Collaborative Filtering

7

Should be careful

slide-8
SLIDE 8

Imitation Learning in Games

8

Avatar learns from your behavior

Black & White Lionsgate Studios

slide by Alex Smola

slide-9
SLIDE 9

Reinforcement Learning

9

https://www.youtube.com/watch?v=lleRKHsJBJ0

slide by Alex Smola

slide-10
SLIDE 10

Reinforcement Learning

10

https://www.youtube.com/watch?v=5iZlrBqDYPM

slide-11
SLIDE 11

Spam Filtering

11

ham spam

slide by Alex Smola

slide-12
SLIDE 12

Cheque Reading

12

segment image recognize handwriting

slide by Alex Smola

slide-13
SLIDE 13

Image Layout

  • Raw set of images from several cameras
  • Joint layout based on image similarity

13

slide by Alex Smola

slide-14
SLIDE 14

Search Ads

14

why these ads?

slide by Alex Smola

slide-15
SLIDE 15

Self-Driving Cars

15

Image: https://medium.com/waymo/simulation-how-one-flashing-yellow-light-turns-into-thousands-of-hours-of- experience-a7a1cb475565

slide-16
SLIDE 16

Speech Recognition

16

Given an audio waveform, robustly extract & recognize any spoken words

  • Statistical models can be used to

  • Provide greater robustness to noise
  • Adapt to accent of different speakers

  • Learn from training
slide-17
SLIDE 17

Natural Language Processing

17

I need to hide a body noun, verb, preposition, …

slide-18
SLIDE 18

Face Detection

18

Yang et al., From Facial Parts Responses to Face Detection: A Deep Learning Approach, ICCV 2015

slide-19
SLIDE 19

Scene Labeling via Deep Learning

19

[Farabet et al. ICML 2012, PAMI 2013]

slide by Eric Eaton

slide-20
SLIDE 20

Topic Models of Text Documents

20

Topic Models of Text Documents

slide by Eric Sudderth

slide-21
SLIDE 21

Genomics: group individuals by genetic similarity

21

slide by Daphne Koller

individuals genes

slide-22
SLIDE 22

Learning - revisited

22

data


Learning


knowledge
 



prior
 knowledge


slide by Stuart Russell

slide-23
SLIDE 23

Learning - revisited

23

data


Learning


knowledge
 



prior
 knowledge


slide by Stuart Russell

slide-24
SLIDE 24

Programming with Data

  • Want adaptive robust and fault tolerant systems
  • Rule-based implementation is (often)
  • difficult (for the programmer)
  • brittle (can miss many edge-cases)
  • becomes a nightmare to maintain explicitly
  • often doesn’t work too well (e.g. OCR)

  • Usually easy to obtain examples of what we want


IF x THEN DO y

  • Collect many pairs (xi, yi)
  • Estimate function f such that f(xi) = yi (supervised learning)
  • Detect patterns in data (unsupervised learning)

24

slide by Mehryar Mohri

slide-25
SLIDE 25

Objectives of Machine Learning

25

  • Algorithms: design of efficient, accurate, and

general learning algorithms to

– deal with large-scale problems. – make accurate predictions (unseen examples). – handle a variety of different learning problems.

  • Theoretical questions:

– what can be learned? Under what conditions? – what learning guarantees can be given? – what is the algorithmic complexity?

slide by Mehryar Mohri

slide-26
SLIDE 26

Definitions and Terminology

  • Example: an object, instance of the data used.
  • Features: the set of attributes, often

represented as a vector, associated to an example (e.g., height and weight for gender prediction).

  • Labels: in classification, category associated to

an object (e.g., positive or negative in binary classification); in regression real value.

  • Training data: data used for training learning

algorithm (often labeled data).

26

slide by Mehryar Mohri

slide-27
SLIDE 27

Definitions and Terminology (cont’d.)

  • Test data: data used for testing learning

algorithm (unlabeled data).

  • Unsupervised learning: no labeled data.
  • Supervised learning: uses labeled data.
  • Weakly or semi-supervised learning:

intermediate scenarios.

  • Reinforcement learning: rewards from

sequence of action.

27

slide by Mehryar Mohri

slide-28
SLIDE 28

Supervised Learning

slide by Alex Smola

slide-29
SLIDE 29

Supervised Learning

  • Binary classification


Given x find y in {-1, 1}

  • Multicategory classification


Given x find y in {1, ... k}

  • Regression


Given x find y in R (or Rd)

  • Sequence annotation


Given sequence x1 ... xl find y1 ... yl

  • Hierarchical Categorization (Ontology)


Given x find a point in the hierarchy of y (e.g. a tree)

  • Prediction


Given xt and yt-1 ... y1 find yt


29

y = f(x)

l(y, f(x))

  • ften with loss

slide by Alex Smola

slide-30
SLIDE 30

Binary Classification

30

slide by Alex Smola

slide-31
SLIDE 31

Multiclass Classification + Annotation

31

slide by Alex Smola

slide-32
SLIDE 32

Regression

32

linear nonlinear

slide by Alex Smola

slide-33
SLIDE 33

Sequence Annotation

33

given sequence 
 gene finding speech recognition activity segmentation named entities

slide by Alex Smola

slide-34
SLIDE 34

Ontology

34

webpages genes

slide by Alex Smola

slide-35
SLIDE 35

Prediction

35

tomorrow’s stock price

slide by Alex Smola

slide-36
SLIDE 36

Unsupervised Learning

slide by Alex Smola

slide-37
SLIDE 37

Unsupervised Learning

  • Given data x, ask a good question ... about x or about model for x
  • Clustering


Find a set of prototypes representing the data

  • Principal Components


Find a subspace representing the data

  • Sequence Analysis


Find a latent causal sequence for observations

  • Sequence Segmentation
  • Hidden Markov Model (discrete state)
  • Kalman Filter (continuous state)
  • Hierarchical representations
  • Independent components / dictionary learning


Find (small) set of factors for observation

  • Novelty detection


Find the odd one out

37

slide by Alex Smola

slide-38
SLIDE 38

Clustering

  • Documents
  • Users
  • Webpages
  • Diseases
  • Pictures
  • Vehicles


...

38

slide by Alex Smola

slide-39
SLIDE 39

Principal Components

39

Variance component model to account for sample structure in genome-wide association studies, Nature Genetics 2010

slide by Alex Smola

slide-40
SLIDE 40

Hierarchical Grouping

40

slide by Alex Smola

slide-41
SLIDE 41

Independent Components

41

find them automatically

slide by Alex Smola

slide-42
SLIDE 42

Novelty detection

42

typical atypical

slide by Alex Smola

slide-43
SLIDE 43

Important challenges in ML

  • How important is the actual learning

algorithm and its tuning

  • Simple versus complex algorithm
  • Overfitting
  • Model Selection
  • Regularization

43

slide-44
SLIDE 44

Your 1st Classifier: 
 Nearest Neighbor Classifier

slide-45
SLIDE 45

Concept Learning

  • Definition: Acquire an operational definition
  • f a general category of objects given positive

and negative training examples.

  • Also called binary classification, binary

supervised learning

  • 45

slide by Thorsten Joachims

slide-46
SLIDE 46

Concept Learning Example

  • Instance Space X: Set of all possible objects describable by

attributes (often called features).

  • Concept c : Subset of objects from X (c is unknown).
  • Target Function f : Characteristic function indicating

membership in c based on attributes (i.e. label) (f is unknown).

  • Training Data S : Set of instances labeled with target function.

46

correct (3) color (2)

  • riginal

(2) presentation (3) binder (2) A+ complete yes yes clear no yes complete no yes clear no yes partial yes no unclear no no complete yes yes clear yes yes correct

(complete, partial, guessing)

color

(yes, no)

  • riginal

(yes, no)

presentation

(clear, unclear, cryptic)

binder

(yes, no)

A+ 1 complete yes yes clear no yes 2 complete no yes clear no yes 3 partial yes no unclear no no 4 complete yes yes clear yes yes

slide by Thorsten Joachims

slide-47
SLIDE 47

Concept Learning as Learning 
 A Binary Function

  • Task


– Learn (to imitate) a function f : X → {+1,-1}

  • Training Examples


– Learning algorithm is given the correct value of the 
 function for particular inputs → training examples
 – An example is a pair (x, y), where x is the input and 
 y = f(x) is the output of the target function applied to x.

  • Goal


– Find a function 
 h: X → {+1,-1} 
 that approximates 
 f: X → {+1,-1} 
 as well as possible.

47

slide by Thorsten Joachims

slide-48
SLIDE 48

Supervised Learning

48

  • Task


– Learn (to imitate) a function f : X → Y

  • Training Examples


– Learning algorithm is given the correct value of the function 
 for particular inputs → training examples
 – An example is a pair (x, f (x)), where x is the input and y = f (x) is 
 the output of the target function applied to x.

  • Goal


– Find a function 
 h: X → Y 
 that approximates 
 f: X → Y 
 as well as possible.

slide by Thorsten Joachims

slide-49
SLIDE 49

Supervised / Inductive Learning

  • Given
  • examples of a function (x, f (x))
  • Predict function f (x) for new examples x
  • Discrete f (x): Classification
  • Continuous f (x): Regression
  • f (x) = Probability(x): Probability estimation

49

slide by Thorsten Joachims

slide-50
SLIDE 50

Image Classification: a core task in Computer Vision

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

50

slide-51
SLIDE 51

The problem: semantic gap

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

51

slide-52
SLIDE 52

Challenges: Viewpoint Variation

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

52

slide-53
SLIDE 53

Challenges: Illumination

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

53

slide-54
SLIDE 54

Challenges: Deformation

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

54

slide-55
SLIDE 55

Challenges: Occlusion

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

55

slide-56
SLIDE 56

Challenges: Background clutter

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

56

slide-57
SLIDE 57

Challenges: Intraclass variation

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

57

slide-58
SLIDE 58

An image classifier

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Unlike e.g. sorting a list of numbers, no obvious way to hard-code the algorithm for recognizing a cat, or other classes.

58

slide-59
SLIDE 59

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Attempts have been made

59

slide-60
SLIDE 60

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Data-driven approach: 1.Collect a dataset of images and labels 2.Use Machine Learning to train an image classifier 3.Evaluate the classifier on a withheld set of test images

60

slide-61
SLIDE 61

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

First classifier: Nearest Neighbor Classifier

Remember all training images and their labels Predict the label of the most similar training image

61

slide-62
SLIDE 62

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

62

slide-63
SLIDE 63

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

63

slide-64
SLIDE 64

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

How do we compare the images? What is the distance metric?

64

slide-65
SLIDE 65

Lecture 2 - 6 Jan 2016 Lecture 2 - 6 Jan 2016 65

Nearest Neighbor classifier

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

65

slide-66
SLIDE 66

Lecture 2 - 6 Jan 2016 Lecture 2 - 6 Jan 2016 66 remember the training data Nearest Neighbor classifier

66

slide-67
SLIDE 67

Lecture 2 - 6 Jan 2016 Lecture 2 - 6 Jan 2016 67

for every test image:

  • find nearest train

image with L1 distance

  • predict the label
  • f nearest training

image

Nearest Neighbor classifier

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

67

slide-68
SLIDE 68

Lecture 2 - 6 Jan 2016 Lecture 2 - 6 Jan 2016 68 Q: how does the classification speed depend

  • n the size of

the training data? Nearest Neighbor classifier

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

68

slide-69
SLIDE 69

Lecture 2 - 6 Jan 2016 Lecture 2 - 6 Jan 2016 69 Q: how does the classification speed depend

  • n the size of the

training data? linearly :( Nearest Neighbor classifier

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

69

slide-70
SLIDE 70

Lecture 2 - 6 Jan 2016 Lecture 2 - 6 Jan 2016 70

Aside: Approximate Nearest Neighbor find approximate nearest neighbors quickly

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

70

slide-71
SLIDE 71

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

71

slide-72
SLIDE 72

k-Nearest Neighbor

find the k nearest images, have them vote on the label

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

72

slide-73
SLIDE 73

K-Nearest Neighbor (kNN)

73

  • Given: Training data {(𝑦1,𝑧1),…, (𝑦n,𝑧n )} 


– Attribute vectors: 𝑦𝑗 ∈ 𝑌
 – Labels: 𝑧𝑗 ∈ 𝑍

  • Parameter:


– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 → R
 – Number of nearest neighbors to consider: k

  • Prediction rule


– New example 𝑦′ 
 – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗, 𝑦′)

  • ( 𝑦

⃗, 𝑧 , … , x, 𝑧 )

– 𝑦 ⃗ ∈ 𝑌 – 𝑧 ∈ 𝑍

𝐿 ∶ 𝑌 × 𝑌 ¡ → ¡ℜ –

x’ – 𝐿(𝑦 ⃗, 𝑦 ⃗)

slide by Thorsten Joachims

slide-74
SLIDE 74

1-Nearest Neighbor

74

slide by Thorsten Joachims

slide-75
SLIDE 75

4-Nearest Neighbors

75

slide by Thorsten Joachims

slide-76
SLIDE 76

4-Nearest Neighbors Sign

76

slide by Thorsten Joachims

slide-77
SLIDE 77

4-Nearest Neighbors Sign

77

slide by Thorsten Joachims

For binary classification problems, 
 why is it a good idea to use an odd number of K?

slide-78
SLIDE 78

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

78

slide-79
SLIDE 79

slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

We will talk about this later!

79

slide-80
SLIDE 80

If we get more data

80

  • 1 Nearest Neighbor
  • Converges to perfect solution if clear separation
  • Twice the minimal error rate 2p (1-p) for noisy problems
  • k-Nearest Neighbor
  • Converges to perfect solution if clear separation (but needs more data)
  • Converges to minimal error min(p, 1-p) for noisy problems if k increases
slide-81
SLIDE 81

Demo

81

slide-82
SLIDE 82

Weighted K-Nearest Neighbor

82

  • Given: Training data {(𝑦1,𝑧1),…, (𝑦n,𝑧n )} 


– Attribute vectors: 𝑦𝑗 ∈ 𝑌
 – Target attribute 𝑧𝑗 ∈ 𝑍

  • Parameter:


– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 → R
 – Number of nearest neighbors to consider: k

  • Prediction rule


– New example 𝑦′ 
 – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗, 𝑦′)

  • 𝑦

⃗, 𝑧 , … , 𝑦 ⃗, 𝑧

– 𝑦 ⃗ ∈ 𝑌 – 𝑧 ∈ 𝑍

𝐿 ∶ 𝑌 × 𝑌 ¡ → ¡ℜ –

x’ – 𝐿 𝑦 ⃗, 𝑦 ⃗

slide-83
SLIDE 83

More Nearest Neighbors 
 in Visual Data

83

slide-84
SLIDE 84

Where in the World? [Hays & Efros, CVPR 2008]

84

A nearest neighbor
 recognition example

slide by James Hays

slide-85
SLIDE 85

85

Where in the World? [Hays & Efros, CVPR 2008]

slide by James Hays

slide-86
SLIDE 86

86

Where in the World? [Hays & Efros, CVPR 2008]

slide by James Hays

slide-87
SLIDE 87

Annotated by Flickr users

6+ million geotagged photos
 by 109,788 photographers

slide by James Hays

87

slide-88
SLIDE 88

6+ million geotagged photos
 by 109,788 photographers

Annotated by Flickr users

slide by James Hays

88

slide-89
SLIDE 89

89

slide by James Hays

89

slide-90
SLIDE 90

Scene Matches

90

slide by James Hays

slide-91
SLIDE 91

slide by James Hays

91

slide-92
SLIDE 92

Scene Matches

92

slide by James Hays

slide-93
SLIDE 93

slide by James Hays

93

slide-94
SLIDE 94

Scene Matches

94

slide by James Hays

slide-95
SLIDE 95

slide by James Hays

95

slide-96
SLIDE 96

The Importance of Data

96

slide by James Hays

slide-97
SLIDE 97

Scene Completion [Hays & Efros, SIGGRAPH07]

97

slide by James Hays

slide-98
SLIDE 98

98

… 200 total

Hays and Efros, SIGGRAPH 2007 slide by James Hays

slide-99
SLIDE 99

Context Matching

99

Hays and Efros, SIGGRAPH 2007 slide by James Hays

slide-100
SLIDE 100

100

Graph cut + Poisson blending

Hays and Efros, SIGGRAPH 2007 slide by James Hays

100

slide-101
SLIDE 101

101

Hays and Efros, SIGGRAPH 2007 slide by James Hays

slide-102
SLIDE 102

102

Hays and Efros, SIGGRAPH 2007 slide by James Hays

slide-103
SLIDE 103

103

Hays and Efros, SIGGRAPH 2007 slide by James Hays

slide-104
SLIDE 104

104

Hays and Efros, SIGGRAPH 2007 slide by James Hays

slide-105
SLIDE 105

105

Hays and Efros, SIGGRAPH 2007 slide by James Hays

slide-106
SLIDE 106

106

Hays and Efros, SIGGRAPH 2007 slide by James Hays

slide-107
SLIDE 107

Weighted K-NN for Regression

107

slide by Thorsten Joachims

  • 𝑦

1, 𝑧1 , … , 𝑦 𝑜, 𝑧𝑜

– 𝑦 𝑗 ∈ 𝑌 – 𝑧𝑗 ∈ ℜ

𝐿 ∶ 𝑌 × 𝑌 → ℜ –

x’ – 𝐿 𝑦 𝑗, 𝑦 ′

  • Given: Training data {(𝑦1,𝑧1),…, (𝑦n,𝑧n )} 


– Attribute vectors: 𝑦𝑗 ∈ 𝑌
 – Target attribute 𝑧𝑗 ∈

  • Parameter:


– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 →
 – Number of nearest neighbors to consider: k

  • Prediction rule


– New example 𝑦′ 
 – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦′)

R R

slide-108
SLIDE 108

Collaborative Filtering

108

slide by Thorsten Joachims

slide-109
SLIDE 109

Overview of Nearest Neighbors

  • Very simple method
  • Retain all training data 

  • Can be slow in testing 

  • Finding NN in high dimensions is slow
  • Metrics are very important
  • Good baseline

109

slide by Rob Fergus

slide-110
SLIDE 110

Next Class:

Linear Regression and Least Squares

110