Diagnostics & Kernel Methods Visualized Ondej Bojar April 3, - - PowerPoint PPT Presentation

diagnostics kernel methods visualized
SMART_READER_LITE
LIVE PREVIEW

Diagnostics & Kernel Methods Visualized Ondej Bojar April 3, - - PowerPoint PPT Presentation

Diagnostics & Kernel Methods Visualized Ondej Bojar April 3, 2019 NPFL104 Machine Learning Methods Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Diagnostics


slide-1
SLIDE 1

Diagnostics & Kernel Methods Visualized

Ondřej Bojar

April 3, 2019

NPFL104 Machine Learning Methods

Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

slide-2
SLIDE 2

Diagnostics

slide-3
SLIDE 3

Outline

  • Motivation for principled analysis.
  • Ideas for visualization.
  • Bias vs. Variance.
  • Optimizer vs. Objective function issue.
  • a.k.a. Search error vs. Modelling error.
  • Error analysis, Ablative analysis.

Slides based on:

  • the Stanford ML Lecture 11: http://www.youtube.com/watch?v=sQ8T9b-uGVE
  • http://scott.fortmann-roe.com/docs/BiasVariance.html
  • All errors are Ondřej’s fault.

1/110

slide-4
SLIDE 4

Motivation: Debugging ML

Some ML does not perform suffjciently well. You can consider random improvements:

  • Getting more training examples.
  • Reduce the set of features.
  • Enlarge the set of features.
  • Use difgerent features.
  • Run the optimizer for some more iterations.
  • Choose a difgerent optimization algorithm.
  • Use a difgerent regularization term or constant value.
  • Try another learning algorithm (SVM).

… some may be fjxing problems you don’t have.

2/110

slide-5
SLIDE 5

Principled Analysis

First fjgure out what’s going on.

  • Overfjtting vs. Underfjtting?
  • Search error vs. Modelling error?
  • Complex system: Find the most problematic component.

Trivial but vital:

  • Visualize the data. (Plot or view frequent patterns.)
  • Start with simple things.

3/110

slide-6
SLIDE 6

Data Visualization

  • Data visualization is extremely useful.
  • Always plot the data when working with a new dataset.
  • This is inherent part of

hw_my_dataset .

  • Choose one way of visualizing the data to give a quick overview of it.
  • Suggested gradual steps on the following slides.

(Python source on the seminar web page.) An excellent resource: https://matplotlib.org/gallery.html

4/110

slide-7
SLIDE 7

Scatter Plot of Random Values

3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 5/110

slide-8
SLIDE 8

Scatter Plot of Activity – Heart Rate

5 5 10 15 20 25 30 60 80 100 120 140 160 180 200 6/110

slide-9
SLIDE 9

Box Plot of Activity – Heart Rate

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 60 80 100 120 140 160 180 200 7/110

slide-10
SLIDE 10

Gaussian Function

3 2 1 1 2 3 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 8/110

slide-11
SLIDE 11

Histogram of Random Values

4 3 2 1 1 2 3 4 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

hist

9/110

slide-12
SLIDE 12

Histogram and Gaussian Function (not Fit!)

4 3 2 1 1 2 3 4 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

gauss hist

10/110

slide-13
SLIDE 13

Histogram of Heart Rate when Lying

75 80 85 90 95 100 105 110 100 200 300 400 500 600

heartrates hist. for activity 1

11/110

slide-14
SLIDE 14

Hists of HR when Lying, Walking, Running

60 80 100 120 140 160 180 100 200 300 400 500 600

heartrates hist. for activity 1 heartrates hist. for activity 3 heartrates hist. for activity 5

12/110

slide-15
SLIDE 15

Least Squares for Linear Fit

60 70 80 90 100 110 120 130 50 100 150 200 250

data initial

  • ptimized

13/110

slide-16
SLIDE 16

Cubic Fit of a Histogram

4 3 2 1 1 2 3 4 20 10 10 20 30 40 50 60 70

data initial

  • ptimized

14/110

slide-17
SLIDE 17

Bias vs. Variance

High Variance = Overfjtting:

  • the model has too many parameters.

High Bias = Underfjtting:

  • the model is too rigid.

Consider:

  • What is the efgect of each of those on training error?
  • Will more training data help?
  • Sketch the shape of learning curves for each of those:
  • for the test error.
  • for the training error.

15/110

slide-18
SLIDE 18

Fit sin(x) with poly, orders: 0,1,2,3

16/110

slide-19
SLIDE 19

Fit sin(x) with poly, orders: 3,5,7

17/110

slide-20
SLIDE 20

Fit sin(x) with poly, orders: 7,9,10

18/110

slide-21
SLIDE 21

Fit sin(x) with poly, orders: 10,11,12

19/110

slide-22
SLIDE 22

Fit sin(x) with poly, orders: 12,13,14

20/110

slide-23
SLIDE 23

Bias-Variance Trade-ofg

𝐹𝑠𝑠(𝑦) = 𝐹[(𝑍 − ̂ 𝑔𝐸(𝑦))2] Expected error 𝐹𝑠𝑠(𝑦) of learning ̂ 𝑔𝐸 over various datasets 𝐸 on a fjxed test set 𝑦 with observed values 𝑍 = 𝑔(𝑦) + 𝜗 can be decomposed as: 𝐹𝑠𝑠(𝑦) = (𝐹 ̂ 𝑔𝐸(𝑦) − 𝑔(𝑦))2 +𝐹( ̂ 𝑔𝐸(𝑦) − 𝐹 ̂ 𝑔𝐸(𝑦))2 +𝜏2

𝑓

𝐹𝑠𝑠(𝑦) = Bias2 +Variance +Noise

  • Bias: how much the average predicted value 𝐹

̂ 𝑔𝐸(𝑦) difgers from the ideal value 𝑔(𝑦).

  • Variance: how much a particular prediction

̂ 𝑔𝐸(𝑦) difgers from the average prediction 𝐹 ̂ 𝑔𝐸(𝑦), on average over datasets 𝐸.

More: http://scott.fortmann-roe.com/docs/BiasVariance.html Derivation: see slides by Cohen

21/110

slide-24
SLIDE 24

Bias-Variance Trade-ofg

Picture from: http://scott.fortmann-roe.com/docs/BiasVariance.html 22/110

slide-25
SLIDE 25

Diagnosing Bias vs. Variance from Learning Curves

See the slides by Andrew Ng, plots on slide 7 and 8.

23/110

slide-26
SLIDE 26

Search vs. Modelling Error

Search Error:

  • the optimizer fails to fjnd the best parameters
  • … a problem with the optimizer.

Modelling Error:

  • the best parameters do not lead to the best performance.
  • … a problem with the objective function.

Consider:

  • Will more iterations help?
  • When can two learners help to diagnose the problem?

24/110

slide-27
SLIDE 27

Diagnosing for Search vs. Modelling Error

See the slides by Andrew Ng, slide 14.

25/110

slide-28
SLIDE 28

Complex Systems

Error Analysis:

  • Compares the best possible vs. current accuracy.
  • Provide more and more golden truth data as part of the input.
  • Find the component where the jump in accuracy is the highest.

Ablative Analysis:

  • Compares some baseline vs. current accuracy.
  • Switch ofg more and more components.
  • Find the component where the loss in accuracy is the highest.

26/110

slide-29
SLIDE 29

Kernels Illustrations

slide-30
SLIDE 30

Outline for Kernel Illustrations

  • Regularization parameter C in SVM.
  • Linear Kernel: 𝑙(x, y) = x ⋅ y
  • The gain from higher dimensionality.
  • Polynomial Kernel: 𝑙(x, y) = (𝛿 ∗ x ⋅ y + coefg0)degree
  • RBF Kernel: 𝑙(x, y) = 𝑓𝑦𝑞(−𝛿‖x − y‖2); 𝛿 > 0

… including their parameters

  • Cross-validation Heatmap
  • Multi-class SVM
  • For the PAMAP-easy dataset.
  • Regularization parameters.
  • Inseparable classes.

Based on http://scikit-learn.org/stable/modules/svm.html and other scikit-demos.

27/110

slide-31
SLIDE 31

Regularization (C) in linear SVM

𝑙(x, y) = x ⋅ y (Linear kernel = no kernel) The parameter 𝐷 in (linear) SVM:

  • sets the weight of the sum of slack variables.
  • serves as a regularization parameter.
  • controls the number of support vectors.

Penalty Number of 𝐷 for Errors points considered Margin Bias Variance Low Low Many Wide High Low High High Few Narrow Low High Think 𝐷 for Varian𝐷e.

28/110

slide-32
SLIDE 32

SVM Linear C=0.1

29/110

slide-33
SLIDE 33

SVM Linear C=0.2

30/110

slide-34
SLIDE 34

SVM Linear C=0.5

31/110

slide-35
SLIDE 35

SVM Linear C=1

32/110

slide-36
SLIDE 36

SVM Linear C=5

33/110

slide-37
SLIDE 37

SVM Linear C=10

34/110

slide-38
SLIDE 38

SVM Linear C=20

35/110

slide-39
SLIDE 39

SVM Linear C=50

36/110

slide-40
SLIDE 40

SVM Linear C=100

37/110

slide-41
SLIDE 41

Benefjtting from Higher Dimensionality

  • Classifjers generally do linear separation.
  • It can be very diffjcult to come up with features that allow for

linear separation. The trick: map the coordinates to another space where separation is possible:

Ø

38/110

slide-42
SLIDE 42

Kernel Function to a Higher Dimension

𝑙(𝑦, 𝑧) = 𝑦𝑧 + 𝑦2𝑧2 (1)

Picture from https://en.wikipedia.org/wiki/Kernel_method 39/110

slide-43
SLIDE 43

Deep NNs: Kernels on Steroids

Slides from ?). 40/110

slide-44
SLIDE 44

Deep NNs: Kernels on Steroids

Slides from ?). 41/110

slide-45
SLIDE 45

Deep NNs: Kernels on Steroids

Slides from ?). 42/110

slide-46
SLIDE 46

Deep NNs: Kernels on Steroids

Slides from ?). 43/110

slide-47
SLIDE 47

Deep NNs: Kernels on Steroids

Slides from ?). 44/110

slide-48
SLIDE 48

Polynomial Kernel

𝑙(x, y) = (𝛿 ∗ x ⋅ y + coefg0)degree

45/110

slide-49
SLIDE 49

SVM Poly (degree 1)

46/110

slide-50
SLIDE 50

SVM Poly (degree 2)

47/110

slide-51
SLIDE 51

SVM Poly (degree 3)

48/110

slide-52
SLIDE 52

SVM Poly (degree 4)

49/110

slide-53
SLIDE 53

SVM Poly (degree 5)

50/110

slide-54
SLIDE 54

SVM Poly (degree 6)

51/110

slide-55
SLIDE 55

SVM Poly (degree 7)

52/110

slide-56
SLIDE 56

SVM Poly (degree 8)

53/110

slide-57
SLIDE 57

SVM Poly (degree 9)

54/110

slide-58
SLIDE 58

SVM Poly (degree 3, gamma 0.05)

55/110

slide-59
SLIDE 59

SVM Poly (degree 3, gamma 0.1)

56/110

slide-60
SLIDE 60

SVM Poly (degree 3, gamma 0.2)

57/110

slide-61
SLIDE 61

SVM Poly (degree 3, gamma 0.5)

58/110

slide-62
SLIDE 62

SVM Poly (degree 3, gamma 0.7)

59/110

slide-63
SLIDE 63

SVM Poly (degree 3, gamma 1)

60/110

slide-64
SLIDE 64

SVM Poly (degree 3, gamma 2)

61/110

slide-65
SLIDE 65

SVM Poly (d=3, g=0.5, coef=-2.0)

62/110

slide-66
SLIDE 66

SVM Poly (d=3, g=0.5, coef=-1.0)

63/110

slide-67
SLIDE 67

SVM Poly (d=3, g=0.5, coef=-0.50)

64/110

slide-68
SLIDE 68

SVM Poly (d=3, g=0.5, coef=0)

65/110

slide-69
SLIDE 69

SVM Poly (d=3, g=0.5, coef=0.5)

66/110

slide-70
SLIDE 70

SVM Poly (d=3, g=0.5, coef=1)

67/110

slide-71
SLIDE 71

SVM Poly (d=3, g=0.5, coef=2)

68/110

slide-72
SLIDE 72

RBF Kernels

𝑙(x, y) = 𝑓𝑦𝑞(−𝛿‖x − y‖2); 𝛿 > 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • 10
  • 5

5 10 exp(-0.5*x*x) exp(-1*x*x) exp(-2*x*x) exp(-3*x*x)

  • Each training point creates its bell.
  • Overall shape is the sum of the bells.
  • Kind of “all nearest neighbours”.

… Totally “fmips” the space:

  • The axes are now closeness to other objects.

69/110

slide-73
SLIDE 73

RBF Kernel Parameters

𝐷 Decision Surface Model Bias Variance Low Smooth Simple High Low High Peaked Complex Low High gamma Afgected Points Low can be far from training examples High must be close to training examples

  • Does higher gamma lead to higher variance?
  • Choice critical for SVM performance.
  • Advised to use GridSearchCV for 𝐷 and gamma:
  • exponentially spaced probes
  • wide range

70/110

slide-74
SLIDE 74

SVM RBF (C=0.05, gamma=2)

71/110

slide-75
SLIDE 75

SVM RBF (C=0.1, gamma=2)

72/110

slide-76
SLIDE 76

SVM RBF (C=0.2, gamma=2)

73/110

slide-77
SLIDE 77

SVM RBF (C=0.5, gamma=2)

74/110

slide-78
SLIDE 78

SVM RBF (C=0.6, gamma=2)

75/110

slide-79
SLIDE 79

SVM RBF (C=0.7, gamma=2)

76/110

slide-80
SLIDE 80

SVM RBF (C=1, gamma=2)

77/110

slide-81
SLIDE 81

SVM RBF (C=2, gamma=2)

78/110

slide-82
SLIDE 82

SVM RBF (C=1, gamma=2)

79/110

slide-83
SLIDE 83

SVM RBF (C=0.5, gamma=2)

80/110

slide-84
SLIDE 84

SVM RBF (C=0.5, gamma=5)

81/110

slide-85
SLIDE 85

SVM RBF (C=0.5, gamma=10)

82/110

slide-86
SLIDE 86

SVM RBF (C=0.5, gamma=5)

83/110

slide-87
SLIDE 87

SVM RBF (C=0.5, gamma=2)

84/110

slide-88
SLIDE 88

SVM RBF (C=0.5, gamma=1)

85/110

slide-89
SLIDE 89

SVM RBF (C=0.5, gamma=0.7)

86/110

slide-90
SLIDE 90

SVM RBF (C=0.5, gamma=0.5)

87/110

slide-91
SLIDE 91

SVM RBF (C=0.5, gamma=0.2)

88/110

slide-92
SLIDE 92

SVM RBF (C=0.5, gamma=0.1)

89/110

slide-93
SLIDE 93

SVM RBF (C=0.5, gamma=0.05)

90/110

slide-94
SLIDE 94

Cross-validation Heatmap

http: //scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

91/110

slide-95
SLIDE 95

Multi-class SVM

Two implementations in scikit-learn:

  • SVC: one-against-one
  • 𝑜(𝑜 − 1)/2 classifjers constructed
  • supports various kernels, incl. custom ones
  • LinearSVC: one-vs-the-rest
  • 𝑜 classifjers trained

92/110

slide-96
SLIDE 96

PAMAP-easy Training Data

30 30.5 31 31.5 32 32.5 33 33.5 34 60 80 100 120 140 160 180 200 1 12 13 16 17 2 24 3 4 5 6 7 93/110

slide-97
SLIDE 97

Default View (every 200)

94/110

slide-98
SLIDE 98

Default View (every 300)

95/110

slide-99
SLIDE 99

Default View (every 400)

96/110

slide-100
SLIDE 100

Regularization C=0.5

97/110

slide-101
SLIDE 101

Regularization C=1

98/110

slide-102
SLIDE 102

Regularization C=5

99/110

slide-103
SLIDE 103

Regularization C=10

100/110

slide-104
SLIDE 104

Regularization C=20

101/110

slide-105
SLIDE 105

Regularization C=50

102/110

slide-106
SLIDE 106

Regularization C=500

103/110

slide-107
SLIDE 107

Regularization C=5000

104/110

slide-108
SLIDE 108

Inseparable classes 12,13 (every 200)

105/110

slide-109
SLIDE 109

Inseparable classes 12,13 (every 100)

106/110

slide-110
SLIDE 110

Inseparable classes 12,13 (every 80)

107/110

slide-111
SLIDE 111

Inseparable classes 12,13 (every 60)

108/110

slide-112
SLIDE 112

Inseparable classes 12,13 (every 55)

109/110

slide-113
SLIDE 113

Summary

Diagnostics:

  • Visualization is the fjrst, quick and easy but very efgective.
  • Principled diagnostics:
  • Bias vs. Variance.
  • Optimizer (Search) Error vs. Objective function (Modelling) Error.
  • Error Analysis vs. Ablative Analysis

Kernels and Efgects on Hyperparameters Visualizations:

  • Linear, Polynomial and RBF Kernels.
  • Cross-validation for the choice of hyperparameters.
  • Multi-class SVM.

For

hw_gridsearch see the web.

110/110