[PPT] - Outils Statistiques pour Data Science Part I : Supervised Learning PowerPoint Presentation

SLIDE 1

Outils Statistiques pour Data Science

Part I : Supervised Learning Massih-Reza Amini

Université Grenoble Alpes Laboratoire d’Informatique de Grenoble Massih-Reza.Amini@imag.fr

SLIDE 2

2

Organization

❑ Classifjcation Automatique (Massih R Amini, Georgios Balikas) ❑ Clustering (Massih R Amini, Georgios Balikas) ❑ Représentation et indexation d’un document (Massih R Amini, Georgios Balikas) ❑ Recherche de thèmes latents (Marianne Clausel, Georgios Balikas) ❑ Visualisation (Marianne Clausel, Georgios Balikas)

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 3

3

Learning and Inference

The process of inference is done in three steps:

1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

These steps are involved in more or less all natural sciences! The aim of learning is to automate this process, The aim of the learning theory is to formalize the process.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 4

3

Learning and Inference

The process of inference is done in three steps:

1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

❑ These steps are involved in more or less all natural sciences! All that is necessary to reduce the whole nature of laws similar to those which Newton discovered with the aid of calculus, is to have a suffjcient number of observations and a mathematics that is complex enough (Marquis de Condorcet, 1785) The aim of learning is to automate this process, The aim of the learning theory is to formalize the process.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 5

3

Learning and Inference

The process of inference is done in three steps:

1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

❑ These steps are involved in more or less all natural sciences! ❑ The aim of learning is to automate this process, The aim of the learning theory is to formalize the process.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 6

3

Learning and Inference

The process of inference is done in three steps:

1. Observe a phenomenon,
2. Construct a model of the phenomenon,
3. Do predictions.

❑ These steps are involved in more or less all natural sciences! ❑ The aim of learning is to automate this process, ❑ The aim of the learning theory is to formalize the process.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 7

4

Induction vs. deduction

❑ Induction is the process of deriving general principles from particular facts or instances. ❑ Deduction is, in the other hand, the process of reasoning in which a conclusion follows necessarily from the stated premises; it is an inference by reasoning from the general to the specifjc. This is how mathematicians prove theorems from axioms.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 8

5

Pattern recognition

If we consider the context of supervised learning for pattern recognition: ❑ The data consist of pairs of examples (vector representation of an observation, class label), ❑ Class labels are often Y = {1, . . . , K} with K large (but in the theory of ML we consider the binary classifjcation case Y = {−1, +1}), ❑ The learning algorithm constructs an association between the vector representation of an observation → class label, ❑ Aim: Make few errors on unseen examples.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 9

6

Pattern recognition (Exemple)

IRIS classifjcation, Ronald Fisher (1936) Iris Setosa Iris Versicolor Iris Virginica

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 10

7

Pattern recognition (Exemple)

❑ First step is to formalize the perception of the fmowers with relevant common characteristics, that constitute the features of their vector representations. ❑ This usually requires expert knowledge.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 11

8

Pattern recognition (Exemple)

❑ If observations are from a Field of Irises

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 12

8

Pattern recognition (Exemple)

❑ If observations are from a Field of Irises then they become ... ...

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 13

8

Pattern recognition (Exemple)

If observations are from a Field of Irises ❑ The constitution of vectorised observations and their associated labels is generally time consuming. ❑ Many studies are now focused on representation learning using deep neural networks ❑ Second step: Learning translates then in the search of a function that maps vectorised observations (inputs) to their associated outputs

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 14

9

Pattern recognition

0. Base d’apprentissage
1. Vecteur de

représentation

2. Trouver les

séparateurs

3. Nouveaux exemples
5. Prédire les étiquettes

des nouveaux exemples

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 15

10

Approximation - Interpolation

It is always possible to construct a function that exactly fjts the data.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 16

10

Approximation - Interpolation

It is always possible to construct a function that exactly fjts the data.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 17

10

Approximation - Interpolation

It is always possible to construct a function that exactly fjts the data. Is it reasonable?

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 18

11

Occam razor

Idea: Search for regularities (or repetitions) in the observed phenomenon, generalization is done from the passed

bservations to the new futur ones ⇒ Take the most simple

model ... But how to measure the simplicity ?

1. Number of constantes,
2. Number de parameters,
3. ...

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 19

12

Basic Hypotheses

Two types of hypotheses: ❑ Past observations are related to the future ones → The phenomenon is stationary ❑ Observations are independently generated from a source → Notion of independence

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 20

13

Aims

→ How can one do predictions with past data? What are the hypotheses? ❑ Give a formel defjnition of learning, generalization,

verfjtting,

❑ Characterize the performance of learning algorithms, ❑ Construct better algorithms.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 21

14

Probabilistic model

Relations between the past and future observations. ❑ Independence: Each new observation provides a maximum individual information, ❑ identically Distributed : Observations provide information

n the phenomenon which generates the observations.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 22

15

Formally

We consider an input space X ⊆ Rd and an output space Y. Assumption: Example pairs (x, y) ∈ X × Y are identically and independently distributed (i.i.d) with respect to an unknown but fjxed probability distribution D. Samples: We observe a sequence of m pairs of examples (xi, yi) generated i.i.d from D. Aim: Construct a prediction function f : X → Y which predicts an output y for a given new x with a minimum probability of error.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 23

16

Supervised Learning

❑ Discriminant models directly fjnd a classifjcation function f : X → Y from a given class of functions F; ❑ The function found should be the one having the lowest probability of error R(f) = E(x,y)∼DL(x, y) =

∫

X×Y

L(f(x), y)dD(x, y) Where L is a risk function defjned as L : Y × Y → R+ The risk function considered in classifjcation is usually the misclassifjcation error: ∀(x, y); L(f(x), y) = [ [f(x) ̸= y] ] Where [ [π] ] is equal to 1 if the predicate π is true and 0

therwise.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 24

17

Empirical risk minimization (ERM) principle

❑ As the probability distribution D is unknown, the analytic form of the true risk cannot be driven, so the prediction function cannot be found directly on R(f). ❑ Empirical risk minimization (ERM) principle: Find f by minimizing the unbiased estimator of R on a given training set S = (xi, yi)m

i=1:

ˆ Rm(f, S) = 1 m

m

∑

i=1

L(f(xi), yi) ❑ However, without restricting the class of functions this is not the right way of proceeding (occam razor) ...

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 25

18

ERM principle, problem

Suppose that the input dimension is d = 1, let the input space X be the interval [a, b] ⊂ R where a and b are real values such that a < b, and suppose that the output space is {−1, +1}. Moreover, suppose that the distribution D generating the examples (x, y) is an uniform distribution over [a, b] × {−1}. Consider now, a learning algorithm which minimizes the empirical risk by choosing a function in the function class F = {f : [a, b] → {−1, +1}} (also denoted as F = {−1, +1}[a,b]) in the following way ; after reviewing a training set S = {(x1, y1), . . . , (xm, ym)} the algorithm outputs the prediction function fS such that fS(x) =

{

−1, if x ∈ {x1, . . . , xm} +1,

therwise

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 26

19

Consistency of the ERM principle

❑ For the above problem, the found classifjer has an empirical risk equal to 0, and that for any given training

set. However, as the classifjer makes an error over the

entire infjnite set [a, b] except on a fjnite training set (of measure zero), its generalization error is always equal to 1. ❑ So the question is : in which case the ERM principle is likely to generate a general learning rule? ⇒ The answer of this question lies in a statistical notion called consistency.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 27

20

Consistency of the ERM principle (2)

This concept indicates two conditions that a learning algorithm has to fulfjl, namely (a) the algorithm must return a prediction function whose empirical error refmects its generalization error when the size of the training set tends to infjnity : ∀ϵ > 0, lim

m→∞ P(|ˆ

L(fS, S) − L(fS)| > ϵ) = 0, denoted as, ˆ L(fS, S) P → L(fS) (b) in the asymptotic case, the algorithm must allow to fjnd the function which minimises the generalization error in the considered function class : ˆ L(fS, S) P → inf

g∈F L(g)

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 28

21

Consistency of the ERM principle (3)

These two conditions imply that the empirical error ˆ L(fS, S) of the prediction function found by the learning algorithm over a training S, fS, converges in probability to its generalization error L(fS) and infg∈F L(g) :

Empirical ¡risk, ¡ ¡ True ¡risk, ¡ ¡

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 29

22

Study the consistency of the ERM principle

The fundamental result of the learning theory [?, theorem 2.1, p.38] concerning the consistency of the ERM principle, exhibits another relation involving the supremum over the function class in the form of an unilateral uniform convergence and which stipulates that : The ERM principle is consistent if and only if : ∀ϵ > 0, lim

m→∞ P

(

sup

f∈F

[

L(f) − ˆ L(f, S)

]

> ϵ

)

= 0

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 30

23

Study the consistency of the ERM principle

❑ A direct implication of this result is an uniform bound over the generalization error of all prediction functions f ∈ F learned on a training set S of size m and which writes : ∀δ ∈]0, 1], P

(

∀f ∈ F, (L(f) − ˆ L(f, S)) ≤ C(F, m, δ)

)

≥ 1−δ Where C depends on the size of the function class, the size

f the training set, and the desired precision δ ∈]0, 1].

There are difgerent ways to measure the size of a function class and the measure commonly used is called complexity

r the capacity of the function class.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 31

Usual binary classifjcation models

SLIDE 32

24

Perceptron [?]

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 33

25

Perceptron [?]

x1 x2 xd

b

w1

b

w2

b

wd

Σ

¯ H(.)

w0 Output Signals synaptic weights

❑ Linear prediction function hw : Rd → R x → ⟨ ¯ w, x⟩ + w0

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 34

26

Perceptron [?]

❑ Linear prediction function hw : Rd → R x → ⟨ ¯ w, x⟩ + w0

❑ Find the parameters w = ( ¯ w, w0) by minimising the distance between the misclassifjed examples to the decision boundary.

¯ w hw(x) = ⟨ ¯ w, x⟩ + w0 = 0

b

|hw(x)| || ¯ w||

x xp Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 35

27

Learning Perceptron parameters

❑ Objective function ˆ L(w) = − ∑

i′∈I

yi′(⟨ ¯ w, xi′⟩ + w0) ❑ Derivatives of with respect to the parameters ∂ ˆ L(w) ∂w0 = − ∑

i′∈I

yi′, ∇ ˆ L( ¯ w) = − ∑

i′∈I

yi′xi′ ❑ Perceptron: on-line parameter updates ∀(x, y), if y(⟨ ¯ w, x⟩ + w0) ≤ 0 then (w0 ¯ w ) ← (w0 ¯ w ) + η ( y yx )

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 36

28

Graphical depiction of the online update rule

b (x1, +1) b (x2, +1)

w(t)

rs (x3, −1) rs

(x4, −1)

b b

w(t) w(t+1) −x3

rs rs

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 37

29

Perceptron (algorithm)

Algorithm 1 The algorithm of perceptron

1: Training set S = {(xi, yi) | i ∈ {1, . . . , m}} 2: Initialize the weights w(0) ← 0 3: t ← 0 4: Learning rate η > 0 5: repeat 6:

Choose randomly an example (x(t), y(t)) ∈ S

7:

if y ⟨ w(t), x(t)⟩ < 0 then

8:

w(t+1) ← w(t) + η × y(t)

9:

w(t+1) ← w(t) + η × y(t) × x(t)

10:

end if

11:

t ← t + 1

12: until t > T ☞ But does this updates converge?

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 38

30

Perceptron (convergence)

[?] showed that ❑ if there exists a weight ¯ w∗, such that ∀i ∈ {1, . . . , m}, yi × ⟨ ¯ w∗, xi⟩ > 0, ❑ then, by denoting ρ = mini∈{1,...,m}

(

yi

⟨

¯ w∗ || ¯ w∗||, xi

⟩)

, ❑ and, R = maxi∈{1,...,m} ||xi||, ❑ and, ¯ w(0) = 0, η = 1, ❑ we have a bound over the maximum number of updates ℓ : ℓ ≤

⌊(R

ρ

)2⌋

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 39

31

Proof of convergence

1. We suppose that all the examples in the training set are within a

hypersphere of radius R (i.e. ∀xi ∈ S, ||xi|| ≤ R). Further, we initialise the weight vector to be the null vector (i.e. w(0) = 0) as well as the learning rate ϵ = 1. Show that after ℓ updates, the norme of the current weight vector satisfjes : ||w(ℓ)||2 ≤ t × R2 (1) hint : You can consider ||w(ℓ)||2 as ||w(ℓ) − w(0)||2

2. Using the the same condition than in the previous question, show that

after ℓ updates of the weight vector we have

⟨

w∗ ||w∗||, w(ℓ)

⟩

≥ ℓ × ρ (2)

3. Deduce from equations (??) and (??) that the number of iterations ℓ

is bounded by ℓ ≤

⌊(

R ρ

)2⌋

where ⌊x⌋ represents the fmoor function (This result is due to Novikofg, 1966).

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 40

32

Perceptron Program

#include "defs.h" void perceptron(X, Y, w, m, d, eta, T) double **X; double *Y; double *w; long int m; long int d; double eta; long int T; { long int i, j, t=0; double ProdScal; // Initialisation of the weight vector for(j=0; j<=d; j++) w[j]=0.0; while(t<T) { i=(rand()%m) + 1; for(ProdScal=w[0], j=1; j<=d; j++) ProdScal+=w[j]*X[i][j]; if(Y[i]*ProdScal <= 0.0){ w[0]+=eta*Y[i]; for(j=1; j<=d; j++) w[j]+=eta*Y[i]*X[i][j]; } t++; } } source: http://ama.liglab.fr/~amini/Perceptron/ Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 41

33

ADAptive LInear NEuron [?]

❑ ADAptive LInear NEuron ❑ Linear prediction function : hw : X → R x → ⟨ ¯ w, x⟩ + w0 ❑ Find parameters that minimise the convex upper-bound of the empirical 0/1 loss ˆ L(w) = 1 m

m

∑

i=1

(yi − hw(xi))2 ❑ Update rule : stochastic gradient descent algorithm with a learning rate η > 0 ∀(x, y), (w0 ¯ w ) ← (w0 ¯ w ) + η(y − hw(x)) (1 x ) (3)

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 42

34

Adaline

❑ ADAptive LInear NEuron ❑ Linear prediction function : hw : X → R x → ⟨ ¯ w, x⟩ + w0

Algorithm 2 The algorithm of Adaline

1: Training set S = {(xi, yi) | i ∈ {1, . . . , m}} 2: Initialize the weights w(0) ← 0 3: t ← 0 4: Learning rate η > 0 5: repeat 6:

Choose randomly an example (x(t), y(t)) ∈ S

7:

w(t+1) ← w(t) + η × (y(t) − hw(x(t)))

8:

¯ w(t+1) ← ¯ w(t) + η × (y(t) − hw(x(t))) × x(t)

9:

t ← t + 1

10: until t > T

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 43

35

Formal models

x0 = 1 w0

b

x1 x2 xd

b

w1

b

w2

b

wd

Σ

hw(x) x

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 44

36

Perceptron vs Adaline

b b b b b b b b rs rs rs rs rs rs rs rs rs rs rs rs

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 45

37

Logistic regression: generative models

Each example x is supposed to be generated by a mixture model of parameters Θ: P(x | Θ) =

K

∑

k=1

P(y = k)P(x | y = k, Θ)

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 46

38

Logistic regression: generative models

❑ The aim is then to fjnd the parameters Θ for which the model explains the best the observations, ❑ That is done by maximizing the log-likelihood of data S = {(xi, yi); i ∈ {1, . . . , m}} L(Θ) = ln

m

∏

i=1

P(xi | Θ) ❑ Classical density functions are Gaussian density functions P(x | y = k, Θ) = 1 (2π)

d 2 |Σk| 1 2

e− 1

2 (x−µk)⊤Σ−1 k (x−µk) Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 47

39

Logistic regression: generative models

❑ Once the parameters Θ are estimated; the generative model can be used for classifjcation by applying the Bayes rule: ∀x; y∗ = argmax

k

P(y = k | x) ∝ argmax

k

P(y = k) × P(x | y = k, Θ) ❑ Problem: in most real life applications the distributional assumption over data does not hold, ❑ The Logistic Regression model does not make any assumption except that ln P(y = 1 | x) P(y = 0 | x) = ⟨ ¯ w, x⟩ + w0

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 48

40

Logistic regression

❑ The logistic regression has been proposed to model the posterior probability of classes via logistic functions.

P(y = 1 | x) = 1 1 + e−⟨ ¯

w,x⟩−w0 = gw(x)

P(y = 0 | x) = 1 − P(y = 1 | x) = 1 1 + e⟨ ¯

w,x⟩+w0 = 1 − gw(x)

P(y | x) = (gw(x))y(1 − gw(x))1−y

0.2 0.4 0.6 0.8 1

6
4
2

2 4 6 1/(1+exp(-<w,x>)) <w,x>

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 49

41

Logistic regression

❑ For

g : R → ]0, 1[ x → 1 1 + e−x

we have

g′(x) = ∂g ∂x = g(x)(1 − g(x))

❑ Model parameters w are found by maximizing the complete log-liklihood, which by assuming that m training examples are generated independently, writes

L = ln

m

∏

i=1

P(xi, yi) = ln

m

∏

i=1

P(yi | xi) + ln

m

∏

i=1

P(xi) ≈

m

∑

i=1

ln [ (gw(xi))yi(1 − gw(xi))1−yi]

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 50

42

Logisitic Regression : link with the ERM principle

❑ If we consider the function hw : x → ⟨ ¯ w, x⟩ + w0, the maximization of the log-liklihood L is equivalent to the minimization of the empirical logistic loss in the case where ∀i, yi ∈ {−1, +1}. ˆ L(w) = 1 m

m

∑

i=1

ln(1 + e−yihw(xi)) ❑ Minimization can be carried out with usual convex

ptimization techniques (i.e. conjugate gradient or the

quasi-newton method)

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 51

43

Adaline vs Logistic regression

y x1 ¡ x2 x1 ¡ x2 y 0 ¡ 0.5 ¡

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 52

44

ADAptive BOOSTing [?]

❑ The Adaboost algorithm generates a set of weak learners and combines them with a majority vote in order to produce an effjcient fjnal classifjer. ❑ Each weak classifjer is trained sequentially in the way to take into account the classifjcation errors of the previous classifjer

☞ This is done by assigning weights to training examples and at each iteration to increase the weights of those on which the current classifjer makes misclassifjcation. ☞ In this way the new classifjer is focalized on hard examples that have been misclassifjed by the previous classifjer.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 53

45

AdaBoost, algorithm

Algorithm 3 The algorithm of Boosting

1:

Training set S = {(xi, yi) | i ∈ {1, . . . , m}}

2:

Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1(i) =

1 m

3:

T , the maximum number of iterations (or classifjers to be combined)

4:

for t=1,…,T do

5:

Train a weak classifjer ft : X → {−1, +1} by using the distribution Dt

6:

Set ϵt = ∑

i:ft(xi)̸=yi

Dt(i)

7:

Choose αt = 1

2 ln 1−ϵt ϵt

8:

Update the distribution of weights ∀i ∈ {1, . . . , m}, Dt+1(i) = Dt(i)e−αtyift(xi) Zt Where, Zt =

m

∑

i=1

Dt(i)e−αtyift(xi)

9:

end for

10:

The fjnal classifjer: ∀x, F (x) = sign(∑T

t=1 αtft(x))

source: http://ama.liglab.fr/~amini/RankBoost/ Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 54

45

AdaBoost, algorithm

Algorithm 4 The algorithm of Boosting

1:

Training set S = {(xi, yi) | i ∈ {1, . . . , m}}

2:

Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1(i) =

1 m

3:

T , the maximum number of iterations (or classifjers to be combined)

4:

for t=1,…,T do

5:

Train a weak classifjer ft : X → {−1, +1} by using the distribution Dt

6:

Set ϵt = ∑

i:ft(xi)̸=yi

Dt(i)

7:

Choose αt = 1

2 ln 1−ϵt ϵt

8:

Update the distribution of weights ∀i ∈ {1, . . . , m}, Dt+1(i) = Dt(i)e−αtyift(xi) Zt Where, Zt =

m

∑

i=1

Dt(i)e−αtyift(xi)

9:

end for

10:

The fjnal classifjer: ∀x, F (x) = sign(∑T

t=1 αtft(x))

source: http://ama.liglab.fr/~amini/RankBoost/ Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 55

46

How to sample using a distribution Dt

b

V Dt(U) M

b

U

❑ Choose randomly an index U ∈ {1, . . . , m} and a real-value V ∈ [0, maxi∈{1,...,m} Dt(i)], if Dt(U) > V then accept the example (xU, yU).

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 56

47

AdaBoost, geometry interpretation

b b b b b b rs rs rs rs rs

α1 = 0.5

b b b b b b rs rs rs rs rs

α2 = 0.1

b b b b b b rs rs rs rs rs

α3 = 0.75

b b b b b b rs rs rs rs rs

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 57

48

Proof of convergence

1. If we denote by ∀x, H(x) = ∑T

t=1 αtft(x) and

F(x) = sign(H(x)) show that 1 m

m

∑

i=1

[ [yi ̸= F(xi)] ] ≤ 1 m

m

∑

i=1

e−yiH(xi)

2. Deduce that

1 m

m

∑

i=1

e−yiH(xi) =

m

∑

i=1

Z1D2(i)

∏

t>1

e−yiαtft(xi) And, 1 m

m

∑

i=1

e−yiH(xi) =

T

∏

t=1

Zt (4)

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 58

49

Homework

3. The minimization of (??) is carried out by minimizing each
f its terms. Using the defjnition of ϵt show that:

∀t, Zt = ϵteαt + (1 − ϵt)e−αt

4. Further show that the minimum of the normalisation term,

with respect to the combination weights, αt is reached for αt = 1

2 ln 1−ϵt ϵt

5. By posing γt = 1

2 − ϵt, and when ϵt < 1 2 show that

∀t, Zt =

√

1 − 4γ2

t ≤ e−2γ2

t

6. Finally show that the empirical misclassifjcation error

decreases exponentially to 0 1 m

m

∑

i=1

[ [yi ̸= F(xi)] ] ≤

T

∏

t=1

Zt ≤ e−2∑T

t=1 γ2 t Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 59

Unconstrained convex optimization

SLIDE 60

50

Common convex upper bounds for the misclassifjcation error

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 61

51

Property

❑ The learning problem casts into a easier unconstrained convex optimization problem. ❑ Consider the Taylor formula of the objective function around its minimiser

ˆ L(w) = ˆ L(w∗)+(w−w∗)⊤ ∇ ˆ L(w∗)

=0

+ 1 2 (w−w∗)⊤H(w−w∗)+o(∥ w−w∗ ∥2)

❑ The Hessian matrix is symmetric and from Schwarz theorem its eigenvectors (vi)d

i=1 form an orthonormal basis.

∀(i, j) ∈ {1, . . . , d}2, Hvi = λivi, et v⊤

i vj =

{

+1 : si i = j, 0 : otherwise.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 62

52

Property (2)

❑ Every weight vector w − w∗ can be uniquely decomposed in this basis w − w∗ =

d

∑

i=1

qivi ❑ That to say ˆ L(w) = ˆ L(w∗) + 1 2

d

∑

i=1

λiq2

i

❑ Furthermore the Hessian matrix is defjnite positive, because of the defjnition of the global minimum (w − w∗)⊤H(w − w∗) =

d

∑

i=1

λiq2

i = 2( ˆ

L(w) − ˆ L(w∗)) ≥ 0 All the eigenvalues of H are then positive.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 63

53

Property (3)

❑ This implies that the level lines of ˆ L, defjned by weight points for which ˆ L is constant, are ellipses

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 64

54

Gradient descent algorithm [?]

❑ The gradient descent algorithm is an iterative algorithm that updates the weight vectors at each step : ∀t ∈ N, w(t+1) = w(t) − η∇ ˆ L(w(t)) Where η > 0 is the learning rate

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 65

55

Convergence of the gradient descent algorithm

❑ Take the decomposition of any vector w − w∗ in the

rthonormal basis (vi)d

i=1 formed by the eigenvectors of the

Hessian matrix ∇ ˆ L(w) =

d

∑

i=1

qiλivi ❑ Let w(t) be the weight vector obtained from w(t−1) after applying the gradient descent rule

w(t)−w(t−1) =

d

∑

i=1

(

q(t)

i

− q(t−1)

i

)

vi = −η∇ ˆ L(w(t−1)) = −η

d

∑

i=1

q(t−1)

i

λivi

❑ So ∀i ∈ {1, . . . , d}, q(t)

i

= (1 − ηλi)tq(0)

i

and the algorithm convergence if η < 1 2λmax

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 66

Consistency of the ERM principle

SLIDE 67

56

A uniform generalization error bound

❑ As part of the study of the consistency of the ERM principle, we would now establish a uniform bound on the generalization error of a learned function depending on its empirical error over a training base. ❑ We cannot reach this result, by using the same development than previously. ❑ This is mainly due to the fact that when the learned function fS has knowledge of the training data S = {(xi, yi); i ∈ {1, . . . , m}}, random variables Xi = 1

mL(fS(xi), yi); i ∈ {1, . . . , m} involved in the

estimation of the empirical error of fS on S, are all dependent on each other. ⇒ Indeed, if we change an example of the training set, the selected function fS will also change, as well as the instantaneous errors of all the other examples.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 68

57

Rademacher complexity [?]

❑ In the derivation of uniform generalization error bounds difgerent capacity measures of the class of functions have been proposed. Among which the Rademacher complexity allows an accurate estimates of the capacity of a class of functions and it is dependent to the training sample ❑ The empirical Rademacher complexity estimates the richness of a function class F by measuring the degree to which the latter is able fjt to random noise on a training set S = {(x1, y1), . . . , (xm, ym)} of size m generated i.i.d. with respect to a probability distribution D.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 69

58

Rademacher complexity

❑ This complexity is estimated through Rademacher variables σ = (σ1, . . . , σm)⊤ which are independent discrete random variables taking values in {−1, +1} with the same probability 1/2, i.e. ∀i ∈ {1, . . . , m}; P(σi = −1) = P(σi = +1) = 1/2, and is defjned as : ˆ Rm(F, S) = 2 mEσ

[

sup

f∈F

m

∑

i=1

σif(xi)

| x1, . . . , xm

]

❑ Furthermore, we defjne the Rademacher complexity of the class of functions F independently to a given training set by Rm(F) = ES∼Dm ˆ Rm(F, S) = 2 mESσ

[

sup

f∈F

m

∑

i=1

σif(xi)

]

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 70

59

A uniform generalization error bound

Theorem (Generalization bound with the Rademacher complexity) Let X ∈ Rd be a vectoriel space and Y = {−1, +1} an output space. Suppose that the pairs of examples (x, y) ∈ X × Y are generated i.i.d. with respect to the distribution probability D. Let F be a class of functions having values in Y and L : Y × Y → [0, 1] a given instantaneous loss. Then for all δ ∈]0, 1], we have with probability at least 1 − δ the following inequality : ∀f ∈ F, L(f) ≤ ˆ L(f, S) + Rm(L ◦ F) +

√

ln 1

δ

2m (5) and also with probability at least 1 − δ L(f) ≤ ˆ L(f, S) + ˆ Rm(L ◦ F, S) + 3

√

ln 2

δ

2m (6) Where L ◦ F = {(x, y) → L(f(x), y) | f ∈ F}.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 71

60

A uniform generalization error bound (1)

1. Link the supremum of L(f) − ˆ

L(f, S) on F with its expectation

The study of this bound is achieved by linking the supremum appearing, in the right hand side of the above inequality, with its expectation through a powerful tool developed for empirical processes by [?], and known as the theorem of bounded difgerences

Let I ⊂ R be a real valued interval, and (X1, ..., Xm), m independent random variables taking values in Im. Let Φ : Im → R be defjned such that : ∀i ∈ {1, ..., m}, ∃ci ∈ R the following inequality holds for any (x1, ..., xm) ∈ Im and ∀x′ ∈ I : |Φ(x1, .., xi−1, xi, xi+1, .., xm)−Φ(x1, .., xi−1, x′, xi+1, .., xm)| ≤ ci We have then ∀ϵ > 0, P(Φ(x1, ..., xm) − E[Φ] > ϵ) ≤ e

−2ϵ2

∑m

i=1 c2 i Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 72

61

A uniform generalization error bound (1)

1. Link the supremum of L(f) − ˆ

L(f, S) on F with its expectation consider the following function Φ : S → sup

f∈F

[L(f) − ˆ L(f, S)] Mcdiarmid inequality can then be applied for the function Φ with ci = 1/m, ∀i, thus :

∀ϵ > 0, P ( sup

f∈F

[L(f) − ˆ L(f, S)] − ES sup

f∈F

[L(f) − ˆ L(f, S)] > ϵ ) ≤ e−2mϵ2

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 73

62

A uniform generalization error bound (2)

2. Bound ES sup

f∈F

[L(f) − ˆ L(f, S)] with respect to Rm(L ◦ F)

This step is a symmetrisation step and it consists in introducing a second virtual sample S′ also generated i.i.d. with respect to Dm into ES supf∈F[L(f) − ˆ L(f, S)]. → ES sup

f∈F

(L(f) − ˆ L(f, S)) = ES sup

f∈F

[ES′(ˆ L(f, S′) − ˆ L(f, S))] ≤ ESES′ sup

f∈F

[L(f, S′) − ˆ L(f, S)] → In the other hand, ESES′ sup

f∈F

[L(f, S′) − ˆ L(f, S)] = ESES′Eσ sup

f∈F

[ 1 m

m

∑

i=1

σi(L(f(x′

i), y′ i) − L(f(xi), yi))

]

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 74

63

A uniform generalization error bound (2)

2. Bound ES sup

f∈F

[L(f) − ˆ L(f, S)] with respect to Rm(L ◦ F) By applying the triangular inequality sup = ||.||∞ it comes

ESES′Eσ sup

f∈F

[

1 m

m

∑

i=1

σi(L(f(x′

i), y′ i) − L(f(xi), yi))

]

≤ ESES′Eσ sup

f∈F

1 m

m

∑

i=1

σiL(f(x′

i), y′ i) + ESES′Eσ

sup

f∈F

1 m

m

∑

i=1

(−σi)L(f(x′

i), y′ i)

Finally as ∀i, σi and −σi have the same distribution we have ESES′ sup

f∈F

[L(f, S′) − ˆ L(f, S)] ≤ 2ESEσ sup

f∈F

1 m

m

∑

i=1

σiL(f(xi), yi)

≤Rm(L◦F)

(7)

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 75

64

A uniform generalization error bound (2)

2. Bound ES sup

f∈F

[L(f) − ˆ L(f, S)] with respect to Rm(L ◦ F) In summarizing the results obtained so far, we have:

1. ∀f ∈ F, ∀S, L(f) − ˆ

L(f, S) ≤ supf∈F[L(f) − ˆ L(f, S)]

2. ∀ϵ > 0, P

(

sup

f∈F

[L(f) − ˆ L(f, S)] − ES sup

f∈F

[L(f) − ˆ L(f, S)] > ϵ

)

≤ e−2mϵ2

3. ES sup

f∈F

(L(f) − ˆ L(f, S)) ≤ Rm(L ◦ F)

The fjrst point of the theorem ?? is obtained by resolving the equation e−2mϵ2 = δ with respect to ϵ.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 76

65

A uniform generalization error bound (3)

3. Bound Rm(L ◦ F) with respect to ˆ

Rm(L ◦ F, S)

→ Apply the McDiarmid inequality to the function Φ : S → ˆ Rm(L ◦ F, S) ∀ϵ > 0, P(Rm(L ◦ F) > ˆ Rm(L ◦ F, S) + ϵ) ≤ e−mϵ2/2 Thus for δ/2 = e−mϵ2/2, we have with probability at least equal to 1 − δ/2 : Rm(L ◦ F) ≤ ˆ Rm(L ◦ F, S) + 2

√

ln 2

δ

2m From the fjrst point (Eq. ??) of the theorem ??, we have also with probability at least equal to 1 − δ/2 : ∀f ∈ F, ∀S, L(f) ≤ ˆ L(f, S) + Rm(L ◦ F) +

√

ln 2

δ

2m The second point (Eq. ??) of the theorem ?? is then obtained by combining the two previous results using the union bound.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 77

66

Structural Risk Minimization

Empirical ¡error ¡ Empirical ¡error ¡+ ¡complexity ¡ Complexity ¡

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 78

67

Structural Risk Minimization (2)

Image from : http://www.svms.org/srm/

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 79

68

Regularization

❑ Find a predictor by minimising the empirical risk with an added penalty for the size of the model, ❑ A simple approach consists in choosing a large class of functions F and to defjne on F a regularizer, typically a norm || g ||, then to minimize the regularized empirical risk ˆ f = argmin

f∈F

ˆ Rm(f, S) + γ

hyperparameter

× || f ||2 ❑ The hyper parameter, or the regularisation parameter allows to choose the right trade-ofg between fjt and complexity.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 80

69

K-fold cross validation

❑ Create a K-fold partition of the dataset

❑ For each of K experiments, use K − 1 folds for training and a difgerent fold for testing, this procedure is illustrated in the following fjgure for K = 4

Train 1, 1 Train 2, 2 Train 3, 3 Train 4, 4 Test 1 Test 2 Test 3 Test 4

Crossval. 1
Crossval. 2
Crossval. 3
Crossval. 4

❑ The value of the hyper parameter corresponds to the value

f γk for which the testing performance is the highest on
ne of the folds.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 81

70

In summary

❑ For induction, we should control the capacity of the class of functions. ❑ The study of the consistency of the ERM principle led to the second fundamental principle of machine learning called structural risk minimization (SRM). ❑ Learning is a compromise between a low empirical risk and a high capacity of the class of functions in use.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 82

71

References

Massih-Reza Amini Apprentissage Machine de la théorie à la pratique éditions Eyrolles, 2015.

W. Hoefgding

Probability inequalities for sums of bounded random variables Journal of the American Statistical Association, 58:13–30, 1963.

C. McDiarmid

On the method of bounded difgerences Surveys in combinatorics, 141:148–188, 1989.

V. Koltchinskii

Rademacher penalties and structural risk minimization IEEE Transactions on Information Theory, 47(5):1902–1914, 2001. Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalker Foundations of Machine Learning 2012.

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 83

72

References

A.B. Novikofg On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12: 615–622. 1962

F. Rosenblatt

The perceptron: A probabilistic model for information storage and

rganization in the brain.

Psychological Review, 65: 386–408. 1958

D. E. Rumelhart, G. E. Hinton and R. Williams

Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1986 R.E. Schapire Theoretical views of boosting and applications. In Proceedings of the 10th International Conference on Algorithmic Learning Theory, pages 13–25. 1999

Massih-Reza.Amini@imag.fr Introduction to Data-Science

SLIDE 84

73

References

G. Widrow and M. Hofg

Adaptive switching circuits. Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, 4: 96–104, 1960.

V. Vapnik.

The nature of statistical learning theory. Springer, Verlag, 1998.

Massih-Reza.Amini@imag.fr Introduction to Data-Science