Outils Statistiques pour Data Science Part I : Supervised Learning - - PowerPoint PPT Presentation
Outils Statistiques pour Data Science Part I : Supervised Learning - - PowerPoint PPT Presentation
Outils Statistiques pour Data Science Part I : Supervised Learning Massih-Reza Amini Universit Grenoble Alpes Laboratoire dInformatique de Grenoble Massih-Reza.Amini@imag.fr 2 Organization Balikas) Amini, Georgios Balikas) Balikas)
2
Organization
❑ Classifjcation Automatique (Massih R Amini, Georgios Balikas) ❑ Clustering (Massih R Amini, Georgios Balikas) ❑ Représentation et indexation d’un document (Massih R Amini, Georgios Balikas) ❑ Recherche de thèmes latents (Marianne Clausel, Georgios Balikas) ❑ Visualisation (Marianne Clausel, Georgios Balikas)
Massih-Reza.Amini@imag.fr Introduction to Data-Science
3
Learning and Inference
The process of inference is done in three steps:
- 1. Observe a phenomenon,
- 2. Construct a model of the phenomenon,
- 3. Do predictions.
These steps are involved in more or less all natural sciences! The aim of learning is to automate this process, The aim of the learning theory is to formalize the process.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
3
Learning and Inference
The process of inference is done in three steps:
- 1. Observe a phenomenon,
- 2. Construct a model of the phenomenon,
- 3. Do predictions.
❑ These steps are involved in more or less all natural sciences! All that is necessary to reduce the whole nature of laws similar to those which Newton discovered with the aid of calculus, is to have a suffjcient number of observations and a mathematics that is complex enough (Marquis de Condorcet, 1785) The aim of learning is to automate this process, The aim of the learning theory is to formalize the process.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
3
Learning and Inference
The process of inference is done in three steps:
- 1. Observe a phenomenon,
- 2. Construct a model of the phenomenon,
- 3. Do predictions.
❑ These steps are involved in more or less all natural sciences! ❑ The aim of learning is to automate this process, The aim of the learning theory is to formalize the process.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
3
Learning and Inference
The process of inference is done in three steps:
- 1. Observe a phenomenon,
- 2. Construct a model of the phenomenon,
- 3. Do predictions.
❑ These steps are involved in more or less all natural sciences! ❑ The aim of learning is to automate this process, ❑ The aim of the learning theory is to formalize the process.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
4
Induction vs. deduction
❑ Induction is the process of deriving general principles from particular facts or instances. ❑ Deduction is, in the other hand, the process of reasoning in which a conclusion follows necessarily from the stated premises; it is an inference by reasoning from the general to the specifjc. This is how mathematicians prove theorems from axioms.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
5
Pattern recognition
If we consider the context of supervised learning for pattern recognition: ❑ The data consist of pairs of examples (vector representation of an observation, class label), ❑ Class labels are often Y = {1, . . . , K} with K large (but in the theory of ML we consider the binary classifjcation case Y = {−1, +1}), ❑ The learning algorithm constructs an association between the vector representation of an observation → class label, ❑ Aim: Make few errors on unseen examples.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
6
Pattern recognition (Exemple)
IRIS classifjcation, Ronald Fisher (1936) Iris Setosa Iris Versicolor Iris Virginica
Massih-Reza.Amini@imag.fr Introduction to Data-Science
7
Pattern recognition (Exemple)
❑ First step is to formalize the perception of the fmowers with relevant common characteristics, that constitute the features of their vector representations. ❑ This usually requires expert knowledge.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
8
Pattern recognition (Exemple)
❑ If observations are from a Field of Irises
Massih-Reza.Amini@imag.fr Introduction to Data-Science
8
Pattern recognition (Exemple)
❑ If observations are from a Field of Irises then they become ... ...
Massih-Reza.Amini@imag.fr Introduction to Data-Science
8
Pattern recognition (Exemple)
If observations are from a Field of Irises ❑ The constitution of vectorised observations and their associated labels is generally time consuming. ❑ Many studies are now focused on representation learning using deep neural networks ❑ Second step: Learning translates then in the search of a function that maps vectorised observations (inputs) to their associated outputs
Massih-Reza.Amini@imag.fr Introduction to Data-Science
9
Pattern recognition
- 0. Base d’apprentissage
- 1. Vecteur de
représentation
- 2. Trouver les
séparateurs
- 3. Nouveaux exemples
- 5. Prédire les étiquettes
des nouveaux exemples
Massih-Reza.Amini@imag.fr Introduction to Data-Science
10
Approximation - Interpolation
It is always possible to construct a function that exactly fjts the data.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
10
Approximation - Interpolation
It is always possible to construct a function that exactly fjts the data.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
10
Approximation - Interpolation
It is always possible to construct a function that exactly fjts the data. Is it reasonable?
Massih-Reza.Amini@imag.fr Introduction to Data-Science
11
Occam razor
Idea: Search for regularities (or repetitions) in the observed phenomenon, generalization is done from the passed
- bservations to the new futur ones ⇒ Take the most simple
model ... But how to measure the simplicity ?
- 1. Number of constantes,
- 2. Number de parameters,
- 3. ...
Massih-Reza.Amini@imag.fr Introduction to Data-Science
12
Basic Hypotheses
Two types of hypotheses: ❑ Past observations are related to the future ones → The phenomenon is stationary ❑ Observations are independently generated from a source → Notion of independence
Massih-Reza.Amini@imag.fr Introduction to Data-Science
13
Aims
→ How can one do predictions with past data? What are the hypotheses? ❑ Give a formel defjnition of learning, generalization,
- verfjtting,
❑ Characterize the performance of learning algorithms, ❑ Construct better algorithms.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
14
Probabilistic model
Relations between the past and future observations. ❑ Independence: Each new observation provides a maximum individual information, ❑ identically Distributed : Observations provide information
- n the phenomenon which generates the observations.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
15
Formally
We consider an input space X ⊆ Rd and an output space Y. Assumption: Example pairs (x, y) ∈ X × Y are identically and independently distributed (i.i.d) with respect to an unknown but fjxed probability distribution D. Samples: We observe a sequence of m pairs of examples (xi, yi) generated i.i.d from D. Aim: Construct a prediction function f : X → Y which predicts an output y for a given new x with a minimum probability of error.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
16
Supervised Learning
❑ Discriminant models directly fjnd a classifjcation function f : X → Y from a given class of functions F; ❑ The function found should be the one having the lowest probability of error R(f) = E(x,y)∼DL(x, y) =
∫
X×Y
L(f(x), y)dD(x, y) Where L is a risk function defjned as L : Y × Y → R+ The risk function considered in classifjcation is usually the misclassifjcation error: ∀(x, y); L(f(x), y) = [ [f(x) ̸= y] ] Where [ [π] ] is equal to 1 if the predicate π is true and 0
- therwise.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
17
Empirical risk minimization (ERM) principle
❑ As the probability distribution D is unknown, the analytic form of the true risk cannot be driven, so the prediction function cannot be found directly on R(f). ❑ Empirical risk minimization (ERM) principle: Find f by minimizing the unbiased estimator of R on a given training set S = (xi, yi)m
i=1:
ˆ Rm(f, S) = 1 m
m
∑
i=1
L(f(xi), yi) ❑ However, without restricting the class of functions this is not the right way of proceeding (occam razor) ...
Massih-Reza.Amini@imag.fr Introduction to Data-Science
18
ERM principle, problem
Suppose that the input dimension is d = 1, let the input space X be the interval [a, b] ⊂ R where a and b are real values such that a < b, and suppose that the output space is {−1, +1}. Moreover, suppose that the distribution D generating the examples (x, y) is an uniform distribution over [a, b] × {−1}. Consider now, a learning algorithm which minimizes the empirical risk by choosing a function in the function class F = {f : [a, b] → {−1, +1}} (also denoted as F = {−1, +1}[a,b]) in the following way ; after reviewing a training set S = {(x1, y1), . . . , (xm, ym)} the algorithm outputs the prediction function fS such that fS(x) =
{
−1, if x ∈ {x1, . . . , xm} +1,
- therwise
Massih-Reza.Amini@imag.fr Introduction to Data-Science
19
Consistency of the ERM principle
❑ For the above problem, the found classifjer has an empirical risk equal to 0, and that for any given training
- set. However, as the classifjer makes an error over the
entire infjnite set [a, b] except on a fjnite training set (of measure zero), its generalization error is always equal to 1. ❑ So the question is : in which case the ERM principle is likely to generate a general learning rule? ⇒ The answer of this question lies in a statistical notion called consistency.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
20
Consistency of the ERM principle (2)
This concept indicates two conditions that a learning algorithm has to fulfjl, namely (a) the algorithm must return a prediction function whose empirical error refmects its generalization error when the size of the training set tends to infjnity : ∀ϵ > 0, lim
m→∞ P(|ˆ
L(fS, S) − L(fS)| > ϵ) = 0, denoted as, ˆ L(fS, S) P → L(fS) (b) in the asymptotic case, the algorithm must allow to fjnd the function which minimises the generalization error in the considered function class : ˆ L(fS, S) P → inf
g∈F L(g)
Massih-Reza.Amini@imag.fr Introduction to Data-Science
21
Consistency of the ERM principle (3)
These two conditions imply that the empirical error ˆ L(fS, S) of the prediction function found by the learning algorithm over a training S, fS, converges in probability to its generalization error L(fS) and infg∈F L(g) :
Empirical ¡risk, ¡ ¡ True ¡risk, ¡ ¡
Massih-Reza.Amini@imag.fr Introduction to Data-Science
22
Study the consistency of the ERM principle
The fundamental result of the learning theory [?, theorem 2.1, p.38] concerning the consistency of the ERM principle, exhibits another relation involving the supremum over the function class in the form of an unilateral uniform convergence and which stipulates that : The ERM principle is consistent if and only if : ∀ϵ > 0, lim
m→∞ P
(
sup
f∈F
[
L(f) − ˆ L(f, S)
]
> ϵ
)
= 0
Massih-Reza.Amini@imag.fr Introduction to Data-Science
23
Study the consistency of the ERM principle
❑ A direct implication of this result is an uniform bound over the generalization error of all prediction functions f ∈ F learned on a training set S of size m and which writes : ∀δ ∈]0, 1], P
(
∀f ∈ F, (L(f) − ˆ L(f, S)) ≤ C(F, m, δ)
)
≥ 1−δ Where C depends on the size of the function class, the size
- f the training set, and the desired precision δ ∈]0, 1].
There are difgerent ways to measure the size of a function class and the measure commonly used is called complexity
- r the capacity of the function class.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
Usual binary classifjcation models
24
Perceptron [?]
Massih-Reza.Amini@imag.fr Introduction to Data-Science
25
Perceptron [?]
x1 x2 xd
b
w1
b
w2
b
wd
Σ
¯ H(.)
w0 Output Signals synaptic weights
❑ Linear prediction function hw : Rd → R x → ⟨ ¯ w, x⟩ + w0
Massih-Reza.Amini@imag.fr Introduction to Data-Science
26
Perceptron [?]
❑ Linear prediction function hw : Rd → R x → ⟨ ¯ w, x⟩ + w0
❑ Find the parameters w = ( ¯ w, w0) by minimising the distance between the misclassifjed examples to the decision boundary.
¯ w hw(x) = ⟨ ¯ w, x⟩ + w0 = 0
b
|hw(x)| || ¯ w||
x xp Massih-Reza.Amini@imag.fr Introduction to Data-Science
27
Learning Perceptron parameters
❑ Objective function ˆ L(w) = − ∑
i′∈I
yi′(⟨ ¯ w, xi′⟩ + w0) ❑ Derivatives of with respect to the parameters ∂ ˆ L(w) ∂w0 = − ∑
i′∈I
yi′, ∇ ˆ L( ¯ w) = − ∑
i′∈I
yi′xi′ ❑ Perceptron: on-line parameter updates ∀(x, y), if y(⟨ ¯ w, x⟩ + w0) ≤ 0 then (w0 ¯ w ) ← (w0 ¯ w ) + η ( y yx )
Massih-Reza.Amini@imag.fr Introduction to Data-Science
28
Graphical depiction of the online update rule
b (x1, +1) b (x2, +1)
w(t)
rs (x3, −1) rs
(x4, −1)
b b
w(t) w(t+1) −x3
rs rs
Massih-Reza.Amini@imag.fr Introduction to Data-Science
29
Perceptron (algorithm)
Algorithm 1 The algorithm of perceptron
1: Training set S = {(xi, yi) | i ∈ {1, . . . , m}} 2: Initialize the weights w(0) ← 0 3: t ← 0 4: Learning rate η > 0 5: repeat 6:
Choose randomly an example (x(t), y(t)) ∈ S
7:
if y ⟨ w(t), x(t)⟩ < 0 then
8:
w(t+1) ← w(t) + η × y(t)
9:
w(t+1) ← w(t) + η × y(t) × x(t)
10:
end if
11:
t ← t + 1
12: until t > T ☞ But does this updates converge?
Massih-Reza.Amini@imag.fr Introduction to Data-Science
30
Perceptron (convergence)
[?] showed that ❑ if there exists a weight ¯ w∗, such that ∀i ∈ {1, . . . , m}, yi × ⟨ ¯ w∗, xi⟩ > 0, ❑ then, by denoting ρ = mini∈{1,...,m}
(
yi
⟨
¯ w∗ || ¯ w∗||, xi
⟩)
, ❑ and, R = maxi∈{1,...,m} ||xi||, ❑ and, ¯ w(0) = 0, η = 1, ❑ we have a bound over the maximum number of updates ℓ : ℓ ≤
⌊(R
ρ
)2⌋
Massih-Reza.Amini@imag.fr Introduction to Data-Science
31
Proof of convergence
- 1. We suppose that all the examples in the training set are within a
hypersphere of radius R (i.e. ∀xi ∈ S, ||xi|| ≤ R). Further, we initialise the weight vector to be the null vector (i.e. w(0) = 0) as well as the learning rate ϵ = 1. Show that after ℓ updates, the norme of the current weight vector satisfjes : ||w(ℓ)||2 ≤ t × R2 (1) hint : You can consider ||w(ℓ)||2 as ||w(ℓ) − w(0)||2
- 2. Using the the same condition than in the previous question, show that
after ℓ updates of the weight vector we have
⟨
w∗ ||w∗||, w(ℓ)
⟩
≥ ℓ × ρ (2)
- 3. Deduce from equations (??) and (??) that the number of iterations ℓ
is bounded by ℓ ≤
⌊(
R ρ
)2⌋
where ⌊x⌋ represents the fmoor function (This result is due to Novikofg, 1966).
Massih-Reza.Amini@imag.fr Introduction to Data-Science
32
Perceptron Program
#include "defs.h" void perceptron(X, Y, w, m, d, eta, T) double **X; double *Y; double *w; long int m; long int d; double eta; long int T; { long int i, j, t=0; double ProdScal; // Initialisation of the weight vector for(j=0; j<=d; j++) w[j]=0.0; while(t<T) { i=(rand()%m) + 1; for(ProdScal=w[0], j=1; j<=d; j++) ProdScal+=w[j]*X[i][j]; if(Y[i]*ProdScal <= 0.0){ w[0]+=eta*Y[i]; for(j=1; j<=d; j++) w[j]+=eta*Y[i]*X[i][j]; } t++; } } source: http://ama.liglab.fr/~amini/Perceptron/ Massih-Reza.Amini@imag.fr Introduction to Data-Science
33
ADAptive LInear NEuron [?]
❑ ADAptive LInear NEuron ❑ Linear prediction function : hw : X → R x → ⟨ ¯ w, x⟩ + w0 ❑ Find parameters that minimise the convex upper-bound of the empirical 0/1 loss ˆ L(w) = 1 m
m
∑
i=1
(yi − hw(xi))2 ❑ Update rule : stochastic gradient descent algorithm with a learning rate η > 0 ∀(x, y), (w0 ¯ w ) ← (w0 ¯ w ) + η(y − hw(x)) (1 x ) (3)
Massih-Reza.Amini@imag.fr Introduction to Data-Science
34
Adaline
❑ ADAptive LInear NEuron ❑ Linear prediction function : hw : X → R x → ⟨ ¯ w, x⟩ + w0
Algorithm 2 The algorithm of Adaline
1: Training set S = {(xi, yi) | i ∈ {1, . . . , m}} 2: Initialize the weights w(0) ← 0 3: t ← 0 4: Learning rate η > 0 5: repeat 6:
Choose randomly an example (x(t), y(t)) ∈ S
7:
w(t+1) ← w(t) + η × (y(t) − hw(x(t)))
8:
¯ w(t+1) ← ¯ w(t) + η × (y(t) − hw(x(t))) × x(t)
9:
t ← t + 1
10: until t > T
Massih-Reza.Amini@imag.fr Introduction to Data-Science
35
Formal models
x0 = 1 w0
b
x1 x2 xd
b
w1
b
w2
b
wd
Σ
hw(x) x
Massih-Reza.Amini@imag.fr Introduction to Data-Science
36
Perceptron vs Adaline
b b b b b b b b rs rs rs rs rs rs rs rs rs rs rs rs
Massih-Reza.Amini@imag.fr Introduction to Data-Science
37
Logistic regression: generative models
Each example x is supposed to be generated by a mixture model of parameters Θ: P(x | Θ) =
K
∑
k=1
P(y = k)P(x | y = k, Θ)
Massih-Reza.Amini@imag.fr Introduction to Data-Science
38
Logistic regression: generative models
❑ The aim is then to fjnd the parameters Θ for which the model explains the best the observations, ❑ That is done by maximizing the log-likelihood of data S = {(xi, yi); i ∈ {1, . . . , m}} L(Θ) = ln
m
∏
i=1
P(xi | Θ) ❑ Classical density functions are Gaussian density functions P(x | y = k, Θ) = 1 (2π)
d 2 |Σk| 1 2
e− 1
2 (x−µk)⊤Σ−1 k (x−µk) Massih-Reza.Amini@imag.fr Introduction to Data-Science
39
Logistic regression: generative models
❑ Once the parameters Θ are estimated; the generative model can be used for classifjcation by applying the Bayes rule: ∀x; y∗ = argmax
k
P(y = k | x) ∝ argmax
k
P(y = k) × P(x | y = k, Θ) ❑ Problem: in most real life applications the distributional assumption over data does not hold, ❑ The Logistic Regression model does not make any assumption except that ln P(y = 1 | x) P(y = 0 | x) = ⟨ ¯ w, x⟩ + w0
Massih-Reza.Amini@imag.fr Introduction to Data-Science
40
Logistic regression
❑ The logistic regression has been proposed to model the posterior probability of classes via logistic functions.
P(y = 1 | x) = 1 1 + e−⟨ ¯
w,x⟩−w0 = gw(x)
P(y = 0 | x) = 1 − P(y = 1 | x) = 1 1 + e⟨ ¯
w,x⟩+w0 = 1 − gw(x)
P(y | x) = (gw(x))y(1 − gw(x))1−y
0.2 0.4 0.6 0.8 1
- 6
- 4
- 2
2 4 6 1/(1+exp(-<w,x>)) <w,x>
Massih-Reza.Amini@imag.fr Introduction to Data-Science
41
Logistic regression
❑ For
g : R → ]0, 1[ x → 1 1 + e−x
we have
g′(x) = ∂g ∂x = g(x)(1 − g(x))
❑ Model parameters w are found by maximizing the complete log-liklihood, which by assuming that m training examples are generated independently, writes
L = ln
m
∏
i=1
P(xi, yi) = ln
m
∏
i=1
P(yi | xi) + ln
m
∏
i=1
P(xi) ≈
m
∑
i=1
ln [ (gw(xi))yi(1 − gw(xi))1−yi]
Massih-Reza.Amini@imag.fr Introduction to Data-Science
42
Logisitic Regression : link with the ERM principle
❑ If we consider the function hw : x → ⟨ ¯ w, x⟩ + w0, the maximization of the log-liklihood L is equivalent to the minimization of the empirical logistic loss in the case where ∀i, yi ∈ {−1, +1}. ˆ L(w) = 1 m
m
∑
i=1
ln(1 + e−yihw(xi)) ❑ Minimization can be carried out with usual convex
- ptimization techniques (i.e. conjugate gradient or the
quasi-newton method)
Massih-Reza.Amini@imag.fr Introduction to Data-Science
43
Adaline vs Logistic regression
y x1 ¡ x2 x1 ¡ x2 y 0 ¡ 0.5 ¡
Massih-Reza.Amini@imag.fr Introduction to Data-Science
44
ADAptive BOOSTing [?]
❑ The Adaboost algorithm generates a set of weak learners and combines them with a majority vote in order to produce an effjcient fjnal classifjer. ❑ Each weak classifjer is trained sequentially in the way to take into account the classifjcation errors of the previous classifjer
☞ This is done by assigning weights to training examples and at each iteration to increase the weights of those on which the current classifjer makes misclassifjcation. ☞ In this way the new classifjer is focalized on hard examples that have been misclassifjed by the previous classifjer.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
45
AdaBoost, algorithm
Algorithm 3 The algorithm of Boosting
1:
Training set S = {(xi, yi) | i ∈ {1, . . . , m}}
2:
Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1(i) =
1 m
3:
T , the maximum number of iterations (or classifjers to be combined)
4:
for t=1,…,T do
5:
Train a weak classifjer ft : X → {−1, +1} by using the distribution Dt
6:
Set ϵt = ∑
i:ft(xi)̸=yi
Dt(i)
7:
Choose αt = 1
2 ln 1−ϵt ϵt
8:
Update the distribution of weights ∀i ∈ {1, . . . , m}, Dt+1(i) = Dt(i)e−αtyift(xi) Zt Where, Zt =
m
∑
i=1
Dt(i)e−αtyift(xi)
9:
end for
10:
The fjnal classifjer: ∀x, F (x) = sign(∑T
t=1 αtft(x))
source: http://ama.liglab.fr/~amini/RankBoost/ Massih-Reza.Amini@imag.fr Introduction to Data-Science
45
AdaBoost, algorithm
Algorithm 4 The algorithm of Boosting
1:
Training set S = {(xi, yi) | i ∈ {1, . . . , m}}
2:
Initialize the initial distribution over examples ∀i ∈ {1, . . . , m}, D1(i) =
1 m
3:
T , the maximum number of iterations (or classifjers to be combined)
4:
for t=1,…,T do
5:
Train a weak classifjer ft : X → {−1, +1} by using the distribution Dt
6:
Set ϵt = ∑
i:ft(xi)̸=yi
Dt(i)
7:
Choose αt = 1
2 ln 1−ϵt ϵt
8:
Update the distribution of weights ∀i ∈ {1, . . . , m}, Dt+1(i) = Dt(i)e−αtyift(xi) Zt Where, Zt =
m
∑
i=1
Dt(i)e−αtyift(xi)
9:
end for
10:
The fjnal classifjer: ∀x, F (x) = sign(∑T
t=1 αtft(x))
source: http://ama.liglab.fr/~amini/RankBoost/ Massih-Reza.Amini@imag.fr Introduction to Data-Science
46
How to sample using a distribution Dt
b
V Dt(U) M
b
U
❑ Choose randomly an index U ∈ {1, . . . , m} and a real-value V ∈ [0, maxi∈{1,...,m} Dt(i)], if Dt(U) > V then accept the example (xU, yU).
Massih-Reza.Amini@imag.fr Introduction to Data-Science
47
AdaBoost, geometry interpretation
b b b b b b rs rs rs rs rs
α1 = 0.5
b b b b b b rs rs rs rs rs
α2 = 0.1
b b b b b b rs rs rs rs rs
α3 = 0.75
b b b b b b rs rs rs rs rs
Massih-Reza.Amini@imag.fr Introduction to Data-Science
48
Proof of convergence
- 1. If we denote by ∀x, H(x) = ∑T
t=1 αtft(x) and
F(x) = sign(H(x)) show that 1 m
m
∑
i=1
[ [yi ̸= F(xi)] ] ≤ 1 m
m
∑
i=1
e−yiH(xi)
- 2. Deduce that
1 m
m
∑
i=1
e−yiH(xi) =
m
∑
i=1
Z1D2(i)
∏
t>1
e−yiαtft(xi) And, 1 m
m
∑
i=1
e−yiH(xi) =
T
∏
t=1
Zt (4)
Massih-Reza.Amini@imag.fr Introduction to Data-Science
49
Homework
- 3. The minimization of (??) is carried out by minimizing each
- f its terms. Using the defjnition of ϵt show that:
∀t, Zt = ϵteαt + (1 − ϵt)e−αt
- 4. Further show that the minimum of the normalisation term,
with respect to the combination weights, αt is reached for αt = 1
2 ln 1−ϵt ϵt
- 5. By posing γt = 1
2 − ϵt, and when ϵt < 1 2 show that
∀t, Zt =
√
1 − 4γ2
t ≤ e−2γ2
t
- 6. Finally show that the empirical misclassifjcation error
decreases exponentially to 0 1 m
m
∑
i=1
[ [yi ̸= F(xi)] ] ≤
T
∏
t=1
Zt ≤ e−2∑T
t=1 γ2 t Massih-Reza.Amini@imag.fr Introduction to Data-Science
Unconstrained convex optimization
50
Common convex upper bounds for the misclassifjcation error
Massih-Reza.Amini@imag.fr Introduction to Data-Science
51
Property
❑ The learning problem casts into a easier unconstrained convex optimization problem. ❑ Consider the Taylor formula of the objective function around its minimiser
ˆ L(w) = ˆ L(w∗)+(w−w∗)⊤ ∇ ˆ L(w∗)
=0
+ 1 2 (w−w∗)⊤H(w−w∗)+o(∥ w−w∗ ∥2)
❑ The Hessian matrix is symmetric and from Schwarz theorem its eigenvectors (vi)d
i=1 form an orthonormal basis.
∀(i, j) ∈ {1, . . . , d}2, Hvi = λivi, et v⊤
i vj =
{
+1 : si i = j, 0 : otherwise.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
52
Property (2)
❑ Every weight vector w − w∗ can be uniquely decomposed in this basis w − w∗ =
d
∑
i=1
qivi ❑ That to say ˆ L(w) = ˆ L(w∗) + 1 2
d
∑
i=1
λiq2
i
❑ Furthermore the Hessian matrix is defjnite positive, because of the defjnition of the global minimum (w − w∗)⊤H(w − w∗) =
d
∑
i=1
λiq2
i = 2( ˆ
L(w) − ˆ L(w∗)) ≥ 0 All the eigenvalues of H are then positive.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
53
Property (3)
❑ This implies that the level lines of ˆ L, defjned by weight points for which ˆ L is constant, are ellipses
Massih-Reza.Amini@imag.fr Introduction to Data-Science
54
Gradient descent algorithm [?]
❑ The gradient descent algorithm is an iterative algorithm that updates the weight vectors at each step : ∀t ∈ N, w(t+1) = w(t) − η∇ ˆ L(w(t)) Where η > 0 is the learning rate
Massih-Reza.Amini@imag.fr Introduction to Data-Science
55
Convergence of the gradient descent algorithm
❑ Take the decomposition of any vector w − w∗ in the
- rthonormal basis (vi)d
i=1 formed by the eigenvectors of the
Hessian matrix ∇ ˆ L(w) =
d
∑
i=1
qiλivi ❑ Let w(t) be the weight vector obtained from w(t−1) after applying the gradient descent rule
w(t)−w(t−1) =
d
∑
i=1
(
q(t)
i
− q(t−1)
i
)
vi = −η∇ ˆ L(w(t−1)) = −η
d
∑
i=1
q(t−1)
i
λivi
❑ So ∀i ∈ {1, . . . , d}, q(t)
i
= (1 − ηλi)tq(0)
i
and the algorithm convergence if η < 1 2λmax
Massih-Reza.Amini@imag.fr Introduction to Data-Science
Consistency of the ERM principle
56
A uniform generalization error bound
❑ As part of the study of the consistency of the ERM principle, we would now establish a uniform bound on the generalization error of a learned function depending on its empirical error over a training base. ❑ We cannot reach this result, by using the same development than previously. ❑ This is mainly due to the fact that when the learned function fS has knowledge of the training data S = {(xi, yi); i ∈ {1, . . . , m}}, random variables Xi = 1
mL(fS(xi), yi); i ∈ {1, . . . , m} involved in the
estimation of the empirical error of fS on S, are all dependent on each other. ⇒ Indeed, if we change an example of the training set, the selected function fS will also change, as well as the instantaneous errors of all the other examples.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
57
Rademacher complexity [?]
❑ In the derivation of uniform generalization error bounds difgerent capacity measures of the class of functions have been proposed. Among which the Rademacher complexity allows an accurate estimates of the capacity of a class of functions and it is dependent to the training sample ❑ The empirical Rademacher complexity estimates the richness of a function class F by measuring the degree to which the latter is able fjt to random noise on a training set S = {(x1, y1), . . . , (xm, ym)} of size m generated i.i.d. with respect to a probability distribution D.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
58
Rademacher complexity
❑ This complexity is estimated through Rademacher variables σ = (σ1, . . . , σm)⊤ which are independent discrete random variables taking values in {−1, +1} with the same probability 1/2, i.e. ∀i ∈ {1, . . . , m}; P(σi = −1) = P(σi = +1) = 1/2, and is defjned as : ˆ Rm(F, S) = 2 mEσ
[
sup
f∈F
- m
∑
i=1
σif(xi)
- | x1, . . . , xm
]
❑ Furthermore, we defjne the Rademacher complexity of the class of functions F independently to a given training set by Rm(F) = ES∼Dm ˆ Rm(F, S) = 2 mESσ
[
sup
f∈F
- m
∑
i=1
σif(xi)
- ]
Massih-Reza.Amini@imag.fr Introduction to Data-Science
59
A uniform generalization error bound
Theorem (Generalization bound with the Rademacher complexity) Let X ∈ Rd be a vectoriel space and Y = {−1, +1} an output space. Suppose that the pairs of examples (x, y) ∈ X × Y are generated i.i.d. with respect to the distribution probability D. Let F be a class of functions having values in Y and L : Y × Y → [0, 1] a given instantaneous loss. Then for all δ ∈]0, 1], we have with probability at least 1 − δ the following inequality : ∀f ∈ F, L(f) ≤ ˆ L(f, S) + Rm(L ◦ F) +
√
ln 1
δ
2m (5) and also with probability at least 1 − δ L(f) ≤ ˆ L(f, S) + ˆ Rm(L ◦ F, S) + 3
√
ln 2
δ
2m (6) Where L ◦ F = {(x, y) → L(f(x), y) | f ∈ F}.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
60
A uniform generalization error bound (1)
- 1. Link the supremum of L(f) − ˆ
L(f, S) on F with its expectation
The study of this bound is achieved by linking the supremum appearing, in the right hand side of the above inequality, with its expectation through a powerful tool developed for empirical processes by [?], and known as the theorem of bounded difgerences
Let I ⊂ R be a real valued interval, and (X1, ..., Xm), m independent random variables taking values in Im. Let Φ : Im → R be defjned such that : ∀i ∈ {1, ..., m}, ∃ci ∈ R the following inequality holds for any (x1, ..., xm) ∈ Im and ∀x′ ∈ I : |Φ(x1, .., xi−1, xi, xi+1, .., xm)−Φ(x1, .., xi−1, x′, xi+1, .., xm)| ≤ ci We have then ∀ϵ > 0, P(Φ(x1, ..., xm) − E[Φ] > ϵ) ≤ e
−2ϵ2
∑m
i=1 c2 i Massih-Reza.Amini@imag.fr Introduction to Data-Science
61
A uniform generalization error bound (1)
- 1. Link the supremum of L(f) − ˆ
L(f, S) on F with its expectation consider the following function Φ : S → sup
f∈F
[L(f) − ˆ L(f, S)] Mcdiarmid inequality can then be applied for the function Φ with ci = 1/m, ∀i, thus :
∀ϵ > 0, P ( sup
f∈F
[L(f) − ˆ L(f, S)] − ES sup
f∈F
[L(f) − ˆ L(f, S)] > ϵ ) ≤ e−2mϵ2
Massih-Reza.Amini@imag.fr Introduction to Data-Science
62
A uniform generalization error bound (2)
- 2. Bound ES sup
f∈F
[L(f) − ˆ L(f, S)] with respect to Rm(L ◦ F)
This step is a symmetrisation step and it consists in introducing a second virtual sample S′ also generated i.i.d. with respect to Dm into ES supf∈F[L(f) − ˆ L(f, S)]. → ES sup
f∈F
(L(f) − ˆ L(f, S)) = ES sup
f∈F
[ES′(ˆ L(f, S′) − ˆ L(f, S))] ≤ ESES′ sup
f∈F
[L(f, S′) − ˆ L(f, S)] → In the other hand, ESES′ sup
f∈F
[L(f, S′) − ˆ L(f, S)] = ESES′Eσ sup
f∈F
[ 1 m
m
∑
i=1
σi(L(f(x′
i), y′ i) − L(f(xi), yi))
]
Massih-Reza.Amini@imag.fr Introduction to Data-Science
63
A uniform generalization error bound (2)
- 2. Bound ES sup
f∈F
[L(f) − ˆ L(f, S)] with respect to Rm(L ◦ F) By applying the triangular inequality sup = ||.||∞ it comes
ESES′Eσ sup
f∈F
[
1 m
m
∑
i=1
σi(L(f(x′
i), y′ i) − L(f(xi), yi))
]
≤ ESES′Eσ sup
f∈F
1 m
m
∑
i=1
σiL(f(x′
i), y′ i) + ESES′Eσ
sup
f∈F
1 m
m
∑
i=1
(−σi)L(f(x′
i), y′ i)
Finally as ∀i, σi and −σi have the same distribution we have ESES′ sup
f∈F
[L(f, S′) − ˆ L(f, S)] ≤ 2ESEσ sup
f∈F
1 m
m
∑
i=1
σiL(f(xi), yi)
- ≤Rm(L◦F)
(7)
Massih-Reza.Amini@imag.fr Introduction to Data-Science
64
A uniform generalization error bound (2)
- 2. Bound ES sup
f∈F
[L(f) − ˆ L(f, S)] with respect to Rm(L ◦ F) In summarizing the results obtained so far, we have:
- 1. ∀f ∈ F, ∀S, L(f) − ˆ
L(f, S) ≤ supf∈F[L(f) − ˆ L(f, S)]
- 2. ∀ϵ > 0, P
(
sup
f∈F
[L(f) − ˆ L(f, S)] − ES sup
f∈F
[L(f) − ˆ L(f, S)] > ϵ
)
≤ e−2mϵ2
- 3. ES sup
f∈F
(L(f) − ˆ L(f, S)) ≤ Rm(L ◦ F)
The fjrst point of the theorem ?? is obtained by resolving the equation e−2mϵ2 = δ with respect to ϵ.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
65
A uniform generalization error bound (3)
- 3. Bound Rm(L ◦ F) with respect to ˆ
Rm(L ◦ F, S)
→ Apply the McDiarmid inequality to the function Φ : S → ˆ Rm(L ◦ F, S) ∀ϵ > 0, P(Rm(L ◦ F) > ˆ Rm(L ◦ F, S) + ϵ) ≤ e−mϵ2/2 Thus for δ/2 = e−mϵ2/2, we have with probability at least equal to 1 − δ/2 : Rm(L ◦ F) ≤ ˆ Rm(L ◦ F, S) + 2
√
ln 2
δ
2m From the fjrst point (Eq. ??) of the theorem ??, we have also with probability at least equal to 1 − δ/2 : ∀f ∈ F, ∀S, L(f) ≤ ˆ L(f, S) + Rm(L ◦ F) +
√
ln 2
δ
2m The second point (Eq. ??) of the theorem ?? is then obtained by combining the two previous results using the union bound.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
66
Structural Risk Minimization
Empirical ¡error ¡ Empirical ¡error ¡+ ¡complexity ¡ Complexity ¡
Massih-Reza.Amini@imag.fr Introduction to Data-Science
67
Structural Risk Minimization (2)
Image from : http://www.svms.org/srm/
Massih-Reza.Amini@imag.fr Introduction to Data-Science
68
Regularization
❑ Find a predictor by minimising the empirical risk with an added penalty for the size of the model, ❑ A simple approach consists in choosing a large class of functions F and to defjne on F a regularizer, typically a norm || g ||, then to minimize the regularized empirical risk ˆ f = argmin
f∈F
ˆ Rm(f, S) + γ
- hyperparameter
× || f ||2 ❑ The hyper parameter, or the regularisation parameter allows to choose the right trade-ofg between fjt and complexity.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
69
K-fold cross validation
❑ Create a K-fold partition of the dataset
❑ For each of K experiments, use K − 1 folds for training and a difgerent fold for testing, this procedure is illustrated in the following fjgure for K = 4
Train 1, 1 Train 2, 2 Train 3, 3 Train 4, 4 Test 1 Test 2 Test 3 Test 4
- Crossval. 1
- Crossval. 2
- Crossval. 3
- Crossval. 4
❑ The value of the hyper parameter corresponds to the value
- f γk for which the testing performance is the highest on
- ne of the folds.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
70
In summary
❑ For induction, we should control the capacity of the class of functions. ❑ The study of the consistency of the ERM principle led to the second fundamental principle of machine learning called structural risk minimization (SRM). ❑ Learning is a compromise between a low empirical risk and a high capacity of the class of functions in use.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
71
References
Massih-Reza Amini Apprentissage Machine de la théorie à la pratique éditions Eyrolles, 2015.
- W. Hoefgding
Probability inequalities for sums of bounded random variables Journal of the American Statistical Association, 58:13–30, 1963.
- C. McDiarmid
On the method of bounded difgerences Surveys in combinatorics, 141:148–188, 1989.
- V. Koltchinskii
Rademacher penalties and structural risk minimization IEEE Transactions on Information Theory, 47(5):1902–1914, 2001. Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalker Foundations of Machine Learning 2012.
Massih-Reza.Amini@imag.fr Introduction to Data-Science
72
References
A.B. Novikofg On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12: 615–622. 1962
- F. Rosenblatt
The perceptron: A probabilistic model for information storage and
- rganization in the brain.
Psychological Review, 65: 386–408. 1958
- D. E. Rumelhart, G. E. Hinton and R. Williams
Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1986 R.E. Schapire Theoretical views of boosting and applications. In Proceedings of the 10th International Conference on Algorithmic Learning Theory, pages 13–25. 1999
Massih-Reza.Amini@imag.fr Introduction to Data-Science
73
References
- G. Widrow and M. Hofg
Adaptive switching circuits. Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, 4: 96–104, 1960.
- V. Vapnik.
The nature of statistical learning theory. Springer, Verlag, 1998.
Massih-Reza.Amini@imag.fr Introduction to Data-Science