MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical Learning Setting
Lorenzo Rosasco
MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and - - PowerPoint PPT Presentation
MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical Learning Setting Lorenzo Rosasco Learning from examples rather than being explicitly programmed. theory. L.Rosasco, 9.520/6.860 Fall 2018
MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical Learning Setting
Lorenzo Rosasco
Learning from examples
◮ Machine Learning deals with systems that are trained from data rather than being explicitly programmed. ◮ Here we describe the framework considered in statistical learning theory.
L.Rosasco, 9.520/6.860 Fall 2018
All starts with DATA
◮ Supervised: {(x1, y1), . . . , (xn, yn)}. ◮ Unsupervised: {x1, . . . , xm}. ◮ Semi-supervised: {(x1, y1), . . . , (xn, yn)} ∪ {x1, . . . , xm}.
L.Rosasco, 9.520/6.860 Fall 2018
Supervised learning Problem: given Sn
x1 y1 xn yn fjnd f xnew ynew
L.Rosasco, 9.520/6.860 Fall 2018
The supervised learning problem
◮ X × R probability space, with measure P. ◮ ℓ : Y × Y → [0, ∞), measurable loss function. Defjne expected risk: L(f ) = L(f ) = E(x,y)∼P[ℓ(y, f (x))] Problem: Solve min
f :X→Y L(f ),
given only Sn = (x1, y1), . . . , (xn, yn) ∼ Pn, i.e. n i.i.d. samples w.r.t. P fjxed, but unknown.
L.Rosasco, 9.520/6.860 Fall 2018
Data space
L.Rosasco, 9.520/6.860 Fall 2018
Input space
X input space: ◮ Linear spaces, e. g.
– vectors, – functions, – matrices/operators.
◮ “Structured” spaces, e. g.
– strings, – probability distributions, – graphs.
L.Rosasco, 9.520/6.860 Fall 2018
Output space
Y output space: ◮ linear spaces, e. g.
– Y = R, regression, – Y = RT, multitask regression, – Y Hilbert space, functional regression.
◮ “Structured” spaces, e. g.
– Y = {−1, 1}, classifjcation, – Y = {1, . . . , T}, multicategory classifjcation, – strings, – probability distributions, – graphs.
L.Rosasco, 9.520/6.860 Fall 2018
Probability distribution
Refmects uncertainty and stochasticity of the learning problem, P(x, y) = PX(x)P(y|x), ◮ PX marginal distribution on X, ◮ P(y|x) conditional distribution on Y given x ∈ X.
L.Rosasco, 9.520/6.860 Fall 2018
Conditional distribution and noise
f∗
(x2, y2) (x3, y3) (x4, y4) (x5, y5)
(x1, y1)
Regression
yi = f∗(xi) + ǫi.
◮ Let f∗ : X → Y , fjxed function, ◮ ǫ1, . . . , ǫn zero mean random variables, ǫi ∼ N(0, σ), ◮ x1, . . . , xn random, P(y|x) = N(f ∗(x), σ).
L.Rosasco, 9.520/6.860 Fall 2018
Conditional distribution and misclassifjcation
Classifjcation P(y|x) = {P(1|x), P(−1|x)}.
1 0.9Noise in classifjcation: overlap between the classes, ∆δ =
L.Rosasco, 9.520/6.860 Fall 2018
Marginal distribution and sampling
PX takes into account uneven sampling of the input space.
L.Rosasco, 9.520/6.860 Fall 2018
Marginal distribution, densities and manifolds
p(x) = dPX(x) dx ⇒ p(x) = dPX(x) dvol(x)
1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0
L.Rosasco, 9.520/6.860 Fall 2018
Loss functions
ℓ : Y × Y → [0, ∞) ◮ Cost of predicting f (x) in place of y. ◮ Measures the pointwise error ℓ(y, f (x)). ◮ Part of the problem defjnition since L(f ) =
Note: sometimes it is useful to consider loss of the form ℓ : Y × G → [0, ∞) for some space G, e.g. G = R.
L.Rosasco, 9.520/6.860 Fall 2018
Loss for regression
ℓℓ(y, y′) = V (y − y′), V : R → [0, ∞). ◮ Square loss ℓ(y, y′) = (y − y′)2. ◮ Absolute loss ℓ(y, y′) = |y − y′|. ◮ ǫ-insensitive ℓ(y, y′) = max(|y − y′| − ǫ, 0).
1.0 0.5 0.5 1.0 0.2 0.4 0.6 0.8 1.0
Square Loss Absolute insensitive
Loss for classifjcation
ℓ(y, y′) = V (−yy′), V : R → [0, ∞). ◮ 0-1 loss ℓ(y, y′) = Θ(−yy′), Θ(a) = 1, if a ≥ 0 and 0 otherwise. ◮ Square loss ℓ(y, y′) = (1 − yy′)2. ◮ Hinge-loss ℓ(y, y′) = max(1 − yy′, 0). ◮ Logistic loss ℓ(y, y′) = log(1 + exp(−yy′)).
1 2 0.5 1.0 1.5 2.0
0 1 loss square loss Hinge loss Logistic loss
0.5
L.Rosasco, 9.520/6.860 Fall 2018
Loss function for structured prediction
Loss specifjc for each learning task, e.g. ◮ Multiclass: square loss, weighted square loss, logistic loss, … ◮ Multitask: weighted square loss, absolute, … ◮ …
L.Rosasco, 9.520/6.860 Fall 2018
Expected risk
L(f ) = E(x,y)∼P[ℓ(y, f (x))] =
ℓ(y, f (x))dP(x, y), with f ∈ F, F = {f : X → Y | f measurable}. Example Y = {−1, +1}, ℓ(y, f (x)) = Θ(−yf (x)) 1 L(f ) = P({(x, y) ∈ X × Y | f (x) = y}).
1Θ(a) = 1, if a ≥ 0 and 0 otherwise.
L.Rosasco, 9.520/6.860 Fall 2018
Target function
fP = arg min
f ∈F L(f ),
can be derived for many loss functions. L(f ) =
, It is possible to show that: ◮ inff ∈F L(f ) =
◮ Minimizers of L(f ) can be derived “pointwise” from the inner risk Lx(f (x)). ◮ Measurability of this pointwise defjnition of fP can be ensured.
L.Rosasco, 9.520/6.860 Fall 2018
Target functions in regression
fP(x) = arg min
a∈R
Lx(a). square loss fP(x) =
ydP(y|x). absolute loss fP(x) = median(P(y|x)), median(p(·)) = y s.t. y
−∞
tdp(t) = +∞
y
tdp(t).
L.Rosasco, 9.520/6.860 Fall 2018
Target functions in classifjcation
misclassifjcation loss fP(x) = sign(P(1|x) − P(−1|x)). square loss fP(x) = P(1|x) − P(−1|x). logistic loss fP(x) = log P(1|x) P(−1|x). hinge-loss fP(x) = sign(P(1|x) − P(−1|x)).
L.Rosasco, 9.520/6.860 Fall 2018
Difgerent loss, difgerent target
◮ Each loss functions defjnes a difgerent optimal target function. Learning enters the picture when the latter is impossible or hard to compute (as in simulations). ◮ As we see in the following, loss functions also difger in terms of induced computations.
L.Rosasco, 9.520/6.860 Fall 2018
Learning algorithms
Solve min
f ∈F L(f ),
given only Sn = (x1, y1), . . . , (xn, yn) ∼ Pn. Learning algorithm Sn → fn = fSn. fn estimates fP given the observed examples Sn. How to measure the error of an estimate?
L.Rosasco, 9.520/6.860 Fall 2018
Excess risk
Excess risk: L( f ) − min
f ∈F L(f ).
Consistency: For any ǫ > 0, lim
n→∞ P
f ) − min
f ∈F L(f ) > ǫ
L.Rosasco, 9.520/6.860 Fall 2018
Other forms of consistency
Consistency in Expectation: For any ǫ > 0, lim
n→∞ E[L(
f ) − min
f ∈F L(f )] = 0.
Consistency almost surely: For any ǫ > 0, P
n→∞ L(
f ) − min
f ∈F L(f ) = 0
Note: difgerent notions of consistency correspond to difgerent notions of convergence for random variables: weak, in expectation and almost sure.
L.Rosasco, 9.520/6.860 Fall 2018
Sample complexity, tail bounds and error bounds
◮ Sample complexity: For any ǫ > 0, δ ∈ (0, 1], when n ≥ nP,F(ǫ, δ), P
f ) − min
f ∈F L(f ) ≥ ǫ
◮ Tail bounds: For any ǫ > 0, n ∈ N, P
f ) − min
f ∈F L(f ) ≥ ǫ
◮ Error bounds: For any δ ∈ (0, 1], n ∈ N, P
f ) − min
f ∈F L(f ) ≤ ǫP,F(n, δ)
L.Rosasco, 9.520/6.860 Fall 2018
No free-lunch theorem
A good algorithm should have small sample complexity for many distributions P. No free-lunch Is it possible to have an algorithm with small (fjnite) sample complexity for all problems? The no free lunch theorem provides a negative answer. In other words given an algorithm there exists a problem for which the learning performance are arbitrarily bad.
L.Rosasco, 9.520/6.860 Fall 2018
Algorithm design: complexity and regularization
The design of most algorithms proceed as follows: ◮ Pick a (possibly large) class of function H, ideally min
f ∈H L(f ) = min f ∈F L(f )
◮ Defjne a procedure Aγ(Sn) = ˆ fγ ∈ H to explore the space H
L.Rosasco, 9.520/6.860 Fall 2018
Bias and variance
Let fγ be the solution obtained with an infjnite number of examples.
Key error decomposition
L(ˆ fγ) − min
f ∈H L(f ) = L(ˆ
fγ) − L(fγ)
+ L(fγ) − min
f ∈H L(f )
Small Bias lead to good data fjt, high variance to possible instability.
L.Rosasco, 9.520/6.860 Fall 2018
ERM and structural risk minimization
A classical example. Consider (Hγ)γ such that H1 ⊂ H2, . . . Hγ ⊂ . . . H Then, let ˆ fγ = min
f ∈Hγ
n
n
ℓ(yi, f (xi)) Example Hγ are functions f (x) = w⊤x (or f (x) = w⊤Φ(x)), s.t.w ≤ γ
L.Rosasco, 9.520/6.860 Fall 2018
Beyond constrained ERM
In this course we will see other algorithm design principles: ◮ Penalization ◮ Stochastic gradient descent ◮ Implicit regularization ◮ Regularization by projection
L.Rosasco, 9.520/6.860 Fall 2018
Beyond supervised learning
◮ Z probability space, with measure P. ◮ H a set. ◮ ℓ : Z × H → [0, ∞), measurable loss function. Problem: Solve min
h∈H Ez∼P[ℓ(z, h)],
given only Sn = z1, . . . , zn ∼ Pn, i.e. n i.i.d. samples w.r.t. P fjxed, but unknown. ◮ H is part of the defjnition of the problem ◮ The above setting covers for example many unsupervised learning problems as well as decision theory problem (aka, general learning setting).
L.Rosasco, 9.520/6.860 Fall 2018