SLIDE 1
Scalable Non-Parametric Statistical Estimation Aymeric DIEULEVEUT - - PowerPoint PPT Presentation
Scalable Non-Parametric Statistical Estimation Aymeric DIEULEVEUT - - PowerPoint PPT Presentation
Scalable Non-Parametric Statistical Estimation Aymeric DIEULEVEUT ENS Paris, INRIA February 6, 2017 Statistics Statistical model Performance measure Estimator Convergence: F (# obs ) Optimization Statistics Minimize a given function
SLIDE 2
SLIDE 3
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter)
SLIDE 4
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
SLIDE 5
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Non-parametric Regression Square loss Tikhonov regularization
SLIDE 6
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data
SLIDE 7
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015
SLIDE 8
Non-parametric Stochastic Approximation with large step sizes 1/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
Random design least-squares regression.
SLIDE 9
Non-parametric Stochastic Approximation with large step sizes 1/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
Random design least-squares regression. ε(f ) := E(X,Y )
- (f (X) − Y )2
.
SLIDE 10
Non-parametric Stochastic Approximation with large step sizes 1/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
Random design least-squares regression. ε(f ) := E(X,Y )
- (f (X) − Y )2
. Within a reproducing kernel Hilbert space H: min
f ∈H ε(f ).
(xi, yi) i.i.d. observations.
SLIDE 11
Non-parametric Stochastic Approximation with large step sizes 1/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
Random design least-squares regression. ε(f ) := E(X,Y )
- (f (X) − Y )2
. Within a reproducing kernel Hilbert space H: min
f ∈H ε(f ).
(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H.
SLIDE 12
Non-parametric Stochastic Approximation with large step sizes 1/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
Random design least-squares regression. ε(f ) := E(X,Y )
- (f (X) − Y )2
. Within a reproducing kernel Hilbert space H: min
f ∈H ε(f ).
(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation.
SLIDE 13
Non-parametric Stochastic Approximation with large step sizes 1/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
Random design least-squares regression. ε(f ) := E(X,Y )
- (f (X) − Y )2
. Within a reproducing kernel Hilbert space H: min
f ∈H ε(f ).
(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation. Using unbiased gradients of the loss function:
SLIDE 14
Non-parametric Stochastic Approximation with large step sizes 1/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
Random design least-squares regression. ε(f ) := E(X,Y )
- (f (X) − Y )2
. Within a reproducing kernel Hilbert space H: min
f ∈H ε(f ).
(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation. Using unbiased gradients of the loss function: ft+1 = ft − γt(ft(xt) − yt)Kxt, where: K is the kernel, . Kx = K(x, ·).
SLIDE 15
Non-parametric Stochastic Approximation with large step sizes 1/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
Random design least-squares regression. ε(f ) := E(X,Y )
- (f (X) − Y )2
. Within a reproducing kernel Hilbert space H: min
f ∈H ε(f ).
(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation. Using unbiased gradients of the loss function: ft+1 = ft − γt(ft(xt) − yt)Kxt, where: K is the kernel, . Kx = K(x, ·). Stochastic Approximation.
SLIDE 16
Non-parametric Stochastic Approximation with large step sizes 1/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
Random design least-squares regression. ε(f ) := E(X,Y )
- (f (X) − Y )2
. Within a reproducing kernel Hilbert space H: min
f ∈H ε(f ).
(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation. Using unbiased gradients of the loss function: ft+1 = ft − γt(ft(xt) − yt)Kxt, where: K is the kernel, . Kx = K(x, ·). Stochastic Approximation. Depending on assumptions on:
◮ the Gaussian complexity of the unit ball of the kernel space, ◮ the smoothness in H of the optimal predictor f∗(X) = E [Y |X].
SLIDE 17
Non-parametric Stochastic Approximation with large step sizes 2/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
L2
ρX
H xf∗ xf∗
SLIDE 18
Non-parametric Stochastic Approximation with large step sizes 2/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
L2
ρX
H xf∗ xf∗ L2
ρX
H x f∗ x f∗
SLIDE 19
Non-parametric Stochastic Approximation with large step sizes 2/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
L2
ρX
H xf∗ xf∗ L2
ρX
H x f∗ x f∗ L2
ρX
H x f∗ x f∗
SLIDE 20
Non-parametric Stochastic Approximation with large step sizes 2/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
L2
ρX
H xf∗ xf∗ L2
ρX
H x f∗ x f∗ L2
ρX
H x f∗ x f∗ Theorem: Averaged, unregularized, least mean squares algorithm, with large step sizes, gets Statistical optimal rate of convergence.
SLIDE 21
Non-parametric Stochastic Approximation with large step sizes 2/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
L2
ρX
H xf∗ xf∗ L2
ρX
H x f∗ x f∗ L2
ρX
H x f∗ x f∗ Theorem: Averaged, unregularized, least mean squares algorithm, with large step sizes, gets Statistical optimal rate of convergence. Recovers the finite dimension situation with rate O
- σ2d
n
- .
Optimal rates in both the well-specified regime and some situations of the mis-specified.
SLIDE 22
Non-parametric Stochastic Approximation with large step sizes 2/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
L2
ρX
H xf∗ xf∗ L2
ρX
H x f∗ x f∗ L2
ρX
H x f∗ x f∗ Theorem: Averaged, unregularized, least mean squares algorithm, with large step sizes, gets Statistical optimal rate of convergence. Recovers the finite dimension situation with rate O
- σ2d
n
- .
Optimal rates in both the well-specified regime and some situations of the mis-specified.
SLIDE 23
Non-parametric Stochastic Approximation with large step sizes 2/2.
Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.
L2
ρX
H xf∗ xf∗ L2
ρX
H x f∗ x f∗ L2
ρX
H x f∗ x f∗ Theorem: Averaged, unregularized, least mean squares algorithm, with large step sizes, gets Statistical optimal rate of convergence. Recovers the finite dimension situation with rate O
- σ2d
n
- .
Optimal rates in both the well-specified regime and some situations of the mis-specified.
SLIDE 24
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015
SLIDE 25
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015 Faster Rates for Least-Squares Regression,
- Tech. report, 2016
SLIDE 26
Harder, Better, Faster, Stronger Convergence Rates for Least Squares Regression
Aymeric Dieuleveut, Nicolas Flammarion & Francis Bach, Technical report, 2016.
Classical tradeoff: a Bias term and a Variance term appear.
◮ The bias is the hardness of forgetting the initial condition. ◮ The variance is linked with the statistical hardness of the problem.
SLIDE 27
Harder, Better, Faster, Stronger Convergence Rates for Least Squares Regression
Aymeric Dieuleveut, Nicolas Flammarion & Francis Bach, Technical report, 2016.
Classical tradeoff: a Bias term and a Variance term appear.
◮ The bias is the hardness of forgetting the initial condition. ◮ The variance is linked with the statistical hardness of the problem.
Lower bounds:
◮ Optimal first order algorithm forgets initial conditions as Ω
- θ0−θ∗2
t2
- ◮ Optimal statistical estimation is Ω
- σ2d
n
- ,
SLIDE 28
Harder, Better, Faster, Stronger Convergence Rates for Least Squares Regression
Aymeric Dieuleveut, Nicolas Flammarion & Francis Bach, Technical report, 2016.
Classical tradeoff: a Bias term and a Variance term appear.
◮ The bias is the hardness of forgetting the initial condition. ◮ The variance is linked with the statistical hardness of the problem.
Lower bounds:
◮ Optimal first order algorithm forgets initial conditions as Ω
- θ0−θ∗2
t2
- ◮ Optimal statistical estimation is Ω
- σ2d
n
- ,
◮ Single pass over the data: t = n.
SLIDE 29
Harder, Better, Faster, Stronger Convergence Rates for Least Squares Regression
Aymeric Dieuleveut, Nicolas Flammarion & Francis Bach, Technical report, 2016.
Classical tradeoff: a Bias term and a Variance term appear.
◮ The bias is the hardness of forgetting the initial condition. ◮ The variance is linked with the statistical hardness of the problem.
Lower bounds:
◮ Optimal first order algorithm forgets initial conditions as Ω
- θ0−θ∗2
t2
- ◮ Optimal statistical estimation is Ω
- σ2d
n
- ,
◮ Single pass over the data: t = n.
New algorithm, based on Nesterov acceleration: Both optimal terms: E
- ε(¯
θn) − ε(¯ θ∗)
- ≤ Lθ0−θ∗2
n2
+ σ2d
n .
Improves convergence rate for mis-specified non-parametric regression.
SLIDE 30
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015 Faster Rates for Least-Squares Regression,
- Tech. report, 2016
SLIDE 31
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015 Faster Rates for Least-Squares Regression,
- Tech. report, 2016
Adaptation to the smoothness for learning in Kernel spaces
SLIDE 32
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
SLIDE 33
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Density estimation Shape constraint (log concave) MLE
SLIDE 34
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Density estimation Shape constraint (log concave) MLE Non smooth
- ptimization
SLIDE 35
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Density estimation Shape constraint (log concave) MLE Non smooth
- ptimization
New ideas
SLIDE 36
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Density estimation Shape constraint (log concave) MLE Non smooth
- ptimization
New ideas
Scalable MLE algorithm in high dimension ?
SLIDE 37
Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and
- bservations
Convergence: F(#iter) Accurate & Efficient Scalable estimators with
- ptimal statistical properties
Density estimation Shape constraint (log concave) MLE Non smooth
- ptimization