Scalable Non-Parametric Statistical Estimation Aymeric DIEULEVEUT - - PowerPoint PPT Presentation

scalable non parametric statistical estimation
SMART_READER_LITE
LIVE PREVIEW

Scalable Non-Parametric Statistical Estimation Aymeric DIEULEVEUT - - PowerPoint PPT Presentation

Scalable Non-Parametric Statistical Estimation Aymeric DIEULEVEUT ENS Paris, INRIA February 6, 2017 Statistics Statistical model Performance measure Estimator Convergence: F (# obs ) Optimization Statistics Minimize a given function


slide-1
SLIDE 1

Scalable Non-Parametric Statistical Estimation

Aymeric DIEULEVEUT

ENS Paris, INRIA

February 6, 2017

slide-2
SLIDE 2

Statistics Statistical model Performance measure Estimator Convergence: F(#obs)

slide-3
SLIDE 3

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter)

slide-4
SLIDE 4

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties
slide-5
SLIDE 5

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Non-parametric Regression Square loss Tikhonov regularization

slide-6
SLIDE 6

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data

slide-7
SLIDE 7

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015

slide-8
SLIDE 8

Non-parametric Stochastic Approximation with large step sizes 1/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

Random design least-squares regression.

slide-9
SLIDE 9

Non-parametric Stochastic Approximation with large step sizes 1/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

Random design least-squares regression. ε(f ) := E(X,Y )

  • (f (X) − Y )2

.

slide-10
SLIDE 10

Non-parametric Stochastic Approximation with large step sizes 1/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

Random design least-squares regression. ε(f ) := E(X,Y )

  • (f (X) − Y )2

. Within a reproducing kernel Hilbert space H: min

f ∈H ε(f ).

(xi, yi) i.i.d. observations.

slide-11
SLIDE 11

Non-parametric Stochastic Approximation with large step sizes 1/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

Random design least-squares regression. ε(f ) := E(X,Y )

  • (f (X) − Y )2

. Within a reproducing kernel Hilbert space H: min

f ∈H ε(f ).

(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H.

slide-12
SLIDE 12

Non-parametric Stochastic Approximation with large step sizes 1/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

Random design least-squares regression. ε(f ) := E(X,Y )

  • (f (X) − Y )2

. Within a reproducing kernel Hilbert space H: min

f ∈H ε(f ).

(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation.

slide-13
SLIDE 13

Non-parametric Stochastic Approximation with large step sizes 1/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

Random design least-squares regression. ε(f ) := E(X,Y )

  • (f (X) − Y )2

. Within a reproducing kernel Hilbert space H: min

f ∈H ε(f ).

(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation. Using unbiased gradients of the loss function:

slide-14
SLIDE 14

Non-parametric Stochastic Approximation with large step sizes 1/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

Random design least-squares regression. ε(f ) := E(X,Y )

  • (f (X) − Y )2

. Within a reproducing kernel Hilbert space H: min

f ∈H ε(f ).

(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation. Using unbiased gradients of the loss function: ft+1 = ft − γt(ft(xt) − yt)Kxt, where: K is the kernel, . Kx = K(x, ·).

slide-15
SLIDE 15

Non-parametric Stochastic Approximation with large step sizes 1/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

Random design least-squares regression. ε(f ) := E(X,Y )

  • (f (X) − Y )2

. Within a reproducing kernel Hilbert space H: min

f ∈H ε(f ).

(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation. Using unbiased gradients of the loss function: ft+1 = ft − γt(ft(xt) − yt)Kxt, where: K is the kernel, . Kx = K(x, ·). Stochastic Approximation.

slide-16
SLIDE 16

Non-parametric Stochastic Approximation with large step sizes 1/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

Random design least-squares regression. ε(f ) := E(X,Y )

  • (f (X) − Y )2

. Within a reproducing kernel Hilbert space H: min

f ∈H ε(f ).

(xi, yi) i.i.d. observations. Sequence of estimators ft ∈ H. Update after each observation. Using unbiased gradients of the loss function: ft+1 = ft − γt(ft(xt) − yt)Kxt, where: K is the kernel, . Kx = K(x, ·). Stochastic Approximation. Depending on assumptions on:

◮ the Gaussian complexity of the unit ball of the kernel space, ◮ the smoothness in H of the optimal predictor f∗(X) = E [Y |X].

slide-17
SLIDE 17

Non-parametric Stochastic Approximation with large step sizes 2/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

L2

ρX

H xf∗ xf∗

slide-18
SLIDE 18

Non-parametric Stochastic Approximation with large step sizes 2/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

L2

ρX

H xf∗ xf∗ L2

ρX

H x f∗ x f∗

slide-19
SLIDE 19

Non-parametric Stochastic Approximation with large step sizes 2/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

L2

ρX

H xf∗ xf∗ L2

ρX

H x f∗ x f∗ L2

ρX

H x f∗ x f∗

slide-20
SLIDE 20

Non-parametric Stochastic Approximation with large step sizes 2/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

L2

ρX

H xf∗ xf∗ L2

ρX

H x f∗ x f∗ L2

ρX

H x f∗ x f∗ Theorem: Averaged, unregularized, least mean squares algorithm, with large step sizes, gets Statistical optimal rate of convergence.

slide-21
SLIDE 21

Non-parametric Stochastic Approximation with large step sizes 2/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

L2

ρX

H xf∗ xf∗ L2

ρX

H x f∗ x f∗ L2

ρX

H x f∗ x f∗ Theorem: Averaged, unregularized, least mean squares algorithm, with large step sizes, gets Statistical optimal rate of convergence. Recovers the finite dimension situation with rate O

  • σ2d

n

  • .

Optimal rates in both the well-specified regime and some situations of the mis-specified.

slide-22
SLIDE 22

Non-parametric Stochastic Approximation with large step sizes 2/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

L2

ρX

H xf∗ xf∗ L2

ρX

H x f∗ x f∗ L2

ρX

H x f∗ x f∗ Theorem: Averaged, unregularized, least mean squares algorithm, with large step sizes, gets Statistical optimal rate of convergence. Recovers the finite dimension situation with rate O

  • σ2d

n

  • .

Optimal rates in both the well-specified regime and some situations of the mis-specified.

slide-23
SLIDE 23

Non-parametric Stochastic Approximation with large step sizes 2/2.

Aymeric Dieuleveut & Francis Bach, in the Annals of Statistics, 2015.

L2

ρX

H xf∗ xf∗ L2

ρX

H x f∗ x f∗ L2

ρX

H x f∗ x f∗ Theorem: Averaged, unregularized, least mean squares algorithm, with large step sizes, gets Statistical optimal rate of convergence. Recovers the finite dimension situation with rate O

  • σ2d

n

  • .

Optimal rates in both the well-specified regime and some situations of the mis-specified.

slide-24
SLIDE 24

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015

slide-25
SLIDE 25

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015 Faster Rates for Least-Squares Regression,

  • Tech. report, 2016
slide-26
SLIDE 26

Harder, Better, Faster, Stronger Convergence Rates for Least Squares Regression

Aymeric Dieuleveut, Nicolas Flammarion & Francis Bach, Technical report, 2016.

Classical tradeoff: a Bias term and a Variance term appear.

◮ The bias is the hardness of forgetting the initial condition. ◮ The variance is linked with the statistical hardness of the problem.

slide-27
SLIDE 27

Harder, Better, Faster, Stronger Convergence Rates for Least Squares Regression

Aymeric Dieuleveut, Nicolas Flammarion & Francis Bach, Technical report, 2016.

Classical tradeoff: a Bias term and a Variance term appear.

◮ The bias is the hardness of forgetting the initial condition. ◮ The variance is linked with the statistical hardness of the problem.

Lower bounds:

◮ Optimal first order algorithm forgets initial conditions as Ω

  • θ0−θ∗2

t2

  • ◮ Optimal statistical estimation is Ω
  • σ2d

n

  • ,
slide-28
SLIDE 28

Harder, Better, Faster, Stronger Convergence Rates for Least Squares Regression

Aymeric Dieuleveut, Nicolas Flammarion & Francis Bach, Technical report, 2016.

Classical tradeoff: a Bias term and a Variance term appear.

◮ The bias is the hardness of forgetting the initial condition. ◮ The variance is linked with the statistical hardness of the problem.

Lower bounds:

◮ Optimal first order algorithm forgets initial conditions as Ω

  • θ0−θ∗2

t2

  • ◮ Optimal statistical estimation is Ω
  • σ2d

n

  • ,

◮ Single pass over the data: t = n.

slide-29
SLIDE 29

Harder, Better, Faster, Stronger Convergence Rates for Least Squares Regression

Aymeric Dieuleveut, Nicolas Flammarion & Francis Bach, Technical report, 2016.

Classical tradeoff: a Bias term and a Variance term appear.

◮ The bias is the hardness of forgetting the initial condition. ◮ The variance is linked with the statistical hardness of the problem.

Lower bounds:

◮ Optimal first order algorithm forgets initial conditions as Ω

  • θ0−θ∗2

t2

  • ◮ Optimal statistical estimation is Ω
  • σ2d

n

  • ,

◮ Single pass over the data: t = n.

New algorithm, based on Nesterov acceleration: Both optimal terms: E

  • ε(¯

θn) − ε(¯ θ∗)

  • ≤ Lθ0−θ∗2

n2

+ σ2d

n .

Improves convergence rate for mis-specified non-parametric regression.

slide-30
SLIDE 30

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015 Faster Rates for Least-Squares Regression,

  • Tech. report, 2016
slide-31
SLIDE 31

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Non-parametric Regression Square loss Tikhonov regularization Stochastic algorithms First order methods Few passes on the data Non-parametric Stochastic Approximation, AOS, 2015 Faster Rates for Least-Squares Regression,

  • Tech. report, 2016

Adaptation to the smoothness for learning in Kernel spaces

slide-32
SLIDE 32

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties
slide-33
SLIDE 33

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Density estimation Shape constraint (log concave) MLE

slide-34
SLIDE 34

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Density estimation Shape constraint (log concave) MLE Non smooth

  • ptimization
slide-35
SLIDE 35

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Density estimation Shape constraint (log concave) MLE Non smooth

  • ptimization

New ideas

slide-36
SLIDE 36

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Density estimation Shape constraint (log concave) MLE Non smooth

  • ptimization

New ideas

Scalable MLE algorithm in high dimension ?

slide-37
SLIDE 37

Statistics Statistical model Performance measure Estimator Convergence: F(#obs) Optimization Minimize a given function Algorithm focused Scales with dimension and

  • bservations

Convergence: F(#iter) Accurate & Efficient Scalable estimators with

  • ptimal statistical properties

Density estimation Shape constraint (log concave) MLE Non smooth

  • ptimization

New ideas

Scalable MLE algorithm in high dimension ? Online algorithm ?