Machine learning theory for time series Exponential inequalities for - - PowerPoint PPT Presentation

machine learning theory for time series
SMART_READER_LITE
LIVE PREVIEW

Machine learning theory for time series Exponential inequalities for - - PowerPoint PPT Presentation

Short introduction to machine learning theory Machine learning and time series Machine learning theory for time series Exponential inequalities for nonstationary Markov chains Pierre Alquier CIMFAV seminar January 16, 2019 Pierre Alquier


slide-1
SLIDE 1

Short introduction to machine learning theory Machine learning and time series

Machine learning theory for time series

Exponential inequalities for nonstationary Markov chains

Pierre Alquier CIMFAV seminar January 16, 2019

Pierre Alquier Machine learning theory for time series

slide-2
SLIDE 2

Short introduction to machine learning theory Machine learning and time series

1

Short introduction to machine learning theory

2

Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Pierre Alquier Machine learning theory for time series

slide-3
SLIDE 3

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

Pierre Alquier Machine learning theory for time series

slide-4
SLIDE 4

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

Pierre Alquier Machine learning theory for time series

slide-5
SLIDE 5

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

→ usually i.i.d from an unknown distribution P...

Pierre Alquier Machine learning theory for time series

slide-6
SLIDE 6

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

→ usually i.i.d from an unknown distribution P... a restricted set of predictors (fθ, θ ∈ Θ)

Pierre Alquier Machine learning theory for time series

slide-7
SLIDE 7

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

→ usually i.i.d from an unknown distribution P... a restricted set of predictors (fθ, θ ∈ Θ) → fθ(X) meant to predict Y .

Pierre Alquier Machine learning theory for time series

slide-8
SLIDE 8

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

→ usually i.i.d from an unknown distribution P... a restricted set of predictors (fθ, θ ∈ Θ) → fθ(X) meant to predict Y . A loss function ℓ → ℓ(y ′ − y) incurred by predicting y ′ while the truth is y.

Pierre Alquier Machine learning theory for time series

slide-9
SLIDE 9

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

→ usually i.i.d from an unknown distribution P... a restricted set of predictors (fθ, θ ∈ Θ) → fθ(X) meant to predict Y . A loss function ℓ → ℓ(y ′ − y) incurred by predicting y ′ while the truth is y. the risk R(θ)

Pierre Alquier Machine learning theory for time series

slide-10
SLIDE 10

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

→ usually i.i.d from an unknown distribution P... a restricted set of predictors (fθ, θ ∈ Θ) → fθ(X) meant to predict Y . A loss function ℓ → ℓ(y ′ − y) incurred by predicting y ′ while the truth is y. the risk R(θ) → R(θ) = E(X,Y )∼P[ℓ(fθ(X) − Y )]. Not observable.

Pierre Alquier Machine learning theory for time series

slide-11
SLIDE 11

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

→ usually i.i.d from an unknown distribution P... a restricted set of predictors (fθ, θ ∈ Θ) → fθ(X) meant to predict Y . A loss function ℓ → ℓ(y ′ − y) incurred by predicting y ′ while the truth is y. the risk R(θ) → R(θ) = E(X,Y )∼P[ℓ(fθ(X) − Y )]. Not observable. an empirical proxy r(θ) for R(θ)

Pierre Alquier Machine learning theory for time series

slide-12
SLIDE 12

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

→ usually i.i.d from an unknown distribution P... a restricted set of predictors (fθ, θ ∈ Θ) → fθ(X) meant to predict Y . A loss function ℓ → ℓ(y ′ − y) incurred by predicting y ′ while the truth is y. the risk R(θ) → R(θ) = E(X,Y )∼P[ℓ(fθ(X) − Y )]. Not observable. an empirical proxy r(θ) for R(θ) → for example r(θ) = 1

n

n

i=1 ℓ(fθ(Xi) − Yi).

empirical risk minimizer ˆ θ

Pierre Alquier Machine learning theory for time series

slide-13
SLIDE 13

Short introduction to machine learning theory Machine learning and time series

Generic machine learning problem

Main ingredients :

  • bservations : (X1, Y1), (X2, Y2), ..., (Xn, Yn)

→ usually i.i.d from an unknown distribution P... a restricted set of predictors (fθ, θ ∈ Θ) → fθ(X) meant to predict Y . A loss function ℓ → ℓ(y ′ − y) incurred by predicting y ′ while the truth is y. the risk R(θ) → R(θ) = E(X,Y )∼P[ℓ(fθ(X) − Y )]. Not observable. an empirical proxy r(θ) for R(θ) → for example r(θ) = 1

n

n

i=1 ℓ(fθ(Xi) − Yi).

empirical risk minimizer ˆ θ → ˆ θ = argmin

θ∈Θ

r(θ).

Pierre Alquier Machine learning theory for time series

slide-14
SLIDE 14

Short introduction to machine learning theory Machine learning and time series

Sub-gamma random variables

Definition T is said to be sub-gamma iff ∃(v, w) such that ∀k ≥ 2, E

  • |T|k

≤ k!vw k−2 2 .

Pierre Alquier Machine learning theory for time series

slide-15
SLIDE 15

Short introduction to machine learning theory Machine learning and time series

Sub-gamma random variables

Definition T is said to be sub-gamma iff ∃(v, w) such that ∀k ≥ 2, E

  • |T|k

≤ k!vw k−2 2 . Examples : T ∼ Γ(a, b), holds with (v, w) = (ab2, b).

Pierre Alquier Machine learning theory for time series

slide-16
SLIDE 16

Short introduction to machine learning theory Machine learning and time series

Sub-gamma random variables

Definition T is said to be sub-gamma iff ∃(v, w) such that ∀k ≥ 2, E

  • |T|k

≤ k!vw k−2 2 . Examples : T ∼ Γ(a, b), holds with (v, w) = (ab2, b). any Z with P(|Z| ≥ t) ≤ P(|T| ≥ t).

Pierre Alquier Machine learning theory for time series

slide-17
SLIDE 17

Short introduction to machine learning theory Machine learning and time series

Bernstein’s inequality

Theorem Let T1, . . . , Tn be i.i.d and (v, w)-sub-gamma random

  • variables. Then, ∀ζ ∈ (0, 1/w),

E exp

  • ζ

n

  • i=1

[Ti − ETi]

  • ≤ exp
  • nvζ2

2(1 − wζ)

  • .

Pierre Alquier Machine learning theory for time series

slide-18
SLIDE 18

Short introduction to machine learning theory Machine learning and time series

Bernstein’s inequality

Theorem Let T1, . . . , Tn be i.i.d and (v, w)-sub-gamma random

  • variables. Then, ∀ζ ∈ (0, 1/w),

E exp

  • ζ

n

  • i=1

[Ti − ETi]

  • ≤ exp
  • nvζ2

2(1 − wζ)

  • .

Consequence in ML : put Ti = −ℓ(fθ(Xi) − Yi) and assume Ti is (v, w)-sub-gamma, then

Pierre Alquier Machine learning theory for time series

slide-19
SLIDE 19

Short introduction to machine learning theory Machine learning and time series

Bernstein’s inequality

Theorem Let T1, . . . , Tn be i.i.d and (v, w)-sub-gamma random

  • variables. Then, ∀ζ ∈ (0, 1/w),

E exp

  • ζ

n

  • i=1

[Ti − ETi]

  • ≤ exp
  • nvζ2

2(1 − wζ)

  • .

Consequence in ML : put Ti = −ℓ(fθ(Xi) − Yi) and assume Ti is (v, w)-sub-gamma, then for any s > 0,

  • P
  • R(θ) − r(θ) > t
  • =

P

  • exp [s (R(θ) − r(θ))] > exp(st)
  • Pierre Alquier

Machine learning theory for time series

slide-20
SLIDE 20

Short introduction to machine learning theory Machine learning and time series

Bernstein’s inequality

Theorem Let T1, . . . , Tn be i.i.d and (v, w)-sub-gamma random

  • variables. Then, ∀ζ ∈ (0, 1/w),

E exp

  • ζ

n

  • i=1

[Ti − ETi]

  • ≤ exp
  • nvζ2

2(1 − wζ)

  • .

Consequence in ML : put Ti = −ℓ(fθ(Xi) − Yi) and assume Ti is (v, w)-sub-gamma, then for any s > 0,

  • P
  • R(θ) − r(θ) > t

E exp

  • s (R(θ) − r(θ)) − st
  • Pierre Alquier

Machine learning theory for time series

slide-21
SLIDE 21

Short introduction to machine learning theory Machine learning and time series

Bernstein’s inequality

Theorem Let T1, . . . , Tn be i.i.d and (v, w)-sub-gamma random

  • variables. Then, ∀ζ ∈ (0, 1/w),

E exp

  • ζ

n

  • i=1

[Ti − ETi]

  • ≤ exp
  • nvζ2

2(1 − wζ)

  • .

Consequence in ML : put Ti = −ℓ(fθ(Xi) − Yi) and assume Ti is (v, w)-sub-gamma, then for any s > 0,

  • P
  • R(θ) − r(θ) > t

E exp

  • s

n

  • n

i=1[Ti − ETi]

  • − st
  • Pierre Alquier

Machine learning theory for time series

slide-22
SLIDE 22

Short introduction to machine learning theory Machine learning and time series

Bernstein’s inequality

Theorem Let T1, . . . , Tn be i.i.d and (v, w)-sub-gamma random

  • variables. Then, ∀ζ ∈ (0, 1/w),

E exp

  • ζ

n

  • i=1

[Ti − ETi]

  • ≤ exp
  • nvζ2

2(1 − wζ)

  • .

Consequence in ML : put Ti = −ℓ(fθ(Xi) − Yi) and assume Ti is (v, w)-sub-gamma, then for any s > 0,

  • P
  • R(θ) − r(θ) > t

E exp

  • s

n

  • n

i=1[Ti − ETi]

  • − st
  • Pierre Alquier

Machine learning theory for time series

slide-23
SLIDE 23

Short introduction to machine learning theory Machine learning and time series

Bernstein’s inequality

Theorem Let T1, . . . , Tn be i.i.d and (v, w)-sub-gamma random

  • variables. Then, ∀ζ ∈ (0, 1/w),

E exp

  • ζ

n

  • i=1

[Ti − ETi]

  • ≤ exp
  • nvζ2

2(1 − wζ)

  • .

Consequence in ML : put Ti = −ℓ(fθ(Xi) − Yi) and assume Ti is (v, w)-sub-gamma, then for any s > 0,

  • P
  • R(θ) − r(θ) > t

exp

  • vs2

2(n − ws) − st

  • Pierre Alquier

Machine learning theory for time series

slide-24
SLIDE 24

Short introduction to machine learning theory Machine learning and time series

Bernstein’s inequality

Theorem Let T1, . . . , Tn be i.i.d and (v, w)-sub-gamma random

  • variables. Then, ∀ζ ∈ (0, 1/w),

E exp

  • ζ

n

  • i=1

[Ti − ETi]

  • ≤ exp
  • nvζ2

2(1 − wζ)

  • .

Consequence in ML : put Ti = −ℓ(fθ(Xi) − Yi) and assume Ti is (v, w)-sub-gamma, then for any s > 0,

  • P
  • R(θ) − r(θ) > t

exp

  • vs2

2(n − ws) − st

  • Pierre Alquier

Machine learning theory for time series

slide-25
SLIDE 25

Short introduction to machine learning theory Machine learning and time series

Bernstein’s inequality

Theorem Let T1, . . . , Tn be i.i.d and (v, w)-sub-gamma random

  • variables. Then, ∀ζ ∈ (0, 1/w),

E exp

  • ζ

n

  • i=1

[Ti − ETi]

  • ≤ exp
  • nvζ2

2(1 − wζ)

  • .

Consequence in ML : put Ti = −ℓ(fθ(Xi) − Yi) and assume Ti is (v, w)-sub-gamma, then for any s > 0,

  • P
  • |R(θ) − r(θ)| > t

2 exp

  • vs2

2(n − ws) − st

  • Pierre Alquier

Machine learning theory for time series

slide-26
SLIDE 26

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t
  • =

P

  • θ∈Θ
  • |R(θ) − r(θ)| ≥ t
  • Pierre Alquier

Machine learning theory for time series

slide-27
SLIDE 27

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t
  • θ∈Θ

P

  • |R(θ) − r(θ)| ≥ t
  • Pierre Alquier

Machine learning theory for time series

slide-28
SLIDE 28

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

  • Pierre Alquier

Machine learning theory for time series

slide-29
SLIDE 29

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

  • ,On the complement,

, R(ˆ θ) ≤ r(ˆ θ) + t

Pierre Alquier Machine learning theory for time series

slide-30
SLIDE 30

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

  • ,On the complement, ∀θ0,

, R(ˆ θ) ≤ r(ˆ θ) + t

Pierre Alquier Machine learning theory for time series

slide-31
SLIDE 31

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

  • ,On the complement, ∀θ0,

, R(ˆ θ) ≤ r(θ0) + t

Pierre Alquier Machine learning theory for time series

slide-32
SLIDE 32

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

  • ,On the complement, ∀θ0,

, R(ˆ θ) ≤ R(θ0) + 2t

Pierre Alquier Machine learning theory for time series

slide-33
SLIDE 33

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

  • ,On the complement,

, R(ˆ θ) ≤ min

θ∈Θ R(θ) + 2t

Pierre Alquier Machine learning theory for time series

slide-34
SLIDE 34

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

,On the complement, , R(ˆ θ) ≤ min

θ∈Θ R(Θ) +

vs n − ws + 2 log 2|Θ|

α

s

Pierre Alquier Machine learning theory for time series

slide-35
SLIDE 35

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

,On the complement, for s = [n/(2w)] ∧

  • (n/v) log(2|Θ|/α),

, R(ˆ θ) ≤ min

θ∈Θ R(Θ) +

vs n − ws + 2 log 2|Θ|

α

s

Pierre Alquier Machine learning theory for time series

slide-36
SLIDE 36

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

,On the complement, we obtain : , Theorem With probability at least 1 − α, R(ˆ θ) ≤ min

θ∈Θ R(θ) + 2

  • v log 2|Θ|

α

n ∨ 2w log 2|Θ|

α

n .

Pierre Alquier Machine learning theory for time series

slide-37
SLIDE 37

Short introduction to machine learning theory Machine learning and time series

Finite set of predictors and union bound

P

  • ∃θ ∈ Θ, |R(θ) − r(θ)| > t

2|Θ| exp

  • vs2

2(n − ws) − st

,On the complement, we obtain : , Theorem With proba. at least 1 − α, with α not rediculously small, R(ˆ θ) ≤ min

θ∈Θ R(θ) + 2

  • v log 2|Θ|

α

n

Pierre Alquier Machine learning theory for time series

slide-38
SLIDE 38

Short introduction to machine learning theory Machine learning and time series

Infinite parameter set

Θ compact ⇒ ∃ a finite Θ(ε) such that ∀θ ∈ Θ, ∃θ′ ∈ Θ(ε) with δ(θ, θ′) ≤ ε.

Pierre Alquier Machine learning theory for time series

slide-39
SLIDE 39

Short introduction to machine learning theory Machine learning and time series

Infinite parameter set (image from Wikipedia)

Θ compact ⇒ ∃ a finite Θ(ε) such that ∀θ ∈ Θ, ∃θ′ ∈ Θ(ε) with δ(θ, θ′) ≤ ε.

Pierre Alquier Machine learning theory for time series

slide-40
SLIDE 40

Short introduction to machine learning theory Machine learning and time series

Infinite parameter set

Θ compact ⇒ ∃ a finite Θ(ε) such that ∀θ ∈ Θ, ∃θ′ ∈ Θ(ε) with δ(θ, θ′) ≤ ε. Assume θ → ℓ(fθ(X) − Y ) is a.s. L-Lipschitz w.r.t δ(·, ·), then R(ˆ θ) ≤ minθ∈Θ R(θ) + 2Lε + 2

  • v log 2|Θ(ε)|

α

n

Pierre Alquier Machine learning theory for time series

slide-41
SLIDE 41

Short introduction to machine learning theory Machine learning and time series

Infinite parameter set

Θ compact ⇒ ∃ a finite Θ(ε) such that ∀θ ∈ Θ, ∃θ′ ∈ Θ(ε) with δ(θ, θ′) ≤ ε. Example : in the finite dimensional case, there is Θε with |Θ(ε)| 1 εd . Assume θ → ℓ(fθ(X) − Y ) is a.s. L-Lipschitz w.r.t δ(·, ·), then R(ˆ θ) ≤ minθ∈Θ R(θ) + 2Lε + 2

  • vd log

2 εα

n

Pierre Alquier Machine learning theory for time series

slide-42
SLIDE 42

Short introduction to machine learning theory Machine learning and time series

Infinite parameter set

Θ compact ⇒ ∃ a finite Θ(ε) such that ∀θ ∈ Θ, ∃θ′ ∈ Θ(ε) with δ(θ, θ′) ≤ ε. Example : in the finite dimensional case, there is Θε with |Θ(ε)| 1 εd . Assume θ → ℓ(fθ(X) − Y ) is a.s. L-Lipschitz w.r.t δ(·, ·), then R(ˆ θ) ≤ minθ∈Θ R(θ) + 2Lε + 2

  • vd log

2 εα

n

, ε =

  • d

n

Pierre Alquier Machine learning theory for time series

slide-43
SLIDE 43

Short introduction to machine learning theory Machine learning and time series

Infinite parameter set

Θ compact ⇒ ∃ a finite Θ(ε) such that ∀θ ∈ Θ, ∃θ′ ∈ Θ(ε) with δ(θ, θ′) ≤ ε. Example : in the finite dimensional case, there is Θε with |Θ(ε)| 1 εd . Assume θ → ℓ(fθ(X) − Y ) is a.s. L-Lipschitz w.r.t δ(·, ·), then R(ˆ θ) ≤ minθ∈Θ R(θ) +

  • d

n

  • 2L + 2
  • v log 2n

  • Pierre Alquier

Machine learning theory for time series

slide-44
SLIDE 44

Short introduction to machine learning theory Machine learning and time series

Infinite parameter set

Θ compact ⇒ ∃ a finite Θ(ε) such that ∀θ ∈ Θ, ∃θ′ ∈ Θ(ε) with δ(θ, θ′) ≤ ε. Example : in the finite dimensional case, there is Θε with |Θ(ε)| 1 εd . With the loss is L-Lipschitz, with proba. at least 1 − α, R(ˆ θ) ≤ min

θ∈Θ R(θ) +

  • d

n

  • 2L + 2
  • v log 2n

  • .

Pierre Alquier Machine learning theory for time series

slide-45
SLIDE 45

Short introduction to machine learning theory Machine learning and time series

Model selection

Now we consider Θ1, . . . , ΘM with estimators ˆ θ1, . . . ˆ θM :

Pierre Alquier Machine learning theory for time series

slide-46
SLIDE 46

Short introduction to machine learning theory Machine learning and time series

Model selection

Now we consider Θ1, . . . , ΘM with estimators ˆ θ1, . . . ˆ θM : P

  • ∃m, ∃θ ∈ Θm, |R(θ) − r(θ)| > tm
  • ≤ α.

Pierre Alquier Machine learning theory for time series

slide-47
SLIDE 47

Short introduction to machine learning theory Machine learning and time series

Model selection

Now we consider Θ1, . . . , ΘM with estimators ˆ θ1, . . . ˆ θM : P

  • ∃m, ∃θ ∈ Θm, |R(θ) − r(θ)| > tm
  • ≤ α.

Define ˆ m = argmin

m

[r(ˆ θm) + tm], similar derivations lead to

Pierre Alquier Machine learning theory for time series

slide-48
SLIDE 48

Short introduction to machine learning theory Machine learning and time series

Model selection

Now we consider Θ1, . . . , ΘM with estimators ˆ θ1, . . . ˆ θM : P

  • ∃m, ∃θ ∈ Θm, |R(θ) − r(θ)| > tm
  • ≤ α.

Define ˆ m = argmin

m

[r(ˆ θm) + tm], similar derivations lead to With proba. at least 1 − α, R(ˆ θ) ≤ min

1≤m≤M

  • min

θ∈Θm R(θ) +

  • dm

n

  • 2L + 2
  • v log 2nM

  • .

Pierre Alquier Machine learning theory for time series

slide-49
SLIDE 49

Short introduction to machine learning theory Machine learning and time series

Going further

Improvements, extensions... removing log(n) by refinment of the ε-net structure.

Pierre Alquier Machine learning theory for time series

slide-50
SLIDE 50

Short introduction to machine learning theory Machine learning and time series

Going further

Improvements, extensions... removing log(n) by refinment of the ε-net structure. faster rates :

  • d/n becomes d/n thanks to a better

analysis of v under the Bernstein condition.

Pierre Alquier Machine learning theory for time series

slide-51
SLIDE 51

Short introduction to machine learning theory Machine learning and time series

Going further

Improvements, extensions... removing log(n) by refinment of the ε-net structure. faster rates :

  • d/n becomes d/n thanks to a better

analysis of v under the Bernstein condition. relaxing the sub-gamma assumption.

Pierre Alquier Machine learning theory for time series

slide-52
SLIDE 52

Short introduction to machine learning theory Machine learning and time series

Going further

Improvements, extensions... removing log(n) by refinment of the ε-net structure. faster rates :

  • d/n becomes d/n thanks to a better

analysis of v under the Bernstein condition. relaxing the sub-gamma assumption. more flexible way to measure the complexity of Θ : PAC-Bayesian bounds.

Pierre Alquier Machine learning theory for time series

slide-53
SLIDE 53

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

1

Short introduction to machine learning theory

2

Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Pierre Alquier Machine learning theory for time series

slide-54
SLIDE 54

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Extension to time series

Machine learning studied for time series with various

  • techniqes. Asymptotic study in the mixing case :
  • I. Steinwart, D. Hush, C. Scovel. Learning from dependent observations. Journal of

Multivariate Analysis, 2009. Pierre Alquier Machine learning theory for time series

slide-55
SLIDE 55

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Extension to time series

Machine learning studied for time series with various

  • techniqes. Asymptotic study in the mixing case :
  • I. Steinwart, D. Hush, C. Scovel. Learning from dependent observations. Journal of

Multivariate Analysis, 2009.

In order to extend the previous (non-asymptotic) approach to non-independent observations, exponential inequalities (Hoeffding, Bernstein, etc.) required.

Pierre Alquier Machine learning theory for time series

slide-56
SLIDE 56

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Extension to time series

Machine learning studied for time series with various

  • techniqes. Asymptotic study in the mixing case :
  • I. Steinwart, D. Hush, C. Scovel. Learning from dependent observations. Journal of

Multivariate Analysis, 2009.

In order to extend the previous (non-asymptotic) approach to non-independent observations, exponential inequalities (Hoeffding, Bernstein, etc.) required. These inequalities require some assumption on the dependence of the series : Markov, mixing, weak dependence, martingale,...

Pierre Alquier Machine learning theory for time series

slide-57
SLIDE 57

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

An example on Markov chains

Pierre Alquier Machine learning theory for time series

slide-58
SLIDE 58

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

An example on Markov chains

Pierre Alquier Machine learning theory for time series

slide-59
SLIDE 59

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

An example on Markov chains

Let F : X × Y → X and(Xt)t≥1 be the Markov chain Xt = F(Xt−1, εt).

Pierre Alquier Machine learning theory for time series

slide-60
SLIDE 60

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

An example on Markov chains

Let F : X × Y → X and(Xt)t≥1 be the Markov chain Xt = F(Xt−1, εt). Assume, for ρ ∈ [0, 1) and C > 0, E

  • d
  • F(x, ε1), F(x′, ε1)
  • ≤ ρd(x, x′)

d(F(x, y), F(x, y ′)) ≤ Cδ(y, y ′).

Pierre Alquier Machine learning theory for time series

slide-61
SLIDE 61

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Objective

Xt = F(Xt−1, εt).

Pierre Alquier Machine learning theory for time series

slide-62
SLIDE 62

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Objective

Xt = F(Xt−1, εt). In this case, we study one step ahead prediction : r(θ) = 1 n − 1

n

  • i=2

ℓ(fθ(Xi−1) − Xi), R(θ) = E[ℓ(fθ(X1) − X2)].

Pierre Alquier Machine learning theory for time series

slide-63
SLIDE 63

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Objective

Xt = F(Xt−1, εt). In this case, we study one step ahead prediction : r(θ) = 1 n − 1

n

  • i=2

ℓ(fθ(Xi−1) − Xi), R(θ) = E[ℓ(fθ(X1) − X2)]. Define ˆ θ = argmin

θ∈Θ

r(θ).

Pierre Alquier Machine learning theory for time series

slide-64
SLIDE 64

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Notations in DF15

Define GX1(x) =

  • d(x, x′)PX1(dx′),

Gε(y) =

  • Cδ(y, y ′)Pε(dy ′).

Pierre Alquier Machine learning theory for time series

slide-65
SLIDE 65

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Notations in DF15

Define GX1(x) =

  • d(x, x′)PX1(dx′),

Gε(y) =

  • Cδ(y, y ′)Pε(dy ′).

Assumption : for any k ≥ 2, E

  • GX1(X1)k

≤ k! 2 V1Mk−2, and E

  • Gε(ε)k

≤ k! 2 V2Mk−2 .

Pierre Alquier Machine learning theory for time series

slide-66
SLIDE 66

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Notations in DF15

Define GX1(x) =

  • d(x, x′)PX1(dx′),

Gε(y) =

  • Cδ(y, y ′)Pε(dy ′).

Assumption : for any k ≥ 2, E

  • GX1(X1)k

≤ k! 2 V1Mk−2, and E

  • Gε(ε)k

≤ k! 2 V2Mk−2 . Define V = V1 + V2 1 − ρ2 , δ = 1 − ρ M .

Pierre Alquier Machine learning theory for time series

slide-67
SLIDE 67

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Dedecker and Fan’s inequality

Theorem (Dedecker & Fan 2015) Consider a separately Lipschitz function f : X n → R : |f (x1, . . . , xn) − f (x′

1, . . . , x′ n)| ≤ n

  • t=1

d(xt, x′

t) .

Then, for any s ∈ [0, δ−1), E [e±s{f (X1,...,Xn)−E[f (X1,...,Xn)]}] ≤ exp (n − 1)s2V 2(1 − s δ)

  • .

Pierre Alquier Machine learning theory for time series

slide-68
SLIDE 68

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Consequences of DF15 for prediction

E

  • e±s{f (X1,...,Xn)−E[f (X1,...,Xn)]}

≤ exp (n − 1)s2V 2(1 − s δ)

  • .

Pierre Alquier Machine learning theory for time series

slide-69
SLIDE 69

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Consequences of DF15 for prediction

E

  • e±s{f (X1,...,Xn)−E[f (X1,...,Xn)]}

≤ exp (n − 1)s2V 2(1 − s δ)

  • .

Take f (X1, . . . , Xn) = 1 L

n

  • i=2

ℓ(fθ(Xi−1) − Xi).

Pierre Alquier Machine learning theory for time series

slide-70
SLIDE 70

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Consequences of DF15 for prediction

E

  • e±s{f (X1,...,Xn)−E[f (X1,...,Xn)]}

≤ exp (n − 1)s2V 2(1 − s δ)

  • .

Take f (X1, . . . , Xn) = 1 L

n

  • i=2

ℓ(fθ(Xi−1) − Xi). Then for any 0 ≤ s < (n − 1)/(L(1 + ρ)δ), P

  • |R(θ) − r(θ)| > t
  • ≤ 2 exp
  • s2(1 + ρ)2L2V

2(n − 1) − 2s(1 + ρ)δL − st

  • .

Pierre Alquier Machine learning theory for time series

slide-71
SLIDE 71

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Learning theorem for Markov chains

Assume |Θ(ε)| ≤ ε−d. Theorem As soon as n ≥ 1 + 4δ2d log(Ln)/V we have, with probability at least 1 − α, R(ˆ θ) ≤ inf

θ∈Θ R(θ) + C1

  • d log(Ln)

n − 1 + C2 log 4

α

  • √n − 1 + C3

n , where C1 = 4(1 + ρ)L √ V, C2 = 2(1 + ρ)L √ V + 2δ and C3 = 3[Gε(0) + GX1(0)]/(1 − ρ) + V/(2δ).

Pierre Alquier Machine learning theory for time series

slide-72
SLIDE 72

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Other works on ML & TS

Study of Xt = F(εt; Xt−1, Xt−1, . . . ) in

  • P. Alquier, O. Wintengerger. Model selection for weakly dependent time series
  • forecasting. Bernoulli, 2012.

based on Rio’s version of Hoeffding’s inequality

  • E. Rio. Inégalités de Hoeffding pour les fonctions lipschitziennes de suites dépendantes.

CRAS, 2000.

Rates in

  • d

n .

Pierre Alquier Machine learning theory for time series

slide-73
SLIDE 73

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Fast rates

Rates in d n for quadratic ℓ in

  • P. Alquier, X. Li, O. Wintengerger. Prediction of time series by statistical learning :

general losses and fast rates.. Dependence Modeling, 2013.

based on Samson’s version of Bernstein’s inequality for ϕ-mixing processes

P.-M. Samson. Concentration of measure inequalities for markov chains and ϕ-mixing

  • processes. The Annals of Probability, 2000.

Pierre Alquier Machine learning theory for time series

slide-74
SLIDE 74

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Online prediction approach

The online prediction approach provides tools to aggregate pre- dictors without stochastic as- sumptions on the data.

Pierre Alquier Machine learning theory for time series

slide-75
SLIDE 75

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Online prediction approach

The online prediction approach provides tools to aggregate pre- dictors without stochastic as- sumptions on the data.

  • C. Giraud, F. Roueff, A.

Sanchez-Perez. Aggregation of predictors for nonstationary sub-linear processes and online adaptive forecasting of time varying autoregressive processes. The Annals of Statistics, 2015.

takes advantage of this ap- proach to predict time varying AR processes.

Pierre Alquier Machine learning theory for time series

slide-76
SLIDE 76

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

1

Short introduction to machine learning theory

2

Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Pierre Alquier Machine learning theory for time series

slide-77
SLIDE 77

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Nonstationary Markov chains

  • P. Alquier, P. Doukhan, X. Fan.

Exponential inequalities for nonstationary Markov Chains. Preprint arxiv :1808.08811, 2018. Pierre Alquier Machine learning theory for time series

slide-78
SLIDE 78

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Non-stationary Markov chains

We now assume Xt = Ft(Xt−1, εt),

Pierre Alquier Machine learning theory for time series

slide-79
SLIDE 79

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Non-stationary Markov chains

We now assume Xt = Ft(Xt−1, εt), sup

t E

  • d
  • Ft(x, ε1), Ft(x′, ε1)
  • ≤ ρd(x, x′)

sup

t d(Ft(x, y), Ft(x, y ′)) ≤ Cδ(y, y ′).

Pierre Alquier Machine learning theory for time series

slide-80
SLIDE 80

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Non-stationary Markov chains

We now assume Xt = Ft(Xt−1, εt), sup

t E

  • d
  • Ft(x, ε1), Ft(x′, ε1)
  • ≤ ρd(x, x′)

sup

t d(Ft(x, y), Ft(x, y ′)) ≤ Cδ(y, y ′).

Example 1 : time-varying AR(1) Xt = atXt−1 + εt, sup

t |at| ≤ ρ.

Pierre Alquier Machine learning theory for time series

slide-81
SLIDE 81

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Non-stationary Markov chains

We now assume Xt = Ft(Xt−1, εt), sup

t E

  • d
  • Ft(x, ε1), Ft(x′, ε1)
  • ≤ ρd(x, x′)

sup

t d(Ft(x, y), Ft(x, y ′)) ≤ Cδ(y, y ′).

Example 1 : time-varying AR(1) Xt = atXt−1 + εt, sup

t |at| ≤ ρ.

Example 2 : T-periodic AR(1) Xt = at[T]Xt−1 + εt, max

1≤t≤T |at| ≤ ρ.

Pierre Alquier Machine learning theory for time series

slide-82
SLIDE 82

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Example : T-periodic AR(1)

4-periodic AR(1), (a1, a2, a3, a4) = (0.8, 0.5, 0.9, −0.7).

Figure – Simulated data. Figure – Autocorrelations.

Pierre Alquier Machine learning theory for time series

slide-83
SLIDE 83

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Bernstein’s inequality

Theorem (ADF18) Assume that, for any k ≥ 2, E

  • GX1(X1)k

≤ k! 2 V1Mk−2, and E

  • Gε(ε)k

≤ k! 2 V2Mk−2 . Consider a separately Lipschitz function f : X n → R. For any s ∈ [0, δ−1), E [e±s{f (X1,...,Xn)−E[f (X1,...,Xn)]}] ≤ exp (n − 1)s2V 2(1 − s δ)

  • .

Pierre Alquier Machine learning theory for time series

slide-84
SLIDE 84

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Problem : estimation of the (best) period

From now, assume that Xt = f ∗

t (Xt−1) + εt

(not necessarily periodic, but we hope so).

Pierre Alquier Machine learning theory for time series

slide-85
SLIDE 85

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Problem : estimation of the (best) period

From now, assume that Xt = f ∗

t (Xt−1) + εt

(not necessarily periodic, but we hope so). Let (fθ, θ ∈ Θ) be a set of predictors X → X, define H(ε) as log |Θ(ε)|. Put, for any T and θ1:T = (θ1, . . . , θT) ∈ ΘT : rn(θ1:T) = 1 n − 1

n

  • i=2

ℓ(fθi[T](Xi−1) − Xi) R(θ1:T) = E[rn(θ1:T)]

Pierre Alquier Machine learning theory for time series

slide-86
SLIDE 86

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Problem : estimation of the (best) period

From now, assume that Xt = f ∗

t (Xt−1) + εt

(not necessarily periodic, but we hope so). Let (fθ, θ ∈ Θ) be a set of predictors X → X, define H(ε) as log |Θ(ε)|. Put, for any T and θ1:T = (θ1, . . . , θT) ∈ ΘT : rn(θ1:T) = 1 n − 1

n

  • i=2

ℓ(fθi[T](Xi−1) − Xi) R(θ1:T) = E[rn(θ1:T)] ∈

  • 1

T

T

  • t=1

E[ℓ(fθt[T](Xt−1) − Xt)] ± C0T n − 1

  • if actually f ∗

t = f ∗ t[T], where C0 = L(1 + ρ)

  • Gε(0)

1−ρ + GX1(0)

  • .

Pierre Alquier Machine learning theory for time series

slide-87
SLIDE 87

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Estimators

Estimation for a given period T : ˆ θ1:T = (ˆ θ1, . . . , θT) = argmin

θ1:T =(θ1,...,θT )

rn(θ1:T). Period selection : ˆ T = argmin

1≤T≤Tmax

 rn(ˆ f1:T) + C1 2

  • TH( 1

Ln)

n − 1   where C1 is as in the stationary case : C1 = 4(1 + ρ)L √ V.

Pierre Alquier Machine learning theory for time series

slide-88
SLIDE 88

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Analysis of the estimators

Theorem (ADF18) As soon as n ≥ 1 + 4δ2TmaxH( 1

Ln)/V, with probability at least

1 − α, R(ˆ θ1: ˆ

T) ≤

inf

1≤T≤Tmax

inf

θ1:T ∈ΘT

  • R(θ1:T)

+ C1

  • TH( 1

Ln)

n − 1 + C2 log 4Tmax

α

  • √n − 1

+ C3 n

  • .

Pierre Alquier Machine learning theory for time series

slide-89
SLIDE 89

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Analysis of the estimators

Theorem (ADF18) As soon as n ≥ 1 + 4δ2TmaxH( 1

Ln)/V, with probability at least

1 − α, R(ˆ θ1: ˆ

T) ≤

inf

1≤T≤Tmax

inf

θ1:T ∈ΘT

  • R(θ1:T)

+ C1

  • TH( 1

Ln)

n − 1 + C2 log 4Tmax

α

  • √n − 1

+ C3 n

  • .

In practice, C1, C2 and C3 are too large and ρ is not known anyway... we recommend to use the slope heuristic here.

Pierre Alquier Machine learning theory for time series

slide-90
SLIDE 90

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Slope heuristic

Figure – Empirical risk as a function of T.

Pierre Alquier Machine learning theory for time series

slide-91
SLIDE 91

Short introduction to machine learning theory Machine learning and time series Machine learning & stationary time series Nonstationary Markov chains

Thank you !

Pierre Alquier Machine learning theory for time series