[PPT] - Consistent Change-point Detection with Kernels Damien Garreau 1 PowerPoint Presentation

SLIDE 1

Consistent Change-point Detection with Kernels

Damien Garreau 1 Sylvain Arlot 2

1Inria, DI ENS 2Université Paris-Sud, Laboratoire de Mathématiques d’Orsay

April 6, 2016

1

SLIDE 2

An example: shot detection in a movie

200 400 600 800 1000 1200 1400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 2

SLIDE 3

An example: shot detection, cont.

200 400 600 800 1000 1200 1400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 3

SLIDE 4

Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion

4

SLIDE 5

Goals

We want to:

◮ detect abrupt changes in the distribution of the data ◮ deal with interesting (structured) data: each point is a

curve, a graph, a histogram, a persistence diagram...

5

SLIDE 6

The change-point problem

◮ X arbitrary (measurable) set, n < +∞, and

X1, . . . , Xn ∈ X sequence of independent random variables.

◮ ∀i ∈

1, . . . , n , PXi the distribution of Xi.

The change-point problem can be formalized as follows:

◮ Given (Xi)1≤i≤n, we want to find the locations of the

abrupt changes in the sequence PX1, . . . , PXn.

6

SLIDE 7

Notations

◮ Take any D ∈

1, . . . , n + 1 , the set of sequences of D − 1

change-points is defined by T D

n ·

·=

(τ0, . . . , τD) ∈ ND+1, 0 = τ0 < τ1 < · · · < τD = n .

◮ τ1, . . . , τD−1 are the change-points, τ is a segmentation of

1, . . . , n into Dτ segments.

◮ τ ⋆ the true segmentation, D⋆ = Dτ ⋆ the true number of

change-points.

7

SLIDE 8

In pictures

Here, X = R, D⋆ = 3, and τ ⋆ = (0, 50, 70, 100).

20

40 60 80 100 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Time X t0 t1 t2 t3

8

SLIDE 9

In pictures, cont.

It is not an easy question:

20

40 60 80 100 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Time X

?

9

SLIDE 10

Summary

◮ With finite sample size, it is not easy to recover the true

change-points in presence of noise.

◮ When X = Rd and the changes occur in the first moments

f the distribution, the problem has already received

considerable attention, cf. [Basseville and Nikiforov, 1993].

◮ Kernel change-point detection can tackle more subtle

changes / less conventional data.

10

SLIDE 11

Plan

Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion

11

SLIDE 12

Kernels: a quick reminder

◮ Let k : X × X → R be a semidefinite kernel. ◮ k is a measurable function s.t. ∀x1, . . . , xm ∈ X, the matrix

(k(xi, xj))1≤i,j≤m is positive semi-definite. Think inner product.

◮ Examples include

◮ the linear kernel k(x, y) = x, y, ◮ the Gaussian kernel k(x, y) = exp(− x − y2 /(2h2)), ◮ the histogram kernel k(x, y) = p

k=1 min(xk, yk),

◮ . . . 12

SLIDE 13

The kernel least-squares criterion

◮ Intuition: least-squares criterion

Rn(τ) ·

·= 1 n

D

ℓ=1

τℓ

i=τℓ−1+1

(Xi − Xτℓ−1+1,τℓ)2.

◮ Define

Rn(τ) ·

·= 1 n

n

i=1

k(Xi, Xi)− − 1 n

D

ℓ=1

 

1 τℓ − τℓ−1

τℓ

i=τℓ−1+1

τℓ

j=τℓ−1+1

k(Xi, Xj)

  .

◮ This is just a kernelized version, the two definitions

coincide when X = R and k(x, y) = xy.

13

SLIDE 14

Most important slide of the talk

We investigate the properties of

τ ∈ arg min

τ∈T n

least−squares

criterion

Rn (τ)

+ pen(τ)

penalty

function

,

where pen is a function increasing with Dτ.

14

SLIDE 15

Constant mean and variance

Constant mean and variance: the distribution of Xi is chosen among B(0.5), N(0.5, 0.25) and Γ(1, 0.5).

100 200 300 400 500 600 700 800 900 1000 −1 1 2 3 4

(courtesy of [Arlot et al., 2012])

15

SLIDE 16

Constant mean and variance, cont.

100 200 300 400 500 600 700 800 900 1000 0.1 0.2 0.3 0.4 0.5 0.6 Position

Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0.1 0.2 0.3 0.4 0.5 0.6 Position

Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0.1 0.2 0.3 0.4 0.5 0.6 Position

Freq. of selected chgpts

Linear, Hermite, and Gaussian kernels (courtesy of [Arlot et al., 2012]).

16

SLIDE 17

Plan

Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion

17

SLIDE 18

More notations

◮ Along with the kernel k comes a reproducing kernel Hilbert

space H endowed with ·, ·H.

◮ There exists a mapping Φ : X → H such that, for any

x, y ∈ X, k(x, y) = Φ(x), Φ(y)H.

◮ The algorithm is looking for breaks in the “mean” of

Yi · ·= Φ(Xi) ∈ H.

◮ Whenever possible, define µ⋆ i the mean of Yi; it satisfies

∀g ∈ H, µ⋆

i , gH = E [g(Xi)] = E [Yi, gH] . ◮ We write Yi = µ⋆ i + εi.

18

SLIDE 19

Hypothesis

◮ H is separable. ◮ Bounded data/kernel:

∃M ∈ (0, +∞), ∀1 ≤ i ≤ n, k(Xi, Xi) ≤ M2. (Db)

◮ Finite variance:

∀1 ≤ i ≤ n, vi · ·= E

εi2

H

≤ V < +∞.

(V) Under (Db), an oracle inequality has been proven. − → See [Arlot et al., 2012] for the result.

19

SLIDE 20

Dimension selection, light version

◮ Assume that (Db) holds true; ◮ Suppose that pen(·) is “large enough”; ◮ Suppose that ∆2 × Γ is “large enough”, where

∆ · ·= infµ⋆

i =µ⋆ i+1

µ⋆

i − µ⋆ i+1

H is the size of the smallest

jump in H, and Γ depends only on the geometry of τ ⋆;

◮ Then, with high probability, D τ = D⋆.

If k is characteristic, we recover all the changes in PXi.

20

SLIDE 21

Dimension selection

Theorem

Let y be a positive number. Assume that (Db) holds true and that ∀τ ∈ T n, pen(τ) = CDτM2 n

1 +
2
4 + y + log n

Dτ

2

, with C ≥ (2D⋆ + 1)(5 + y + log D⋆). Suppose that ∆2 × Γ CD⋆M2 n

y + log n

D⋆

.

Then P

D

τ = D⋆ ≥ 1 − e−y.

21

SLIDE 22

Distance between segmentations

◮ We consider only segmentation with the same number of

segments D⋆.

◮ Several possibilities, equivalent under assumptions

regarding Λτ · ·= 1 n min

λ∈τ |λ| . ◮ We focus on

d∞(τ 1, τ 2) · ·= max

1≤i≤D⋆−1

τ 1

i − τ 2 i

.

22

SLIDE 23

Localization of the change-points, light version

◮ Assume that D⋆ is known and that (V) holds true. ◮ Take δn > 0, and choose

τ in T D⋆

n (δn) ·

·=

τ ∈ T n, Λτ ≥ δn, Dτ = D⋆.

◮ Then, for any 0 < x < Λτ ⋆,

P

1

nd∞( τ, τ ⋆) ≥ x

V

nx

1

δn + 1 x

.

◮ This goes to 0 whenever δn → 0 and nδn → +∞.

23

SLIDE 24

Localization of the change-points

Theorem

Assum that (V) holds true. Take δn > 0 and choose

τ ∈ arg min

τ∈T D⋆

n (δn)

Rn(τ)

.

Suppose that δn ≤ Λτ ⋆. Then, for any 0 < x ≤ Λτ ⋆, P (d∞( τ, τ ⋆) ≥ x) V D⋆ nx∆2

1

δn + (D⋆)3 ∆

2

x∆2

.

For instance: take δn = n−1/2: d∞(ˆ τ, τ ⋆) = oP (n−1/2).

24

SLIDE 25

Plan

Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion

25

SLIDE 26

Take away message

◮ Kernelized version of the change-point detection procedure

f [Lebarbier, 2005].

◮ Detection of changes in the distribution, not only the first

moments.

◮ Possible to deal with structured data more efficiently. ◮ Under reasonable assumptions and for a class of penalty

functions,

◮ we dispose of an oracle inequality ◮ the procedure is consistent ◮ it recovers the true localization of the change-points 26

SLIDE 27

Future work

◮ Exchange the hypothesis and still prove our results (in

progress);

◮ Tackle dependency structures within the Xis as

in [Lavielle and Moulines, 2000];

◮ Learn how to choose the kernel; ◮ Find interesting data!

27

SLIDE 28

Thank you for your attention!

28

SLIDE 29

References I

Arlot, S., Celisse, A., and Harchaoui, Z. (2012). Kernel change-point detection. arXiv preprint arXiv:1202.3878. Basseville, M. and Nikiforov, I. (1993). Detection of abrupt changes: theory and application. Prentice Hall Englewood Cliffs. Lavielle, M. and Moulines, E. (2000). Least-squares estimation of an unknown number of shifts in a time series. Journal of time series analysis, 21(1):33–59. Lebarbier, É. (2005). Detecting multiple change-points in the mean of a Gaussian process by model selection. Signal Proces., 85:717–736.

29

SLIDE 30

Bonus: elements of proof (dimension selection)

Main idea: ∀τ ∈ T n s.t. Dτ = D⋆, w.h.p., R(τ) + pen(τ) > R(τ ⋆) + pen(τ ⋆). Since τ minimizes R(·) + pen(·), D = D⋆ w.h.p..

30

SLIDE 31

Bonus: elements of proof (dimension selection)

Main idea: ∀τ ∈ T n s.t. Dτ = D⋆, w.h.p., R(τ) + pen(τ) > R(τ ⋆) + pen(τ ⋆). Since τ minimizes R(·) + pen(·), D = D⋆ w.h.p.. R(τ) = 1 n µ⋆ − µ⋆

τ2 + 2

n µ⋆ − µ⋆

τ, ε − 1

n Πτε2 + 1 n ε2 , where µ⋆

τ is the projection of µ⋆ on the subset of Hn “constant

n the segments of τ”, i.e.,

Fτ · ·=

f ∈ Hn, fτℓ−1+1 = · · · = fτℓ ∀1 ≤ ℓ ≤ Dτ .

30

SLIDE 32

Bonus: elements of proof (dimension selection)

Main idea: ∀τ ∈ T n s.t. Dτ = D⋆, w.h.p., R(τ) + pen(τ) > R(τ ⋆) + pen(τ ⋆). Since τ minimizes R(·) + pen(·), D = D⋆ w.h.p.. R(τ) = 1 n µ⋆ − µ⋆

τ2 + 2

n µ⋆ − µ⋆

τ, ε − 1

n Πτε2 + 1 n ε2 , where µ⋆

τ is the projection of µ⋆ on the subset of Hn “constant

n the segments of τ”, i.e.,

Fτ · ·=

f ∈ Hn, fτℓ−1+1 = · · · = fτℓ ∀1 ≤ ℓ ≤ Dτ .

We are reduced to show that if Dτ = D⋆, w.h.p., 1 n µ⋆ − µ⋆

τ2 + 2

n µ⋆ − µ⋆

τ, ε + 1

n Πτ ⋆ε2 − 1 n Πτε2 > > pen(τ ⋆) − pen(τ).

30

SLIDE 33

Elements of proof, cont.

We control each term ∀τ, with probability 1 − e−x:

◮ the linear term: |µ⋆ − µ⋆ τ, ε| θ µ⋆ − µ⋆ τ2 + 1 θM2x, ◮ the quadratic term:

Πτε2 − E

Πτε2

(x + √xDτ)M2,

◮ pen(τ) − pen(τ ⋆) via technical lemmas.

31

SLIDE 34

Elements of proof, cont.

Union bound, P

 

τ∈T n

Ωτ

  ≥ 1 − 4

τ∈T n

e−xτ . Recall # T D

n =

n−1

D−1

≤ (n e /d)d.

τ∈T n

e−xτ ≤

n

d=1

exp (d + d log n − d log d − 4d − dy − d log n + d log d) =

n

d=1

exp(d(−3 − y)) =

n

d=1
e−y

e3

d

= e−3−y ·1 − (e−3−y)n 1 − e−3−y ≤ e−y /4. This part fails if we do not have bounded assumption.

32

SLIDE 35

Bonus: a word about the concentration result

Lemma

For every x > 0, with probability ≥ 1 − e−x, Πτε2 − E

Πτε2

≤ 14M2 3

x + 2
2xDτ
.

Proof.

Write Πτε2 as

1≤ℓ≤Dτ Tℓ, a sum of independant random

variables, where Tℓ · ·=

1 τℓ−τℓ−1

τℓ

j=τℓ−1+1 εj

2. Apply

Bernstein’s inequality. The tricky part is to check that the moment conditions for Bernstein are satisfied. Idea: Write E

T q

ℓ

as an integral depending upon P

τℓ

j=τℓ−1+1 εj

≥ y
for which Pinelis-Sakhanenko’s inequality gives an

upper-bound.

33