Consistent Change-point Detection with Kernels Damien Garreau 1 - - PowerPoint PPT Presentation

consistent change point detection with kernels
SMART_READER_LITE
LIVE PREVIEW

Consistent Change-point Detection with Kernels Damien Garreau 1 - - PowerPoint PPT Presentation

Consistent Change-point Detection with Kernels Damien Garreau 1 Sylvain Arlot 2 1 Inria, DI ENS 2 Universit Paris-Sud, Laboratoire de Mathmatiques dOrsay April 6, 2016 1 An example: shot detection in a movie 0.7 0.6 0.5 0.4 0.3 0.2


slide-1
SLIDE 1

Consistent Change-point Detection with Kernels

Damien Garreau 1 Sylvain Arlot 2

1Inria, DI ENS 2Université Paris-Sud, Laboratoire de Mathématiques d’Orsay

April 6, 2016

1

slide-2
SLIDE 2

An example: shot detection in a movie

200 400 600 800 1000 1200 1400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 2

slide-3
SLIDE 3

An example: shot detection, cont.

200 400 600 800 1000 1200 1400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 3

slide-4
SLIDE 4

Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion

4

slide-5
SLIDE 5

Goals

We want to:

◮ detect abrupt changes in the distribution of the data ◮ deal with interesting (structured) data: each point is a

curve, a graph, a histogram, a persistence diagram...

5

slide-6
SLIDE 6

The change-point problem

◮ X arbitrary (measurable) set, n < +∞, and

X1, . . . , Xn ∈ X sequence of independent random variables.

◮ ∀i ∈

1, . . . , n , PXi the distribution of Xi.

The change-point problem can be formalized as follows:

◮ Given (Xi)1≤i≤n, we want to find the locations of the

abrupt changes in the sequence PX1, . . . , PXn.

6

slide-7
SLIDE 7

Notations

◮ Take any D ∈

1, . . . , n + 1 , the set of sequences of D − 1

change-points is defined by T D

n ·

·=

(τ0, . . . , τD) ∈ ND+1, 0 = τ0 < τ1 < · · · < τD = n .

◮ τ1, . . . , τD−1 are the change-points, τ is a segmentation of

1, . . . , n into Dτ segments.

◮ τ ⋆ the true segmentation, D⋆ = Dτ ⋆ the true number of

change-points.

7

slide-8
SLIDE 8

In pictures

Here, X = R, D⋆ = 3, and τ ⋆ = (0, 50, 70, 100).

  • 20

40 60 80 100 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Time X t0 t1 t2 t3

8

slide-9
SLIDE 9

In pictures, cont.

It is not an easy question:

  • 20

40 60 80 100 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Time X

?

9

slide-10
SLIDE 10

Summary

◮ With finite sample size, it is not easy to recover the true

change-points in presence of noise.

◮ When X = Rd and the changes occur in the first moments

  • f the distribution, the problem has already received

considerable attention, cf. [Basseville and Nikiforov, 1993].

◮ Kernel change-point detection can tackle more subtle

changes / less conventional data.

10

slide-11
SLIDE 11

Plan

Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion

11

slide-12
SLIDE 12

Kernels: a quick reminder

◮ Let k : X × X → R be a semidefinite kernel. ◮ k is a measurable function s.t. ∀x1, . . . , xm ∈ X, the matrix

(k(xi, xj))1≤i,j≤m is positive semi-definite. Think inner product.

◮ Examples include

◮ the linear kernel k(x, y) = x, y, ◮ the Gaussian kernel k(x, y) = exp(− x − y2 /(2h2)), ◮ the histogram kernel k(x, y) = p

k=1 min(xk, yk),

◮ . . . 12

slide-13
SLIDE 13

The kernel least-squares criterion

◮ Intuition: least-squares criterion

  • Rn(τ) ·

·= 1 n

D

  • ℓ=1

τℓ

  • i=τℓ−1+1

(Xi − Xτℓ−1+1,τℓ)2.

◮ Define

  • Rn(τ) ·

·= 1 n

n

  • i=1

k(Xi, Xi)− − 1 n

D

  • ℓ=1

 

1 τℓ − τℓ−1

τℓ

  • i=τℓ−1+1

τℓ

  • j=τℓ−1+1

k(Xi, Xj)

  .

◮ This is just a kernelized version, the two definitions

coincide when X = R and k(x, y) = xy.

13

slide-14
SLIDE 14

Most important slide of the talk

We investigate the properties of

  • τ ∈ arg min

τ∈T n

  • least−squares

criterion

  • Rn (τ)

+ pen(τ)

  • penalty

function

  • ,

where pen is a function increasing with Dτ.

14

slide-15
SLIDE 15

Constant mean and variance

Constant mean and variance: the distribution of Xi is chosen among B(0.5), N(0.5, 0.25) and Γ(1, 0.5).

100 200 300 400 500 600 700 800 900 1000 −1 1 2 3 4

(courtesy of [Arlot et al., 2012])

15

slide-16
SLIDE 16

Constant mean and variance, cont.

100 200 300 400 500 600 700 800 900 1000 0.1 0.2 0.3 0.4 0.5 0.6 Position

  • Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0.1 0.2 0.3 0.4 0.5 0.6 Position

  • Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0.1 0.2 0.3 0.4 0.5 0.6 Position

  • Freq. of selected chgpts

Linear, Hermite, and Gaussian kernels (courtesy of [Arlot et al., 2012]).

16

slide-17
SLIDE 17

Plan

Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion

17

slide-18
SLIDE 18

More notations

◮ Along with the kernel k comes a reproducing kernel Hilbert

space H endowed with ·, ·H.

◮ There exists a mapping Φ : X → H such that, for any

x, y ∈ X, k(x, y) = Φ(x), Φ(y)H.

◮ The algorithm is looking for breaks in the “mean” of

Yi · ·= Φ(Xi) ∈ H.

◮ Whenever possible, define µ⋆ i the mean of Yi; it satisfies

∀g ∈ H, µ⋆

i , gH = E [g(Xi)] = E [Yi, gH] . ◮ We write Yi = µ⋆ i + εi.

18

slide-19
SLIDE 19

Hypothesis

◮ H is separable. ◮ Bounded data/kernel:

∃M ∈ (0, +∞), ∀1 ≤ i ≤ n, k(Xi, Xi) ≤ M2. (Db)

◮ Finite variance:

∀1 ≤ i ≤ n, vi · ·= E

  • εi2

H

  • ≤ V < +∞.

(V) Under (Db), an oracle inequality has been proven. − → See [Arlot et al., 2012] for the result.

19

slide-20
SLIDE 20

Dimension selection, light version

◮ Assume that (Db) holds true; ◮ Suppose that pen(·) is “large enough”; ◮ Suppose that ∆2 × Γ is “large enough”, where

∆ · ·= infµ⋆

i =µ⋆ i+1

  • µ⋆

i − µ⋆ i+1

  • H is the size of the smallest

jump in H, and Γ depends only on the geometry of τ ⋆;

◮ Then, with high probability, D τ = D⋆.

If k is characteristic, we recover all the changes in PXi.

20

slide-21
SLIDE 21

Dimension selection

Theorem

Let y be a positive number. Assume that (Db) holds true and that ∀τ ∈ T n, pen(τ) = CDτM2 n

  • 1 +
  • 2
  • 4 + y + log n

2

, with C ≥ (2D⋆ + 1)(5 + y + log D⋆). Suppose that ∆2 × Γ CD⋆M2 n

  • y + log n

D⋆

  • .

Then P

D

τ = D⋆ ≥ 1 − e−y.

21

slide-22
SLIDE 22

Distance between segmentations

◮ We consider only segmentation with the same number of

segments D⋆.

◮ Several possibilities, equivalent under assumptions

regarding Λτ · ·= 1 n min

λ∈τ |λ| . ◮ We focus on

d∞(τ 1, τ 2) · ·= max

1≤i≤D⋆−1

  • τ 1

i − τ 2 i

  • .

22

slide-23
SLIDE 23

Localization of the change-points, light version

◮ Assume that D⋆ is known and that (V) holds true. ◮ Take δn > 0, and choose

τ in T D⋆

n (δn) ·

·=

τ ∈ T n, Λτ ≥ δn, Dτ = D⋆.

◮ Then, for any 0 < x < Λτ ⋆,

P

1

nd∞( τ, τ ⋆) ≥ x

  • V

nx

1

δn + 1 x

  • .

◮ This goes to 0 whenever δn → 0 and nδn → +∞.

23

slide-24
SLIDE 24

Localization of the change-points

Theorem

Assum that (V) holds true. Take δn > 0 and choose

  • τ ∈ arg min

τ∈T D⋆

n (δn)

Rn(τ)

.

Suppose that δn ≤ Λτ ⋆. Then, for any 0 < x ≤ Λτ ⋆, P (d∞( τ, τ ⋆) ≥ x) V D⋆ nx∆2

  • 1

δn + (D⋆)3 ∆

2

x∆2

  • .

For instance: take δn = n−1/2: d∞(ˆ τ, τ ⋆) = oP (n−1/2).

24

slide-25
SLIDE 25

Plan

Introduction Overview The change-point problem Algorithm Kernel change-point algorithm Experimental results Theoretical results Hypothesis Dimension selection Localization of the change points Conclusion

25

slide-26
SLIDE 26

Take away message

◮ Kernelized version of the change-point detection procedure

  • f [Lebarbier, 2005].

◮ Detection of changes in the distribution, not only the first

moments.

◮ Possible to deal with structured data more efficiently. ◮ Under reasonable assumptions and for a class of penalty

functions,

◮ we dispose of an oracle inequality ◮ the procedure is consistent ◮ it recovers the true localization of the change-points 26

slide-27
SLIDE 27

Future work

◮ Exchange the hypothesis and still prove our results (in

progress);

◮ Tackle dependency structures within the Xis as

in [Lavielle and Moulines, 2000];

◮ Learn how to choose the kernel; ◮ Find interesting data!

27

slide-28
SLIDE 28

Thank you for your attention!

28

slide-29
SLIDE 29

References I

Arlot, S., Celisse, A., and Harchaoui, Z. (2012). Kernel change-point detection. arXiv preprint arXiv:1202.3878. Basseville, M. and Nikiforov, I. (1993). Detection of abrupt changes: theory and application. Prentice Hall Englewood Cliffs. Lavielle, M. and Moulines, E. (2000). Least-squares estimation of an unknown number of shifts in a time series. Journal of time series analysis, 21(1):33–59. Lebarbier, É. (2005). Detecting multiple change-points in the mean of a Gaussian process by model selection. Signal Proces., 85:717–736.

29

slide-30
SLIDE 30

Bonus: elements of proof (dimension selection)

Main idea: ∀τ ∈ T n s.t. Dτ = D⋆, w.h.p., R(τ) + pen(τ) > R(τ ⋆) + pen(τ ⋆). Since τ minimizes R(·) + pen(·), D = D⋆ w.h.p..

30

slide-31
SLIDE 31

Bonus: elements of proof (dimension selection)

Main idea: ∀τ ∈ T n s.t. Dτ = D⋆, w.h.p., R(τ) + pen(τ) > R(τ ⋆) + pen(τ ⋆). Since τ minimizes R(·) + pen(·), D = D⋆ w.h.p.. R(τ) = 1 n µ⋆ − µ⋆

τ2 + 2

n µ⋆ − µ⋆

τ, ε − 1

n Πτε2 + 1 n ε2 , where µ⋆

τ is the projection of µ⋆ on the subset of Hn “constant

  • n the segments of τ”, i.e.,

Fτ · ·=

f ∈ Hn, fτℓ−1+1 = · · · = fτℓ ∀1 ≤ ℓ ≤ Dτ .

30

slide-32
SLIDE 32

Bonus: elements of proof (dimension selection)

Main idea: ∀τ ∈ T n s.t. Dτ = D⋆, w.h.p., R(τ) + pen(τ) > R(τ ⋆) + pen(τ ⋆). Since τ minimizes R(·) + pen(·), D = D⋆ w.h.p.. R(τ) = 1 n µ⋆ − µ⋆

τ2 + 2

n µ⋆ − µ⋆

τ, ε − 1

n Πτε2 + 1 n ε2 , where µ⋆

τ is the projection of µ⋆ on the subset of Hn “constant

  • n the segments of τ”, i.e.,

Fτ · ·=

f ∈ Hn, fτℓ−1+1 = · · · = fτℓ ∀1 ≤ ℓ ≤ Dτ .

We are reduced to show that if Dτ = D⋆, w.h.p., 1 n µ⋆ − µ⋆

τ2 + 2

n µ⋆ − µ⋆

τ, ε + 1

n Πτ ⋆ε2 − 1 n Πτε2 > > pen(τ ⋆) − pen(τ).

30

slide-33
SLIDE 33

Elements of proof, cont.

We control each term ∀τ, with probability 1 − e−x:

◮ the linear term: |µ⋆ − µ⋆ τ, ε| θ µ⋆ − µ⋆ τ2 + 1 θM2x, ◮ the quadratic term:

Πτε2 − E

  • Πτε2

(x + √xDτ)M2,

◮ pen(τ) − pen(τ ⋆) via technical lemmas.

31

slide-34
SLIDE 34

Elements of proof, cont.

Union bound, P

 

τ∈T n

Ωτ

  ≥ 1 − 4

  • τ∈T n

e−xτ . Recall # T D

n =

n−1

D−1

≤ (n e /d)d.

  • τ∈T n

e−xτ ≤

n

  • d=1

exp (d + d log n − d log d − 4d − dy − d log n + d log d) =

n

  • d=1

exp(d(−3 − y)) =

n

  • d=1
  • e−y

e3

d

= e−3−y ·1 − (e−3−y)n 1 − e−3−y ≤ e−y /4. This part fails if we do not have bounded assumption.

32

slide-35
SLIDE 35

Bonus: a word about the concentration result

Lemma

For every x > 0, with probability ≥ 1 − e−x, Πτε2 − E

  • Πτε2

≤ 14M2 3

  • x + 2
  • 2xDτ
  • .

Proof.

Write Πτε2 as

1≤ℓ≤Dτ Tℓ, a sum of independant random

variables, where Tℓ · ·=

1 τℓ−τℓ−1

  • τℓ

j=τℓ−1+1 εj

  • 2. Apply

Bernstein’s inequality. The tricky part is to check that the moment conditions for Bernstein are satisfied. Idea: Write E

T q

as an integral depending upon P

  • τℓ

j=τℓ−1+1 εj

  • ≥ y
  • for which Pinelis-Sakhanenko’s inequality gives an

upper-bound.

33