Change detection in multi-dimensional datasets and time series - - PowerPoint PPT Presentation

change detection in multi dimensional datasets and time
SMART_READER_LITE
LIVE PREVIEW

Change detection in multi-dimensional datasets and time series - - PowerPoint PPT Presentation

Change detection in multi-dimensional datasets and time series Andrea De Simone andrea.desimone@sissa.it Univ. Camerino, 2019-02-26 [DS, Jacques arXiv:1807.06038] Outline 1 Two-Sample Test: Intro & Motivation 2 Nearest


slide-1
SLIDE 1

Change detection in multi-dimensional datasets and time series

Andrea De Simone

andrea.desimone@sissa.it

  • Univ. Camerino, 2019-02-26

[DS, Jacques – arXiv:1807.06038]

slide-2
SLIDE 2

Outline

1

Two-Sample Test: Intro & Motivation

2

Nearest Neighbors Two-Sample Test (NN2ST)

3

Gaussian Examples

4

Outlook: Time Series Data

Andrea De Simone

  • Univ. Camerino, 2019-02-26

1 / 18

slide-3
SLIDE 3

Two-Sample Test

Two sets: Trial: T ≡ {x1, . . . , xNT } iid ∼ pT , Benchmark: B ≡ {x′

1, . . . , x′ NB} iid

∼ pB . xi, x′

i ∈ RD

pB, pT unknown

3 2 1 1 2 3 4 5 x1 2 1 1 2 3 4 x2

Benchmark Sample

2 1 1 2 3 4 x1 2 1 1 2 3 4 x2

Trial Sample

Andrea De Simone

  • Univ. Camerino, 2019-02-26

2 / 18

slide-4
SLIDE 4

Two-Sample Test

Two sets: Trial: T ≡ {x1, . . . , xNT } iid ∼ pT , Benchmark: B ≡ {x′

1, . . . , x′ NB} iid

∼ pB . xi, x′

i ∈ RD

pB, pT unknown « Are B, T drawn from the same probability distribution? » easy…

  • easy. . .

Andrea De Simone

  • Univ. Camerino, 2019-02-26

2 / 18

slide-5
SLIDE 5

Two-Sample Test

Two sets: Trial: T ≡ {x1, . . . , xNT } iid ∼ pT , Benchmark: B ≡ {x′

1, . . . , x′ NB} iid

∼ pB . xi, x′

i ∈ RD

pB, pT unknown « Are B, T drawn from the same probability distribution? » … hard! . . . hard

Andrea De Simone

  • Univ. Camerino, 2019-02-26

2 / 18

slide-6
SLIDE 6

Two-Sample Test

Why is it important?

  • detect departures from benchmark
  • find anomalous points (outliers)
  • check if observed data are compatible with expectations
  • detect changes in underlying distributions
  • real-time detect events/shifts in time series

Andrea De Simone

  • Univ. Camerino, 2019-02-26

3 / 18

slide-7
SLIDE 7

Two-Sample Test

Desiderata for a statistical test (1) model-independent no assumption about underlying physical model to interpret data − → more general (2) non-parametric compare two samples as a whole (not just their means, etc.) − → fewer assumptions, no max likelihood estim. (3) un-binned high-dim feature space partitioned without rectangular bins − → retain full multi-dim info of data

Andrea De Simone

  • Univ. Camerino, 2019-02-26

4 / 18

slide-8
SLIDE 8

Two-Sample Test

Recipe (1) Density Estimator − → reconstruct PDF from samples (2) Test Statistic (TS) − → “measure distance” between PDFs (3) TS distribution − → associate probabilities to TS under null hypothesis H0 : pB = pT (4) p-value − → if p < α then reject H0 Let’s build the Nearest Neighbors Two-Sample Test (NN2ST)

Andrea De Simone

  • Univ. Camerino, 2019-02-26

5 / 18

slide-9
SLIDE 9
  • 1. Density Estimator

T

✓ ✓ ✘

B Find PDFs e.g. based on densities of points:

Divide space in square bins? ✓ easy ✓ can use simple statistics (e.g. χ2) ✗ hard/slow/impossible in high-D Need un-binned, multi-variate approach Find PDF estimators ˆ pB, ˆ pT , e.g. based on density of points ˆ pB,T (x) = ρB,T (x) NB,T Nearest Neighbors!

[Schilling 1986, Henze 1988] [Wang et al. 2005-2006, Perez-Cruz. 2008]

Andrea De Simone

  • Univ. Camerino, 2019-02-26

6 / 18

slide-10
SLIDE 10
  • 1. Density Estimator

T B

xj xj

  • Fix integer K.
  • Choose query point xj in T and

draw it in B.

Andrea De Simone

  • Univ. Camerino, 2019-02-26

7 / 18

slide-11
SLIDE 11
  • 1. Density Estimator

T

rj,B xj xj

B

  • Fix integer K.
  • Choose query point xj in T and

draw it in B.

  • Find the distance rj,B of the

Kth-NN of xj in B.

Andrea De Simone

  • Univ. Camerino, 2019-02-26

7 / 18

slide-12
SLIDE 12
  • 1. Density Estimator

T

xj rj,T xj rj,B

B

  • Fix integer K.
  • Choose query point xj in T and

draw it in B.

  • Find the distance rj,B of the

Kth-NN of xj in B.

  • Find the distance rj,T of the

Kth-NN of xj in T .

Andrea De Simone

  • Univ. Camerino, 2019-02-26

7 / 18

slide-13
SLIDE 13
  • 1. Density Estimator

T

xj rj,T xj rj,B

B

  • Fix integer K.
  • Choose query point xj in T and

draw it in B.

  • Find the distance rj,B of the

Kth-NN of xj in B.

  • Find the distance rj,T of the

Kth-NN of xj in T .

  • Estimate PDFs:

ˆ pB(xj) = K NB 1 ωDrD

j,B

ˆ pT (xj) = K NT − 1 1 ωDrD

j,T

Andrea De Simone

  • Univ. Camerino, 2019-02-26

7 / 18

slide-14
SLIDE 14
  • 2. Test Statistic
  • Measure the “distance” between 2 PDFs
  • Define Test Statistic (to detect under-/over-densities)

TS(T ) ≡ 1 NT

NT

  • j=1

log ˆ pT (xj) ˆ pB(xj)

  • Form NN-estimated PDFs:

TS(T ) = D NT

NT

  • j=1

log rj,B rj,T + log NB NT − 1

  • Related to Kullback-Leibler divergence as: TS(T ) = ˆ

DKL(ˆ pT ||ˆ pB)

  • DKL(p||q) ≡

RD p(x) log p(x) q(x) dx

  • Theorem:

this estimator converges to DKL(pB||pT ), in the large sample limit

[Wang et al. – 2005, 2006]

Andrea De Simone

  • Univ. Camerino, 2019-02-26

8 / 18

slide-15
SLIDE 15
  • 3. Test Statistic Distribution

How is TS distributed? Permutation test! Assume pB = pT . Union set U = T ∪ B. T e T e B B

Random reshuffle U

B Repeat many times. Distribution of TS under H0 : f(TS|H0) ← {TSn}

[asymptotically normal with zero mean]

Compute the test statistic TSn

  • n (

B, T ).

e B B

Andrea De Simone

  • Univ. Camerino, 2019-02-26

9 / 18

slide-16
SLIDE 16
  • 4. p-value
  • Find ˆ

µ, ˆ σ: mean, variance of f(TS|H0)

  • Standardize the TS:

TS → TS′ ≡ TS − ˆ µ ˆ σ

  • TS′ distributed according to f′(TS′|H0) = ˆ

σf(ˆ µ + ˆ σTS′|H0)

  • Two-sided p-value

p = 2

|TSobs|

f′(TS′|H0)dTS′ p value

  • |TSobs|

|TSobs|

Andrea De Simone

  • Univ. Camerino, 2019-02-26

10 / 18

slide-17
SLIDE 17

NN2ST: Summary

INPUT: Trial sample: T ≡ {x1, . . . , xNT } iid ∼ pT , Benchmark sample: B ≡ {x′

1, . . . , x′ NB} iid

∼ pB K : number of nearest neighbors Nperm : number of permutations xi, x′

i ∈ RD

pB, pT unknown OUTPUT:

p-value of the null hypothesis H0 : pB = pT

[check compatibility between 2 samples] [detect changes in underlying distributions]

Andrea De Simone

  • Univ. Camerino, 2019-02-26

11 / 18

slide-18
SLIDE 18

NN2ST: Summary

K-NN density ratio estimation

Test Statistic

permutation test

p value TS distribution

  • |TSobs|

TSobs

Benchmark sample Trial sample

|TSobs|

Python code: github.com/de-simone/NN2ST

[DS, Jacques – arXiv:1807.06038]

Andrea De Simone

  • Univ. Camerino, 2019-02-26

12 / 18

slide-19
SLIDE 19

NN2ST: Summary

✓ general, model-independent ✓ solid math foundations ✓ fast, no optimization ✓ sensitive to unspecified signals ✗ need to run for each sample pair ✗ permutation test is bottleneck

Andrea De Simone

  • Univ. Camerino, 2019-02-26

13 / 18

slide-20
SLIDE 20

NN2ST on Gaussian Samples

Random samples from D-dimensional Gaussians pB = N(µB, ΣB) , pT = N(µT , ΣT ) . D = 2, µB =

  • 1.0

1.0

  • ,

µT =

  • 1.2

1.2

  • ,

ΣB = ΣT = I2 .

10

2

10

3

10

4

10

5

10

6

10

7

NB 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 TS

K = 3 K = 20

Convergence to exact KL divergence

Andrea De Simone

  • Univ. Camerino, 2019-02-26

14 / 18

slide-21
SLIDE 21

NN2ST on Gaussian Samples

Dataset µ Σ B 1D ID TG0 1D ID TG1 1.12D ID TG2 1D

  • 0.95

0.1 0.1 0.8 ID−2

  • TG3

1.15D ID

NB = NT = 20 000 K = 5 Nperm = 1 000

10

3

10

4

10

5

10

6

NB 10

89

10

77

10

65

10

53

10

41

10

29

10

17

10

5

p-value

Z=5

more data, more power

2 3 4 5 6 7 8 9 10 dimension D 10

28

10

24

10

20

10

16

10

12

10

8

10

4

10 p-value

Z=5

TG0 TG1 TG2 TG3

higher D, more power

Andrea De Simone

  • Univ. Camerino, 2019-02-26

15 / 18

slide-22
SLIDE 22

Outlook: time series data

[Caveat Emptor: very preliminary!]

Real-time detection of changes in data streams: variation in underlying mechanism generating data. T , B samples: windows of time series data, ending at discrete times t, t′ Tt = {xt−N+1, . . . , xt} , Bt′ = {xt′−N+1, . . . , xt′} , (NB = NT ≡ N) . Trial window sliding forward with time. Benchmark window anchored or rolling.

  • anchored B window: t′ = N −

→ Bt′ = {x1, . . . , xN} Captures cumulative changes over time.

  • adjacent windows: t′ = t − N −

→ Bt′ = {xt−2N+1, . . . , xt−N} Captures “rate of change” at current time.

Andrea De Simone

  • Univ. Camerino, 2019-02-26

16 / 18

slide-23
SLIDE 23

Outlook: time series data

Andrea De Simone

  • Univ. Camerino, 2019-02-26

17 / 18

slide-24
SLIDE 24

Outlook: time series data

adjacent vs. anchored windows

Andrea De Simone

  • Univ. Camerino, 2019-02-26

17 / 18

slide-25
SLIDE 25

Outlook: time series data

◮ Feature space can be high-dimensional: prices (OHLC), prices of related markets, indicators, volumes, . . . ◮ Reduce false alarms with persistence factor γ (∼ 1)%. H0 rejected γ · N times in a row − → detected change in market conditions

Andrea De Simone

  • Univ. Camerino, 2019-02-26

17 / 18

slide-26
SLIDE 26

Take-Home Messages

(1) Proposed a new statistical test: NN2ST (2) Model-independent and suitable for high-D data (3) Excellent results on static datasets (4) Promising applications for change detection in time series data

Andrea De Simone

  • Univ. Camerino, 2019-02-26

18 / 18