change detection in multi dimensional datasets and time
play

Change detection in multi-dimensional datasets and time series - PowerPoint PPT Presentation

Change detection in multi-dimensional datasets and time series Andrea De Simone andrea.desimone@sissa.it Univ. Camerino, 2019-02-26 [DS, Jacques arXiv:1807.06038] Outline 1 Two-Sample Test: Intro & Motivation 2 Nearest


  1. Change detection in multi-dimensional datasets and time series Andrea De Simone andrea.desimone@sissa.it Univ. Camerino, 2019-02-26 [DS, Jacques – arXiv:1807.06038]

  2. � Outline 1 Two-Sample Test: Intro & Motivation 2 Nearest Neighbors Two-Sample Test (NN2ST) 3 Gaussian Examples 4 Outlook: Time Series Data Andrea De Simone Univ. Camerino, 2019-02-26 1 / 18

  3. � Two-Sample Test Two sets: { x 1 , . . . , x N T } iid T ≡ ∼ p T , Trial: x i , x ′ i ∈ R D N B } iid p B , p T unknown { x ′ 1 , . . . , x ′ Benchmark: B ≡ ∼ p B . Benchmark Sample Trial Sample 4 4 3 3 2 2 1 x 2 1 x 2 0 0 1 1 2 2 3 2 1 0 1 2 3 4 5 2 1 0 1 2 3 4 x 1 x 1 Andrea De Simone Univ. Camerino, 2019-02-26 2 / 18

  4. � Two-Sample Test Two sets: { x 1 , . . . , x N T } iid T ≡ ∼ p T , Trial: x i , x ′ i ∈ R D N B } iid p B , p T unknown { x ′ 1 , . . . , x ′ Benchmark: B ≡ ∼ p B . « Are B , T drawn from the same probability distribution? » easy… easy. . . Andrea De Simone Univ. Camerino, 2019-02-26 2 / 18

  5. � Two-Sample Test Two sets: { x 1 , . . . , x N T } iid T ≡ ∼ p T , Trial: x i , x ′ i ∈ R D N B } iid p B , p T unknown { x ′ 1 , . . . , x ′ Benchmark: B ≡ ∼ p B . « Are B , T drawn from the same probability distribution? » … hard! . . . hard Andrea De Simone Univ. Camerino, 2019-02-26 2 / 18

  6. � Two-Sample Test Why is it important? • detect departures from benchmark • find anomalous points (outliers) • check if observed data are compatible with expectations • detect changes in underlying distributions • real-time detect events/shifts in time series Andrea De Simone Univ. Camerino, 2019-02-26 3 / 18

  7. � Two-Sample Test Desiderata for a statistical test (1) model-independent no assumption about underlying physical model to interpret data − → more general (2) non-parametric compare two samples as a whole (not just their means, etc.) − → fewer assumptions, no max likelihood estim. (3) un-binned high-dim feature space partitioned without rectangular bins − → retain full multi-dim info of data Andrea De Simone Univ. Camerino, 2019-02-26 4 / 18

  8. � Two-Sample Test Recipe (1) Density Estimator − → reconstruct PDF from samples (2) Test Statistic (TS) − → “measure distance” between PDFs (3) TS distribution − → associate probabilities to TS under null hypothesis H 0 : p B = p T (4) p -value − → if p < α then reject H 0 Let’s build the Nearest Neighbors Two-Sample Test (NN2ST) Andrea De Simone Univ. Camerino, 2019-02-26 5 / 18

  9. � 1. Density Estimator Divide space in square bins? ✓ easy ✓ B ✓ can use simple statistics (e.g. χ 2 ) ✓ ✘ ✗ hard/slow/impossible in high- D Need un-binned, multi-variate approach Find PDFs Find PDF estimators ˆ p B , ˆ p T , e.g. based on densities of points: T e.g. based on density of points p B,T ( x ) = ρ B,T ( x ) ˆ N B,T Nearest Neighbors! [Schilling 1986, Henze 1988] [Wang et al. 2005-2006, Perez-Cruz. 2008] Andrea De Simone Univ. Camerino, 2019-02-26 6 / 18

  10. � 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j T x j Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18

  11. � 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j • Find the distance r j,B of the r j,B • K th -NN of x j in B . T x j Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18

  12. � 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j • Find the distance r j,B of the r j,B • K th -NN of x j in B . • Find the distance r j,T of the • K th -NN of x j in T . T r j,T x j Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18

  13. � 1. Density Estimator • Fix integer K . • • Choose query point x j in T and B • draw it in B . x j • Find the distance r j,B of the r j,B • K th -NN of x j in B . • Find the distance r j,T of the • K th -NN of x j in T . T • Estimate PDFs: r j,T 1 x j K p B ( x j ) ˆ = ω D r D N B j,B K 1 p T ( x j ) ˆ = N T − 1 ω D r D j,T Andrea De Simone Univ. Camerino, 2019-02-26 7 / 18

  14. � 2. Test Statistic • Measure the “distance” between 2 PDFs • Define Test Statistic (to detect under-/over-densities) N T � 1 log ˆ p T ( x j ) TS( T ) ≡ N T p B ( x j ) ˆ j =1 • Form NN-estimated PDFs: N T � TS( T ) = D log r j,B N B + log N T r j,T N T − 1 j =1 • Related to Kullback-Leibler divergence as: TS( T ) = ˆ D KL (ˆ p T || ˆ p B ) � D KL ( p || q ) ≡ � q ( x ) d x � R D p ( x ) log p ( x ) • Theorem: this estimator converges to D KL ( p B || p T ), in the large sample limit [Wang et al. – 2005, 2006] Andrea De Simone Univ. Camerino, 2019-02-26 8 / 18

  15. � 3. Test Statistic Distribution How is TS distributed? Permutation test! Assume p B = p T . Union set U = T ∪ B . B T e Compute the test Random reshuffle U T statistic TS n on ( � B , � T ). e B B e B B Repeat many times. Distribution of TS under H 0 : f (TS | H 0 ) ← { TS n } [asymptotically normal with zero mean] Andrea De Simone Univ. Camerino, 2019-02-26 9 / 18

  16. � 4. p -value • Find ˆ µ, ˆ σ : mean, variance of f (TS | H 0 ) • Standardize the TS: TS → TS ′ ≡ TS − ˆ µ ˆ σ • TS ′ distributed according to f ′ (TS ′ | H 0 ) = ˆ σ TS ′ | H 0 ) σf (ˆ µ + ˆ • Two-sided p -value � ∞ f ′ (TS ′ | H 0 ) d TS ′ p = 2 | TS obs | -|TS obs | |TS obs | p value Andrea De Simone Univ. Camerino, 2019-02-26 10 / 18

  17. � NN2ST: Summary INPUT: T ≡ { x 1 , . . . , x N T } iid Trial sample: ∼ p T , x i , x ′ i ∈ R D N B } iid B ≡ { x ′ 1 , . . . , x ′ Benchmark sample: ∼ p B p B , p T K : number of nearest neighbors unknown N perm : number of permutations OUTPUT: p -value of the null hypothesis H 0 : p B = p T [check compatibility between 2 samples] [detect changes in underlying distributions] Andrea De Simone Univ. Camerino, 2019-02-26 11 / 18

  18. � NN2ST: Summary Test Statistic Benchmark sample TS obs K-NN density ratio estimation permutation test Trial sample -|TS obs | |TS obs | p value TS distribution Python code: github.com/de-simone/NN2ST [DS, Jacques – arXiv:1807.06038] Andrea De Simone Univ. Camerino, 2019-02-26 12 / 18

  19. � NN2ST: Summary ✓ general, model-independent ✓ solid math foundations ✓ fast, no optimization ✓ sensitive to unspecified signals ✗ need to run for each sample pair ✗ permutation test is bottleneck Andrea De Simone Univ. Camerino, 2019-02-26 13 / 18

  20. � NN2ST on Gaussian Samples Random samples from D = 2, D -dimensional Gaussians � � � � 1 . 0 1 . 2 µ B = , µ T = , = N ( µ B , Σ B ) , p B 1 . 0 1 . 2 N ( µ T , Σ T ) . p T = Σ B = Σ T = I 2 . 0.08 K = 3 K = 20 0.07 0.06 0.05 0.04 TS 0.03 Convergence to exact 0.02 KL divergence 0.01 0.00 2 3 4 5 6 7 10 10 10 10 10 10 N B Andrea De Simone Univ. Camerino, 2019-02-26 14 / 18

  21. � NN2ST on Gaussian Samples Dataset µ Σ B 1 D I D T G 0 1 D I D N B = N T = 20 000 T G 1 1 . 12 D I D � � 0 . 95 0 . 1 = 5 K 0 T G 2 1 D 0 . 1 0 . 8 N perm = 1 000 0 I D − 2 T G 3 1 . 15 D I D 0 10 5 Z=5 10 4 10 17 Z=5 10 8 10 29 10 12 41 10 p-value 10 p -value 16 53 10 10 65 20 10 10 T G 0 77 24 10 10 T G 1 T G 2 89 10 28 10 T G 3 3 4 5 6 2 3 4 5 6 7 8 9 10 10 10 10 10 dimension D N B more data, more power higher D , more power Andrea De Simone Univ. Camerino, 2019-02-26 15 / 18

  22. � Outlook: time series data [Caveat Emptor: very preliminary!] Real-time detection of changes in data streams: variation in underlying mechanism generating data. T , B samples: windows of time series data, ending at discrete times t, t ′ T t = { x t − N +1 , . . . , x t } , B t ′ = { x t ′ − N +1 , . . . , x t ′ } , ( N B = N T ≡ N ) . Trial window sliding forward with time. Benchmark window anchored or rolling. • anchored B window: t ′ = N − → B t ′ = { x 1 , . . . , x N } Captures cumulative changes over time. • adjacent windows: t ′ = t − N − → B t ′ = { x t − 2 N +1 , . . . , x t − N } Captures “rate of change” at current time. Andrea De Simone Univ. Camerino, 2019-02-26 16 / 18

  23. � Outlook: time series data Andrea De Simone Univ. Camerino, 2019-02-26 17 / 18

  24. � Outlook: time series data adjacent vs. anchored windows Andrea De Simone Univ. Camerino, 2019-02-26 17 / 18

  25. � Outlook: time series data ◮ Feature space can be high-dimensional: prices (OHLC), prices of related markets, indicators, volumes, . . . ◮ Reduce false alarms with persistence factor γ ( ∼ 1)%. H 0 rejected γ · N times in a row − → detected change in market conditions Andrea De Simone Univ. Camerino, 2019-02-26 17 / 18

  26. � Take-Home Messages (1) Proposed a new statistical test: NN2ST (2) Model-independent and suitable for high- D data (3) Excellent results on static datasets (4) Promising applications for change detection in time series data Andrea De Simone Univ. Camerino, 2019-02-26 18 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend