compression and similarity indexing for time series
play

Compression and Similarity Indexing for Time Series Masters Thesis - PowerPoint PPT Presentation

CHAIR PROF. BHM Compression and Similarity Indexing for Time Series Masters Thesis Marco Neumann | 19th of August 2016 KIT University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association www.kit.edu


  1. CHAIR PROF. BÖHM Compression and Similarity Indexing for Time Series Master’s Thesis Marco Neumann | 19th of August 2016 KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association www.kit.edu

  2. Outline Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 2/34 1 Google 𝑜 -gram data 2 Clean-up 3 Similarity 4 Baseline 5 CASINO TIMES 6 Final Words

  3. Google 𝑜 -gram data Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 3/34

  4. Public Data Set Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 4/34

  5. Information Provided by the Data Set Similarities = hints for common cause Warning similarity ≠ causality Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 5/34

  6. Current problems Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data „similarity“ is not precisely defined 1 for interactive analysis data is „big“ 1 choosing possible candidates is subject to frame confirmation bias slow manual analysis 6/34

  7. Goals exact description of „similarity“ allowing of interactive nearest neighbor queries design & evaluation of baseline design & evaluation of an own approach Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 7/34

  8. Clean-up Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 8/34

  9. Steps OCR errors 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Clean-up Google 𝑜 -gram data only last 256 years rare words lemmatisation stemming lowercase NFKC Unicode normalization word classes numbers 9/34 1 string filtering: 2 string normalization: 3 word normalization: 4 pruning:

  10. Results 1 -grams: ≈ 800 000 2 -grams: ≈ 6 400 000 Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 10/34

  11. Similarity Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 11/34

  12. Input Data Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 12/34

  13. Normalization Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 13/34

  14. (Smooth) Gradients Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 14/34

  15. DTW Similar structure, but sometimes slightly off ⇒ use Dynamic Time Warping (DTW) (limited by a Sakoe-Chiba Band of radius 𝑠 ) VLDB, 2002, Exact Indexing of Dynamic Time Warping; copying is by permission of the Very Large Data Base Endowment. Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 15/34

  16. Final Order Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data on demand pre-calculation radius of 𝑠 using 𝜏 16/34 1 log (𝑦 + 1) 2 Gauss-smoothing 3 gradient calculation 4 DTW with warping

  17. Sanity Check Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 17/34

  18. Examples of Philosophic Institute Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 18/34

  19. Baseline Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 19/34

  20. R-tree-based index VLDB, 2002, Exact Indexing of Dynamic Time Warping; copying is by permission of the Very Large Data Base Endowment. Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 20/34

  21. Index Inefficiency Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 21/34

  22. Performance Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 22/34

  23. CASINO TIMES Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 23/34

  24. Goals Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data primary: use normal hardware slow pre-processing, fast search enable subrange queries w/o re-indexing secondary: compress data speed up nn queries using an index 24/34

  25. Wavelet decomposition Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 25/34

  26. Information Merging Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data search similar subtrees (of different time series) ( = compression error is below threshold) difference of coefficients is small same children merge node if: node-by-node greedy method process one whole tree at the time 26/34

  27. Example Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 27/34

  28. Example (zoomed) Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 28/34

  29. Weakness Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 29/34

  30. Failed Improvements Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data merge entire subtrees (same index structure) DB seeding information / subtree pruning drop time constraint for leaves DTW for leaves random boosting merge entire subtrees (FLANN) 30/34

  31. Final Words Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 31/34

  32. Conclusion foundation for future research 2 : definition of similarity fast baseline algorithm knowledge about tree-like methods ⇒ not promising 2 starting collaboration with Prof. Dr. Sanders Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 32/34

  33. Possible Ideas Google 𝑜 -gram data 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Clean-up + patching compression using: time series encoding using functions (e.g. cubic splines) locality-preserving hashing static/dynamic downsampling (e.g. snappy, lz4, gzip, xz, brotli) general purpose compression of chunks non-IEEE data types (e.g. A-law and 𝜈 -law) IEEE-half floating point 33/34

  34. Thanks Dr.-Ing. Martin Schäler Prof. Dr.-Ing. Klemens Böhm IPD IT team Philosophic Friends Miguel Angel Meza Martínez Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 34/34

  35. References I Title picture: 19th of August 2016 Marco Neumann – CASINO TIMES Kaushik Chakrabarti et al. „Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases“. In: ACM Trans. [5] Lutz Bornmann and Rüdiger Mutz. „Growth rates of modern science: A bibliometric analysis“. In: CoRR abs/1402.4578 (2014). URL: [4] Ioannina . 1999, pp. 27–29. R. J. Alcock et al. „Time-series similarity queries employing a feature-based approach“. In: In 7 th Hellenic Conference on Informatics, [3] N. Ahmed, T. Natarajan, and K. R. Rao. „Discrete Cosine Transform“. In: IEEE Transactions on Computers C-23.1 (Jan. 1974), pp. 90–93. ISSN: [2] pp. 490–501. ISBN: 1-55860-379-4. of the 21th International Conference on Very Large Data Bases . VLDB ’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995, [1] 35/34 cb 2013 „Casino Royale“ by Rebecca Siegel https://www.flickr.com/photos/grongar/8704148177/ Rakesh Agrawal et al. „Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases“. In: Proceedings 0018-9340. DOI: 10.1109/T-C.1974.223784 . http://arxiv.org/abs/1402.4578 . Database Syst. 27.2 (June 2002), pp. 188–228. ISSN: 0362-5915. DOI: 10.1145/568518.568520 .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend