Compression and Similarity Indexing for Time Series Masters Thesis - PowerPoint PPT Presentation

CHAIR PROF. BÖHM Compression and Similarity Indexing for Time Series Master’s Thesis Marco Neumann | 19th of August 2016 KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association www.kit.edu

Outline Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 2/34 1 Google 𝑜 -gram data 2 Clean-up 3 Similarity 4 Baseline 5 CASINO TIMES 6 Final Words

Google 𝑜 -gram data Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 3/34

Public Data Set Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 4/34

Information Provided by the Data Set Similarities = hints for common cause Warning similarity ≠ causality Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 5/34

Current problems Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data „similarity“ is not precisely defined 1 for interactive analysis data is „big“ 1 choosing possible candidates is subject to frame confirmation bias slow manual analysis 6/34

Goals exact description of „similarity“ allowing of interactive nearest neighbor queries design & evaluation of baseline design & evaluation of an own approach Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 7/34

Clean-up Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 8/34

Steps OCR errors 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Clean-up Google 𝑜 -gram data only last 256 years rare words lemmatisation stemming lowercase NFKC Unicode normalization word classes numbers 9/34 1 string filtering: 2 string normalization: 3 word normalization: 4 pruning:

Results 1 -grams: ≈ 800 000 2 -grams: ≈ 6 400 000 Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 10/34

Similarity Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 11/34

Input Data Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 12/34

Normalization Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 13/34

(Smooth) Gradients Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 14/34

DTW Similar structure, but sometimes slightly off ⇒ use Dynamic Time Warping (DTW) (limited by a Sakoe-Chiba Band of radius 𝑠 ) VLDB, 2002, Exact Indexing of Dynamic Time Warping; copying is by permission of the Very Large Data Base Endowment. Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 15/34

Final Order Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data on demand pre-calculation radius of 𝑠 using 𝜏 16/34 1 log (𝑦 + 1) 2 Gauss-smoothing 3 gradient calculation 4 DTW with warping

Sanity Check Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 17/34

Examples of Philosophic Institute Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 18/34

Baseline Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 19/34

R-tree-based index VLDB, 2002, Exact Indexing of Dynamic Time Warping; copying is by permission of the Very Large Data Base Endowment. Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 20/34

Index Inefficiency Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 21/34

Performance Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 22/34

CASINO TIMES Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 23/34

Goals Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data primary: use normal hardware slow pre-processing, fast search enable subrange queries w/o re-indexing secondary: compress data speed up nn queries using an index 24/34

Wavelet decomposition Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 25/34

Information Merging Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data search similar subtrees (of different time series) ( = compression error is below threshold) difference of coefficients is small same children merge node if: node-by-node greedy method process one whole tree at the time 26/34

Example Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 27/34

Example (zoomed) Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 28/34

Weakness Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 29/34

Failed Improvements Clean-up 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Google 𝑜 -gram data merge entire subtrees (same index structure) DB seeding information / subtree pruning drop time constraint for leaves DTW for leaves random boosting merge entire subtrees (FLANN) 30/34

Final Words Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 31/34

Conclusion foundation for future research 2 : definition of similarity fast baseline algorithm knowledge about tree-like methods ⇒ not promising 2 starting collaboration with Prof. Dr. Sanders Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 32/34

Possible Ideas Google 𝑜 -gram data 19th of August 2016 Marco Neumann – CASINO TIMES Final Words CASINO TIMES Baseline Similarity Clean-up + patching compression using: time series encoding using functions (e.g. cubic splines) locality-preserving hashing static/dynamic downsampling (e.g. snappy, lz4, gzip, xz, brotli) general purpose compression of chunks non-IEEE data types (e.g. A-law and 𝜈 -law) IEEE-half floating point 33/34

Thanks Dr.-Ing. Martin Schäler Prof. Dr.-Ing. Klemens Böhm IPD IT team Philosophic Friends Miguel Angel Meza Martínez Google 𝑜 -gram data Clean-up Similarity Baseline CASINO TIMES Final Words Marco Neumann – CASINO TIMES 19th of August 2016 34/34

References I Title picture: 19th of August 2016 Marco Neumann – CASINO TIMES Kaushik Chakrabarti et al. „Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases“. In: ACM Trans. [5] Lutz Bornmann and Rüdiger Mutz. „Growth rates of modern science: A bibliometric analysis“. In: CoRR abs/1402.4578 (2014). URL: [4] Ioannina . 1999, pp. 27–29. R. J. Alcock et al. „Time-series similarity queries employing a feature-based approach“. In: In 7 th Hellenic Conference on Informatics, [3] N. Ahmed, T. Natarajan, and K. R. Rao. „Discrete Cosine Transform“. In: IEEE Transactions on Computers C-23.1 (Jan. 1974), pp. 90–93. ISSN: [2] pp. 490–501. ISBN: 1-55860-379-4. of the 21th International Conference on Very Large Data Bases . VLDB ’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995, [1] 35/34 cb 2013 „Casino Royale“ by Rebecca Siegel https://www.flickr.com/photos/grongar/8704148177/ Rakesh Agrawal et al. „Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases“. In: Proceedings 0018-9340. DOI: 10.1109/T-C.1974.223784 . http://arxiv.org/abs/1402.4578 . Database Syst. 27.2 (June 2002), pp. 188–228. ISSN: 0362-5915. DOI: 10.1145/568518.568520 .

Compression and Similarity Indexing for Time Series Masters Thesis - PowerPoint PPT Presentation

CHAIR PROF. BHM Compression and Similarity Indexing for Time Series Masters Thesis Marco Neumann | 19th of August 2016 KIT University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association www.kit.edu

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

Lossless compression in lossy compression systems Almost every lossy compression system

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Quality of Similarity Rankings in Time Series T. Bernecker, in Time Series M. E. Houle, H.-P.

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

Development of high rate MWPC and data compression function with FADC (II) Nguyen Minh Truong

Short Cambrex Corporation NYSE:CBM Stephen Saroki OBrien Greene & Co.

Guidelines for Preparation of Technical Paper Presentation at WATMAN International Conference 2020

ICLR PERC Fire webinar - PERC Introduction Michael Sznyi Flood Resilience Program Lead,

Attribute-Value Reordering for Efficient Hybrid OLAP O WEN K ASER Dept. of Computer Science and

Agenda Technical background ! Same-Origin Policy ! Compression-based attacks !

A Novel Scalable IPv6 Lookup Scheme Using Compressed Pipelined Tries Michel Hanna, Sangyeun Cho,

CO 2 GAS COMPRESSION FLEXIBILITY AND SOLUTIONS THROUGH VARIOUS SERVICES AND DESIGNS JASON SOWELS

Compression and Similarity Indexing for Time Series Masters Thesis - PowerPoint PPT Presentation

CHAIR PROF. BHM Compression and Similarity Indexing for Time Series Masters Thesis Marco Neumann | 19th of August 2016 KIT University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association www.kit.edu

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

Lossless compression in lossy compression systems Almost every lossy compression system

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Lead Screw Motors LSM08 Series LSM11 Series LSM14 Series LSM17 Series

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Time Series Analysis and Mining with R Time Series Decomposi- tion Time Series Forecasting

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Quality of Similarity Rankings in Time Series T. Bernecker, in Time Series M. E. Houle, H.-P.

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

Development of high rate MWPC and data compression function with FADC (II) Nguyen Minh Truong

Short Cambrex Corporation NYSE:CBM Stephen Saroki OBrien Greene &amp; Co.

Guidelines for Preparation of Technical Paper Presentation at WATMAN International Conference 2020

ICLR PERC Fire webinar - PERC Introduction Michael Sznyi Flood Resilience Program Lead,

Attribute-Value Reordering for Efficient Hybrid OLAP O WEN K ASER Dept. of Computer Science and

Agenda Technical background ! Same-Origin Policy ! Compression-based attacks !

A Novel Scalable IPv6 Lookup Scheme Using Compressed Pipelined Tries Michel Hanna, Sangyeun Cho,

CO 2 GAS COMPRESSION FLEXIBILITY AND SOLUTIONS THROUGH VARIOUS SERVICES AND DESIGNS JASON SOWELS

Short Cambrex Corporation NYSE:CBM Stephen Saroki OBrien Greene & Co.