[PDF] - Riding the Big IoT Data Wave Complex Analytics for IoT Data Series PDF Document

SLIDE 1

27-Apr-17 1

Riding the Big IoT Data Wave

Complex Analytics for IoT Data Series

Themis Palpanas

Paris Descartes University

Telecom Paristech Paris, April 2017

References

2

Themis Palpanas - Telecom Paristech, Apr 2017

papers

▫ ADS: The Adaptive Data Series Index. VLDBJ 2016

 http://www.mi.parisdescartes.fr/~themisp/publications/vldbj16-ads.pdf

▫ Big Sequence Management: A Glimpse on the Past, the Present, and the Future. LNCS, 2016

 http://www.mi.parisdescartes.fr/~themisp/publications/sofsem16-bisem.pdf

▫ Query Workloads for Data-Series Indexes. KDD 2015

 http://www.mi.parisdescartes.fr/~themisp/publications/kdd15-bends.pdf

▫ RINSE: Interactive Data Series Exploration. VLDB 2015

 http://www.mi.parisdescartes.fr/~themisp/publications/vldb15-rinse.pdf

▫ Indexing for Interactive Exploration of Big Data Series. SIGMOD 2014

 http://www.mi.parisdescartes.fr/~themisp/publications/sigmod14-ads.pdf

▫ Beyond One Billion Time Series: Indexing and Mining Very Large Time Series Collections with iSAX2+. KAIS 2014

 http://www.mi.parisdescartes.fr/~themisp/publications/kais14-isax2plus.pdf

▫ iSAX 2.0: Indexing and Mining One Billion Time Series. ICDM 2010

 http://www.mi.parisdescartes.fr/~themisp/publications/icdm10-billiontimeseries.pdf

code and datasets

▫ http://www.mi.parisdescartes.fr/~themisp/isax2plus/

data series toolbox

▫ https://github.com/zoumpatianos/DSStat

demo

▫ http://daslab.seas.harvard.edu/rinse/

SLIDE 2

27-Apr-17 2

Acknowledgements

Michele Linardi
Anna Gogolou
Botao Peng
Karia Echihabi

Paris Descartes University

Alessandro Camerra

University of Trento

Stratos Idreos
Kostas Zoumpatianos

Harvard University

Yin Lou
Johannes Gehrke

Cornell University

Jin Shieh
Eamonn Keogh

University of California at Riverside

Themis Palpanas - Telecom Paristech, Apr 2017

3

Executive Summary

data collected at unprecedented rates
they enable data-driven scientific

discovery

lots of these data are sequences

▫ takes days-weeks to analyze big sequence collections

4

Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 3

27-Apr-17 3

Executive Summary

data collected at unprecedented rates
they enable data-driven scientific

discovery

lots of these data are sequences

▫ takes days-weeks to analyze big sequence collections

5

ur work: analyze big sequences in minutes/seconds

Themis Palpanas - Telecom Paristech, Apr 2017

Data series

6

Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 4

27-Apr-17 4

Data series

7

Themis Palpanas - Telecom Paristech, Apr 2017

Sequence of points ordered along some dimension

v1 v2 … sequence dimension x1 x2 xn value

Data series

8

Themis Palpanas - Telecom Paristech, Apr 2017

Sequence of points ordered along some dimension

Time v1 v2 … sequence dimension x1 x2 xn value

SLIDE 5

27-Apr-17 5

Data series

9

Themis Palpanas - Telecom Paristech, Apr 2017

Sequence of points ordered along some dimension

Time Position v1 v2 … sequence dimension x1 x2 xn value

GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAA GATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAG AGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGC CCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAA TATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAA CTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATC TGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAA AAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAG CCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCT ATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATT GTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTT GGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGT TATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAAT CAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATT GAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGC CATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTA CTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACA GGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCA CATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTC TTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATA ATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCG GTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCAT CAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTG CAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCA GGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA

Themis Palpanas - Telecom Paristech, Apr 2017

21 Position

SLIDE 6

27-Apr-17 6

GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAA GATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAG AGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGC CCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAA TATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAA CTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATC TGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAA AAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAG CCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCT ATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATT GTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTT GGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGT TATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAAT CAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATT GAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGC CATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTA CTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACA GGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCA CATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTC TTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATA ATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCG GTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCAT CAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTG CAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCA GGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA

Themis Palpanas - Telecom Paristech, Apr 2017

22 Position

GTCAATGGCCAGGATATTAGAACAGTACTCTGTGAACCCTATTTATGGTGGCACCCCTTAGACTAA GATAACACAGGGAGCAAGAGGTTGACAGGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAG AGAAGTGCTAAGTCTCCTTTCTAAGGCACATGATGGATTCAAGGGAAAGCCACATTTGACTAAAGC CCAAGGGATTGTTGCTTCTAATCCGATTTCTTGGCAGAAGATATTACAAACTAAGAGTCAGATTAA TATGTGGGTGCCAAAATAAATAAACAAATAATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAA CTCCTCCACAGCTTGCTACCGAGGCAGAACCGGTTGAAACTGAAATGCATCCGCCGCCAGAGGATC TGTAAAAGAGAGGTTGTTACGAAACTGGCAACTGCCAACCAAAGTCCACCAATGGACAAGCAAAA AAGAGCACTCATCTCATGCTCCCAAGGATCAACCTTCCCAGAGTTTTCACTTAAGTGGCCACCAAG CCAGTTGTCAATCCAGGGCTTTGGACTGAAATCTAGGGCTTCATCCGCTACCTCAGAGTGTCTTCT ATTTCTTCCAGCCAGTGACAAATACAACAAACATCTGAGATGTTTTAGCTATAAATCCTTTACAATT GTTATTTATGTCTTAACTTTTGTTATACCTGGAAAAGTAGGGGAAACAATAAGAACATACTGTCTT GGCCAAGCATCCAAGGTTAAATGAGTTATGGAAATTCATTTGGGAGCCAAGACATTGCACGTGGT TATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCATCAGTTGTTCTTGGCCAAAAGAGCAGAAT CAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTGCAGGGACAAGTCTGCAAGATGAGCATT GAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCAGGCACTTACAAGAGCCCAGGTTGTTGC CATGTTTGTTTTTGCAACTTGTCTATTTAAAGAGATTTGGGCAATGGCCAGGATATTAGAACAGTA CTCTGTGAACCCTATTTATGGTAGCACCCCTTAGACTAAGATAACACAGGGAGCAAGAGGTTGACA GGAAAGCCAGGGGAGCAGGGAAGCCTCCTGTAAAGAGAGAAGTGCTAAGTCTCCTTTCTAAGGCA CATGATGGATCAAGGGAAAGTCACATTTGACTAAAGCCCAAGGGATTGTTGCTTCTAATCCGATTC TTGGCAGAAGATATTGCAAACTAAGAGTCAGATTAATATGTGGGTGCCAAAATAAATAAACAAATA ATTGAATAATCCCTGGAGGTTTAAGTGAGGAGAAACTCCTCCACACTTGCTACCGAGGCAGAACCG GTTGAAACTGAAATGCACCCGCTGCCAGATTTATTAGTCACCCAAGCATGTATTTTGCATGTCCAT CAGTTGTTCTTGGCCAAAAGAACAGAATCAATGAGCCGCTGCAGATGCAGACATAGCAGCCCCTTG CAGGAACAAGTCTGCAAGATGAGCATTGAAGAGGATGCACAAGCCCGGTAGCCCGGGAAATGGCA GGCACTTACAAGAGCCCAGGTTGTTGCCATGTTTGTTTTTGCAACTTGTCTTTTAAACAGATTTGA

Themis Palpanas - Telecom Paristech, Apr 2017

23 Position

SLIDE 7

27-Apr-17 7

Themis Palpanas - Telecom Paristech, Apr 2017

27

Themis Palpanas - Telecom Paristech, Apr 2017

28

Schinnerer et al.

SLIDE 8

27-Apr-17 8

Telecommunications

Themis Palpanas - Telecom Paristech, Apr 2017 29

analysis of call activity patterns

▫ Telecom Italia

clustermap of incoming calls time series

10000 20000 30000 40000 50000 60000 1 30 59 88 117 146 175 204 233 262 291 320 349 378 407 436 465 494 523 552 581 610 639 668 697

average number of calls for 5 smallest clusters call activity for Easter Monday

Time

Home Networks

Themis Palpanas - Telecom Paristech, Apr 2017 30

temporal usage behavior analysis of home networks

▫ Portugal Telecom

clustering based on user activity patterns (previously unknown) frequent behavior pattern

Time

SLIDE 9

27-Apr-17 9

32

Operation Health Monitoring

Themis Palpanas - Telecom Paristech, Apr 2017

Time

33

Operation Health Monitoring

Themis Palpanas - Telecom Paristech, Apr 2017

Time

SLIDE 10

27-Apr-17 10

34

Operation Health Monitoring

Themis Palpanas - Telecom Paristech, Apr 2017

Time

Road Tunnel Monitoring and Control



Lamp levels typically statically determined, ignoring environmental



Overprovisioned to meet the regulations



Problems: waste energy and potential security hazard



Idea: place wireless sensors along tunnel, adjust lamps to actual conditions



Eliminate overprovisioning, account for environmental variations

stop distance

Themis Palpanas - Telecom Paristech, Apr 2017 35

SLIDE 11

27-Apr-17 11 Tunnel length of 630 m, 2 separate 2-lane carriageways, ~28,000 vehicles/day, 90 WSN nodes Full, operational system

Themis Palpanas - Telecom Paristech, Apr 2017 36

37

Themis Palpanas - Telecom Paristech, Apr 2017

Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro

Picco. Practical Data Prediction for Real-World Wireless Sensor Networks. TKDE

27(8), 2015 Usman Raza, Alessandro Camerra, Amy L. Murphy, Themis Palpanas, Gian Pietro Picco. What Does Model-Driven Data Acquisition Really Achieve in Wireless Sensor Networks?. PerCom 2012. BEST PAPER

SLIDE 12

27-Apr-17 12

time

Data-Driven Data Acquisition (DDDA)

Typical WSN System

Sink gathers all sensor readings of the WSN. Advantage: precise

DDDA/WSN System

Sink predicts sensor readings of the WSN. Advantage: less traffic

38

Themis Palpanas - Telecom Paristech, Apr 2017

What Does Data-Driven Data Acquisition Really Achieve?

 DDDA well studied in database community

 Depending on scenario, 99% less traffic generated

 Does 99% less traffic mean 99% more lifetime?

 Current studies look only at application-layer  Full network stack influences lifetime

39

Hardware MAC Routing Application

DDDA

Hardware MAC Routing Application

Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 13

27-Apr-17 13

Derivative-Based Prediction (DBP)

Sensor value Time learning phase prediction phase edge points edge points δ

 DBP: a linear model

 Easy to calculate the model  Easy to decide if the sensed data fit the model

40

Themis Palpanas - Telecom Paristech, Apr 2017

Publications

TKDE’15 PERCOM’12 SPRINGER’12 

Centralized control algorithm requires only approximate values



Tolerances: sensed values may temporarily exceed these values without requiring new model generation



Value tolerance – maximum of a percentage and an absolute



Addresses inherent sensor error and variations of low light levels



Time tolerance – in terms of sampling periods



Lamps are adjusted gradually

Tolerance for Prediction Error

Sensor value Time

value tolerance time tolerance

41

Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 14

27-Apr-17 14

 Test scenario

 250 m long tunnel, 40 TMote equivalent nodes,

1 gateway

 47 days, samples every 30 seconds,

5.4 million measurements

 DBP: parameters established by lighting engineers

40 WSN nodes sink/gateway lamps

Testing the DBP Model

42

Themis Palpanas - Telecom Paristech, Apr 2017

DBP Results:

Excellent reduction in data traffic

 99.1% messages suppressed at the nodes

Model Corrections in 5-minute intervals

Less than 10 model transmissions each 5 min. [instead of 400 samples] Most models generated during daylight hours Few corrections required at night

43

Comparison to other models and with alternate parameters in the paper Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 15

27-Apr-17 15

DBP Results:

Excellent data approximation

Infrequent models yield data very close to the actual readings Deviations from tolerance correspond to model changes Value tolerance magnitude calculated based on actual light values

44

Themis Palpanas - Telecom Paristech, Apr 2017

45

Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 16

27-Apr-17 16

Massive Data Series Collections

Human Genome project

130 TB

NASA’s Solar Observatory

1.5 TB per day

Planned Large Synoptic Survey Telescope

~30 TB per night

Themis Palpanas - Telecom Paristech, Apr 2017

46

data center and services monitoring

2B data series 4M points/sec

What do we want to do with them?

simple query answering

Simlarity Search

select some data series select values in time interval select values in some range

combinations

f those

Themis Palpanas - Telecom Paristech, Apr 2017

47

SLIDE 17

27-Apr-17 17

What do we want to do with them?

complex analytics

Simlarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

48

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

49

SLIDE 18

27-Apr-17 18

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

50 sequence collection

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

51 query sequence collection

SLIDE 19

27-Apr-17 19

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

52 similar sequences query sequence collection

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

53 Euclidean

   

  

 n i i i

y x Y X D

1 2

,

SLIDE 20

27-Apr-17 20

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

54 Euclidean Dynamic Time Warping (DTW)

   

  

 n i i i

y x Y X D

1 2

,              ) 1 , 1 ( ) , 1 ( ) 1 , ( min ) , ( ) , ( ) , ( j i f j i f j i f y x j i f m n f Y X D

j i dtw

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

55 Euclidean Dynamic Time Warping (DTW)

SLIDE 21

27-Apr-17 21

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

56

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

57

SLIDE 22

27-Apr-17 22

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

58

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

59

HARD, because of very high dimensionality: each data series has 100s-1000s of points!

SLIDE 23

27-Apr-17 23

What do we want to do with them?

complex analytics

Similarity Search

Classification Clustering Outlier Detection Frequent Pattern Mining

Themis Palpanas - Telecom Paristech, Apr 2017

60

HARD, because of very high dimensionality: each data series has 100s-1000s of points! even HARDER, because of very large size: millions to billions of data series (multi-TBs)!

Query answering process

Query Answering Procedure Data Loading Procedure

Raw data

Themis Palpanas - Telecom Paristech, Apr 2017 61

SLIDE 24

27-Apr-17 24

Query answering process

data-to-query time

Query Answering Procedure Data Loading Procedure

Data Series Database/ Indexing Data Raw data

Themis Palpanas - Telecom Paristech, Apr 2017 62

Query answering process

data-to-query time query answering time

Query Answering Procedure Data Loading Procedure

Answers Data Series Database/ Indexing Data Raw data Queries

Themis Palpanas - Telecom Paristech, Apr 2017 63

SLIDE 25

27-Apr-17 25

Query answering process

data-to-query time query answering time

Query Answering Procedure Data Loading Procedure

Answers Data Series Database/ Indexing Data Raw data Queries

Themis Palpanas - Telecom Paristech, Apr 2017 64

these times are big!

Similarity Search via

Serial Scan

Themis Palpanas - Telecom Paristech, Apr 2017 65

SLIDE 26

27-Apr-17 26

Similarity Search via

Serial Scan

Themis Palpanas - Telecom Paristech, Apr 2017 66

Similarity Search via

Serial Scan

Themis Palpanas - Telecom Paristech, Apr 2017 67

SLIDE 27

27-Apr-17 27

Similarity Search via

Indexing

Themis Palpanas - Telecom Paristech, Apr 2017 68

Similarity Search via

Indexing

Themis Palpanas - Telecom Paristech, Apr 2017 69

SLIDE 28

27-Apr-17 28

Similarity Search via

Indexing

Themis Palpanas - Telecom Paristech, Apr 2017 70

Similarity Search via

Indexing

Themis Palpanas - Telecom Paristech, Apr 2017 71

SLIDE 29

27-Apr-17 29

Themis Palpanas - Telecom Paristech, Apr 2017 72

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset

Themis Palpanas - Telecom Paristech, Apr 2017 73

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering

serial scan takes 45 minutes/query

SLIDE 30

27-Apr-17 30

Themis Palpanas - Telecom Paristech, Apr 2017 74

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering Query Answering

a data series index can reduce querying time

serial scan takes 45 minutes/query

Themis Palpanas - Telecom Paristech, Apr 2017 75

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering Query Answering

but building the index takes too long! a data series index can reduce querying time

serial scan takes 45 minutes/query

SLIDE 31

27-Apr-17 31

Themis Palpanas - Telecom Paristech, Apr 2017 76

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering Query Answering

but building the index takes too long!

indexing a 1TB dataset takes days

a data series index can reduce querying time

serial scan takes 45 minutes/query

Themis Palpanas - Telecom Paristech, Apr 2017 77

Traditional Approaches

answer nearest neighbor queries on a 1TB dataset:

Query Answering Query Answering

but building the index takes too long!

indexing a 1TB dataset takes days

complex analytics in hours/days…

a data series index can reduce querying time

serial scan takes 45 minutes/query

SLIDE 32

27-Apr-17 32

Themis Palpanas - Telecom Paristech, Apr 2017

Query answering process

data-to-query time query answering time

Query Answering Procedure Data Loading Procedure

Answers

we have proposed the

state-of-the-art

solutions for both problems!

Data Series Database/ Indexing Data Raw data Queries

78 Themis Palpanas - Telecom Paristech, Apr 2017

Our Approach: ADS+

79

complex analytics in minutes/seconds!

SLIDE 33

27-Apr-17 33

Outline

background

▫ SAX representation ▫ iSAX representation ▫ iSAX index

proposed solution

▫ bulk loading ▫ splitting policy ▫ adaptive solution

experimental evaluation, case studies
conclusions, future work, new challenges

Themis Palpanas - Telecom Paristech, Apr 2017

84

3
2
1

1 2 3 4 8 12 16

A dataa series T

SAX Representation

Symbolic Aggregate approXimation

(SAX) ▫ (1) Represent data series T of length n with w segments using Piecewise Aggregate Approximation (PAA)

Themis Palpanas - Telecom Paristech, Apr 2017

85

SLIDE 34

27-Apr-17 34

3
2
1

1 2 3 4 8 12 16

A data series T

4 8 12 16

PAA(T,4)

3
2
1

1 2 3

SAX Representation

Symbolic Aggregate approXimation

(SAX) ▫ (1) Represent data series T of length n with w segments using Piecewise Aggregate Approximation (PAA)

 T typically normalized to μ = 0, σ = 1  PAA(T,w) = where

w

t t T , ,

1 





  



i i j j n w i

w n w n

T t

1 ) 1 (

Themis Palpanas - Telecom Paristech, Apr 2017

86

3
2
1

1 2 3 4 8 12 16

00 01 10 11 iSAX(T,4,4)

3
2
1

1 2 3 4 8 12 16

A data series T

4 8 12 16

PAA(T,4)

3
2
1

1 2 3

SAX Representation

Symbolic Aggregate approXimation

(SAX) ▫ (1) Represent data series T of length n with w segments using Piecewise Aggregate Approximation (PAA)

 T typically normalized to μ = 0, σ = 1  PAA(T,w) = where

▫ (2) Discretize into a vector of symbols  Breakpoints map to small alphabet a

f symbols

w

t t T , ,

1 





  



i i j j n w i

w n w n

T t

1 ) 1 (

Themis Palpanas - Telecom Paristech, Apr 2017

87

SLIDE 35

27-Apr-17 35

iSAX Representation

iSAX offers a bit-aware, quantized, multi-resolution

representation with variable granularity

= { 6, 6, 3, 0} = {110 ,110 ,0111 ,000} = { 3, 3, 1, 0} = {11 ,11 ,011 ,00 } = { 1, 1, 0, 0} = {1 ,1 ,0 ,0 }

Themis Palpanas - Telecom Paristech, Apr 2017

88

non-balanced tree-based index with non-overlapping regions, and

controlled fan-out rate ▫ base cardinality b (optional), segments w, threshold th ▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

89

SLIDE 36

27-Apr-17 36

non-balanced tree-based index with non-overlapping regions, and

controlled fan-out rate ▫ base cardinality b (optional), segments w, threshold th ▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

90 e.g., th=4, w=4, b=1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0

non-balanced tree-based index with non-overlapping regions, and

controlled fan-out rate ▫ base cardinality b (optional), segments w, threshold th ▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

91 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 e.g., th=4, w=4, b=1 Insert: 1 1 1 0

SLIDE 37

27-Apr-17 37

non-balanced tree-based index with non-overlapping regions, and

controlled fan-out rate ▫ base cardinality b (optional), segments w, threshold th ▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

92 1 1 10 0 1 1 10 0 1 1 11 0 1 1 11 0 e.g., th=4, w=4, b=1 1 1 11 0 1 1 1 0

non-balanced tree-based index with non-overlapping regions, and

controlled fan-out rate ▫ base cardinality b (optional), segments w, threshold th ▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

93

SLIDE 38

27-Apr-17 38

non-balanced tree-based index with non-overlapping regions, and

controlled fan-out rate ▫ base cardinality b (optional), segments w, threshold th ▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

94

non-balanced tree-based index with non-overlapping regions, and

controlled fan-out rate ▫ base cardinality b (optional), segments w, threshold th ▫ hierarchically subdivides SAX space until num. entries ≤ th

iSAX Index

Themis Palpanas - Telecom Paristech, Apr 2017

95

SLIDE 39

27-Apr-17 39

iSAX Index

non-balanced tree-based index with non-overlapping regions, and

controlled fan-out rate ▫ base cardinality b (optional), segments w, threshold th ▫ hierarchically subdivides SAX space until num. entries ≤ th

Approximate Search

▫ Match iSAX representation at each level

Exact Search

▫ Leverage approximate search ▫ Prune search space  Lower bounding distance

Themis Palpanas - Telecom Paristech, Apr 2017

96

Background iSAX Index

97 ROOT . . . 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Approximate Search Matches iSAX representation at each level Exact Search Uses a lower bounding function

97

Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 40

27-Apr-17 40

iSAX Index: Shortcomings & Challenges

this is a wonderful index!

Themis Palpanas - Telecom Paristech, Apr 2017

98

iSAX Index: Shortcomings & Challenges

this is a wonderful index!
… but why does it take soooo long to build for huge datasets?

Themis Palpanas - Telecom Paristech, Apr 2017

99

SLIDE 41

27-Apr-17 41

iSAX Index: Shortcomings & Challenges

this is a wonderful index!
… but why does it take soooo long to build for huge datasets?
because iSAX implements

▫ a naive node splitting policy

 may lead to ineffective splits and additional disk I/O

Themis Palpanas - Telecom Paristech, Apr 2017

100

iSAX Index: Shortcomings & Challenges

this is a wonderful index!
… but why does it take soooo long to build for huge datasets?
because iSAX implements

▫ a naive node splitting policy

 may lead to ineffective splits and additional disk I/O

▫ no bulk loading strategy

 does not use available main memory to reduce disk I/O

Themis Palpanas - Telecom Paristech, Apr 2017

101

SLIDE 42

27-Apr-17 42

iSAX 2.0 Bulk Loading Algorithm

design principles:

▫ take advantage of available main memory ▫ maximize sequential disk accesses

Themis Palpanas - Telecom Paristech, Apr 2017

102

iSAX 2.0 Bulk Loading Algorithm

intuition for proposed solution:

▫ for each leaf node, collect as many data series that belong to it as possible before materializing the leaf node

 the raw values of data series in leaf nodes are written to disk

Themis Palpanas - Telecom Paristech, Apr 2017

103

SLIDE 43

27-Apr-17 43

iSAX 2.0 Bulk Loading Algorithm

iterate between two phases (till all data series are indexed):

▫ Phase 1

 read data series and group them according to first-level nodes  use all available main memory

▫ Phase 2

 grow index by processing the subtree rooted at each one of the first- level nodes one at-a-time  flush leaf node contents to disk using sequential accesses

Themis Palpanas - Telecom Paristech, Apr 2017

104

Publications

ICDM‘10

Themis Palpanas - Telecom Paristech, Apr 2017

105

R L1 L2 L3 main memory disk

SLIDE 44

27-Apr-17 44

Themis Palpanas - Telecom Paristech, Apr 2017

106

R FBL L1 L2 L3 main memory disk

Themis Palpanas - Telecom Paristech, Apr 2017

107

R FBL L1 L2 L3 main memory disk

no limit in the size of FBLs!

SLIDE 45

27-Apr-17 45

Themis Palpanas - Telecom Paristech, Apr 2017

108

R

insert new ds

FBL L1 L2 L3 main memory disk

Themis Palpanas - Telecom Paristech, Apr 2017

109

R FBL L1 L2 L3 main memory disk

SLIDE 46

27-Apr-17 46

Themis Palpanas - Telecom Paristech, Apr 2017

110

R L1 L2 I1 FBL LBL L4 L3 main memory disk

Themis Palpanas - Telecom Paristech, Apr 2017

111

R L1 L2 I1 FBL LBL L4 L3 main memory disk

LBLs have same size as leaf nodes!

SLIDE 47

27-Apr-17 47

Themis Palpanas - Telecom Paristech, Apr 2017

112

R L1 L2 I1 FBL LBL L4 L3 main memory disk

Themis Palpanas - Telecom Paristech, Apr 2017

113

R L1 L2 I1 FBL LBL L4 L3 main memory disk

SLIDE 48

27-Apr-17 48

Themis Palpanas - Telecom Paristech, Apr 2017

114

R L1 L2 I1 FBL LBL L4 L3 main memory disk

no extra memory needed!

Themis Palpanas - Telecom Paristech, Apr 2017

115

FBL LBL R L1 L2 I1 L4 L3 main memory disk

SLIDE 49

27-Apr-17 49

Themis Palpanas - Telecom Paristech, Apr 2017

116

FBL LBL R L1 L2 I1 L4 L3 main memory disk

mainly sequential writes!

Experimental Evaluation Bulk Loading

Themis Palpanas - Telecom Paristech, Apr 2017

131

50 100 150 200 250 300 350 400 450 500 100M 200M 300M 400M 500M 800M 900M 1B Time To Build (hours) N° Data Series Indexed iSAX-BufferTree iSAX iSAX 2.0

1 Billion data series indexed in 16 days: 72% less time
indexing time per data series: 0.001 sec

SLIDE 50

27-Apr-17 50

iSAX2+

intuition for proposed solution:

▫ iSAX grows fast at the beginning of bulk loading, its shape stabilizing well before the end of the process ▫ several data series end up in leaf nodes that never need to split ▫ implement lazy splitting:

 move raw data to leaf node the first time  if leaf node splits, do not move raw data until the end of index building process

Themis Palpanas - Telecom Paristech, Apr 2017

148

Publications

KAIS‘14

Experimental Evaluation Bulk Loading

Themis Palpanas - Telecom Paristech, Apr 2017

167

1 Billion data series indexed in 10 days: 82% less time than iSAX
indexing time per data series: 0.8 milliseconds

50 100 150 200 250 300 350 400 450 100M 200M 300M 400M 500M 800M 900M 1B

Time to Build ( hours ) Time Series Indexed iSAX 2.0 iSAX2+ iSAX 2.0 Clustered

SLIDE 51

27-Apr-17 51

Drawback of iSAX2+

cannot start answering queries until entire index is built!

Themis Palpanas - Telecom Paristech, Apr 2017

173

Adaptive Data Series Index: ADS+

novel paradigm for building a data series index

▫ do not build entire index and then answer queries ▫ start answering queries by building the part of the index needed by those queries

still guarantee correct answers

Themis Palpanas - Telecom Paristech, Apr 2017

174

SLIDE 52

27-Apr-17 52

Adaptive Data Series Index: ADS+

intuition for proposed solution

▫ build the iSAX index using the iSAX representations

▫ just like iSAX2+

▫ but start with a large leaf size

▫ minimize initial cost

▫ postpone leaf materialization to query time

▫ only materialize (at query time) leaves needed by queries

▫ parts that are queried more are refined more

▫ use smaller leaf sizes (reduced leaf materialization and query answering costs)

Themis Palpanas - Telecom Paristech, Apr 2017

175

Publications

SIGMOD‘14 VLDBJ‘16

ROOT I1 I2

LBL FBL

Raw data DISK RAM Start building an index with only the iSAX representations

177 Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 53

27-Apr-17 53

ROOT I1 I2

LBL FBL

Raw data DISK RAM Read the data-series one by one from the raw file

178 Themis Palpanas - Telecom Paristech, Apr 2017

ROOT I1 I2

LBL FBL

Raw data DISK RAM Convert them to iSAX

179 Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 54

27-Apr-17 54

ROOT I1 I2

LBL FBL

Raw data DISK RAM Store only iSAX in memory (64 times smaller) ~1%

180 Themis Palpanas - Telecom Paristech, Apr 2017

ROOT I1 I2

LBL FBL

Raw data DISK RAM Discard raw data and keep pointer to raw file

181 Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 55

27-Apr-17 55

ROOT I1

LBL FBL

Raw data I2 DISK RAM Continue loading data until we run out of memory

182 Themis Palpanas - Telecom Paristech, Apr 2017

ROOT I1 L3 L4 L1 L2 I2

LBL FBL

Raw data DISK RAM Expand each sub-tree and move data to LBL

183 Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 56

27-Apr-17 56

Raw data ROOT I1 L3 L4 L1 L2 I2

LBL FBL

DISK RAM We flush the data to the disk to free up memory

184 Themis Palpanas - Telecom Paristech, Apr 2017

Raw data

PARTIAL

PARTIAL ROOT I1 L5 L1 L2 I2

LBL FBL

PARTIAL

DISK RAM L4

PARTIAL

193 Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 57

27-Apr-17 57

Raw data

PARTIAL

PARTIAL ROOT I1 L5 L1 L2 I2

LBL FBL

PARTIAL

DISK RAM L4

PARTIAL

Query #1

194 Themis Palpanas - Telecom Paristech, Apr 2017

Raw data

PARTIAL

PARTIAL ROOT I1 L5 L1 L2 I2

LBL FBL

PARTIAL

DISK RAM L4

PARTIAL

Query #1 TOO BIG!

195 Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 58

27-Apr-17 58

Raw data

PARTIAL

PARTIAL ROOT I1 L5 L2 L1 I2

LBL FBL

PARTIAL

DISK RAM L4

PARTIAL

Query #1 TOO BIG!

196 Themis Palpanas - Telecom Paristech, Apr 2017

Raw data

PARTIAL

PARTIAL ROOT I1 L5 I3 L2 I2

LBL FBL

PARTIAL

DISK RAM L4

PARTIAL

Query #1 PARTIAL L5 L4 Adaptive split Create a smaller leaf

197 Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 59

27-Apr-17 59

Raw data

PARTIAL

PARTIAL ROOT I1 L5 I3 L2 I2

LBL FBL

PARTIAL

DISK RAM L4

PARTIAL

Query #1 PARTIAL L5 L4 Load data in LBL and answer the query

198 Themis Palpanas - Telecom Paristech, Apr 2017

Raw data

PARTIAL

PARTIAL ROOT I1 L5 I3 L2 I2

LBL FBL

PARTIAL

DISK RAM L4

PARTIAL

FULL L5 L4 We spill to the disk when we run out of memory

199 Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 60

27-Apr-17 60

Experimental Evaluation

Themis Palpanas - Telecom Paristech, Apr 2017

200

iSAX 2.0 needs more than 35 hours to answer 100K queries
ADS+ answers 100K queries in less than 5 hours

7x faster

Comparison to multi-dimensional indices

1-3 orders of magnitude faster than multi-dimensional indexing methods

measure data-to-query time (just index 1 billion data-series)

1 10 100 1000 10000 Indexing time (Minutes)

6.6x faster 40x faster 130x faster 1000x faster

Themis Palpanas - Telecom Paristech, Apr 2017 201

SLIDE 61

27-Apr-17 61

Demo

demo

Themis Palpanas - Telecom Paristech, Apr 2017

202

http://www.mi.parisdescartes.fr/~themisp/rinse/

Publications

VLDB‘15

Related Work

significant amount of work in data series indexing

▫ e.g., TS-Tree [Assent et al. ‘08], iSAX [Shieh & Keogh ‘08]

none of these approaches

▫ considered bulk loading ▫ examined more than 1 Million data series

several studies for index bulk loading

▫ merge-based assume data is pre-clustered [Choubey et al. ‘99] ▫ buffering-based work only for balanced indices [Arge et al. ‘02] [Van den Bercken & Seeger ‘01] [Soisalon-Soininen & Widmayer ‘03]

Adaptive indexing/file reorganization for column stores

▫ Database cracking [Idreos et al. ‘07], raw file cracking [Idreos et al. ’11]

Themis Palpanas - Telecom Paristech, Apr 2017

203

SLIDE 62

27-Apr-17 62

Distribution/Parallelization/Cloud?

Themis Palpanas - Telecom Paristech, Apr 2017

204

Distribution/Parallelization/Cloud?

discussion so far assumed a single core

▫ focus on efficient resource utilization ▫ squeeze the most out of a single core ▫ produce scalable solutions at lowest possible cost

 also suitable for analysts with no access to/expertise for clusters

Themis Palpanas - Telecom Paristech, Apr 2017

205

SLIDE 63

27-Apr-17 63

Distribution/Parallelization/Cloud?

further scale-up and scale-out possible!

▫ techniques inherently parallelizable

 across cores, across machines

▫ more involved solutions required when optimizing for energy

 minimize total work

Themis Palpanas - Telecom Paristech, Apr 2017

206

1 2 … n compute nodes/ threads First Buffer Layers Leaf Buffer Layers subset of collection that contains the answer parallelized data series index data series collection

Conclusions

proposed iSAX 2.0, iSAX 2.0 Clustered, iSAX2+, ADS+

▫ indexing for very large data series collections

 code and datasets: http://www.mi.parisdescartes.fr/~themisp/isax2plus/

▫ current state of the art

experimentally validated proposed approach

▫ first published experiments with 1 Billion data series

Themis Palpanas - Telecom Paristech, Apr 2017

207

SLIDE 64

27-Apr-17 64

Conclusions

proposed iSAX 2.0, iSAX 2.0 Clustered, iSAX2+, ADS+

▫ indexing for very large data series collections

 code and datasets: http://www.mi.parisdescartes.fr/~themisp/isax2plus/

▫ current state of the art

experimentally validated proposed approach

▫ first published experiments with 1 Billion data series

case studies in diverse domains exhibit usefulness of approach

▫ for the first time enable pain-free analysis of existing, vast collections of data series

Themis Palpanas - Telecom Paristech, Apr 2017

208

What Next?

new challenge: index and mine 10 billion data series

Themis Palpanas - Telecom Paristech, Apr 2017

209

SLIDE 65

27-Apr-17 65

What Next?

infrastructure monitoring

▫ Facebook wants to manage 4B data series, 12M new values/sec

Themis Palpanas - Telecom Paristech, Apr 2017

217

What Next?

218

Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 66

27-Apr-17 66

219

What Next?

Themis Palpanas - Telecom Paristech, Apr 2017 Themis Palpanas - Telecom Paristech, Apr 2017

What Next?

220

SLIDE 67

27-Apr-17 67

The Road Ahead

“enable practitioners and non-expert users to easily and efficiently manage and analyze massive data series collections”

Themis Palpanas - Telecom Paristech, Apr 2017

221

The Road Ahead

Big Sequence Management System

▫ general purpose data series management system

Themis Palpanas - Telecom Paristech, Apr 2017

222 data sequences

SLIDE 68

27-Apr-17 68

The Road Ahead

Big Sequence Management System

Themis Palpanas - Telecom Paristech, Apr 2017

223

257

Themis Palpanas - Telecom Paristech, Apr 2017

SLIDE 69

27-Apr-17 69

Current and Past Collaborations



Infrastructure Monitoring



analysis and mining of hardware and software infrastructure for health monitoring



with Facebook, EDF, Safran



Human Behavior Patterns



identification of different social groups, and analysis of their macro- and micro-patterns of behavior



with IBM Research, Telecom Italia



Human Brain Activity



analysis of fully-detailed neurobiological data for explaining brain functions



with ICM

Themis Palpanas - Telecom Paristech, Apr 2017 258



Green Manufacturing



analysis and optimization of manufacturing processes for energy savings



with SAP, Intel, Volvo, Infineon



eCrime



identification of fraudulent activities related to the telecommunication industry



with Telecom Italia, Vodafone, Wind



World Sentiments and Opinions



analysis of aggregate sentiment for different social groups, role of media in public sentiment



with Qatar Computing Research Institute, and Hewlett-Packard Labs