Automatic Sequential Pattern Mining in Data Streams Koki Kawabata*, - - PowerPoint PPT Presentation

β–Ά
automatic sequential pattern mining in data streams
SMART_READER_LITE
LIVE PREVIEW

Automatic Sequential Pattern Mining in Data Streams Koki Kawabata*, - - PowerPoint PPT Presentation

Automatic Sequential Pattern Mining in Data Streams Koki Kawabata*, Yasuko Matsubara & Yasushi Sakurai ISIR-AIRC, Osaka University *Supported by SIGIR Student Travel Grants Motivation Given: time-evolving data streams e.g., IoT


slide-1
SLIDE 1

Automatic Sequential Pattern Mining in Data Streams

Koki Kawabata*, Yasuko Matsubara & Yasushi Sakurai

ISIR-AIRC, Osaka University

*Supported by SIGIR Student Travel Grants

slide-2
SLIDE 2

Motivation

Given: time-evolving data streams

  • e.g., IoT sensors/Web click logs
  • contain multiple patterns

CIKM2019 Sakurai Lab. K.Kawabata et al. 2

slide-3
SLIDE 3

Motivation

Given: time-evolving data streams

  • e.g., IoT sensors/Web click logs
  • contain multiple patterns

Answer: the following questions:

  • 1. What kind of patterns?
  • 2. How many patterns?
  • 3. When do patterns change?

CIKM2019 Sakurai Lab. K.Kawabata et al. 3

slide-4
SLIDE 4

?

Motivation

Given: time-evolving data streams

  • e.g., IoT sensors/Web click logs
  • contain multiple patterns

Requirements:

  • Incremental
  • We cannot access all historical data
  • Automatic
  • # of patterns are unknown in advance
  • without any parameter tunings

CIKM2019 Sakurai Lab. K.Kawabata et al. 4

slide-5
SLIDE 5

Motivation

Given: time-evolving data streams

  • e.g., IoT sensors/Web click logs
  • contain multiple patterns

Requirements:

  • Incremental
  • We cannot access all historical data
  • Automatic
  • # of patterns are unknown in advance
  • without any parameter tunings

CIKM2019 Sakurai Lab. K.Kawabata et al. 5

StreamScope: automatic & incremental approach

slide-6
SLIDE 6

Demo movie

CIKM2019 Sakurai Lab. K.Kawabata et al. 6

slide-7
SLIDE 7

Demo movie

CIKM2019 Sakurai Lab. K.Kawabata et al. 7

#1: Arm curl #2: Rowing

slide-8
SLIDE 8

Demo movie

CIKM2019 Sakurai Lab. K.Kawabata et al. 8

slide-9
SLIDE 9

Demo movie

CIKM2019 Sakurai Lab. K.Kawabata et al. 9

#1: Arm curl #2: Rowing #5: Push up #4: side raise #3: Intervals

slide-10
SLIDE 10

Outline

  • 1. Motivation
  • 2. Problem definition
  • 3. Model
  • 4. Streaming Algorithm
  • 5. Experiments
  • 6. Conclusions

CIKM2019 Sakurai Lab. K.Kawabata et al. 10

slide-11
SLIDE 11

Problem definition

CIKM2019 Sakurai Lab. K.Kawabata et al. 11 1000 2000 3000 4000 5000

Time

0.5 1

Value

L-arm R-arm L-leg R-leg

1 2 3 4

Given:

  • Data stream X

Find:

  • 1. Segment:

𝒯

  • 2. Regime:

Θ

  • 3. Segment-

membership: β„±

slide-12
SLIDE 12

Problem definition

Data stream: set of d-dimensional vectors

CIKM2019 Sakurai Lab. K.Kawabata et al. 12 1000 2000 3000 4000 5000

Time

0.5 1

Value

L-arm R-arm L-leg R-leg

π‘Œ = 𝑦!, … , 𝑦"

Given 𝑒 = 4

slide-13
SLIDE 13

Problem definition

Segment: start/end positions of each pattern

CIKM2019 Sakurai Lab. K.Kawabata et al. 13 1000 2000 3000 4000 5000

Time

0.5 1

Value

L-arm R-arm L-leg R-leg

𝒯 = 𝑑!, … , 𝑑#

𝑛 = 8 Hidden 𝑑! 𝑑" 𝑑# 𝑑$ 𝑑% 𝑑& 𝑑' 𝑑(

slide-14
SLIDE 14

Problem definition

Regime: segment groups

CIKM2019 Sakurai Lab. K.Kawabata et al. 14 1000 2000 3000 4000 5000

Time

0.5 1

Value

L-arm R-arm L-leg R-leg

1 2 3 4

Θ = πœ„!, … , πœ„$, Ξ¦

𝑠 = 4 Hidden

slide-15
SLIDE 15

Problem definition

Segment-membership: regime-assignment

CIKM2019 Sakurai Lab. K.Kawabata et al. 15 1000 2000 3000 4000 5000

Time

0.5 1

Value

L-arm R-arm L-leg R-leg

1 2 3 4

β„± = 𝑔 1 , … , 𝑔 𝑛

1, 1, 2, 2, 2, 4, 4, 3,

β„± = { } e.g., 𝑔 3 = 4 Hidden

slide-16
SLIDE 16

Problem definition

Given: d-dimensional data stream π‘Œ = 𝑦!, … , 𝑦) Find: compact description π’Ÿ = 𝑛, 𝑠, 𝒯, Θ, β„± of π‘Œ

  • 𝑛 segments 𝒯
  • 𝑠 regimes Θ
  • segment-membership β„±

CIKM2019 Sakurai Lab. K.Kawabata et al. 16

π’Ÿ = 𝑛, 𝑠, 𝒯, Θ, β„±

slide-17
SLIDE 17

Outline

  • 1. Motivation
  • 2. Problem definition
  • 3. Model
  • 4. Streaming Algorithm
  • 5. Experiments
  • 6. Conclusions

CIKM2019 Sakurai Lab. K.Kawabata et al. 17

slide-18
SLIDE 18

Proposed model

Goal: find compact description C in a streaming setting

CIKM2019 Sakurai Lab. K.Kawabata et al. 18

Challenges:

  • Q1. How can we represent regimes?

Idea (1): Hierarchical probabilistic model

  • Q2. How can we decide # of segments/regimes?

Idea (2): Model description cost

slide-19
SLIDE 19

Idea (1): hierarchical probabilistic model

  • Q. How to describe patterns?

CIKM2019 Sakurai Lab. K.Kawabata et al. 19

πœ„*

𝑏""

Idea: HMM-based probabilistic model

  • β€˜within-regime’ transitions: A hidden Markov model πœ„ = 𝜌, 𝐡, 𝐢
  • β€˜across-regime’ transitions: Regime transition matrix Ξ¦ = 𝜚!" !,"$%

&

Model Data stream

stand run walk

t

Regimes

slide-20
SLIDE 20

Idea (1): hierarchical probabilistic model

Full model Θ = πœ„!, … , πœ„+, Ξ¦

CIKM2019 Sakurai Lab. K.Kawabata et al. 20

πœ„*

𝑏"" 𝑏#" 𝑏"# 𝑏## 𝑏#$ 𝑏$# 𝑏$$ 𝑏"$ 𝑏$"

state1 state3 state2

πœ„* = 𝜌*, 𝐡*, 𝐢* Single HMM parameters:

Θ = πœ„!, … , πœ„$, Ξ¦

stand run walk

Regimes

slide-21
SLIDE 21

Idea (1): hierarchical probabilistic model

Full model Θ = πœ„!, … , πœ„+, Ξ¦

CIKM2019 Sakurai Lab. K.Kawabata et al. 21

πœ„*

𝑏"" 𝑏#" 𝑏"# 𝑏## 𝑏#$ 𝑏$# 𝑏$$ 𝑏"$ 𝑏$"

state1 state3 state2

πœ„* = 𝜌*, 𝐡*, 𝐢* Single HMM parameters: Ξ¦ = 𝜚*, *,,.!

+

Regime transition matrix:

Θ = πœ„!, … , πœ„$, Ξ¦

stand run walk

Regimes

𝜚%% 𝜚'' 𝜚%( 𝜚(% 𝜚'( 𝜚(' 𝜚(( 𝜚%' 𝜚'%

slide-22
SLIDE 22

Idea (2): Incremental encoding scheme

  • Q. How to decide # of segments/regimes?

CIKM2019 Sakurai Lab. K.Kawabata et al. 22

Idea: Minimum description length (MDL)

  • Minimize the total description cost of a data stream
  • Update β€˜optimal’ # of segments/regimes
slide-23
SLIDE 23

Idea (2): Incremental encoding scheme

Idea: Minimize total encoding cost

CIKM2019 Sakurai Lab. K.Kawabata et al. 23

Good compression Good description CostM(C) + CostC(X|C) Model cost Coding cost min ( )

1 2 3 4 5 6 7 8 9 10 CostM CostC CostT

(# of r, m)

slide-24
SLIDE 24
  • Q. How many new components does π’Ÿ need?

Idea (2): Incremental encoding scheme

CIKM2019 Sakurai Lab. K.Kawabata et al. 24

A regime A segment A state Keep compact!

π’Ÿ

π‘Œ!

regime? How many? s e g m e n t ?

slide-25
SLIDE 25
  • Q. How many new components does π’Ÿ need?

Idea (2): Incremental encoding scheme

CIKM2019 Sakurai Lab. K.Kawabata et al. 25

Keep compact!

π’Ÿ

π‘Œ!

regime? How many? s e g m e n t ? A regime A segment A state D e t a i l s i n p a p e r

slide-26
SLIDE 26

Outline

  • 1. Motivation
  • 2. Problem definition
  • 3. Model
  • 4. Streaming Algorithm
  • 5. Experiments
  • 6. Conclusions

CIKM2019 Sakurai Lab. K.Kawabata et al. 26

slide-27
SLIDE 27

Streaming algorithms

  • Algorithms

CIKM2019 Sakurai Lab. K.Kawabata et al. 27

StreamScope Optimize/update parameter set π’Ÿ

  • 1. SegmentAssignment

Identify regime transitions & segments

  • 2. RegimeGeneration

Estimate new regimes πœ„ Main

slide-28
SLIDE 28

StreamScope

  • Overview

CIKM2019 Sakurai Lab. K.Kawabata et al. 28

π‘Œ:

  • 1. Keep current window:
  • The latest segment, 𝑑%
  • New observations, 𝑦&, …

π‘Œ) = 𝑑* βˆͺ 𝑦+

Data stream π‘Œ

𝑒 β†’

π’Ÿ

slide-29
SLIDE 29

StreamScope

  • Overview

CIKM2019 Sakurai Lab. K.Kawabata et al. 29

π‘Œ:

  • 1. Keep current window:
  • The latest segment, 𝑑%
  • New observations, 𝑦&, …

π‘Œ) = 𝑑* βˆͺ 𝑦+

Data stream π‘Œ

𝑒 β†’

π’Ÿ

  • 2. Update model set 𝓓
  • Minimize Δ𝐷𝑝𝑑𝑒'(π‘Œ(|π’Ÿ)

Increase segments? (SegmentAssignment) vs. Increase states/regimes? (RegimeGeneration)

slide-30
SLIDE 30

StreamScope

  • Overview

CIKM2019 Sakurai Lab. K.Kawabata et al. 30

Data stream π‘Œ

𝑒 β†’

π’Ÿ

π‘Œ:

  • 1. Keep current window:
  • The latest segment, 𝑑%
  • New observations, 𝑦&, …

π‘Œ) = 𝑑* βˆͺ 𝑦+

  • 3. Update 𝒀𝒅
  • If pattern has changed
  • 2. Update model set 𝓓
  • Minimize Δ𝐷𝑝𝑑𝑒'(π‘Œ(|π’Ÿ)

Increase segments? (SegmentAssignment) vs. Increase states/regimes? (RegimeGeneration)

slide-31
SLIDE 31
  • 1. SegmentAssignment

Given:

  • Observation π’šπ’–
  • Model parameter set Θ = {πœ„%, … , πœ„&, Ξ¦}

Find:

  • Optimal cut point between regimes: 𝑛, 𝒯, β„±

CIKM2019 Sakurai Lab. K.Kawabata et al. 31

slide-32
SLIDE 32
  • 1. SegmentAssignment

Overview

CIKM2019 Sakurai Lab. K.Kawabata et al. 32

πœ„! πœ„" πœ„# 𝑒 β†’

𝑦" 𝑦# 𝑦$ 𝑦* 𝑦+ 𝑦, 𝑦-

Dynamic programing algorithm to compute

𝑄(𝑦"|Θ)

𝜚"$

slide-33
SLIDE 33
  • 1. SegmentAssignment

Overview

CIKM2019 Sakurai Lab. K.Kawabata et al. 33

πœ„! πœ„" πœ„#

𝜚"#

Dynamic programing algorithm to compute

𝑄(𝑦"|Θ)

Keep all candidate cut points

β„’ = 𝑀!, 𝑀%, …

𝑀# = 2, 3 𝑀# = 4, 2 𝑦" 𝑦# 𝑦$ 𝑦* 𝑦+ 𝑦, 𝑦-

𝑒 β†’

𝑑π‘₯π‘—π‘’π‘‘β„Ž? ? 𝑑π‘₯π‘—π‘’π‘‘β„Ž? ? 𝜚"$

slide-34
SLIDE 34
  • 1. SegmentAssignment

Overview

CIKM2019 Sakurai Lab. K.Kawabata et al. 34

πœ„! πœ„" πœ„# 𝑒 β†’

Dynamic programing algorithm to compute

𝑄(𝑦"|Θ)

𝛿 βˆ’ guarantee:

𝛿 ∝ π‘›π‘“π‘π‘œ( 𝑑 )

𝑀# = 2, 3

𝛿

𝑦" 𝑦# 𝑦$ 𝑦* 𝑦+ 𝑦, 𝑦-

Keep all candidate cut points

β„’ = 𝑀!, 𝑀%, …

𝜚"$

slide-35
SLIDE 35
  • 2. RegimeGeneration

Given:

  • Current window π‘Œ)

Find:

  • New regimes: parameter set 𝑛, 𝑠, 𝒯, Θ, β„± for π‘Œ)

CIKM2019 Sakurai Lab. K.Kawabata et al. 35

slide-36
SLIDE 36
  • 2. RegimeGeneration
  • 1. Two phase iterative approach
  • Phase1: split segments into 2 groups
  • Phase2: update 2 model parameters

CIKM2019 Sakurai Lab. K.Kawabata et al. 36

Phase 1 Phase 2

S1 = S2 =

πœ„!, πœ„%, Ξ¦

πœ„! πœ„" π‘Œ/

slide-37
SLIDE 37
  • 2. RegimeGeneration
  • 1. Two phase iterative approach
  • Phase1: split segments into 2 groups
  • Phase2: update 2 model parameters
  • 2. Recursively split new regimes
  • While total cost can be reduced

CIKM2019 Sakurai Lab. K.Kawabata et al. 37

πœ„! πœ„" π‘Œ/ πœ„! πœ„" πœ„#

β‹―

slide-38
SLIDE 38

Outline

  • 1. Motivation
  • 2. Problem definition
  • 3. Model
  • 4. Streaming Algorithm
  • 5. Experiments
  • 6. Conclusions

CIKM2019 Sakurai Lab. K.Kawabata et al. 38

slide-39
SLIDE 39

Experiments

  • We answer the following questions:

CIKM2019 Sakurai Lab. K.Kawabata et al. 39

  • Q1. Effectiveness:

How successful is it in discovering patterns?

  • Q2. Accuracy:

How well does it find cut-points & regimes?

  • Q3. Scalability:

How does it scale in terms of time & memory consumption?

slide-40
SLIDE 40
  • Q1. Effectiveness - #Mocap
  • MoCap sensor stream

CIKM2019 Sakurai Lab. K.Kawabata et al. 40 1000 2000 3000 4000 5000

Time

0.5 1

Value

L-arm R-arm L-leg R-leg

?

slide-41
SLIDE 41
  • Q1. Effectiveness - #Mocap
  • MoCap sensor stream

CIKM2019 Sakurai Lab. K.Kawabata et al. 41 1000 2000 3000 4000 5000

Time

0.5 1

Value

L-arm R-arm L-leg R-leg

1 2 3 4

#1 Going straight #2 Stretching arms #4 Stretching left arm #3 Stretching right arm

StreamScope can find intuitive patterns automatically

slide-42
SLIDE 42
  • Q1. Effectiveness - #Bicycle
  • Bicycle dataset

CIKM2019 Sakurai Lab. K.Kawabata et al. 42

β‰ˆ

Acceleration – (X,Y ,Z)

slide-43
SLIDE 43
  • Q1. Effectiveness - #Bicycle
  • Bicycle dataset

CIKM2019 Sakurai Lab. K.Kawabata et al. 43

#1 #2 #3 #4 #5

slide-44
SLIDE 44
  • Q2. Accuracy
  • Segmentation accuracy (higher is better)

CIKM2019 Sakurai Lab. K.Kawabata et al. 44

#Mocap #Bicycle #Workout 0.5 1

Macro-F1 score

StreamScope AutoPlait TICC-2 TICC-4 TICC-8 pHMM

Good accuracy compared with other methods

slide-45
SLIDE 45
  • Q2. Accuracy
  • Clustering accuracy (higher is better)

CIKM2019 Sakurai Lab. K.Kawabata et al. 45

#Mocap #Bicycle #Workout 0.5 1

Accuracy

StreamScope AutoPlait TICC-2 TICC-4 TICC-8 pHMM

Good accuracy compared with other methods

slide-46
SLIDE 46
  • Q3. Scalability
  • Wall clock time vs. stream length

CIKM2019 Sakurai Lab. K.Kawabata et al. 46

The complexity is independent of the data length

1 1.5 2 2.5 3 3.5 4

Time

104 10-4 10-2 100 102 104

Wall clock time (s)

StreamScope AutoPlait TICC pHMM

100x

slide-47
SLIDE 47
  • Q3. Scalability
  • Memory space vs. stream length

CIKM2019 Sakurai Lab. K.Kawabata et al. 47

The complexity is independent of the data length

1 1.5 2 2.5 3 3.5 4

Time

104 104 105 106

Memory space (byte)

StreamScope O(n)

100x

slide-48
SLIDE 48

Conclusions

StreamScope has the following advantages: Effective:

Find optimal segments/regimes

Adaptive:

Automatic and incremental

Scalable:

It does not depend on data length

CIKM2019 Sakurai Lab. K.Kawabata et al. 48

slide-49
SLIDE 49

Thank you !

CIKM2019 Sakurai Lab. K.Kawabata et al. 49

β‰ˆ