An Efficient Algorithm for SPM with CP Problem of Sequential 2 1 - - PowerPoint PPT Presentation

an efficient algorithm for spm with cp
SMART_READER_LITE
LIVE PREVIEW

An Efficient Algorithm for SPM with CP Problem of Sequential 2 1 - - PowerPoint PPT Presentation

An Efficient Algorithm for SPM with CP Problem of Sequential 2 1 1 J. AOGA , T. Guns , P. Schaus 2 1 UCLouvain, KULeuven Belgium Pattern Mining (SPM) ECML PKDD 2016, Riva del Garda, Italy,1923/09/2016 Aoga et al., An Efficient


slide-1
SLIDE 1

An Efficient Algorithm for SPM with CP

  • J. AOGA , T. Guns , P. Schaus

ECML PKDD 2016, Riva del Garda, Italy,19–23/09/2016

UCLouvain, KULeuven — Belgium

1 2 1 1 2

Problem of Sequential Pattern Mining (SPM)

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

Client1 Milk Coffee Sugar Coffee Sugar Client2 Coffee Milk Coffee Sugar Client3 Milk Coffee Client4 Coffee Sugar Egg Coffee

  • Sequence Database (SDB)

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

3

Client1 Milk Coffee Sugar Coffee Sugar Client2 Coffee Milk Coffee Sugar Client3 Milk Coffee Client4 Coffee Sugar Egg

Sequence

Coffee

  • Sequence Database (SDB)
  • Sequence : < Milk Coffee Sugar Coffee Sugar>

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

3

slide-2
SLIDE 2

Client1 Milk Coffee Sugar Coffee Sugar Client2 Coffee Milk Coffee Sugar Client3 Milk Coffee Client4 Coffee Sugar Egg

Sequence Sub sequence

Coffee

  • Sequence Database (SDB)
  • Sequence : < Milk Coffee Sugar Coffee Sugar>
  • Subsequence : <Coffee Sugar>

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

3

Client1 Milk Coffee Sugar Coffee Sugar Client2 Coffee Milk Coffee Sugar Client3 Milk Coffee Client4 Coffee Sugar Egg

Sequence Sub sequence

Coffee

  • Sequence Database (SDB)
  • Sequence : < Milk Coffee Sugar Coffee Sugar>
  • Subsequence : <Coffee Sugar>
  • Support (<Coffee Sugar>) = 3

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

3

Client1 Milk Coffee Sugar Coffee Sugar Client2 Coffee Milk Coffee Sugar Client3 Milk Coffee Client4 Coffee Sugar Egg

Sequence Sub sequence

Coffee

  • Sequence Database (SDB)

Problem : Find all subsequences with support ≥ Given Threshold

  • Sequence : < Milk Coffee Sugar Coffee Sugar>
  • Subsequence : <Coffee Sugar>
  • Support (<Coffee Sugar>) = 3

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

3

Apriori-All [1]

1 9 9 5 1 9 9 6

GSP

2 2 1 2 5

HashTree

[c]SPADE [18] ⇪

Vertical SDB

PrefixSpan [10] ⇪

Prefix
 Projection

Lapin-Spam [16] ⇪

Last Position
 idea

Timeline Specialised Methods

2 1 5 2 1 6

CPSM [8] ⇪

One Prop./Seq.

PPIC ⇪

?

PP [6] ⇪

Global Prop.

GapSeq [5] ⇪

Gap constraint

SatEms [3]

2 1 2

SAT-Based

Timeline CP-Based Methods

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

4

Related Work

slide-3
SLIDE 3

Apriori-All [1]

1 9 9 5 1 9 9 6

GSP

2 2 1 2 5

HashTree

[c]SPADE [18] ⇪

Vertical SDB

PrefixSpan [10] ⇪

Prefix
 Projection

Lapin-Spam [16] ⇪

Last Position
 idea

Timeline Specialised Methods

2 1 5 2 1 6

CPSM [8] ⇪

One Prop./Seq.

PPIC ⇪

?

PP [6] ⇪

Global Prop.

GapSeq [5] ⇪

Gap constraint

SatEms [3]

2 1 2

SAT-Based

Timeline CP-Based Methods

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

4

Related Work

Apriori-All [1]

1 9 9 5 1 9 9 6

GSP

2 2 1 2 5

HashTree

[c]SPADE [18] ⇪

Vertical SDB

PrefixSpan [10] ⇪

Prefix
 Projection

Lapin-Spam [16] ⇪

Last Position
 idea

Timeline Specialised Methods

2 1 5 2 1 6

CPSM [8] ⇪

One Prop./Seq.

PPIC ⇪

?

PP [6] ⇪

Global Prop.

GapSeq [5] ⇪

Gap constraint

SatEms [3]

2 1 2

SAT-Based

Timeline CP-Based Methods

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

4

Related Work CP : Filtering + DFSearch

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

5

Milk Coffee Sugar Egg 𝝑

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5 Variables Domains P1 = Coffee P1 !=Coffee Frequent Pattern Found:

P1P2…PL=MS𝝑𝝑𝝑

P1 = Milk P2 = Sugar P2 != Sugar

CP : Filtering + DFSearch

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

5

Milk Coffee Sugar Egg 𝝑

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5 Variables Domains

slide-4
SLIDE 4

P1 = Coffee P1 !=Coffee Frequent Pattern Found:

P1P2…PL=MS𝝑𝝑𝝑

P1 = Milk P2 = Sugar P2 != Sugar

CP : Filtering + DFSearch

Constraint Store P1P2P3P4P5

RegExpr Support counting

Constraints

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

5

Milk Coffee Sugar Egg 𝝑

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5 Variables Domains P1 = Coffee P1 !=Coffee Frequent Pattern Found:

P1P2…PL=MS𝝑𝝑𝝑

P1 = Milk P2 = Sugar P2 != Sugar

CP : Filtering + DFSearch

Constraint Store P1P2P3P4P5

RegExpr Support counting

Constraints

repeat select a constraint c if c is OK wrt the domain store apply filtering algorithm of c // i.e. remove impossible values else return FAIL until domain store did not change

Fix-Point Algorithm Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

5

Milk Coffee Sugar Egg 𝝑

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5 Variables Domains P1 = Coffee P1 !=Coffee Frequent Pattern Found:

P1P2…PL=MS𝝑𝝑𝝑

P1 = Milk P2 = Sugar P2 != Sugar

CP : Filtering + DFSearch

Constraint Store P1P2P3P4P5

RegExpr Support counting

Constraints

repeat select a constraint c if c is OK wrt the domain store apply filtering algorithm of c // i.e. remove impossible values else return FAIL until domain store did not change

Fix-Point Algorithm This is main bottleneck Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

5

Milk Coffee Sugar Egg 𝝑

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5 Variables Domains

  • 1. Memory and DFS improvement. How to Store and

restore databases in the DFSearch ? => reversible vectors making use of trailing techniques.

  • 2. Support Count Improvement. Visit only the last

position of each symbol after start position. [weakness 2]

  • 3. Sequence visited Improvement. Visit a sequence only

if current start position is less than last position of prefix [weakness 3]

  • 4. Pruning Improvement. Remove infrequent item from
  • nly Di+1 domains of Pi+1. [weakness 1]

Improvements of Literature (1/4)

slide-5
SLIDE 5

Client1 Milk Coffee Sugar Coffee Sugar Client2 Coffee Milk Coffee Sugar Client3 Milk Coffee Client4 Coffee Sugar Egg

7

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4 0 1 2 3 4

7

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4 0 1 2 3 4

7

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M : M M M Supports

0 1 2 3 4

7

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-6
SLIDE 6

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3

0 1 2 3 4

7

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

7

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

7

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

7

Milk Coffee Sugar

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-7
SLIDE 7

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

7

Milk Coffee Sugar

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

7

Milk Coffee Sugar

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

7

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

M C C S S M C C S MC C S E

1 2 3 4 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

7

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-8
SLIDE 8

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

7

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1

7

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

7

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

7

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-9
SLIDE 9

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

7

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

7

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Coffee 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C C S S C S C

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

7

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Coffee Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

7

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Coffee Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-10
SLIDE 10

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=7 Size=3

1 1 2 2 3 1 1 2 2 3 3 2

7

start=0, size=4 start=4

TrailStack

Top of the sub-stack Milk

P2

Coffee Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=7 Size=3

1 1 2 2 3 1 1 2 2 3 3 2

7

start=0, size=4 start=4

TrailStack

Top of the sub-stack Milk

P2

Coffee 𝝑 𝝑 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack start=4 Size=3

7

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg 𝝑

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack start=4 Size=3

7

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

𝝑

slide-11
SLIDE 11

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack B a c k t r a c k start=0 Size=4

7

TrailStack

Top of the sub-stack

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Coffee Sugar 𝝑 Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg

S C MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports C C S S M C CS MC E

1 2 3 4 0 1 2 3 4

C M

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack B a c k t r a c k start=0 Size=4

7

TrailStack

Top of the sub-stack

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Coffee 𝝑 Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg

S MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports C S S MCS E

1 2 3 4 0 1 2 3 4

C

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack B a c k t r a c k start=0 Size=4

7

TrailStack

Top of the sub-stack

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Coffee 𝝑 Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg

S MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports C S S MCS E

1 2 3 4 0 1 2 3 4

C

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

2 3 3 2 1 2 2 1 3 2 4 1

start=4 Size=4 Backtrack B a c k t r a c k

7

start=0

TrailStack

Top of the sub-stack

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Coffee 𝝑 Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg

slide-12
SLIDE 12
  • 1. Memory and DFS improvement. How to Store and

restore databases in the DFSearch ? => reversible vectors making use of trailing techniques.

  • 2. Support Count Improvement. How to compute

support efficiently? Visit only the last position of each symbol after start position.

  • 3. Sequence visited Improvement. Visit a sequence only

if current start position is less than last position of prefix [weakness 3]

  • 4. Pruning Improvement. Remove infrequent item from
  • nly Di+1 domains of Pi+1. [weakness 1]

Improvements of Literature (2/4)

A C A B A A A C B B B B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

startPos

✦ startPos=5

  • How to compute items support in SDB?

Often many repeating symbols: cache for each symbol its 'last position' and only iterate over those (O(m) vs O(n))

★ Identify items which exist in each sequence of SDB and


increase items support value.

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

9

★ Last Position List = [(B,16),(C,12),(A,11),(D,2),(E,0)] A C A B A A A C B B B B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

startPos

✦ startPos=5

  • How to compute items support in SDB?

Often many repeating symbols: cache for each symbol its 'last position' and only iterate over those (O(m) vs O(n))

★ Identify items which exist in each sequence of SDB and


increase items support value.

A C B

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

9

★ Last Position List = [(B,16),(C,12),(A,11),(D,2),(E,0)] A C A B A A A C B B B B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

startPos

✦ startPos=5

  • How to compute items support in SDB?

Often many repeating symbols: cache for each symbol its 'last position' and only iterate over those (O(m) vs O(n))

★ Identify items which exist in each sequence of SDB and


increase items support value.

A C B

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

9

slide-13
SLIDE 13
  • 1. Memory and DFS improvement. How to Store and

restore databases in the DFSearch ? => reversible vectors making use of trailing techniques.

  • 2. Support Count Improvement. How to compute items

support efficiently? Visit only the last position of each symbol after start position.

  • 3. Sequence projection Improvement. P1P2P3P4P5 has

just been extended to EP2P3P4P5. Can we decide in O(1) if an embedding can not be extended?.

  • 4. Pruning Improvement. Remove infrequent item from
  • nly Di+1 domains of Pi+1. [weakness 1]

Improvements of Literature (3/4)

11

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

11

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

Naive Solution: Scan all Items in each sequence. O(n) check per sequence

11

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

slide-14
SLIDE 14

11

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

0 > 0

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

11

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

0 > 0 0 > 0

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

11

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

0 > 0 0 > 0 0 > 0

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

11

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

0 > 0 0 > 0 0 > 0 3 > 0

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

slide-15
SLIDE 15
  • 1. Memory and DFS improvement. How to Store and

restore databases in the DFSearch ? => reversible vectors making use of trailing techniques.

  • 2. Support Count Improvement. How to compute items

support efficiently? Visit only the last position of each symbol after start position..

  • 3. Sequence visited Improvement. P1P2P3P4P5 has just

been extended to EP2P3P4P5. Can we decide in O(1) if an embedding can not be extended?

  • 4. Pruning Improvement. Remove infrequent item from
  • nly Di+1 domains of Pi+1.

Improvements of Literature (4/4)

Experiments

Three Global constraints in OscaR implemented with Scala


  • PPIC : memory + support counting + sequence

visited + prune improvements

  • PPDC : same with PPIC but support counting is

different.

  • PPmixed : Mixed of PPIC/PPDC base on heuristic

OSCAR

www.oscarlib.org

Results

http://sites.uclouvain.be/cp4dm/spm/

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

14

Compared with CP-Based methods

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

15

PPIC GapSeq

Time limit = 3600s (1H)

slide-16
SLIDE 16

Compared with Specialized methods

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

16

PPIC cSPADE

Time limit = 3600s (1H)

1 2 3 4 5 6 Size (nItem) Time (s)

0.5 1.0 1.5 2.0 2.5 3.0

(a) LEVIATHAN − minsup=4%

GapSeq.size cSPADE.size PPIC.size

50000 100000 150000 Minfreq (seq)

20 40 60 80 100 120 140

(b) Data200k − RE10

PP.RE10 PPIC.RE10

− −

  • Minimum and maximum pattern size constraint
  • Item constraint
  • Regular expression Contraint

Handling of different additional constraints

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

17

  • Integrate ideas of both SPM and CP in a global constraint for CP solver

(Generic and Flexible);

  • First time, CP-based approach outperforms both CP-based

approaches and Specialized methods;

  • Efficient memory using Trail-based backtracking aware

datastructure really speed up search in DFSearch;

  • Compatible with a number of other constraints, such as: minimum

and maximum length constraint, Regular expression constraint, etc

  • Future work: use generality of framework for more constraints and other

sequence mining settings (multi-objective, information theory based, interactive, ...)

  • Code and apps are open 


http://sites.uclouvain.be/cp4dm/spm/

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

18

Conclusion

slide-17
SLIDE 17
  • 1. Memory and DFS improvement. How to Store and

restore databases in the DFSearch ? => reversible vectors making use of trailing techniques.

  • 2. Support Count Improvement. Visit only the last

position of each symbol after start position. [weakness 2]

  • 3. Sequence visited Improvement. Visit a sequence only

if current start position is less than last position of prefix [weakness 3]

  • 4. Pruning Improvement. Remove infrequent item from
  • nly Di+1 domains of Pi+1. [weakness 1]

Improvements of Literature (1/4)

Client1 Milk Coffee Sugar Coffee Sugar Client2 Coffee Milk Coffee Sugar Client3 Milk Coffee Client4 Coffee Sugar Egg

21

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4 0 1 2 3 4

21

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4 0 1 2 3 4

21

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-18
SLIDE 18

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M : M M M Supports

0 1 2 3 4

21

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3

0 1 2 3 4

21

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

21

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

21

Milk Coffee Sugar Egg

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-19
SLIDE 19

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

21

Milk Coffee Sugar

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

21

Milk Coffee Sugar

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

21

Milk Coffee Sugar

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

21

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-20
SLIDE 20

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

M C C S S M C C S MC C S E

1 2 3 4 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

21

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

21

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

start=0 Size=4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1

21

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

21

Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-21
SLIDE 21

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

21

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

21

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

21

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Coffee 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C C S S C S C

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

21

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Coffee Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

slide-22
SLIDE 22

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=4 Size=3

1 1 2 2 3 1

21

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

Coffee Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=7 Size=3

1 1 2 2 3 1 1 2 2 3 3 2

21

start=0, size=4 start=4

TrailStack

Top of the sub-stack Milk

P2

Coffee Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑 Milk Coffee Sugar Egg 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

start=7 Size=3

1 1 2 2 3 1 1 2 2 3 3 2

21

start=0, size=4 start=4

TrailStack

Top of the sub-stack Milk

P2

Coffee 𝝑 𝝑 𝝑

P1 P4 P3 P5

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack start=4 Size=3

21

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg 𝝑

slide-23
SLIDE 23

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack start=4 Size=3

21

start=0, size=4

TrailStack

Top of the sub-stack Milk

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

𝝑

MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack B a c k t r a c k start=0 Size=4

21

TrailStack

Top of the sub-stack

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Coffee Sugar 𝝑 Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg

S C MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports C C S S M C CS MC E

1 2 3 4 0 1 2 3 4

C M

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack B a c k t r a c k start=0 Size=4

21

TrailStack

Top of the sub-stack

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Coffee 𝝑 Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg

S MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports C S S MCS E

1 2 3 4 0 1 2 3 4

C

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

1 1 2 2 3 1 1 2 2 3 3 2

Backtrack B a c k t r a c k start=0 Size=4

21

TrailStack

Top of the sub-stack

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Coffee 𝝑 Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg

slide-24
SLIDE 24

S MC C S S M C CS M C C S E M M M C C C C C C S S S S E

1 2 3 4

M C : 4 : S : 3 E : 1 Supports 3 Given Threshold=3 (75%)

0 1 2 3 4

C C S S C S C

1 2 3 0 1 2 3 4

M M : 0 C : 3 S : 2 E : 1 Supports C S S S

1 2 3 0 1 2 3 4

C M : 0 C : 1 S : 2 E : 1 Supports C S S MCS E

1 2 3 4 0 1 2 3 4

C

  • Seq. Pos.

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13

2 3 3 2 1 2 2 1 3 2 4 1

start=4 Size=4 Backtrack B a c k t r a c k

21

start=0

TrailStack

Top of the sub-stack

P2

𝝑 𝝑 𝝑

P1 P4 P3 P5

Coffee 𝝑 Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg Milk Coffee Sugar Egg

  • 1. Memory and DFS improvement. How to Store and

restore databases in the DFSearch ? => reversible vectors making use of trailing techniques.

  • 2. Support Count Improvement. How to compute

support efficiently? Visit only the last position of each symbol after start position.

  • 3. Sequence visited Improvement. Visit a sequence only

if current start position is less than last position of prefix [weakness 3]

  • 4. Pruning Improvement. Remove infrequent item from
  • nly Di+1 domains of Pi+1. [weakness 1]

Improvements of Literature (2/4)

A C A B A A A C B B B B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

startPos

✦ startPos=5

  • How to compute items support in SDB?

Often many repeating symbols: cache for each symbol its 'last position' and only iterate over those (O(m) vs O(n))

★ Identify items which exist in each sequence of SDB and


increase items support value.

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

23

★ Last Position List = [(B,16),(C,12),(A,11),(D,2),(E,0)] A C A B A A A C B B B B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

startPos

✦ startPos=5

  • How to compute items support in SDB?

Often many repeating symbols: cache for each symbol its 'last position' and only iterate over those (O(m) vs O(n))

★ Identify items which exist in each sequence of SDB and


increase items support value.

A C B

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

23

slide-25
SLIDE 25

★ Last Position List = [(B,16),(C,12),(A,11),(D,2),(E,0)] A C A B A A A C B B B B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

startPos

✦ startPos=5

  • How to compute items support in SDB?

Often many repeating symbols: cache for each symbol its 'last position' and only iterate over those (O(m) vs O(n))

★ Identify items which exist in each sequence of SDB and


increase items support value.

A C B

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

23

  • 1. Memory and DFS improvement. How to Store and

restore databases in the DFSearch ? => reversible vectors making use of trailing techniques.

  • 2. Support Count Improvement. How to compute items

support efficiently? Visit only the last position of each symbol after start position.

  • 3. Sequence projection Improvement. P1P2P3P4P5 has

just been extended to EP2P3P4P5. Can we decide in O(1) if an embedding can not be extended?.

  • 4. Pruning Improvement. Remove infrequent item from
  • nly Di+1 domains of Pi+1. [weakness 1]

Improvements of Literature (3/4)

25

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

25

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

Naive Solution: Scan all Items in each sequence. O(n) check per sequence

slide-26
SLIDE 26

25

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

25

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

0 > 0

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

25

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

0 > 0 0 > 0

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

25

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

0 > 0 0 > 0 0 > 0

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

slide-27
SLIDE 27

25

M C C S S M C C S M C C S E

1 2 3 4

M M M

1 2 3 4

  • 1. Which sequences contain at least one E ?

0 > 0 0 > 0 0 > 0 3 > 0

Naive Solution: Scan all Items in each sequence. O(n) check per sequence Can we do better? YES For each sequence, precompute a map from symbol to last position in the sequence. If sequence is extended by 'E', verify lastPos[sid][E] is larger than startPos. O(1) check per sequence

startPos lastPos[sid][E]

  • 1. Memory and DFS improvement. How to Store and

restore databases in the DFSearch ? => reversible vectors making use of trailing techniques.

  • 2. Support Count Improvement. How to compute items

support efficiently? Visit only the last position of each symbol after start position..

  • 3. Sequence visited Improvement. P1P2P3P4P5 has just

been extended to EP2P3P4P5. Can we decide in O(1) if an embedding can not be extended?

  • 4. Pruning Improvement. Remove infrequent item from
  • nly Di+1 domains of Pi+1.

Improvements of Literature (4/4)

Experiments

Three Global constraints in OscaR implemented with Scala


  • PPIC : memory + support counting + sequence

visited + prune improvements

  • PPDC : same with PPIC but support counting is

different.

  • PPmixed : Mixed of PPIC/PPDC base on heuristic

OSCAR

www.oscarlib.org

Results

http://sites.uclouvain.be/cp4dm/spm/

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

28

slide-28
SLIDE 28

Compared with CP-Based methods

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

29

PPIC GapSeq

Time limit = 3600s (1H)

Compared with Specialized methods

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

30

PPIC cSPADE

Time limit = 3600s (1H)

1 2 3 4 5 6 Size (nItem) Time (s)

0.5 1.0 1.5 2.0 2.5 3.0

(a) LEVIATHAN − minsup=4%

GapSeq.size cSPADE.size PPIC.size

50000 100000 150000 Minfreq (seq)

20 40 60 80 100 120 140

(b) Data200k − RE10

PP.RE10 PPIC.RE10

− −

  • Minimum and maximum pattern size constraint
  • Item constraint
  • Regular expression Contraint

Handling of different additional constraints

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

31

  • Integrate ideas of both SPM and CP in a global constraint for CP solver

(Generic and Flexible);

  • First time, CP-based approach outperforms both CP-based

approaches and Specialized methods;

  • Efficient memory using Trail-based backtracking aware

datastructure really speed up search in DFSearch;

  • Compatible with a number of other constraints, such as: minimum

and maximum length constraint, Regular expression constraint, etc

  • Future work: use generality of framework for more constraints and other

sequence mining settings (multi-objective, information theory based, interactive, ...)

  • Code and apps are open 


http://sites.uclouvain.be/cp4dm/spm/

Aoga et al., An Efficient Algorithm for SPM with CP, ECML PKDD 2016

32

Conclusion

slide-29
SLIDE 29