outline
play

Outline ` Mining Sequential Patterns PrefixSpan: Mining Sequential - PowerPoint PPT Presentation

Outline ` Mining Sequential Patterns PrefixSpan: Mining Sequential Patterns Problem statement Efficiently by Prefix-Projected Pattern Definitions & examples Growth Strategies PrefixSpan algorithm Authors: Motivation


  1. Outline ` � Mining Sequential Patterns PrefixSpan: Mining Sequential Patterns � Problem statement Efficiently by Prefix-Projected Pattern � Definitions & examples Growth � Strategies � PrefixSpan algorithm Authors: � Motivation Jian Pei, Jiawei Han, Behzad Mortazavi-Asi, Helen Pinto Qiming Chen, Umeshwar Dayal, Mei-Chun Hsu � Definitions & examples � Algorithm � Example � Performance study � Conclusions Presenter: Wojciech Stach 2 Sequential Pattern Mining Sequential Pattern Mining ` ` � Given � Find all the frequent subsequences, i.e. the subsequences whose occurrence frequency in the � a set of sequences, where each sequence consists of a list of elements and each element consists of set of items set of sequences is no less than min_support � user-specified min_support threshold Solution – 53 frequent subsequences <a><aa> <ab> <a(bc)> <a(bc)a> <aba> <abc> <a(abc)(ac)d(cf)> - 5 elements, 9 items id Sequence <(ab)> <(ab)c> <(ab)d> <(ab)f> <(ab)dc> <ac> id Sequence 10 <a(abc)(ac)d(cf)> <aca> <acb> <acc> <ad> <adc> <af> 10 <a(abc)(ac)d(cf)> <a(abc)(ac)d(cf)> - 9-sequence 20 <(ad)c(bc)(ae)> <b> <ba> <bc> <(bc)> <(bc)a> <bd> <bdc> <bf> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 30 <(ef)(ab)(df)cb> <c> <ca> <cb> <cc> <a(abc)(ac)d(cf)> = <a(cba)(ac)d(cf)> 40 <eg(af)cbc> 40 <eg(af)cbc> <d> <db> <dc> <dcb> <a(abc)(ac)d(cf)> ≠ <a(ac)(abc)d(cf)> <e> <ea> <eab> <eac> <eacb> <eb> <ebc> <ec> <ecb> <ef> <efb> <efc> <efcb> min_support = 2 <f> <fb> <fbc> <fc> <fcb> 3 4

  2. Subsequence vs. super sequence Sequence Support Count ` ` � Given two sequences α =<a 1 a 2 …a n > and � A sequence database is a set of tuples <sid, s> β =<b 1 b 2 …b m > � A tuple <sid, s> is said to contain a sequence α , if � α is called a subsequence of β , denoted as α⊆ β , α is a subsequence of s, i.e., α ⊆ s if there exist integers 1 ≤ j 1 <j 2 <…<j n ≤ m such that � The support of a sequence α is the number of a 1 ⊆ b j1 , a 2 ⊆ b j2 ,…, a n ⊆ b jn tuples containing α � β is a super sequence of α α 1 =<a> support( α 1 ) = 4 id Sequence 10 <a(abc)(ac)d(cf)> β =<a(abc)(ac)d(cf)> β =<a(abc)(ac)d(cf)> α 2 =<ac> support( α 2 ) = 4 20 <(ad)c(bc)(ae)> α 1 =<aa(ac)d(c)> α 4 =<df(cf)> 30 <(ef)(ab)(df)cb> α 3 =<(ab)c> support( α 3 ) = 2 40 <eg(af)cbc> α 2 =<(ac)(ac)d(cf)> α 5 =<(cf)d> α 3 =<ac> α 6 =<(abc)dcf> 5 6 Strategies Outline ` ` � Apriori-property based � Mining Sequential Patterns � AprioriSome (1995) � Problem statement � AprioriAll (1995) � Definitions & examples � DynamicSome (1995) � Strategies � GSP (1996) � PrefixSpan algorithm � Motivation � Regular expression constraints � Definitions & examples � SPIRIT (1999) � Algorithm � Example � Data projection based � Performance study � Conclusions � FreeSpan (2000) 7 8

  3. Motivation and Background Prefix ` ` Shortcomings of Apriori-like approaches � Given two sequences α =<a 1 a 2 …a n > and � β =<b 1 b 2 …b m >, m ≤ n Potentially huge set of candidate sequences � Multiple scans of databases � Sequence β is called a prefix of α if and only if: � Difficulties at mining long sequential patterns � � b i = a i for i ≤ m-1; � b m ⊆ a m ; FreeSpan ( Fre qu e nt pattern-projected S equential pa tter n � � All the items in (a m – b m ) are alphabetically after those in mining) – pattern growth method b m General idea is to use frequent items to recursively project � sequence databases into a smaller projected databases and grow subsequence fragments in each projected database α =<a(abc)(ac)d(cf)> α =<a(abc)(ac)d(cf)> PrefixSpan ( Prefix -projected S equential pa tter n mining) � Less projections and quickly shrinking sequences β =<a(abc)a> � β =<a(abc)c> 9 10 Projection Postfix ` ` � Given sequences α and β , such that β is a � Let α ’ =<a 1 a 2 …a n > be the projection of α w.r.t. subsequence of α . prefix β =<a 1 a 2 …a m-1 a’ m > (m ≤ n) � A subsequence α ’ of sequence α is called a � Sequence γ =<a’’ m a m+1 …a n > is called the postfix of projection of α w.r.t. β prefix if and only if α w.r.t. prefix β , denoted as γ = α / β , where a’’ m =(a m -a’ m ) � α ’ has prefix β ; � There exist no proper super-sequence α ’’ of α ’ such that � We also denote α = β⋅γ α ’’ is a subsequence of α and also has prefix β α =<a(abc)(ac)d(cf)> α ’ =<a(abc)(ac)d(cf)> β =<(bc)a> β =<a(abc)a> α ’ =<(bc)(ac)d(cf)> γ =<(_c)d(cf)> 11 12

  4. PrefixSpan – Algorithm PrefixSpan – Algorithm (2) ` ` Input : A sequence database S, and the minimum support Method � � threshold min_sup Scan S| α once, find the set of frequent items b 1. such that: Output : The complete set of sequential patterns � b can be assembled to the last element of α to form a a) sequential pattern; or Method : Call PrefixSpan(<>,0,S) � <b> can be appended to α to form a sequential pattern. b) Subroutine PrefixSpan( α , l, S| α ) For each frequent item b, append it to α to form a � 2. sequential pattern α ’, and output α ’; Parameters : � For each α ’, construct α ’-projected database S| α ’, 3. α : sequential pattern, � and call PrefixSpan( α ’, l+1, S| α ’ ). l: the length of α ; � S| α : the α -projected database, if α ≠ <>; otherwise; the � sequence database S. 13 14 id Sequence 10 <a(abc)(ac)d(cf)> PrefixSpan - Example PrefixSpan – Example (2) 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> ` ` Find subsets of sequential patterns 3. Find length-1 sequential patterns min_support = 2 1. <d> <a> <b> <c> <d> <e> <f> <g> <a> <b> <c> <d> <e> <(_e)> <f> <(_f)> <(cf)> 4 4 4 3 3 3 1 1 2 3 0 1 0 1 1 <c(bc)(ae)> <(_f)cb> Divide search space 2. <db> <dc> Prefix <a> <b> <c> <d> <e> <f> <db> <dc> <b> <c> <(abc)(ac)d(cf)> <(_c)(ac)d(cf)> <(ac)d(cf)> <(cf)> <(_f)(ab)(df)cb> <(ab)(df)cb> <(_c)> <(bc)> 2 1 <(_d)c(bc)(ae)> <(_c)(ae)> <(bc)(ae)> <c(bc)(ae)> <(af)cbc> <cbc> <b> <(_b)(df)cb> <(df)cb> <b> <(_f)cb> <(_f)cbc> <c> <bc> <dcb> <dcb > <> 15 16

  5. id Sequence 10 <a(abc)(ac)d(cf)> PrefixSpan - characteristics Bi-level Projection 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> � No candidate sequence needs to be generated by 40 <eg(af)cbc> ` ` min_support = 2 PrefixSpan � Scan to get 1-length sequences � Projected databases keep shrinking � Construct a triangular matrix instead of projected � The major cost of PrefixSpan is the construction of databases for each length-1 patterns projected databases a 2 � How to reduce this cost? b (4,2,2) 1 ALL length-2 sequential c (4,2,1) (3,3,2) 3 pattern Different projection methods d (2,1,1) (2,2,0) (1,3,0) 0 e (1,2,1) (1,2,0) (1,2,0) (1,1,0) 0 � Bi-level projection f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) 1 � reduces the number and the size of projected databases a b c d e f � Pseudo-Projection Support(< ac >) = 4 Support(< cc >) = 3 Support(< ca >) = 2 � reduces the cost of projection when projected database can be Support(< (ac) >) = 1 held in main memory 17 18 Bi-level projection (2) Bi-level projection (3) - optimization ` ` � For each length-2 sequential pattern α , construct � “Do we need to include every item in a postfix in the α -projected database and find the frequent the projected databases?” items � NO! Item pruning in projected database by 3-way � Construct corresponding S-matrix Apriori checking <ab> a b c (_c) d (_d) e (_e) f (_f) Any super-sequence of <(_c)(ac)(cf)> 2 0 2 2 0 1 0 0 1 0 c can be excluded from construction of <ac> is not frequent it can never be a sequential <ab> - projected database <(_c)a> pattern <c> <aba> <abc> <a(bc)> To construct <a(bc)>-projected database, a 0 <a(bd)> is not frequent sequence <a(bcde)df> should be projected to <(_e)df> c (1,0,1) 1 instead of <(_de)df> (_c) ( φ ,2, φ ) ( φ ,1, φ ) φ a c (_c) <a(bc)a> 19 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend