Not Every Pattern Is Interesting! Trivial patterns Pregnant Female - - PowerPoint PPT Presentation

not every pattern is interesting
SMART_READER_LITE
LIVE PREVIEW

Not Every Pattern Is Interesting! Trivial patterns Pregnant Female - - PowerPoint PPT Presentation

Not Every Pattern Is Interesting! Trivial patterns Pregnant Female 100% confidence Misleading patterns Play basketball eat cereal [40%, 66.7%] Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000


slide-1
SLIDE 1

Not Every Pattern Is Interesting!

  • Trivial patterns

– Pregnant à Female 100% confidence

  • Misleading patterns

– Play basketball à eat cereal [40%, 66.7%]

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 1

Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000

slide-2
SLIDE 2

Evaluation Criteria

  • Objective interestingness measures

– Examples: support, patterns formed by mutually independent items – Domain independent

  • Subjective measures

– Examples: domain knowledge, templates/ constraints

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 2

slide-3
SLIDE 3

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 3

Correlation and Lift

  • P(B|A)/P(B) is called the lift of rule A à B
  • Play basketball à eat cereal (lift: 0.89)
  • Play basketball à not eat cereal (lift: 1.33)

corr

A,B = P(A∪ B)

P(A)P(B) = P(AB) P(A)P(B)

Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000

Contingency table

B B A f11 f10 f1+ A f01 f00 f0+ f+1 f+0 N

slide-4
SLIDE 4

Property of Lift

  • If A and B are independent, lift = 1
  • If A and B are positively correlated, lift > 1
  • If A and B are negatively correlated, lift < 1
  • Limitation: lift is sensitive to P(A) and P(B)

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 4

{ } { } p p r r q 880 50 930 s 20 50 70 q 50 20 70 s 50 880 930 930 70 1000 70 930 1000

lift(p, q) < lift(r, s)!

slide-5
SLIDE 5

Leverage

  • The difference between the observed and

expected joint probability of XY assuming X and Y are independent

  • An “absolute” measure of the surprisingness
  • f a rule

– Should be used together with lift

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 5

leverage(X → Y ) = P(XY ) − P(X)P(Y )

slide-6
SLIDE 6

Convinction

  • The expected error of a rule
  • Consider not only the joint distribution of X

and Y

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 6

conv(X → Y ) = P(X)P( ¯ Y ) P(X ¯ Y ) = 1 lift(X → ¯ Y )

slide-7
SLIDE 7

Odds Ratio

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 7

  • dds(Y | X) =

P (XY ) P (X) P (X ¯ Y ) P (X)

= P(XY ) P(X ¯ Y )

  • dds(Y | ¯

X) =

P ( ¯ XY ) P ( ¯ X) P ( ¯ X ¯ Y ) P ( ¯ X)

= P( ¯ XY ) P( ¯ X ¯ Y )

  • ddsratio(X → Y ) = odds(Y | X)
  • dds(Y | ¯

X) = P(XY ) · P( ¯ X ¯ Y ) P(X ¯ Y ) · P( ¯ XY )

slide-8
SLIDE 8

χ2

  • Suppose attribute A has c distinct values a1,

…, ac and attribute B has r distinct values b1, …, br

  • The χ2 value (Pearson χ2 statistics) is

– oij and eij are the observed frequency and the expected frequency, respectively, of the joint event aibj, respectively

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 8

χ2 =

c

X

i=1 r

X

j=1

(oij − eij)2 eij

slide-9
SLIDE 9

Example

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 9

Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000

  • The χ2 value is greater than 1
  • count(basketball, cereal) = 2000 <

expectation (2250) à play basketball and eating cereal are negatively correlated

χ2 = (2000 − 2250)2 2250 + (1750 − 1500)2 1500 + (1000 − 750)2 750 + (250 − 500)2 500 = 277.8

slide-10
SLIDE 10

Φ-coefficient

  • –1: if A and B perfectly negatively correlated
  • 1: if A and B perfectly positively correlated
  • 0: if A and B statistically independent
  • Drawback: Φ-coefficient puts the same

weight on co-occurrence and co-absence

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 10

φ = P(AB)P( ¯ A ¯ B) − P( ¯ AB)P(A ¯ B) p P(A)P(B)P( ¯ A)P( ¯ B)

slide-11
SLIDE 11

IS Measure

  • Biased on frequent co-occurrence
  • Equivalent to cosine similarity for binary

variables (bit vectors)

  • Geometric mean of rules between a pair of

binary random variables

  • Drawback: the value depends on P(A) and

P(B)

– Similar drawbacks in lift

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 11

IS(A, B) = p lift(A, B)P(A, B) = P(A, B) p P(A)P(B)

IS(A, B) = s P(A, B) P(A) P(A, B) P(B) = p conf(A → B)conf(B → A)

slide-12
SLIDE 12

More Measures

  • All confidence: min{ P(A|B), P(B|A) }
  • Max confidence: max{ P(A|B), P(B|A) }
  • The Kulczynski measure: ½ (P(A|B) + P(B|

A)

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 12

slide-13
SLIDE 13

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 13

Comparing Measures

Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m Σ

Contingency table Transaction databases and their contingency tables

Data Set mc mc mc mc

χ2

lift all conf. max conf. Kulc. cosine

D1 10,000 1,000 1,000 100,000 90557 9.26 0.91 0.91 0.91 0.91 D2 10,000 1,000 1,000 100 1 0.91 0.91 0.91 0.91 D3 100 1,000 1,000 100,000 670 8.44 0.09 0.09 0.09 0.09 D4 1,000 1,000 1,000 100,000 24740 25.75 0.5 0.5 0.5 0.5 D5 1,000 100 10,000 100,000 8173 9.18 0.09 0.91 0.5 0.29 D6 1,000 10 100,000 100,000 965 1.97 0.01 0.99 0.5 0.10

χ2 and lift do not perform well on those data sets, since they are sensitive to ~m~c

slide-14
SLIDE 14

Imbalance Ration

  • Assess the imbalance of two itemsets A and

B in rule implications

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 14

IR(A, B) = |P(A) − P(B)| P(A) + P(B) − P(A ∪ B)

slide-15
SLIDE 15

Properties of Measures

  • Symmetry: is M(AàB) = M(BàA)
  • Null-transaction dependent (null addition

invariant): is ~A~B used in the measure?

  • Inversion invariant: the value does not

change if f11 and f10 are exchanged with f00 and f01

  • Scaling: whether the measure remains if the

contingency table [f11, f10, f01, f00] is changed to [k1k3f11, k2k3f10, k1k4f01, k2k4f00]?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 15

slide-16
SLIDE 16

Measuring 3 Random Variables

  • 3 dimensional contingency table
  • For a k-itemset {i1, i2, …, ik}, the condition for

statistical independence is

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 16

c b b c b b a f111 f101 f1+1 a f110 f100 f1+0 a f011 f001 f0+1 a f010 f000 f0+0 f+11 f+01 f++1 f+10 f+00 f++0

fi1i2···ik = fi1+···+f+i2···+ · · · f++···ik N k−1

slide-17
SLIDE 17

Measuring More Random Variables

  • Some measures, such as lift and statistical

independence, can be extended

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 17

I = N k−1fi1i2···ik fi1+···+f+i2···+ · · · f+···+ik PS = fi1i2···ik N − fi1+···+f+i2···+ · · · f+···+ik N k

slide-18
SLIDE 18

Simpson’s Paradox

  • A trend that appears in different groups of

data disappears when these groups are combined, and the reverse trend appears for the aggregate data

– Also known as Yule-Simpson effect – Often encountered in social-science and medical-science statistics – Particularly confounding when frequency data are unduly given causal interpretations

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 18

slide-19
SLIDE 19

Kidney Stone Treatment Example

  • Which treatment, A or B, is better?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 19

Treatment A Treatment B Small stones G1: 81/87=93% G2: 234/270=87% Large stones G3: 192/263=73% G4: 55/80=69% Overall 273/350=78% 289/350=83%

slide-20
SLIDE 20

Berkeley Gender Bias Case

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 20

Applicants Admitted Men 8442 44% Women 4321 35% Department Men Women Applicants Admitted Applicants Admitted A 825 62% 108 82% B 560 63% 25 68% C 325 37% 593 34% D 417 33% 375 35% E 191 28% 393 24% F 272 6% 341 7%

slide-21
SLIDE 21

Fisher Exact Test

  • Directly test whether a rule X à Y is

productive by comparing its confidence with those of its generalizations W à Y, where W is a subset of X

– Let X = W U Z

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 21

W Y Not Y Z a b a + b Not Z c d c + d a + c b + d Sup(W)

a = sup(WZY ) = sup(XY ), b = sup(WZ ¯ Y ) = sup(X ¯ Y ) c = sup(W ¯ ZY ), d = sup(W ¯ Z ¯ Y )

slide-22
SLIDE 22

Marginals

  • Row marginals
  • Column marginals

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 22

W Y Not Y Z a b a + b Not Z c d c + d a + c b + d Sup(W)

a + b = sup(WZ) = sup(X), c + d = sup(W ¯ Z) a + c = sup(WY ), b + d = sup(W ¯ Y )

  • ddratio =

a a+b b a+b c c+d d c+d

= ad bc

slide-23
SLIDE 23

Hypothesis

  • H0: Z and Y are independent given W

– X à Y is not productive given W à Y

  • If Z and Y are independent, then

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 23

a = (a + b)(a + c) n , b = (a + b)(b + d) n

W Y Not Y Z a b a + b Not Z c d c + d a + c b + d Sup(W)

c = (c + d)(a + c) n , d = (c + d)(b + d) n

  • ddratio = ad

bc = 1

slide-24
SLIDE 24

Relation between a and b, c, d

  • Assumption: the row and column marginals

are fixed

  • The value of a uniquely determines b, c, and

d

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 24

W Y Not Y Z a b a + b Not Z c d c + d a + c b + d Sup(W)

slide-25
SLIDE 25

Probability Mass Function of a

  • The probability mass function of observing

the value of a in the contingency table is given by the Hypergeometric distribution

– The probability of choosing s successes in t trials using sampling without replacement from a finite population of size T that has S success in total

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 25

P(s | t, S, T) = ✓ S s ◆ · ✓ T − S t − s ◆ ✓ T t ◆

slide-26
SLIDE 26

Probability Mass Function of a

  • An occurrence of Z – a success
  • T = sup(W) = n
  • W always occurs, the total number of

successes = sup(Z|W) à S = a + b, t = a + c

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 26

P(a | a + c, a + b, n) = ✓ a + b a ◆ · ✓ n − (a + b) (a + c) − a ◆ ✓ n a + c ◆ = ✓ a + b a ◆ · ✓ c + d c ◆ ✓ n a + c ◆ = (a + b)!(c + d)!(a + c)!(b + d)! n!a!b!c!d!

slide-27
SLIDE 27

Calculating the p-value

  • Assuming that the null hypothesis is true, the p-

value is the probability of obtaining a test statistic at least as extreme as the one actually

  • bserved
  • If p-value is very small (e.g., 0.01), the null

hypothesis can be rejected

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 27

p − value(a) =

min(b,c)

X

i=0

P(a + i | a + c, a + b, n) =

min(b,c)

X

i=0

(a + b)!(c + d)!(a + c)!(b + d)! n!(a + i)!(b − i)!(c − i)!(d + i)!

slide-28
SLIDE 28

Permutation (Randomization) Test

  • Determine the distribution of a given test

statistic by randomly modifying the observed data several times to obtain a random sample of the data sets

– The modified data sets are used for significance testing

  • Compute the empirical probability mass

function (EPMF)

  • Generate the empirical cumulative

distribution function

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 28

slide-29
SLIDE 29

Compute p-value on Statistics

  • The empirical cumulative distribution

function

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 29

ˆ F(x) = ˆ P(Θ ≤ x) = 1 k

k

X

i=1

I(θi ≤ x) p − value(θ) = 1 − ˆ F(θ)

slide-30
SLIDE 30

Swap Randomization

  • In permutation test, what characteristics

should be preserved in permutation?

  • Swap randomization keeps the column and

row marginals invariant

– The support of each item does not change – The length of each transaction does not change

  • Swap two items in two transactions
  • Conduct a certain number of swaps to make

up a new data set

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 30

slide-31
SLIDE 31

Bootstrap Sampling

  • A transaction database is just a sample from

a larger population

– What is the frequency (or, range of possible frequency) of X in the underlying population?

  • Given a test assessment statistic θ, how can

we infer the confidence interval for the possible values of θ at a desired confidence interval α?

  • Bootstrap sampling: sampling with

replacement

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 31

slide-32
SLIDE 32

Calculating Statistic Range

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 32

ˆ F(x) = ˆ P(Θ ≤ x) = 1 k

k

X

i=1

I(θi ≤ x) let v 1−α

2

= ˆ F −1(1 − α 2 ), v 1+α

2

= ˆ F −1(1 + α 2 ), P(Θ ∈ [v 1−α

2 , v 1+α 2 ]) = ˆ

F(1 + α 2 ) − ˆ F(1 − α 2 ) = 1 + α 2 − 1 − α 2 = α Thus, the α confidence interval for test statistic Θ is [v 1−α

2 , v 1+α 2 ]

slide-33
SLIDE 33

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 33

From Itemsets to Sequences

  • Itemsets: combinations of items, no temporal order
  • Temporal order is important in many situations

– Time-series databases and sequence databases – Frequent patterns à (frequent) sequential patterns

  • Applications of sequential pattern mining

– Customer shopping sequences:

  • First buy computer, then iPod, and then digital camera, within 3

months.

– Medical treatment, natural disasters, science and engineering processes, stocks and markets, telephone calling patterns, Web log clickthrough streams, DNA sequences and gene structures

slide-34
SLIDE 34

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 34

What Is Sequential Pattern Mining?

  • Given a set of sequences, find the complete

set of frequent subsequences

A sequence database

A sequence : < (ef) (ab) (df) c b > An element may contain a set of items. Items within an element are unordered and we list them alphabetically.

<a(bc)dc> is a subsequence

  • f <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

slide-35
SLIDE 35

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 35

Challenges in Seq Pat Mining

  • A huge number of possible sequential

patterns are hidden in databases

  • A mining algorithm should

– Find the complete set of patterns satisfying the minimum support (frequency) threshold – Be highly efficient, scalable, involving only a small number of database scans – Be able to incorporate various kinds of user- specific constraints

slide-36
SLIDE 36

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 36

Apriori Property of Seq Patterns

  • Apriori property in sequential patterns

– If a sequence S is infrequent, then none of the super-sequences of S is frequent – E.g, <hb> is infrequent à so do <hab> and <(ah)b>

Given support threshold min_sup =2

Seq-id Sequence 10

<(bd)cb(ac)>

20

<(bf)(ce)b(fg)>

30

<(ah)(bf)abf>

40

<(be)(ce)d>

50

<a(bd)bcb(ade)>

slide-37
SLIDE 37

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 37

GSP

  • GSP (Generalized Sequential Pattern) mining
  • Outline of the method

– Initially, every item in DB is a candidate of length-1 – For each level (i.e., sequences of length-k) do

  • Scan database to collect support count for each candidate

sequence

  • Generate candidate length-(k+1) sequences from length-k

frequent sequences using Apriori

– Repeat until no frequent sequence or no candidate can be found

  • Major strength: Candidate pruning by Apriori
slide-38
SLIDE 38

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 38

Finding Len-1 Seq Patterns

  • Initial candidates

– <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>

  • Scan database once

– count support for candidates

min_sup =2 Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1

Seq-id Sequence 10

<(bd)cb(ac)>

20

<(bf)(ce)b(fg)>

30

<(ah)(bf)abf>

40

<(be)(ce)d>

50

<a(bd)bcb(ade)>

slide-39
SLIDE 39

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 39

Generating Length-2 Candidates

<a> <b> <c> <d> <e> <f> <a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff> <a> <b> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <b> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f>

51 length-2 Candidates

Without Apriori property, 8*8+8*7/2=92 candidates

Apriori prunes 44.57% candidates

slide-40
SLIDE 40

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 40

Finding Len-2 Seq Patterns

  • Scan database one more time, collect

support count for each length-2 candidate

  • There are 19 length-2 candidates which

pass the minimum support threshold

– They are length-2 sequential patterns

slide-41
SLIDE 41

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 41

Generating Length-3 Candidates and Finding Length-3 Patterns

  • Generate Length-3 Candidates

– Self-join length-2 sequential patterns

  • <ab>, <aa> and <ba> are all length-2 sequential

patterns à <aba> is a length-3 candidate

  • <(bd)>, <bb> and <db> are all length-2 sequential

patterns à <(bd)b> is a length-3 candidate

– 46 candidates are generated

  • Find Length-3 Sequential Patterns

– Scan database once more, collect support counts for candidates – 19 out of 46 candidates pass support threshold

slide-42
SLIDE 42

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 42

The GSP Mining Process

<a> <b> <c> <d> <e> <f> <g> <h> <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> <abb> <aab> <aba> <baa> <bab> … <abba> <(bd)bc> … <(bd)cba> 1st scan: 8 cand. 6 length-1 seq. pat. 2nd scan: 51 cand. 19 length-2 seq.

  • pat. 10 cand. not in DB at all

3rd scan: 46 cand. 19 length-3 seq.

  • pat. 20 cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. pat. 5th scan: 1 cand. 1 length-5 seq. pat.

  • Cand. cannot pass
  • sup. threshold
  • Cand. not in DB at all

min_sup =2 Seq-id Sequence 10

<(bd)cb(ac)>

20

<(bf)(ce)b(fg)>

30

<(ah)(bf)abf>

40

<(be)(ce)d>

50

<a(bd)bcb(ade)>

slide-43
SLIDE 43

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 43

The GSP Algorithm

  • Take sequences in form of <x> as length-1

candidates

  • Scan database once, find F1, the set of length-1

sequential patterns

  • Let k=1; while Fk is not empty do

– Form Ck+1, the set of length-(k+1) candidates from Fk; – If Ck+1 is not empty, scan database once, find Fk+1, the set of length-(k+1) sequential patterns – Let k=k+1;

slide-44
SLIDE 44

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 44

Bottlenecks of GSP

  • A huge set of candidates

– 1,000 frequent length-1 sequences generate length-2 candidates!

  • Multiple scans of database in mining
  • Real challenge: mining long sequential

patterns

– An exponential number of short candidates – A length-100 sequential pattern needs 1030 candidate sequences!

500 , 499 , 1 2 999 1000 1000 1000 = × + ×

30 100 100 1

10 1 2 100 ≈ − = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛

= i

i

slide-45
SLIDE 45

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 45

FreeSpan: Freq Pat-projected Sequential Pattern Mining

  • The itemset of a seq pat must be frequent

– Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns – Mine each projected database to find its patterns

f_list: b:5, c:4, a:3, d:3, e:3, f:2 All seq. pat. can be divided into 6 subsets:

  • Seq. pat. containing item f
  • Those containing e but no f
  • Those containing d but no e nor f
  • Those containing a but no d, e or f
  • Those containing c but no a, d, e or f
  • Those containing only item b

Sequence Database SDB < (bd) c b (ac) > < (bf) (ce) b (fg) > < (ah) (bf) a b f > < (be) (ce) d > < a (bd) b c b (ade) >

slide-46
SLIDE 46

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 46

From FreeSpan to PrefixSpan

  • Freespan:

– Projection-based: no candidate sequence needs to be generated – But, projection can be performed at any point in the sequence, and the projected sequences may not shrink much

  • PrefixSpan

– Projection-based – But only prefix-based projection: less projections and quickly shrinking sequences

slide-47
SLIDE 47

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 47

Prefix and Suffix (Projection)

  • <a>, <aa>, <a(ab)> and <a(abc)> are

prefixes of sequence <a(abc)(ac)d(cf)>

  • Given sequence <a(abc)(ac)d(cf)>

Prefix Suffix (Prefix-Based Projection)

<a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)>

slide-48
SLIDE 48

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 48

Mining Sequential Patterns by Prefix Projections

  • Step 1: find length-1 sequential patterns

– <a>, <b>, <c>, <d>, <e>, <f>

  • Step 2: divide search space. The complete

set of seq. pat. can be partitioned into 6 subsets:

– The ones having prefix <a>; – The ones having prefix <b>; – … – The ones having prefix <f>

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

slide-49
SLIDE 49

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 49

Finding Seq. Pat. with Prefix <a>

  • Only need to consider projections w.r.t. <a>

– <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

  • Find all the length-2 seq. pat. having prefix

<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

– Further partition into 6 subsets

  • Having prefix <aa>;
  • Having prefix <af>

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

slide-50
SLIDE 50

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 50

Completeness of PrefixSpan

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

SDB

Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>

Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa> <aa>-proj. db

… <af>-proj. db

Having prefix <af>

<b>-projected database

Having prefix <b> Having prefix <c>, …, <f>

… …

slide-51
SLIDE 51

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 51

Efficiency of PrefixSpan

  • No candidate sequence needs to be

generated

  • Projected databases keep shrinking
  • Major cost of PrefixSpan: constructing

projected databases

– Can be improved by bi-level projections

slide-52
SLIDE 52

Effectiveness

  • Redundancy due to anti-monotonicity

– {<abcd>} leads to 15 sequential patterns of same support – Closed sequential patterns and sequential generators

  • Constraints on sequential patterns

– Gap – Length – More sophisticated, application oriented constraints

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 52

slide-53
SLIDE 53

Sequences and Partial Orders

Sequential patterns: CHK à MMK à MORT à RESP CHK à MMK à MORT à BROK CHK à RRSP à MORT à RESP CHK à RRSP à MORT à BROK

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 53

slide-54
SLIDE 54

Why Frequent Orders?

  • Frequent orders capture more thorough

information than sequential patterns

  • Many important applications

– Bioinformatics: order-preserving clustering of microarray data – Web mining and market basket analysis: modeling customer purchase behaviors – Network management and intrusion detection: frequent routing paths, signatures for intrusions – Preference-based services: partial orders from ranking data

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 54

slide-55
SLIDE 55

Why Mining Orders Difficult?

  • Use sequential patterns to assemble

frequent partial orders?

– One frequent closed partial order may summarize a few sequential patterns – Assembling can be costly

Sequential patterns: CHK à MMK à MORT à RESP CHK à MMK à MORT à BROK CHK à RRSP à MORT à RESP CHK à RRSP à MORT à BROK

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 55

slide-56
SLIDE 56

Model

  • A sequence s induces a full order R1, if R1 à R2,

where R2 is a partial order, then R1 is said to support R2

  • The support of a partial order R in a sequence

database is the number of sequences supporting R in the database

  • An order R is closed if there exists no any R’ à R

and sup(R)=sup(R’)

  • Given a minimum support threshold, order R is a

frequent closed partial order if it is closed and passes the support threshold

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 56

slide-57
SLIDE 57

Ideas

  • Depth-first search to generate frequent

closed partial orders in transitive reduction

– Transitive reduction is a succinct representation

  • f partial orders
  • Pruning infrequent items, edges and partial
  • rders
  • Pruning forbidden edges
  • Extracting transitive reductions of frequent

partial orders directly

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 57

slide-58
SLIDE 58

Interesting Orders

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 58

slide-59
SLIDE 59

To-Do List

  • Read Sections 6.3

Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 59