Efficiently Mining Long Patterns from Databases Roberto Bayardo - - PowerPoint PPT Presentation

efficiently mining long patterns from databases
SMART_READER_LITE
LIVE PREVIEW

Efficiently Mining Long Patterns from Databases Roberto Bayardo - - PowerPoint PPT Presentation

Efficiently Mining Long Patterns from Databases Roberto Bayardo IBM Almaden Research Center 1 of 22 The Problem Current flock of algorithms for mining frequent itemsets in databases: Use (almost exclusively) subset-infrequency pruning -


slide-1
SLIDE 1

1 of 22

Efficiently Mining Long Patterns from Databases

Roberto Bayardo IBM Almaden Research Center

slide-2
SLIDE 2

2 of 22

The Problem

Current flock of algorithms for mining frequent itemsets in databases:

  • Use (almost exclusively) subset-infrequency pruning
  • An itemset is frequent if and only if all its subsets are frequent
  • Example: Apriori will check eggs&bread&butter only after

eggs&bread, eggs&butter, and bread&butter are known to be frequent

  • Scale exponentially (in time and space) in length of

longest frequent itemset

  • Complexity becomes problematic on many data-sets
  • utside the domain of market-basket analysis
  • Several classification benchmarks [Bayardo 97]
  • Census data [Brin et al., 97]
slide-3
SLIDE 3

3 of 22

Talk Overview

  • Show how to incorporate superset-frequency based

pruning into a search for maximal frequent itemsets

  • If an itemset is known to be frequent, then so are its subsets
  • Define a technique for lower-bounding the frequency of

an itemset using known frequencies of its proper subsets

  • Incorporate frequency-lower-bounding into the maximal

frequent-itemset finding algorithm (producing Max- Miner) as well as Apriori (producing Apriori-LB)

  • Experimental evaluation
  • Conclusion & Future Work
slide-4
SLIDE 4

4 of 22

Some Quick Definitions

We are focusing on the problem of finding maximal frequent itemsets in transactional databases.

  • A transaction is a database entity composed of a set of

items, e.g. the supermarket items purchased by a customer during a shopping visit.

  • The support of a set of items (or itemset) is the number of

transactions in the database to contain it.

  • An itemset is frequent if its support exceeds a user-

defined threshold (minsup). Otherwise it is infrequent.

  • An itemset is maximal frequent if no superset of it is

frequent.

slide-5
SLIDE 5

5 of 22

Pruning with Superset-Frequency

Some previous work has investigated the idea in the context

  • f identifying maximal frequent itemsets in data:
  • Gunopulos et al. [ICDT-97]
  • Memory resident data limitation
  • Evaluated primarily an incomplete algorithm
  • Zaki [KDD-97]
  • Superset-frequency pruning limited in its application
  • Does not scale to long frequent itemsets
  • Lin & Kedem [EDBT-98]
  • Concurrent proposal
  • Uses NP-hard candidate generation scheme
slide-6
SLIDE 6

6 of 22

My Approach

  • Explicitly formulate the search for frequent itemsets as a

tree search (instead of lattice search) problem.

  • Use both superset-frequency and subset-infrequency to

prune branches and nodes of the tree.

  • Dynamically reorganize the search tree to (heuristically)

maximize pruning effectiveness.

slide-7
SLIDE 7

7 of 22

Set-Enumeration Tree Search

  • Impose an ordering on the set of items.
  • Root node is the empty set.
  • Children of a node are formed by appending an item that

follows all existing node items in the item ordering.

  • Each and every itemset enumerated exactly once.
  • Key to efficient search: Pruning strategies applied to

remove nodes and sub-trees from consideration.

{} 1 2 1,2 1,2,3 1,2,3,4 1,3 1,3,4 1,4 2,3 2,3,4 2,4 3 3,4 4 1,2,4

slide-8
SLIDE 8

8 of 22

Node Representation

  • To facilitate pruning and other optimizations, we

represent each node g in the SE-tree as a candidate group consisting of:

  • The itemset represented by the node, called the head

and denoted .

  • The set of viable items that can be appended to the

head to form the node’s children, called the tail and denoted .

  • By “computing the support” of a candidate group

, I mean computing the support of not only , but also:

  • for all
  • (called the long itemset of a candidate

group). h g ( ) t g ( ) g h g ( ) h g ( ) i ∪ i t g ( ) ∈ h g ( ) t g ( ) ∪

slide-9
SLIDE 9

9 of 22

Example

At node where and

  • Compute the support of

, , .

  • Used for subset-infrequency based pruning.
  • For example, if

is infrequent, then 4 is not viable.

  • Children of a node need only inherit viable tail items.
  • Compute the support of
  • Used for superset-frequency based pruning.
  • For example, if

is frequent, then so are all other children of . g h g ( ) 1 2 , { } = t g ( ) 3 4 5 , , { } = 1 2 3 , , { } 1 2 4 , , { } 1 2 5 , , { } 1 2 4 , , { } 1 2 3 4 5 , , , , { } 1 2 3 4 5 , , , , { } g

slide-10
SLIDE 10

10 of 22

Algorithm (Max-Miner)

  • initialized to contain one group with an empty head.
  • initialized to empty.
  • While

non-empty:

  • Compute the support of all candidate groups in

.

  • For each

with a long itemset that is frequent, put in .

  • For every other

, generate children of If has no children, then put in .

  • Let

contain the newly generated children.

  • Remove sets in

with supersets, and return . C M C C g C ∈ h g ( ) t g ( ) ∪ M g C ∈ g g h g ( ) M C M M

slide-11
SLIDE 11

11 of 22

Generating Children

To generate children of a candidate group :

  • Remove any tail item

from if is infrequent.

  • Impose a new order on the remaining tail items.
  • For each remaining tail item

Generate a child with:

  • g

i t g ( ) h g ( ) i ∪ i g' h g' ( ) h g ( ) i { } ∪ = t g' ( ) j j follows i in t g ( ) { } =

slide-12
SLIDE 12

12 of 22

Example

, :

  • ,
  • ,
  • ,
  • ,

h g ( ) 1 2 , { } = t g ( ) 3 4 5 6 , , , { } = h g1 ( ) 1 2 3 , , { } = t g1 ( ) 4 5 6 , , { } = h g2 ( ) 1 2 4 , , { } = t g2 ( ) 5 6 , { } = h g3 ( ) 1 2 5 , , { } = t g3 ( ) 6 { } = h g4 ( ) 1 2 6 , , { } = t g6 ( ) { } =

slide-13
SLIDE 13

13 of 22

Item Ordering

  • Goal: Maximize pruning effectiveness.
  • Strategy: Order tail items in increasing order of support

relative to the head, .

  • Forces candidate groups with long tails to have heads

with low support.

  • Forces most-frequent items to appear more

frequently in the tails of candidate groups.

  • This is a critical optimization!

sup h g ( ) i { } ∪ ( )

slide-14
SLIDE 14

14 of 22

Support Lower-Bounding

  • Idea is to use the support information provided by an

itemset’s proper subsets to lower-bound its support.

  • If the itemset’s support can be lower-bounded above

minsup, then it is known to be frequent without requiring database access.

  • Support lower-bounding can be used to avoid overhead

associated with computing the support of many candidate itemsets.

slide-15
SLIDE 15

15 of 22

Support Lower-bounding: Theory

  • Definition:
  • Note that

is an upper-bound on

  • Note that
  • Theorem:

is a lower-bound on the support of . drop Is j , ( ) sup Is ( ) sup Is j { } ∪ ( ) – = I j { } ∪ I Is Is j { } ∪ drop Is j , ( ) drop I j , ( ) sup I j { } ∪ ( ) sup I ( ) drop I j , ( ) – = sup I ( ) drop Is j , ( ) – I j { } ∪

slide-16
SLIDE 16

16 of 22

Exploiting Support Lower-Bounds

To generate children of a candidate group :

  • Remove any tail item

from if is infrequent.

  • Impose a new order on the remaining tail items.
  • For each remaining tail item

in increasing item order do: Generate a child with:

  • if Compute-Lower-Bound(

) >= minsup, then return to be put in g i t g ( ) h g ( ) i ∪ i g' h g' ( ) h g ( ) i { } ∪ = t g' ( ) j j follows i in t g ( ) { } = h g' ( ) t g' ( ) ∪ h g' ( ) t g' ( ) ∪ M

slide-17
SLIDE 17

17 of 22

Lower-bounding in Apriori

  • Modify Apriori so that it only computes the support of

candidate itemsets that were not found frequent through lower-bounding.

  • We call the resulting algorithm Apriori-LB.
slide-18
SLIDE 18

18 of 22

Results: Census Data

10 100 1000 10000 100000 5 10 15 20 25 30 35 CPU Time (sec) Support (%) Max-Miner Apriori-LB Apriori

slide-19
SLIDE 19

19 of 22

Scaling

(external slide)

slide-20
SLIDE 20

20 of 22

DB Passes

5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 DB passes Length of longest pattern census* chess connect-4 splice mushroom retail

slide-21
SLIDE 21

21 of 22

Conclusions

  • Long maximal frequent itemsets can be efficiently mined

from large data-sets.

  • Key idea: Superset-frequency based pruning applied

heuristically throughout the search.

  • Support lower-bounding is effective at substantially

reducing the candidate groups considered by Max-Miner.

  • Support lower-bounding is effective at reducing

candidate itemsets checked against the database in Apriori.

slide-22
SLIDE 22

22 of 22

Future Work

  • Integrating additional constraints into the search:
  • Association rule confidence
  • Rule “interestingness” measures
  • Goal: Be able to mine association rules instead of

maximal-frequent itemsets from long-pattern data.

  • Apply ideas to mining other patterns
  • sequential patterns
  • frequent episodes