Altobelli B. Mantuan and Leandro A. F. Fernandes {amantuan, - - PowerPoint PPT Presentation

altobelli b mantuan and leandro a f fernandes
SMART_READER_LITE
LIVE PREVIEW

Altobelli B. Mantuan and Leandro A. F. Fernandes {amantuan, - - PowerPoint PPT Presentation

prograf.ic.uff.br Altobelli B. Mantuan and Leandro A. F. Fernandes {amantuan, laffernandes}@ic.uff.br Acknowledgments Graphics Processing Research Laboratory at IC-UFF, Brazil It Itemset min ining Frequent pattern / Itemset: A set of one


slide-1
SLIDE 1

Acknowledgments Graphics Processing Research Laboratory at IC-UFF, Brazil prograf.ic.uff.br

Altobelli B. Mantuan and Leandro A. F. Fernandes

{amantuan, laffernandes}@ic.uff.br

slide-2
SLIDE 2

It Itemset min ining

  • Frequent pattern / Itemset: A set of one or more

items that occurs frequently in a dataset ▪ k-itemset: X = {x1, …, xk}

  • Finding all the itemsets

▪ combinatorial problems

2 Items bought

Beer, Nuts, Diaper Beer, Coffee, Diaper Beer, Diaper, Eggs Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Beer

Example: Itemset = {Beer, Diaper} Support = 4

slide-3
SLIDE 3

It Itemset min ining procedure

  • Existing solutions for generating frequent itemsets

▪ Threshold parameters

  • Difficult to perceive the influence of the parameter

▪ Amount of itemset retrieved ▪ Search space

  • large area of search space is unnecessarily explored

3

It would be interesting to use other information to reduce the search space for generating itemset

slide-4
SLIDE 4

Flo lowchart - SCIM IM

4

slide-5
SLIDE 5

Contributions

  • The spatial contextualization of items for mining

interesting itemsets in transactional databases

  • A procedure for clustering items in the Solution Space
  • f the Dual Scaling mapping
  • A procedure for generating closed itemsets based on

spatial contextualization

5

slide-6
SLIDE 6

Summary ry

  • Dual Scaling
  • Overlapping clustering procedure
  • SC-close procedure
  • Results
  • Conclusion

6

slide-7
SLIDE 7

7

slide-8
SLIDE 8

Dual l Scaling

Nishisato, Shizuhiko. (1994). Elements of Dual Scaling: An Introduction to Practical Data Analysis. 8

Subject Stimulus Low BP Aver BP Hight BP Rare Migr.

  • Occa. Migr.
  • Frequ. Migr.

Young Middle Age Old Low Anxiety Mid Anxiety High Anxiety Light Average Heavy Short Medium Tall 1 1 1 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 4 1 1 1 1 1 1 5 1 1 1 1 1 1 .... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 15 1 1 1 1 1 1

  • Greater the occurrence of a set of stimulus appear in the

database, smaller the distance between these stimulus

  • Lower the frequency of the stimulus, further from the origin

the stimulus will be positioned

slide-9
SLIDE 9

Dual l Scaling – Work rking example

Transaction Item 1 2 3 4 5 6 1 1 0 0 1 0 0 2 1 0 0 1 0 0 3 0 0 1 1 0 0 4 0 0 1 0 1 0 5 0 0 1 0 1 0 6 0 0 1 0 1 0 7 0 0 1 0 1 0 8 0 0 1 0 1 0 9 0 0 1 0 0 1 10 0 1 0 0 1 0 11 0 1 0 0 0 1 12 0 1 0 0 0 1 13 0 1 0 0 0 1 14 0 1 0 1 0 0 15 0 1 0 0 0 1

9

Item 1 2 3 4 5 6 1 16.6052 20.5072 3.2338 21.9519 18.4428 2 16.6052 10.9090 14.0740 10.4710 1.7637 3 20.5072 10.9090 17.5109 1.8114 9.7269 4 3.2338 14.0740 17.5109 21.4269 17.5219 5 21.9519 10.4710 1.8114 21.4269 10.8555 6 18.4428 1.7637 9.7269 17.5219 10.8555

Subject = Transaction Stimulus = Item

  • Transactions are not maped to the solution space
  • Using 𝜓-square distance for pairs of items

𝜓-square distance matrix for pairs of items

1 2 3 4 5 6 9.7467 3.2046 3.7234 8.2790 4.6087 3.7871

Reference distance is the distance of each item to origin of the space

slide-10
SLIDE 10

10

slide-11
SLIDE 11
  • 𝜓-square distance between pairs of points
  • Each item point 𝑦𝑗 is the center of cluster 𝐷𝑗
  • Parameter 𝑒𝑠 using to define 𝑑𝑠𝑗

Overlapping clu lustering procuderure

11

𝜓-square distance matrix

X1 X2 X3 X4 X5 X6 X1 16.6052 20.5072 3.2338 21.9519 18.4428 X2 16.6052 10.9090 14.0740 10.4710 1.7637 X3 20.5072 10.9090 17.5109 1.8114 9.7269 X4 3.2338 14.0740 17.5109 21.4269 17.5219 X5 21.9519 10.4710 1.8114 21.4269 10.8555 X6 18.4428 1.7637 9.7269 17.5219 10.8555

slide-12
SLIDE 12

12

slide-13
SLIDE 13

SC SC-close

  • SC-Close uses the compact vertical structure FP-tree

▪ Clusters reduce the search space on the FP-tree

  • SC-Close uses a CFI-tree for report only closed itemsets
  • Itemset formation rule
  • 1. The itemset must belong to the cluster coverage
  • 2. At least one item from itemset must belong to

minimum coverage

  • 3. Condition 2 is ignored if minimum coverage only includes

the center point

13

slide-14
SLIDE 14

Results

  • Using 11 databases from LUCS-KDD
  • Compared algorithms

▪ SCIM, FPClose, Slim, and TopPI

  • Metrics

▪ Mean All-Confidence, MDL, and processing time

14

slide-15
SLIDE 15

Mean All ll-Confidence

15

slide-16
SLIDE 16

Mean All ll-Confidence

16

slide-17
SLIDE 17

Database Technique # Patterns Time (s) L% Letter recognition Slim 1,231 34.30

31.89

n = 20,000 q = 16, m = 80 TopPI k = 7 447 0.62 73.58 SCIM dr = 0.03 89 0.48

75.28

mFeat Slim 5,121 10,053.93

35.82

n = 2;000 q = 240, m = 1;648 TopPI k = 3 3,567 1.35 72.89 SCIM dr = 0.00 11,943 3,351.34

56.00

Wine FPClose 13,169 0.42

112.52

n = 178, q = 13, m = 65 Slim 55 0.22

76.79

TopPI k = 2 68 0.25 90.22 SCIM dr = 0.03 60 0.04 88.92 Page blocks FPClose 714 0.20 3.90 n = 5,473 q = 10, m = 41 Slim 40 0.19

3.84

TopPI k = 1 31 0.29 83.09 SCIM dr = 0.00 54 0.13

38.23

Pen digits Slim 1,220 45.23

38.67

n = 10,992 q = 16, m = 79 TopPI k = 7 401 0.55 76.14 SCIM dr = 0.04 148 0.37 76.51 Waveform Slim 717 8.49

38.84

n = 5,000 q = 21, m = 98 TopPI k = 9 734 0.44 77.53 SCIM dr = 0.00 724 0.39 72.35

Database Technique # Patterns Time (s) L%

Ecoli FPClose 530 0.12 53.59 n = 336 q = 7, m = 26 Slim 25 0.16

40.15

TopPI k = 3 60 0.25 68.67 SCIM dr = 0.02 66 0.03

59.50

Connect-4 Slim 1,506 88.79

12.12

n = 67,557 q = 42, m = 126 TopPI k = 8 965 2.87 67.39 SCIM dr = 0.00 1,002 9.98 54.28 Tic-tac-toe FPClose 42,684 2.98 145.81 n = 958 q = 9, m = 27 Slim 125 0.22

53.19

TopPI k = 3 250 0.31 86.27 SCIM dr = 0.02 330 0.06

91.32

Led7 FPClose 1,936 0.20 25.77 n = 3,200 q = 7, m = 14 Slim 78 0.16

25.34

TopPI k = 3 67 0.30 69.06 SCIM dr = 0.02 15 0.04 70.54 Pima FPClose 1,608 0.17 41.47 n = 768 q = 8, m = 36 Slim 55 0.21

31.35

TopPI k = 3 717 0.31 54.57 SCIM dr = 0.02 522 0.06

42.38

MDL and processing tim ime

17

MDL metric The relative total compressed size (L%) achieved by the set of patterns retrieved by each algorithm Processing time Slim and TopPI runned in single thread

Database Technique # Patterns Time (s) L%

Letter recognition Slim 1,231 34.30 31.89 n = 20,000 q = 16, m = 80 TopPI k = 7 447 0.62 73.58 SCIM dr = 0.03 89 0.48 75.28 mFeat Slim 5,121 10,053.93 35.82 n = 2;000 q = 240, m = 1;648 TopPI k = 3 3,567 1.35 72.89 SCIM dr = 0.00 11,943 3,351.34 56.00 Wine FPClose 13,169 0.42 112.52 n = 178, q = 13, m = 65 Slim 55 0.22 76.79 TopPI k = 2 68 0.25 90.22 SCIM dr = 0.03 60

0.04

88.92 Page blocks FPClose 714 0.20 3.90 n = 5,473 q = 10, m = 41 Slim 40 0.19 3.84 TopPI k = 1 31 0.29 83.09 SCIM dr = 0.00 54 0.13 38.23 Pen digits Slim 1,220 45.23 38.67 n = 10,992 q = 16, m = 79 TopPI k = 7 401 0.55 76.14 SCIM dr = 0.04 148 0.37 76.51 Waveform Slim 717 8.49 38.84 n = 5,000 q = 21, m = 98 TopPI k = 9 734 0.44 77.53 SCIM dr = 0.00 724 0.39 72.35

Database Technique # Patterns Time (s) L%

Ecoli FPClose 530 0.12 53.59 n = 336 q = 7, m = 26 Slim 25 0.16 40.15 TopPI k = 3 60 0.25 68.67 SCIM dr = 0.02 66

0.03

59.50 Connect-4 Slim 1,506 88.79 12.12 n = 67,557 q = 42, m = 126 TopPI k = 8 965 2.87 67.39 SCIM dr = 0.00 1,002 9.98 54.28 Tic-tac-toe FPClose 42,684 2.98 145.81 n = 958 q = 9, m = 27 Slim 125 0.22 53.19 TopPI k = 3 250 0.31 86.27 SCIM dr = 0.02 330

0.06

91.32 Led7 FPClose 1,936 0.20 25.77 n = 3,200 q = 7, m = 14 Slim 78 0.16 25.34 TopPI k = 3 67 0.30 69.06 SCIM dr = 0.02 15

0.04

70.54 Pima FPClose 1,608 0.17 41.47 n = 768 q = 8, m = 36 Slim 55 0.21 31.35 TopPI k = 3 717 0.31 54.57 SCIM dr = 0.02 522

0.06

42.38

slide-18
SLIDE 18

Conclusions

  • We showed that spatial contextualization can be used

to guide a closed itemset generation

  • We provided an unsupervised clustering heuristic

with cluster overlapping

  • We presented a procedure to reduce the search

space in FP-tree for generation of closed itemsets

18

slide-19
SLIDE 19

prograf.ic.uff.br

Thank you!

19

Project webpage