Acknowledgments Graphics Processing Research Laboratory at IC-UFF, Brazil prograf.ic.uff.br
Altobelli B. Mantuan and Leandro A. F. Fernandes {amantuan, - - PowerPoint PPT Presentation
Altobelli B. Mantuan and Leandro A. F. Fernandes {amantuan, - - PowerPoint PPT Presentation
prograf.ic.uff.br Altobelli B. Mantuan and Leandro A. F. Fernandes {amantuan, laffernandes}@ic.uff.br Acknowledgments Graphics Processing Research Laboratory at IC-UFF, Brazil It Itemset min ining Frequent pattern / Itemset: A set of one
It Itemset min ining
- Frequent pattern / Itemset: A set of one or more
items that occurs frequently in a dataset ▪ k-itemset: X = {x1, …, xk}
- Finding all the itemsets
▪ combinatorial problems
2 Items bought
Beer, Nuts, Diaper Beer, Coffee, Diaper Beer, Diaper, Eggs Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Beer
Example: Itemset = {Beer, Diaper} Support = 4
It Itemset min ining procedure
- Existing solutions for generating frequent itemsets
▪ Threshold parameters
- Difficult to perceive the influence of the parameter
▪ Amount of itemset retrieved ▪ Search space
- large area of search space is unnecessarily explored
3
It would be interesting to use other information to reduce the search space for generating itemset
Flo lowchart - SCIM IM
4
Contributions
- The spatial contextualization of items for mining
interesting itemsets in transactional databases
- A procedure for clustering items in the Solution Space
- f the Dual Scaling mapping
- A procedure for generating closed itemsets based on
spatial contextualization
5
Summary ry
- Dual Scaling
- Overlapping clustering procedure
- SC-close procedure
- Results
- Conclusion
6
7
Dual l Scaling
Nishisato, Shizuhiko. (1994). Elements of Dual Scaling: An Introduction to Practical Data Analysis. 8
Subject Stimulus Low BP Aver BP Hight BP Rare Migr.
- Occa. Migr.
- Frequ. Migr.
Young Middle Age Old Low Anxiety Mid Anxiety High Anxiety Light Average Heavy Short Medium Tall 1 1 1 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 4 1 1 1 1 1 1 5 1 1 1 1 1 1 .... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 15 1 1 1 1 1 1
- Greater the occurrence of a set of stimulus appear in the
database, smaller the distance between these stimulus
- Lower the frequency of the stimulus, further from the origin
the stimulus will be positioned
Dual l Scaling – Work rking example
Transaction Item 1 2 3 4 5 6 1 1 0 0 1 0 0 2 1 0 0 1 0 0 3 0 0 1 1 0 0 4 0 0 1 0 1 0 5 0 0 1 0 1 0 6 0 0 1 0 1 0 7 0 0 1 0 1 0 8 0 0 1 0 1 0 9 0 0 1 0 0 1 10 0 1 0 0 1 0 11 0 1 0 0 0 1 12 0 1 0 0 0 1 13 0 1 0 0 0 1 14 0 1 0 1 0 0 15 0 1 0 0 0 1
9
Item 1 2 3 4 5 6 1 16.6052 20.5072 3.2338 21.9519 18.4428 2 16.6052 10.9090 14.0740 10.4710 1.7637 3 20.5072 10.9090 17.5109 1.8114 9.7269 4 3.2338 14.0740 17.5109 21.4269 17.5219 5 21.9519 10.4710 1.8114 21.4269 10.8555 6 18.4428 1.7637 9.7269 17.5219 10.8555
Subject = Transaction Stimulus = Item
- Transactions are not maped to the solution space
- Using 𝜓-square distance for pairs of items
𝜓-square distance matrix for pairs of items
1 2 3 4 5 6 9.7467 3.2046 3.7234 8.2790 4.6087 3.7871
Reference distance is the distance of each item to origin of the space
10
- 𝜓-square distance between pairs of points
- Each item point 𝑦𝑗 is the center of cluster 𝐷𝑗
- Parameter 𝑒𝑠 using to define 𝑑𝑠𝑗
Overlapping clu lustering procuderure
11
𝜓-square distance matrix
X1 X2 X3 X4 X5 X6 X1 16.6052 20.5072 3.2338 21.9519 18.4428 X2 16.6052 10.9090 14.0740 10.4710 1.7637 X3 20.5072 10.9090 17.5109 1.8114 9.7269 X4 3.2338 14.0740 17.5109 21.4269 17.5219 X5 21.9519 10.4710 1.8114 21.4269 10.8555 X6 18.4428 1.7637 9.7269 17.5219 10.8555
12
SC SC-close
- SC-Close uses the compact vertical structure FP-tree
▪ Clusters reduce the search space on the FP-tree
- SC-Close uses a CFI-tree for report only closed itemsets
- Itemset formation rule
- 1. The itemset must belong to the cluster coverage
- 2. At least one item from itemset must belong to
minimum coverage
- 3. Condition 2 is ignored if minimum coverage only includes
the center point
13
Results
- Using 11 databases from LUCS-KDD
- Compared algorithms
▪ SCIM, FPClose, Slim, and TopPI
- Metrics
▪ Mean All-Confidence, MDL, and processing time
14
Mean All ll-Confidence
15
Mean All ll-Confidence
16
Database Technique # Patterns Time (s) L% Letter recognition Slim 1,231 34.30
31.89
n = 20,000 q = 16, m = 80 TopPI k = 7 447 0.62 73.58 SCIM dr = 0.03 89 0.48
75.28
mFeat Slim 5,121 10,053.93
35.82
n = 2;000 q = 240, m = 1;648 TopPI k = 3 3,567 1.35 72.89 SCIM dr = 0.00 11,943 3,351.34
56.00
Wine FPClose 13,169 0.42
112.52
n = 178, q = 13, m = 65 Slim 55 0.22
76.79
TopPI k = 2 68 0.25 90.22 SCIM dr = 0.03 60 0.04 88.92 Page blocks FPClose 714 0.20 3.90 n = 5,473 q = 10, m = 41 Slim 40 0.19
3.84
TopPI k = 1 31 0.29 83.09 SCIM dr = 0.00 54 0.13
38.23
Pen digits Slim 1,220 45.23
38.67
n = 10,992 q = 16, m = 79 TopPI k = 7 401 0.55 76.14 SCIM dr = 0.04 148 0.37 76.51 Waveform Slim 717 8.49
38.84
n = 5,000 q = 21, m = 98 TopPI k = 9 734 0.44 77.53 SCIM dr = 0.00 724 0.39 72.35
Database Technique # Patterns Time (s) L%
Ecoli FPClose 530 0.12 53.59 n = 336 q = 7, m = 26 Slim 25 0.16
40.15
TopPI k = 3 60 0.25 68.67 SCIM dr = 0.02 66 0.03
59.50
Connect-4 Slim 1,506 88.79
12.12
n = 67,557 q = 42, m = 126 TopPI k = 8 965 2.87 67.39 SCIM dr = 0.00 1,002 9.98 54.28 Tic-tac-toe FPClose 42,684 2.98 145.81 n = 958 q = 9, m = 27 Slim 125 0.22
53.19
TopPI k = 3 250 0.31 86.27 SCIM dr = 0.02 330 0.06
91.32
Led7 FPClose 1,936 0.20 25.77 n = 3,200 q = 7, m = 14 Slim 78 0.16
25.34
TopPI k = 3 67 0.30 69.06 SCIM dr = 0.02 15 0.04 70.54 Pima FPClose 1,608 0.17 41.47 n = 768 q = 8, m = 36 Slim 55 0.21
31.35
TopPI k = 3 717 0.31 54.57 SCIM dr = 0.02 522 0.06
42.38
MDL and processing tim ime
17
MDL metric The relative total compressed size (L%) achieved by the set of patterns retrieved by each algorithm Processing time Slim and TopPI runned in single thread
Database Technique # Patterns Time (s) L%
Letter recognition Slim 1,231 34.30 31.89 n = 20,000 q = 16, m = 80 TopPI k = 7 447 0.62 73.58 SCIM dr = 0.03 89 0.48 75.28 mFeat Slim 5,121 10,053.93 35.82 n = 2;000 q = 240, m = 1;648 TopPI k = 3 3,567 1.35 72.89 SCIM dr = 0.00 11,943 3,351.34 56.00 Wine FPClose 13,169 0.42 112.52 n = 178, q = 13, m = 65 Slim 55 0.22 76.79 TopPI k = 2 68 0.25 90.22 SCIM dr = 0.03 60
0.04
88.92 Page blocks FPClose 714 0.20 3.90 n = 5,473 q = 10, m = 41 Slim 40 0.19 3.84 TopPI k = 1 31 0.29 83.09 SCIM dr = 0.00 54 0.13 38.23 Pen digits Slim 1,220 45.23 38.67 n = 10,992 q = 16, m = 79 TopPI k = 7 401 0.55 76.14 SCIM dr = 0.04 148 0.37 76.51 Waveform Slim 717 8.49 38.84 n = 5,000 q = 21, m = 98 TopPI k = 9 734 0.44 77.53 SCIM dr = 0.00 724 0.39 72.35
Database Technique # Patterns Time (s) L%
Ecoli FPClose 530 0.12 53.59 n = 336 q = 7, m = 26 Slim 25 0.16 40.15 TopPI k = 3 60 0.25 68.67 SCIM dr = 0.02 66
0.03
59.50 Connect-4 Slim 1,506 88.79 12.12 n = 67,557 q = 42, m = 126 TopPI k = 8 965 2.87 67.39 SCIM dr = 0.00 1,002 9.98 54.28 Tic-tac-toe FPClose 42,684 2.98 145.81 n = 958 q = 9, m = 27 Slim 125 0.22 53.19 TopPI k = 3 250 0.31 86.27 SCIM dr = 0.02 330
0.06
91.32 Led7 FPClose 1,936 0.20 25.77 n = 3,200 q = 7, m = 14 Slim 78 0.16 25.34 TopPI k = 3 67 0.30 69.06 SCIM dr = 0.02 15
0.04
70.54 Pima FPClose 1,608 0.17 41.47 n = 768 q = 8, m = 36 Slim 55 0.21 31.35 TopPI k = 3 717 0.31 54.57 SCIM dr = 0.02 522
0.06
42.38
Conclusions
- We showed that spatial contextualization can be used
to guide a closed itemset generation
- We provided an unsupervised clustering heuristic
with cluster overlapping
- We presented a procedure to reduce the search
space in FP-tree for generation of closed itemsets
18
prograf.ic.uff.br
Thank you!
19
Project webpage