Depth-First Non-Derivable Itemset Mining Toon Calders Bart Goethals - PDF document

Depth-First Non-Derivable Itemset Mining Toon Calders ∗ Bart Goethals † University of Antwerp, Belgium HIIT-BRU, University of Helsinki, Finland toon.calders@ua.ac.be bart.goethals@cs.helsinki.fi Abstract support of an itemset X in D is the number of transactions in the cover of X in D : support ( X, D ) := | cover ( X, D ) | . Mining frequent itemsets is one of the main problems in data min- An itemset is called frequent in D if its support in D exceeds ing. Much effort went into developing efficient and scalable al- the minimal support threshold σ . D and σ are omitted when gorithms for this problem. When the support threshold is set too they are clear from the context. The goal is now to find all low, however, or the data is highly correlated, the number of fre- frequent itemsets, given a database and a minimal support quent itemsets can become too large, independently of the algo- threshold. rithm used. Therefore, it is often more interesting to mine a reduced Recent studies on frequent itemset mining algorithms collection of interesting itemsets, i.e., a condensed representation. resulted in significant performance improvements: a first Recently, in this context, the non-derivable itemsets were proposed milestone was the introduction of the breadth-first Apriori- as an important class of itemsets. An itemset is called derivable algorithm [4]. In the case that a slightly compressed form when its support is completely determined by the support of its sub- of the database fits into main memory, even more effi- sets. As such, derivable itemsets represent redundant information cient, depth-first, algorithms such as Eclat [18, 23], and FP- and can be pruned from the collection of frequent itemsets. It was growth [12] were developed. shown both theoretically and experimentally that the collection of However, independently of the chosen algorithm, if the non-derivable frequent itemsets is in general much smaller than the minimal support threshold is set too low, or if the data is complete set of frequent itemsets. A breadth-first, Apriori-based highly correlated, the number of frequent itemsets itself can algorithm, called NDI, to find all non-derivable itemsets was pro- be prohibitively large. No matter how efficient an algorithm posed. In this paper we present a depth-first algorithm, dfNDI, that is, if the number of frequent itemsets is too large, mining all is based on Eclat for mining the non-derivable itemsets. dfNDI is of them becomes impossible. evaluated on real-life datasets, and experiments show that dfNDI To overcome this problem, recently several proposals outperforms NDI with an order of magnitude. have been made to construct a condensed representation [15] 1 Introduction of the frequent itemsets, instead of mining all frequent itemsets. A condensed representation is a sub-collection of Since its introduction in 1993 by Agrawal et al. [3], the all frequent itemsets that still contains all information. The frequent itemset mining problem has received a great deal most well-known example of a condensed representation are of attention. Within the past decade, hundreds of research the closed sets [5, 7, 16, 17, 20]. The closure cl ( I ) of an papers have been published presenting new algorithms or itemset I is the largest superset of I such that supp ( cl ( I )) = improvements on existing algorithms to solve this mining supp ( I ) . A set I is closed if cl ( I ) = I . In the closed sets problem more efficiently. representation only the frequent closed sets are stored. This The problem can be stated as follows. We are given a representation still contains all information of the frequent set of items I , and an itemset I ⊆ I is some set of items. itemsets, because for every set I it holds that A transaction over I is a couple T = ( tid , I ) where tid is the transaction identifier and I is an itemset. A transaction supp ( I ) = max { supp ( C ) | I ⊆ C, cl ( C ) = C } . = ( tid , I ) is said to support an itemset X ⊆ I , if T Another important class of itemsets in the context X ⊆ I . A transaction database D over I is a set of of condensed representations are the non-derivable item- transactions over I . We omit I whenever it is clear from sets [10]. An itemset is called derivable when its support the context. The cover of an itemset X in D consists of the is completely determined by the support of its subsets. As set of transaction identifiers of transactions in D that support such, derivable itemsets represent redundant information and X : cover ( X, D ) := { tid | ( tid , I ) ∈ D , X ⊆ I } . The can be pruned from the collection of frequent itemsets. For an itemset, it can be checked whether or not it is derivable by ∗ Postdoctoral Fellow of the Fund for Scientific Research - Flanders computing bounds on the support. In [10], a method based (Belgium)(F.W.O. - Vlaanderen). † Current affiliation: University of Antwerp, Belgium. on the inclusion-exclusion principle is used.

Depth-First Non-Derivable Itemset Mining Toon Calders Bart Goethals - PDF document

Depth-First Non-Derivable Itemset Mining Toon Calders Bart Goethals University of Antwerp, Belgium HIIT-BRU, University of Helsinki, Finland toon.calders@ua.ac.be bart.goethals@cs.helsinki.fi Abstract support of an itemset X in D is the

Disjunction Property . . . A derivable or B derivable A B derivable . . . A B

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Outline CHARM: An Efficient Algorithm Introductions for Closed Itemset Mining

Sampling for Frequent Itemset Mining prof. dr Arno Siebes Algorithmic Data Analysis Group

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Mining Non-Derivable Association Rules Bart Goethals Juho Muhonen Hannu Toivonen Helsinki

Mining Free Itemsets under Constrained itemset mining Constraints Apriori revisit

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Week 5 Kullmann Analysing BFS Depth-first search Depth-first search Analysing DFS

Frequent Itemset Mining prof. dr Arno Siebes Algorithmic Data Analysis Group Department of

Integrity Verification of Outsourced Frequent Itemset Mining with Deterministic Guarantee

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Multiple socio-economic contexts during adolescence and health behaviors in early adulthood

Chance constrained problems: penalty reformulation and performance of sample approximation

Statistical analysis of EEG data Hierarchical modelling and multiple comparisons correction

Secure Management of Hazardous Materials in Sri Lanka Hazardous Waste Management PCB Safe

6th WCRI 2019 Effectiveness of data auditing as a tool to reinforce good Research Data

Efficacy and Impact of a Cultural Transition Course for First-Year International Students Nelson

Boolean Operators and Venn Diagrams Michael Halleen References ESRI Help, Wikipedia, Getting

Algebras A General Survey Riley Chien University of Puget Sound May 4, 2015 Riley Chien

Depth-First Non-Derivable Itemset Mining Toon Calders Bart Goethals - PDF document

Depth-First Non-Derivable Itemset Mining Toon Calders Bart Goethals University of Antwerp, Belgium HIIT-BRU, University of Helsinki, Finland toon.calders@ua.ac.be bart.goethals@cs.helsinki.fi Abstract support of an itemset X in D is the

Disjunction Property . . . A derivable or B derivable A B derivable . . . A B

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Outline CHARM: An Efficient Algorithm Introductions for Closed Itemset Mining

Sampling for Frequent Itemset Mining prof. dr Arno Siebes Algorithmic Data Analysis Group

for each dst in my.out_edges if dst.depth &gt; my.depth+1 then dst.depth = my.depth+1

Mining Non-Derivable Association Rules Bart Goethals Juho Muhonen Hannu Toivonen Helsinki

Mining Free Itemsets under Constrained itemset mining Constraints Apriori revisit

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Week 5 Kullmann Analysing BFS Depth-first search Depth-first search Analysing DFS

Frequent Itemset Mining prof. dr Arno Siebes Algorithmic Data Analysis Group Department of

Integrity Verification of Outsourced Frequent Itemset Mining with Deterministic Guarantee

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Multiple socio-economic contexts during adolescence and health behaviors in early adulthood

Chance constrained problems: penalty reformulation and performance of sample approximation

Statistical analysis of EEG data Hierarchical modelling and multiple comparisons correction

Secure Management of Hazardous Materials in Sri Lanka Hazardous Waste Management PCB Safe

6th WCRI 2019 Effectiveness of data auditing as a tool to reinforce good Research Data

Efficacy and Impact of a Cultural Transition Course for First-Year International Students Nelson

Boolean Operators and Venn Diagrams Michael Halleen References ESRI Help, Wikipedia, Getting

Algebras A General Survey Riley Chien University of Puget Sound May 4, 2015 Riley Chien

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1