Efficient Parallel Algorithm for Mining High Utility Patterns Based - PowerPoint PPT Presentation

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China 浙江工商大学信电学院 23 June 2019

Content  Motivation  Problem Statement & Preliminaries  High Utility Pattern Mining, Sequential Algorithms, Frameworks  Our Mining Approach  New Parallel Algorithm Based on Spark  Experimental Evaluation  Conclusion and Future Work  References Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Motivation Motivation  High Utility Pattern Mining vs Frequent Pattern Mining  Utility = user’s interest + statistical significance - HUP  Support = statistical significance only - FP  HUP Mining much harder than FP Mining HUP Mining much harder than FP Mining  Anti-monotonicity is satisfied for FP support of a pattern  support of its sub-pattern  Anti-monotonicity is not satisfied with HUP utility of a pattern  ? utiltiy of its sub-pattern  Parallelization to deal with hardness in mining big data 1 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Problem Statement & Preliminaries High Utility Pattern Mining High Utility Pattern Mining  What products purchased together have high profits?  The utility of a set of products = the profits of the products in transactions containing them and depending on quantity and price/cost Utility table Shopping Transactions Tid Tid Items Items I I U U a 1 t 1 b:1, c:2, d:1, g:1 b 2 t 2 a:4, b:1 c:3, d:1,e:1 c 1 t 3 a:4, c:2, d:1 d 5 t 4 c:2, e:1,f:1 ... ... ... ...  FP: What products are frequently purchased together? 2 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Problem Statement & Preliminaries Well-known Sequential Algorithms Well-known Sequential Mining Algorithms Pruning Algorithm References Search Strategy Candidates Strategy TwoPhase [1] KDD Breadth (Apriori) With TWU CTU-PROL [3] PAKDD Breadth (Apriori) With TWU IHUP [5] TKDE Depth (FP-Growth) With TWU Depth (FP-Growth) With TWU [6] KDD, TKDE UPGrowth D2HUP [7] ICDM, TKDE Depth (OP) Without Tight bound Depth (Eclat) Without Tight bounds HUI-Miner [8] CIKM 3 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Problem Statement & Preliminaries Spark / MapReduce Framework Distributed Computing Frameworks [9,10]   Data are distributed over a cluster  One split on one node  Represented as <key, value> pairs: input, output, and interim results  Processing by a series of jobs  Processing by a series of jobs Master  Job is dispatched to where a data split reside, and executed in parallel  Job is defined by a mapper and a reducer, and executed in two phases  Slaves  Resilient Dynamic Dataset ( RDD ): Memory based Cluster of servers (nodes)  Transformations / Actions on RDD 4 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Our Mining Approach Breadth-First Search, Improved Utility Lists Our Mining Approach  Breadth-First Search  adapting HUI-Miner derived from Eclat , which is Depth-First  Improved vertical data structure - UtilityList  Ordering items, e, c, b, a, d, in ascending transaction utilities  {e}, UL({e})   {b}, UL({b})  5 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Our Mining Approach Join Utility Lists Our Mining Approach (cont)  Mining high utility patterns by joining UtilityLists two k-patterns  {e,b}, UL({e,b})   {e,a}, UL({e,a})  (k+1)-pattern  {e,b,a}, UL({e,b,a})  6 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Three phases New Parallel Algorithm Based on Spark Phps: Parallel high utility pattern mining based on Spark  i , ( u ( i,tid ), u ( t,tid ) )  I  i , twu ( i ) )  II  i , ( tid , iutil , rutil )   i , List( tid , iutil , rutil,piutil )   i , (List(  ,  ,  ,  ), iutilSum , rutilSum )  III  P k ,UL( P k )   P k , List(  ,  ,  ,  )   P k -2 , ( P k -1 , UL( P k -1 ))   P k -1 ,UL( P k -1 )  7 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Experimental Evaluation Experimental Evaluation  2 algorithms  Phps - our algorithm  Php MR - the competitor  4 datasets  4 datasets Dataset #Items #Trans. Trans Ave Len Chess 76 3,196 37 WebView-1 497 59,602 2.5 T10DI6N1KD1M 1,000 933,493 10 Chainstore 46,086 1,112,949 7.2 8 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Experimental Evaluation Running time with changing minUtil Running time with changing minUtil 80 1200 (s) Phps Phps (s) Running Time PhpMR PhpMR Running Time 900 60 600 40 300 20 0 0 30 30 35 35 40 40 45 45 50 50 2.5 2.5 3 3 3.5 3.5 4 4 minutil (%) minutil (%) (a) Chess (b) WebView 800 1000 Phps Phps (s) (s) PhpMR 800 (s) PhpMR Running Time 600 Running Time 600 400 400 200 200 0 0 2 3 4 5 0.2 0.4 0.6 0.8 1 minutil (%) minutil (%) (c) T10DI6N1KD1M (d) Chainstore 9 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Experimental Evaluation Running time with each iteration Running time with each iteration 25 40 Phps (s) Phps Running Time (s) PhpMR PhpMR 20 Running Time Running Time 30 Running Time 15 20 20 10 10 5 0 0 1 2 3 4 5 6 7 8 1 2 3 4 Pass of Iteration Pass of Iteration (a) Chess: minutil = 37% (b) Chainstore: minutil = 0.5% 10 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Conclusion and Future Work Conclusion  Phps: a parallel Eclat-like algorithm based on Spark  An improved vertical data structure  A three-phase parallel mining framework  An efficient algorithm Future Work  Hybrid Search : BF + DF  More Pruning in Phase I (filtering irrelevant items)  Algorithms parallelizing D2HUP  Algorithms on new parallel programming frameworks 11 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

References [1] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets mining algorithm. In Proceedings of the Utility-Based Data MiningWorkshop in conjunction with the 11th ACM SIGKDD [C], 2005, p253-262. [2] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategy for discovering high utility itemsets [J]. Data & Knowledge Engineering, 2008, 64(1): 198-217. [3] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from large datasets [A]. In Proceedings of PAKDD 2008 [C], 2008, p554-561. [4] J. W. Han, J. Pei, Y. W. Yin, et al. Mining Frequent Patterns without Candidate Generation. In Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, 2000, p1-12. [5] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, et al. Efficient tree structures for high utility pattern mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data Engineering, 2009, p1708-1721. [6] V. S. Tseng, C.-W. Wu, B.-E. Shie, et al. UP-Growth: an efficient algorithm for high utility itemset mining [A]. In Proceedings of the 16th ACM SIGKDD [C], 2010, p253-262. [7] I J. Liu, K. Wang, and B. Fung. Direct Discovery of High Utility temsets without Candidate Generation. In IEEE 12th International Conference on Data Mining, 2012, p101-109. [8] M. Liu, J. Qu. Mining high utility itemsets without candidate generation. In Proceedings of CIKM 2012, 2012, p55-64. [9] Matei Zaharia. An architecture for fast and general data processing on large clusters. Technical Report No. UCB/EECS-2014-12, University of California at Berkeley. [10] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In OSDI, 2004, p137-150. Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

Thank You ! Questions ? Gracias ! Pregunta?

IEEE DSC 2019 - IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE BDMC 2019 - BIG DATA MINING FOR CYBERSPACE 23 June, 2019 8:30 - 9:30 Workshop Chair: Zhaoquan Gu and Jing Qiu http://www.ieee-dsc.org/2019/

Efficient Parallel Algorithm for Mining High Utility Patterns Based - PowerPoint PPT Presentation

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China 23

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Utility Flood SOLUTIONS November 9, 2017 UTILITY LIGHTING PRODUCTS 1 1 HO HOWARD WARD

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.1 Parallel Algorithm

Parallel Algorithms Parallel Prefix Sums Algorithm Theory WS 2012/13 Fabian Kuhn PRAM Parallel

Efficient Mining of Dissociation Rules Mikoaj Morzy 7 th International Conference DaWaK 2006

An Efficient Algorithm for An Efficient Algorithm for Simulating Coalescence with Simulating

2020 INTERIM RESULTS 30 July 2020 CAUTIONARY STATEMENT Disclaimer : This presentation has been

ABSTRACT NOTIFICATION POSTER PRESENTATION Number: P235 Dear Colleague, We would like to thank

TRAINING TEAM Fran Brandow MEES Trainer + quipe-choc pdagogique, English Second Language

KUMBA IRON ORE LIMITED INTERIM FINANCIAL RESULTS FOR THE SIX MONTHS ENDED 30 JUNE 2019 UNLOCKING

Experimental Identification of Causal Mechanisms Kosuke Imai 1 Dustin Tingley 2 Teppei Yamamoto 3 1

2018 Nine Months Results Financial and Operating Results 8 th November 2018 PLDT Group: 9M18 vs

Teaching in the Vantage One Science Program Sharing Our Curricular and Pedagogical Insights from

VFD Allen Bradley Power Flex 4M Variable Frequency Drive nfi Practical tical De Demon