 
              Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China 浙 江 工 商 大 学 信 电 学 院 23 June 2019
Content  Motivation  Problem Statement & Preliminaries  High Utility Pattern Mining, Sequential Algorithms, Frameworks  Our Mining Approach  New Parallel Algorithm Based on Spark  Experimental Evaluation  Conclusion and Future Work  References Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Motivation Motivation  High Utility Pattern Mining vs Frequent Pattern Mining  Utility = user’s interest + statistical significance - HUP  Support = statistical significance only - FP  HUP Mining much harder than FP Mining HUP Mining much harder than FP Mining  Anti-monotonicity is satisfied for FP support of a pattern  support of its sub-pattern  Anti-monotonicity is not satisfied with HUP utility of a pattern  ? utiltiy of its sub-pattern  Parallelization to deal with hardness in mining big data 1 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Problem Statement & Preliminaries High Utility Pattern Mining High Utility Pattern Mining  What products purchased together have high profits?  The utility of a set of products = the profits of the products in transactions containing them and depending on quantity and price/cost Utility table Shopping Transactions Tid Tid Items Items I I U U a 1 t 1 b:1, c:2, d:1, g:1 b 2 t 2 a:4, b:1 c:3, d:1,e:1 c 1 t 3 a:4, c:2, d:1 d 5 t 4 c:2, e:1,f:1 ... ... ... ...  FP: What products are frequently purchased together? 2 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Problem Statement & Preliminaries Well-known Sequential Algorithms Well-known Sequential Mining Algorithms Pruning Algorithm References Search Strategy Candidates Strategy TwoPhase [1] KDD Breadth (Apriori) With TWU CTU-PROL [3] PAKDD Breadth (Apriori) With TWU IHUP [5] TKDE Depth (FP-Growth) With TWU Depth (FP-Growth) With TWU [6] KDD, TKDE UPGrowth D2HUP [7] ICDM, TKDE Depth (OP) Without Tight bound Depth (Eclat) Without Tight bounds HUI-Miner [8] CIKM 3 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Problem Statement & Preliminaries Spark / MapReduce Framework Distributed Computing Frameworks [9,10]   Data are distributed over a cluster  One split on one node  Represented as <key, value> pairs: input, output, and interim results  Processing by a series of jobs  Processing by a series of jobs Master  Job is dispatched to where a data split reside, and executed in parallel  Job is defined by a mapper and a reducer, and executed in two phases  Slaves  Resilient Dynamic Dataset ( RDD ): Memory based Cluster of servers (nodes)  Transformations / Actions on RDD 4 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Our Mining Approach Breadth-First Search, Improved Utility Lists Our Mining Approach  Breadth-First Search  adapting HUI-Miner derived from Eclat , which is Depth-First  Improved vertical data structure - UtilityList  Ordering items, e, c, b, a, d, in ascending transaction utilities  {e}, UL({e})   {b}, UL({b})  5 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Our Mining Approach Join Utility Lists Our Mining Approach (cont)  Mining high utility patterns by joining UtilityLists two k-patterns  {e,b}, UL({e,b})   {e,a}, UL({e,a})  (k+1)-pattern  {e,b,a}, UL({e,b,a})  6 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Three phases New Parallel Algorithm Based on Spark Phps: Parallel high utility pattern mining based on Spark  i , ( u ( i,tid ), u ( t,tid ) )  I  i , twu ( i ) )  II  i , ( tid , iutil , rutil )   i , List( tid , iutil , rutil,piutil )   i , (List(  ,  ,  ,  ), iutilSum , rutilSum )  III  P k ,UL( P k )   P k , List(  ,  ,  ,  )   P k -2 , ( P k -1 , UL( P k -1 ))   P k -1 ,UL( P k -1 )  7 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Experimental Evaluation Experimental Evaluation  2 algorithms  Phps - our algorithm  Php MR - the competitor  4 datasets  4 datasets Dataset #Items #Trans. Trans Ave Len Chess 76 3,196 37 WebView-1 497 59,602 2.5 T10DI6N1KD1M 1,000 933,493 10 Chainstore 46,086 1,112,949 7.2 8 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Experimental Evaluation Running time with changing minUtil Running time with changing minUtil 80 1200 (s) Phps Phps (s) Running Time PhpMR PhpMR Running Time 900 60 600 40 300 20 0 0 30 30 35 35 40 40 45 45 50 50 2.5 2.5 3 3 3.5 3.5 4 4 minutil (%) minutil (%) (a) Chess (b) WebView 800 1000 Phps Phps (s) (s) PhpMR 800 (s) PhpMR Running Time 600 Running Time 600 400 400 200 200 0 0 2 3 4 5 0.2 0.4 0.6 0.8 1 minutil (%) minutil (%) (c) T10DI6N1KD1M (d) Chainstore 9 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Experimental Evaluation Running time with each iteration Running time with each iteration 25 40 Phps (s) Phps Running Time (s) PhpMR PhpMR 20 Running Time Running Time 30 Running Time 15 20 20 10 10 5 0 0 1 2 3 4 5 6 7 8 1 2 3 4 Pass of Iteration Pass of Iteration (a) Chess: minutil = 37% (b) Chainstore: minutil = 0.5% 10 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Conclusion and Future Work Conclusion  Phps: a parallel Eclat-like algorithm based on Spark  An improved vertical data structure  A three-phase parallel mining framework  An efficient algorithm Future Work  Hybrid Search : BF + DF  More Pruning in Phase I (filtering irrelevant items)  Algorithms parallelizing D2HUP  Algorithms on new parallel programming frameworks 11 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
References [1] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets mining algorithm. In Proceedings of the Utility-Based Data MiningWorkshop in conjunction with the 11th ACM SIGKDD [C], 2005, p253-262. [2] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategy for discovering high utility itemsets [J]. Data & Knowledge Engineering, 2008, 64(1): 198-217. [3] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from large datasets [A]. In Proceedings of PAKDD 2008 [C], 2008, p554-561. [4] J. W. Han, J. Pei, Y. W. Yin, et al. Mining Frequent Patterns without Candidate Generation. In Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, 2000, p1-12. [5] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, et al. Efficient tree structures for high utility pattern mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data Engineering, 2009, p1708-1721. [6] V. S. Tseng, C.-W. Wu, B.-E. Shie, et al. UP-Growth: an efficient algorithm for high utility itemset mining [A]. In Proceedings of the 16th ACM SIGKDD [C], 2010, p253-262. [7] I J. Liu, K. Wang, and B. Fung. Direct Discovery of High Utility temsets without Candidate Generation. In IEEE 12th International Conference on Data Mining, 2012, p101-109. [8] M. Liu, J. Qu. Mining high utility itemsets without candidate generation. In Proceedings of CIKM 2012, 2012, p55-64. [9] Matei Zaharia. An architecture for fast and general data processing on large clusters. Technical Report No. UCB/EECS-2014-12, University of California at Berkeley. [10] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In OSDI, 2004, p137-150. Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark
Thank You ! Questions ? Gracias ! Pregunta?
IEEE DSC 2019 - IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE BDMC 2019 - BIG DATA MINING FOR CYBERSPACE 23 June, 2019 8:30 - 9:30 Workshop Chair: Zhaoquan Gu and Jing Qiu http://www.ieee-dsc.org/2019/
Recommend
More recommend