efficient parallel algorithm for mining high utility
play

Efficient Parallel Algorithm for Mining High Utility Patterns Based - PowerPoint PPT Presentation

Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China 23


  1. Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark Junqiang Liu, Rong Zhao, Xiangcai Yang, Yong Zhang, Xiaoning Jiang Zhejiang Gongshang University, Hangzhou 310018, China 浙 江 工 商 大 学 信 电 学 院 23 June 2019

  2. Content  Motivation  Problem Statement & Preliminaries  High Utility Pattern Mining, Sequential Algorithms, Frameworks  Our Mining Approach  New Parallel Algorithm Based on Spark  Experimental Evaluation  Conclusion and Future Work  References Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  3. Motivation Motivation  High Utility Pattern Mining vs Frequent Pattern Mining  Utility = user’s interest + statistical significance - HUP  Support = statistical significance only - FP  HUP Mining much harder than FP Mining HUP Mining much harder than FP Mining  Anti-monotonicity is satisfied for FP support of a pattern  support of its sub-pattern  Anti-monotonicity is not satisfied with HUP utility of a pattern  ? utiltiy of its sub-pattern  Parallelization to deal with hardness in mining big data 1 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  4. Problem Statement & Preliminaries High Utility Pattern Mining High Utility Pattern Mining  What products purchased together have high profits?  The utility of a set of products = the profits of the products in transactions containing them and depending on quantity and price/cost Utility table Shopping Transactions Tid Tid Items Items I I U U a 1 t 1 b:1, c:2, d:1, g:1 b 2 t 2 a:4, b:1 c:3, d:1,e:1 c 1 t 3 a:4, c:2, d:1 d 5 t 4 c:2, e:1,f:1 ... ... ... ...  FP: What products are frequently purchased together? 2 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  5. Problem Statement & Preliminaries Well-known Sequential Algorithms Well-known Sequential Mining Algorithms Pruning Algorithm References Search Strategy Candidates Strategy TwoPhase [1] KDD Breadth (Apriori) With TWU CTU-PROL [3] PAKDD Breadth (Apriori) With TWU IHUP [5] TKDE Depth (FP-Growth) With TWU Depth (FP-Growth) With TWU [6] KDD, TKDE UPGrowth D2HUP [7] ICDM, TKDE Depth (OP) Without Tight bound Depth (Eclat) Without Tight bounds HUI-Miner [8] CIKM 3 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  6. Problem Statement & Preliminaries Spark / MapReduce Framework Distributed Computing Frameworks [9,10]   Data are distributed over a cluster  One split on one node  Represented as <key, value> pairs: input, output, and interim results  Processing by a series of jobs  Processing by a series of jobs Master  Job is dispatched to where a data split reside, and executed in parallel  Job is defined by a mapper and a reducer, and executed in two phases  Slaves  Resilient Dynamic Dataset ( RDD ): Memory based Cluster of servers (nodes)  Transformations / Actions on RDD 4 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  7. Our Mining Approach Breadth-First Search, Improved Utility Lists Our Mining Approach  Breadth-First Search  adapting HUI-Miner derived from Eclat , which is Depth-First  Improved vertical data structure - UtilityList  Ordering items, e, c, b, a, d, in ascending transaction utilities  {e}, UL({e})   {b}, UL({b})  5 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  8. Our Mining Approach Join Utility Lists Our Mining Approach (cont)  Mining high utility patterns by joining UtilityLists two k-patterns  {e,b}, UL({e,b})   {e,a}, UL({e,a})  (k+1)-pattern  {e,b,a}, UL({e,b,a})  6 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  9. Three phases New Parallel Algorithm Based on Spark Phps: Parallel high utility pattern mining based on Spark  i , ( u ( i,tid ), u ( t,tid ) )  I  i , twu ( i ) )  II  i , ( tid , iutil , rutil )   i , List( tid , iutil , rutil,piutil )   i , (List(  ,  ,  ,  ), iutilSum , rutilSum )  III  P k ,UL( P k )   P k , List(  ,  ,  ,  )   P k -2 , ( P k -1 , UL( P k -1 ))   P k -1 ,UL( P k -1 )  7 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  10. Experimental Evaluation Experimental Evaluation  2 algorithms  Phps - our algorithm  Php MR - the competitor  4 datasets  4 datasets Dataset #Items #Trans. Trans Ave Len Chess 76 3,196 37 WebView-1 497 59,602 2.5 T10DI6N1KD1M 1,000 933,493 10 Chainstore 46,086 1,112,949 7.2 8 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  11. Experimental Evaluation Running time with changing minUtil Running time with changing minUtil 80 1200 (s) Phps Phps (s) Running Time PhpMR PhpMR Running Time 900 60 600 40 300 20 0 0 30 30 35 35 40 40 45 45 50 50 2.5 2.5 3 3 3.5 3.5 4 4 minutil (%) minutil (%) (a) Chess (b) WebView 800 1000 Phps Phps (s) (s) PhpMR 800 (s) PhpMR Running Time 600 Running Time 600 400 400 200 200 0 0 2 3 4 5 0.2 0.4 0.6 0.8 1 minutil (%) minutil (%) (c) T10DI6N1KD1M (d) Chainstore 9 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  12. Experimental Evaluation Running time with each iteration Running time with each iteration 25 40 Phps (s) Phps Running Time (s) PhpMR PhpMR 20 Running Time Running Time 30 Running Time 15 20 20 10 10 5 0 0 1 2 3 4 5 6 7 8 1 2 3 4 Pass of Iteration Pass of Iteration (a) Chess: minutil = 37% (b) Chainstore: minutil = 0.5% 10 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  13. Conclusion and Future Work Conclusion  Phps: a parallel Eclat-like algorithm based on Spark  An improved vertical data structure  A three-phase parallel mining framework  An efficient algorithm Future Work  Hybrid Search : BF + DF  More Pruning in Phase I (filtering irrelevant items)  Algorithms parallelizing D2HUP  Algorithms on new parallel programming frameworks 11 Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  14. References [1] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets mining algorithm. In Proceedings of the Utility-Based Data MiningWorkshop in conjunction with the 11th ACM SIGKDD [C], 2005, p253-262. [2] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategy for discovering high utility itemsets [J]. Data & Knowledge Engineering, 2008, 64(1): 198-217. [3] A. Erwin, R. P. Gopalan, and N. R. Achuthan. Efficient mining of high utility itemsets from large datasets [A]. In Proceedings of PAKDD 2008 [C], 2008, p554-561. [4] J. W. Han, J. Pei, Y. W. Yin, et al. Mining Frequent Patterns without Candidate Generation. In Proceedings of the 2000 ACMSIGMOD International Conference on Management of Data, 2000, p1-12. [5] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, et al. Efficient tree structures for high utility pattern mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data mining in incremental databases[J]. In IEEE Transactions on Knowledge and Data Engineering, 2009, p1708-1721. [6] V. S. Tseng, C.-W. Wu, B.-E. Shie, et al. UP-Growth: an efficient algorithm for high utility itemset mining [A]. In Proceedings of the 16th ACM SIGKDD [C], 2010, p253-262. [7] I J. Liu, K. Wang, and B. Fung. Direct Discovery of High Utility temsets without Candidate Generation. In IEEE 12th International Conference on Data Mining, 2012, p101-109. [8] M. Liu, J. Qu. Mining high utility itemsets without candidate generation. In Proceedings of CIKM 2012, 2012, p55-64. [9] Matei Zaharia. An architecture for fast and general data processing on large clusters. Technical Report No. UCB/EECS-2014-12, University of California at Berkeley. [10] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In OSDI, 2004, p137-150. Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark

  15. Thank You ! Questions ? Gracias ! Pregunta?

  16. IEEE DSC 2019 - IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE BDMC 2019 - BIG DATA MINING FOR CYBERSPACE 23 June, 2019 8:30 - 9:30 Workshop Chair: Zhaoquan Gu and Jing Qiu http://www.ieee-dsc.org/2019/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend