smartAdvisor Jan Neerbek Alexandra Institute Agenda d60: A - PowerPoint PPT Presentation

Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute

Agenda · d60: A cloud/data mining case · Cloud · Data Mining · Market Basket Analysis · Large data sets · Our solution 2

Alexandra Institute The Alexandra Institute is a non-profit company that works with application- oriented IT research. Focus is pervasive computing, and we activate the business potential of our members and customers through research- based userdriven innovation. 3

The case: d60 · Danish company · A similar products recommendation engine · d60 was outgrowing their servers (late 2010) · They saw a potential in moving to Azure 4

The setup Product Recommendations Internet Webshops Log shopping patterns Do data mining 5

The cloud potential · Elasticity · No upfront server cost · Cheaper licenses · Faster calculations 6

Challenges · No SQL Server Analysis Services (SSAS) · Small compute nodes · Partioned database (50GB) · SQL server ingress/outgress access is slow 7

The cloud Node Node Node Node Node Node Node 8

The cloud and services Node Node Node Node Data layer service Node Messaging Service Node Node 9

Data layer service Data layer · Application specific (schema/layout) service · SQL, table or other · Easy a bottleneck · Can be difficult to scale 10

Messaging service Task Queues · Standard data structure Messaging Service · Build-in ordering (FIFO) · Can be scaled · Good for asynchronous messages 11

Data mining Data mining is the use of automated data analysis techniques to uncover relationships among data items Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals [about.com/wikipedia.org] 13

Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers 14

Market basket analysis Customer1 Customer2 Customer3 Customer4 Avocado Milk Beef Cereal Milk Diapers Lemons Beer Butter Avocado Beer Beef Potatoes Beer Chips Diapers Itemset (Diapers, Beer) occur 50% Frequency threshold parameter Find as many frequent itemsets as possible 15

Market basket analysis Popular effective algorithm: FP-growth  Based on data structure FP-tree Requires all data in near-memory  Most research in distributed models has been for cluster setups  16

Building the FP-tree (extends the prefix-tree structure) Customer1 Avocado Avocado Milk Butter Butter Potatoes Milk Potatoes 17

Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Milk Potatoes 18

Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 19

Building the FP-tree Customer2 Avocado Milk Diapers Avocado Butter Beer Beer Milk Diapers Potatoes Milk 20

Building the FP-tree Beef Avocado Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 21

FP-growth Grows the frequent itemsets, recusively FP-growth(FP-tree tree) { … for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.GetTree(tree, item); FP-growth(sub); } 22

FP-growth algorithm Divide and Conquer Traverse tree Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 23

FP-growth algorithm Divide and Conquer Generate sub-trees Avocado Beef Butter Beer Beer Milk Diapers Chips Cereal Potatoes Milk Lemon Diapers 24

FP-growth algorithm Divide and Conquer Call recursively Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 25

FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory 26

Distributed Shared Memory? CPU CPU CPU CPU CPU Memory Memory Memory Memory Memory Network Shared Memory · To add nodes is to add memory · Works best in tightly coubled setups, with low-lantency, high-speed networks 27

FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? · Emulate Distributed Shared Memory · Optimize your data structures · Buy more RAM · Get a good idea 28

Get a good idea · Database scans are serial and can be distributed · The list of items used in the recursive calls uniquely determines what part of data we are looking at 29

Get a good idea Avocado Beef Butter Beer Beer Avocado Milk Diapers Chips Cereal Butter Beer Diapers Potatoes Milk Lemon Diapers 30

Get a good idea Avocado Butter, Milk Avocado Butter Beer Diapers Milk Avocado Beer Diapers,Milk These are postfix paths 31

Buckets · Use postfix paths for messaging · Working with buckets Transactions Items 33

FP-growth revisited Replaced with postfix FP-growth(FP-tree tree) { … Done in parallel for-each (item in tree) Done in parallel count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); Done in parallel sub = tree.GetTree(tree, item); FP-growth(sub); } 34

Communication Node Node Data layer Node Node 35

Revised Communication Node Node MQ Data layer Node Node 36

Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 37

Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 38

Collecting what we have learned · Message-driven work, using message-queue · Peer-to-peer for intermediate results · Distribute data for scalability (buckets) · Small messages (list of items) · Allow us to distribute FP-growth 39

Advantages · Configurable work sizes · Good distribution of work · Robust against computer failure · Fast! 40

So what about performance? 04:30:00 04:00:00 03:30:00 03:00:00 Message-driven FP-growth 02:30:00 FP-growth 02:00:00 Total node time 01:30:00 01:00:00 00:30:00 00:00:00 1 2 4 8 41

Thank you! 42

smartAdvisor Jan Neerbek Alexandra Institute Agenda d60: A - PowerPoint PPT Presentation

Case study: d60 Raptor smartAdvisor Jan Neerbek Alexandra Institute Agenda d60: A cloud/data mining case Cloud Data Mining Market Basket Analysis Large data sets Our solution 2 Alexandra Institute The Alexandra Institute is

Sequencing technology and assembly Sanger sequencing Sanger sequencing with radioactivity

Health Heterogeneity and the Preferences for Consumption Growth Jay H. Hong Josep Pijoan-Mas

Applying Human Psychology to Outline Animal Rights Campaigning Social conformity Lisa Kramer

Discussion Under-investment in Profitable Technologies when Experimentation is Costly

Automated Performance Testing For Virtualization with MMTests Dario Faggioli

A counterexample to the DemyanovRyabova conjecture 25 August 2018 AVOCADO, Newcastle Vera

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

Doesnt Work! Measurable Controllable.Scalable .. WORKS! 108

Domain Adaptation for Commitment Detection in Email Hosein Azarbonyad (1) , Robert Sim (2) , and

ADD AUTHENTICATION TO ANY APPLICATION Aaron Parecki @aaronpk aaronpk.com Developer

Understanding and Changing Your Business Strategy Jan Masaoka Blue Avocado Unconventional,

INTERNATIONAL SEMINAR ON ECONOMICS INTERNATIONAL SEMINAR ON ECONOMICS AND MARKETING OF TROPICAL

Effective Social Media Content to Engage Your Visitors Welcome! The webinar will begin at 10:00

2/6/2019 COPE WEBINAR SERIES FOR HEALTH PROFESSIONALS FINDING SLIDES FOR TODAYS WEBINAR

LECTURE 14: DESIGN FOR TESTING CSE 442 Software Engineering Easiest Code to Test Easiest

Logic Programming Using Data Structures Part 2 Temur Kutsia Research Institute for Symbolic

Bayes rule recall def of conditional: P(a|b) = P(a^b) / P(b) if P(b) != 0 Geoff

Welcome Board Policies: Integrity in Action Webinar February 22, 2013 presented by Center for

The 2016 Nobel prize in Physics D. Thouless and Topological Invariants J. Avron May 2017 Avron

Braiding fluxes in Pauli Hamiltonians Anyons for anyone J. Avron O. Kenneth Department of

Geometry of Quantum Transport Yosi Avron, Martin Fraas, Gian Michele Graf, Oded Kenneth November

Induction-Recursion 20 years later Anton Setzer Swansea University, Swansea UK Gothenburg,

In the next decade, what research areas will have the greatest impact on ed tech products and

2014 LGBTQ2S Advocacy, and Nativeout Terra Matthews-Hartwell, NativeOut Communications