midterm review
play

Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Midterm Review Jan-Willem van de Meent Review: Frequent Itemsets Frequent Itemsets Items = {milk, coke, pepsi, beer, juice} Baskets B 1 = {m, c, b} B 2


  1. Unsupervised Machine Learning 
 and Data Mining DS 5230 / DS 4420 - Fall 2018 Midterm Review Jan-Willem van de Meent

  2. Review: 
 Frequent Itemsets

  3. Frequent Itemsets • Items = {milk, coke, pepsi, beer, juice} • Baskets B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} • Frequent itemsets ( σ ( X ) ≥ 3) : {m}:5, {c}:6, {b}:6, {j}:4, {m,c}: 3, 
 {m,b}:4, {c,b}:5, {c,j}:3, {m,c,b}:3

  4. Example: Confidence and Interest B 1 = {m, c, b} B 2 = {m, p, j} B 3 = {m, b} B 4 = {c, j} B 5 = {m, c, b} B 6 = {m, c, b, j} B 7 = {c, b, j} B 8 = {b, c} Lift = c ( A − → B ) • Association rule: {m} → b , s ( B ) • Confidence = 4/5 • Interest Factor = 1/6 4/5 = 4/30 • Item b appears in 6/8 of the baskets • Rule is not very interesting! adapted from : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

  5. Apriori – Overview All pairs of sets 
 All pairs Count Count that differ by 
 All of items the items the pairs 1 element items from L 1 Filter Filter C 1 Construct C 2 L 2 Construct C 3 L 1 1. Set k = 0 2. Define C 1 as all size 1 item sets 3. While C k +1 is not empty 4. Set k = k + 1 (I/O limited) 5. Scan DB to determine subset L k ⊆ C k 
 with support ≥ s 6. Construct candidates C k +1 by combining 
 (Memory 
 sets in L k that differ by 1 element limited)

  6. FP-Growth – Intuition • Apriori requires one pass for each k • Can we find all frequent item sets 
 in fewer passes over the data? FP-Growth Algorithm : • Pass 1 : Count items with support ≥ s • Sort frequent items in descending 
 order according to count • Pass 2 : Store all frequent itemsets 
 in a frequent pattern tree (FP-tree) • Mine patterns from FP-Tree

  7. FP-Growth vs Apriori Advantages of FP-Growth • Only 2 passes over dataset • Stores “compact” version of dataset • No candidate generation • Faster than A-priori Disadvantages of FP-Growth • The FP-Tree may not be “compact” 
 enough to fit in memory • Used in practice : PFP (a distributed 
 version of FP-growth)

  8. Review: ML Basics

  9. What is Similarity? Can be hard to define, but we know it when we see it.

  10. Similarity Metrics in Machine Learning Regression: Similar points x and x’ should 
 have similar function values f(x) and f(x’) Dimensionality Reduction: Reduce dimension of points x and x’ in a manner that preserves similarities Clustering: Similar points x and x’ should 
 have the same cluster assignments z and z’

  11. Distance Norms s k P Euclidean Distance ( ( x i − y i ) 2 ) i =1 k P Mahattan Distance | x i − y i | i =1 ✓ k ◆ 1 q ( | x i − y i | ) q P Minkowski Distance i =1

  12. Sensitivity to Scaling

  13. Normalization Strategies X − X min Min-Max: X max − X min a + ( X − X min )( b − a ) X max − X min X − µ Z-score: X ∼ N ( µ, σ ): σ X Scaling: X max

  14. Curse of Dimensionality 30% 1.0 p=10 D=10 9% p=3 D=3 0.8 p=2 D=2 0.6 D=1 p=1 Distance 0.4 3% 0.2 0.0 0.0 0.2 0.4 0.6 Fraction of Volume Implication: Estimating similarities 
 is difficult for high-dimensional data

  15. Review: Probability

  16. Bayes' Rule Posterior Likelihood Prior Sum Rule: Product Rule:

  17. <latexit sha1_base64="IrZ8Zgs2wuaHSmKLTUb2mnpgr0=">AF6HicfZRbaxQxFIDT2tW63lp9GVxXyosZaWti9CqR9rKU32FlKJnN2NzbJxCTW8h/8EXEFwX/jL/Bf+NkO+DOZDADQzjnO9ecJWMahNFfxYW7y17j9Yfth9PjJ02crq89PdF4oAsckZ7k6S7EGRgUcG2oYnEkFmKcMTtOLd15/eglK01wcmRsJI4ngo4pwaYUDc96ia8J9euX5+v9KP1aLZ64SauNn1UrYPz1aXfSZaTgoMwhGth3EkzchiZSh4LpJoUFicoEnMCy3AnPQIzvL2fVq2qN4ZMe5MCBIzcxirjk20DoYV2XkmkZGFQ9bCUc9Ahmbwkv/yPrHWag6UTUHaTcdbtJBuOylbMkbZayApw9fL/nbDTYejOIN7ZdA1GQVUS8Ew3KrwlMFICokJ3NQby1EzKyUJLBPyjymM9GgYArknORWaTSyBuWLYqAaELBb4Qm6Tc9mPnXADfoaXNTN9N5pXztqklmBya9dE7uZw3yhJXQTQLdtvm4D7HMblpgpGNySvWmncdHCFgFbhJAKINXMEFpjgtSU5SKoZzxHz+ZkHAZlc0x1xt4lK+9nhgOPctqOykN2MPGyRw6Py7zBFYTjstzTnIJCptc+ft3Rc2U6NtpXehVZU/N+q1DeD7bv6UPp/mtp9F5AkZbPBrPcunFCisjrnq2zBJqO3R1cCygbYNVgT5ZPX9x86MLNycZ6HK3Hzf7u3vVI7iMXqJXaA3FaBvtog/oAB0jgnL0Ff1APzufOl863zrf79DFhcrmBaqtzq+/6U0pHA=</latexit> Expected Values X is a random variable with density p(x) Machine Learning Statistics (distribution implied by X) (explicitly define distribution)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend