carnegie mellon univ
play

Carnegie Mellon Univ. Problem Dept. of Computer Science Getting - PowerPoint PPT Presentation

Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS Data mining - detailed outline Carnegie Mellon Univ. Problem Dept. of Computer Science Getting the data: Data Warehouses, DataCubes, 15-415/615 DB Applications OLAP


  1. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS Data mining - detailed outline Carnegie Mellon Univ. • Problem Dept. of Computer Science • Getting the data: Data Warehouses, DataCubes, 15-415/615 – DB Applications OLAP • Supervised learning: decision trees • Unsupervised learning Data Warehousing / Data Mining – association rules (R&G, ch 25 and 26) C. Faloutsos and A. Pavlo Faloutsos/Pavlo CMU-SCS 2 CMU SCS CMU SCS Problem Data Ware-housing Given: multiple data sources First step: collect the data, in a single place (= Data Warehouse) Find: patterns (classifiers, rules, clusters, outliers...) PGH How? How often? NY How about discrepancies / non- sales(p-id, c-id, date, $price) homegeneities? ??? customers( c-id, age, income, ...) SF Faloutsos/Pavlo CMU-SCS 3 Faloutsos/Pavlo CMU-SCS 4 1

  2. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS Data Ware-housing Data Ware-housing First step: collect the data, in a single place (= Step 2: collect counts. (DataCubes/OLAP) Data Warehouse) Eg.: How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / non- homegeneities? A: Wrappers/Mediators Faloutsos/Pavlo CMU-SCS 5 Faloutsos/Pavlo CMU-SCS 6 CMU SCS CMU SCS OLAP DataCubes Problem: “ is it true that shirts in large sizes sell ‘ color ’ , ‘ size ’ : DIMENSIONS better in dark colors? ” ‘ count ’ : MEASURE C / S S M L TOT C / S S M L TOT φ sales ci-d p-id Size Color $ Red 20 3 5 28 Red 20 3 5 28 size color Blue 3 3 8 14 C10 Shirt L Blue 30 Blue 3 3 8 14 Gray 0 0 5 5 Gray 0 0 5 5 C10 Pants XL Red 50 TOT 23 6 18 47 TOT 23 6 18 47 color; size C20 Shirt XL White 20 ... Faloutsos/Pavlo CMU-SCS 7 Faloutsos/Pavlo CMU-SCS 8 2

  3. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes ‘ color ’ , ‘ size ’ : DIMENSIONS ‘ color ’ , ‘ size ’ : DIMENSIONS ‘ count ’ : MEASURE ‘ count ’ : MEASURE C / S S M L TOT C / S S M L TOT φ φ Red 20 3 5 28 Red 20 3 5 28 size size Blue 3 3 8 14 Blue 3 3 8 14 color color Gray 0 0 5 5 Gray 0 0 5 5 TOT 23 6 18 47 TOT 23 6 18 47 color; size color; size Faloutsos/Pavlo CMU-SCS 9 Faloutsos/Pavlo CMU-SCS 10 CMU SCS CMU SCS DataCubes DataCubes ‘ color ’ , ‘ size ’ : DIMENSIONS ‘ color ’ , ‘ size ’ : DIMENSIONS ‘ count ’ : MEASURE ‘ count ’ : MEASURE C / S S M L TOT C / S S M L TOT φ φ Red 20 3 5 28 Red 20 3 5 28 size size color Blue 3 3 8 14 color Blue 3 3 8 14 Gray 0 0 5 5 Gray 0 0 5 5 TOT 23 6 18 47 TOT 23 6 18 47 color; size color; size Faloutsos/Pavlo CMU-SCS 11 Faloutsos/Pavlo CMU-SCS 12 3

  4. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes ‘ color ’ , ‘ size ’ : DIMENSIONS SQL query to generate DataCube: ‘ count ’ : MEASURE • Naively (and painfully:) C / S S M L TOT select size, color, count(*) φ from sales where p-id = ‘ shirt ’ Red 20 3 5 28 size group by size, color Blue 3 3 8 14 color Gray 0 0 5 5 select size, count(*) TOT 23 6 18 47 from sales where p-id = ‘ shirt ’ color; size group by size DataCube ... Faloutsos/Pavlo CMU-SCS 13 Faloutsos/Pavlo CMU-SCS 14 CMU SCS CMU SCS DataCubes DataCubes SQL query to generate DataCube: DataCube issues: • with ‘ cube by ’ keyword: Q1: How to store them (and/or materialize portions on demand) select size, color, count(*) Q2: Which operations to allow from sales where p-id = ‘ shirt ’ cube by size, color Faloutsos/Pavlo CMU-SCS 15 Faloutsos/Pavlo CMU-SCS 16 4

  5. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes DataCube issues: Q1: How to store a dataCube? Q1: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP C / S S M L TOT Red 20 3 5 28 Q2: Which operations to allow A: roll-up, drill down, slice, dice Blue 3 3 8 14 Gray 0 0 5 5 [More details: book by Han+Kamber] TOT 23 6 18 47 Faloutsos/Pavlo CMU-SCS 17 Faloutsos/Pavlo CMU-SCS 18 CMU SCS CMU SCS DataCubes DataCubes Q1: How to store a dataCube? Q1: How to store a dataCube? A1: Relational (R-OLAP) A2: Multi-dimensional (M-OLAP) A3: Hybrid (H-OLAP) C / S S M L TOT C / S S M L TOT Color Size count Red 20 3 5 28 Red 20 3 5 28 'all' 'all' 47 Blue 3 3 8 14 Blue 3 3 8 14 Blue 'all' 14 Gray 0 0 5 5 Gray 0 0 5 5 Blue M 3 TOT 23 6 18 47 TOT 23 6 18 47 … Faloutsos/Pavlo CMU-SCS 19 Faloutsos/Pavlo CMU-SCS 20 5

  6. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes Pros/Cons: Pros/Cons: ROLAP strong points: (DSS, Metacube) ROLAP strong points: (DSS, Metacube) • use existing RDBMS technology • scale up better with dimensionality Faloutsos/Pavlo CMU-SCS 21 Faloutsos/Pavlo CMU-SCS 22 CMU SCS CMU SCS DataCubes DataCubes Pros/Cons: Q1: How to store a dataCube MOLAP strong points: (EssBase/hyperion.com) Q2: What operations should we support? • faster indexing (careful with: high-dimensionality; sparseness) HOLAP: (MS SQL server OLAP services) • detail data in ROLAP; summaries in MOLAP Faloutsos/Pavlo CMU-SCS 23 Faloutsos/Pavlo CMU-SCS 24 6

  7. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes Q2: What operations should we support? Q2: What operations should we support? Roll-up C / S S M L TOT C / S S M L TOT φ φ Red 20 3 5 28 Red 20 3 5 28 size size Blue 3 3 8 14 Blue 3 3 8 14 color color Gray 0 0 5 5 Gray 0 0 5 5 TOT 23 6 18 47 TOT 23 6 18 47 color; size color; size Faloutsos/Pavlo CMU-SCS 25 Faloutsos/Pavlo CMU-SCS 26 CMU SCS CMU SCS DataCubes DataCubes Q2: What operations should we support? Q2: What operations should we support? Drill-down Slice C / S S M L TOT C / S S M L TOT φ φ Red 20 3 5 28 Red 20 3 5 28 size size color Blue 3 3 8 14 color Blue 3 3 8 14 Gray 0 0 5 5 Gray 0 0 5 5 TOT 23 6 18 47 TOT 23 6 18 47 color; size color; size Faloutsos/Pavlo CMU-SCS 27 Faloutsos/Pavlo CMU-SCS 28 7

  8. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS DataCubes DataCubes Q2: What operations should we support? Q2: What operations should we support? Dice • Roll-up C / S S M L TOT • Drill-down φ Red 20 3 5 28 • Slice size Blue 3 3 8 14 color • Dice Gray 0 0 5 5 • (Pivot/rotate; drill-across; drill-through TOT 23 6 18 47 • top N color; size • moving averages, etc) Faloutsos/Pavlo CMU-SCS 29 Faloutsos/Pavlo CMU-SCS 30 CMU SCS CMU SCS D/W - OLAP - Conclusions Outline • Problem • D/W: copy (summarized) data + analyze • Getting the data: Data Warehouses, DataCubes, • OLAP - concepts: OLAP – DataCube • Supervised learning: decision trees – R/M/H-OLAP servers • Unsupervised learning – ‘ dimensions ’ ; ‘ measures ’ – association rules – (clustering) Faloutsos/Pavlo CMU-SCS 31 Faloutsos/Pavlo CMU-SCS 32 8

  9. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS Decision trees - Problem Decision trees Age Chol-level Gender … CLASS-ID • Pictorially, we have 30 150 M + num. attr#2 - - + … (eg., chol-level) + + - + - - + - + - + ?? num. attr#1 (eg., ‘ age ’ ) Faloutsos/Pavlo CMU-SCS 33 Faloutsos/Pavlo CMU-SCS 34 CMU SCS CMU SCS Decision trees Decision trees • and we want to label ‘ ? ’ • so we build a decision tree: ? ? num. attr#2 num. attr#2 - - - - + + (eg., chol-level) (eg., chol-level) + + + + 40 - - + + - - + + - - + + - - + + 50 num. attr#1 (eg., ‘ age ’ ) num. attr#1 (eg., ‘ age ’ ) Faloutsos/Pavlo CMU-SCS 35 Faloutsos/Pavlo CMU-SCS 36 9

  10. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS Decision trees Outline • Problem • so we build a decision tree: • Getting the data: Data Warehouses, DataCubes, OLAP • Supervised learning: decision trees age<50 – problem N Y – approach – scalability enhancements chol. <40 + Y • Unsupervised learning N – association rules - – (clustering) ... Faloutsos/Pavlo CMU-SCS 37 Faloutsos/Pavlo CMU-SCS 38 CMU SCS CMU SCS Decision trees Tree building • Typically, two steps: • How? – tree building – tree pruning (for over-training/over-fitting) num. attr#2 - - + (eg., chol-level) + + - - + + - + - + num. attr#1 (eg., ‘ age ’ ) Faloutsos/Pavlo CMU-SCS 39 Faloutsos/Pavlo CMU-SCS 40 10

  11. Faloutsos & Pavlo CMU SCS 15-415/615 CMU SCS CMU SCS - - + + + - - + + - Tree building + Tree building - + • How? • Q1: how to introduce splits along attribute A i • A: Partition, recursively - pseudocode: • Q2: how to evaluate a split? Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2) Faloutsos/Pavlo CMU-SCS 41 Faloutsos/Pavlo CMU-SCS 42 CMU SCS CMU SCS Tree building Tree building • Q1: how to introduce splits along attribute A i • Q1: how to introduce splits along attribute A i • A1: - - • Q2: how to evaluate a split? – for num. attributes: + + + - - + + • binary split, or - + - + • multiple split – for categorical attributes: • compute all subsets (expensive!), or • use a greedy algo Faloutsos/Pavlo CMU-SCS 43 Faloutsos/Pavlo CMU-SCS 44 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend