summary of last chapter principles of knowledge discovery
play

Summary of Last Chapter Principles of Knowledge Discovery in Data - PowerPoint PPT Presentation

Summary of Last Chapter Principles of Knowledge Discovery in Data What is the motivation for ad-hoc mining process? What defines a data mining task? Fall 2004 Chapter 5: Data Summarization Can we define an ad-hoc mining language?


  1. Summary of Last Chapter Principles of Knowledge Discovery in Data • What is the motivation for ad-hoc mining process? • What defines a data mining task? Fall 2004 Chapter 5: Data Summarization • Can we define an ad-hoc mining language? Dr. Osmar R. Zaïane Source: Dr. Jiawei Han University of Alberta  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 1 Principles of Knowledge Discovery in Data University of Alberta 2 Course Content Chapter 4 Objectives • Introduction to Data Mining • Data warehousing and OLAP Understand Characterization and • Data cleaning Discrimination of data. • Data mining operations • Data summarization • Association analysis See some examples of data summarization. • Classification and prediction • Clustering • Web Mining • Spatial and Multimedia Data Mining • Other topics if time permits  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 3 4 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

  2. Data Summarization Descriptive vs. Predictive Data Mining Outline • Descriptive mining: describe concepts or task-relevant data sets in concise, informative, discriminative forms. • What are summarization and generalization? • Predictive mining: Based on data and analysis, • What are the methods for descriptive data mining? construct models for the database, and predict the trend • What is the difference with OLAP? and properties of unknown data. Concept description: • Can we discriminate between data classes? • Characterization: provides a concise and succinct summarization of the given collection of data. • Comparison: provides descriptions comparing two or more collections of data.  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 5 Principles of Knowledge Discovery in Data University of Alberta 6 Need for Hierarchies in Descriptive Mining Creating Hierarchies • Schema hierarchy • Defined by database schema: – Ex: house_number < street < city < province < country – Some attributes naturally form a hierarchy: • define hierarchy as [ house_number, street, city, province, country ] • Instance-based (Set-Grouping Hierarchy): • Address (street, city, province, country, continent) – Ex: { freshman, ..., senior } ⊂ undergraduate . – Some hierarchies are formed with different attribute define hierarchy statusHier as • combinations: level2: {freshman, sophomore, junior, senior} < level1:undergraduate; • food ( category, brand, content _spec, package _size, price ). level2: {M.Sc, Ph.D} < level1:graduate; level1: {undergraduate, graduate} < level0: allStatus • Defined by set-grouping operations (by users/experts). • Rule-based: • { chemistry, math, physics } ⊂ science. – undergraduate(x) ∧ gpa(x) > 3.5 � good(x). • Generated automatically by data distribution analysis. • Operation-based: • Adjusted automatically based on the existing hierarchy . – aggregation, approximation, clustering, etc.  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 7 8 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

  3. Automatic Generation of Numeric Hierarchies Methods for Automatic Generation of Hierarchies 40 • Categorical hierarchies: (Cardinality heuristics) 35 Count – Observation: the higher hierarchy, the smaller cardinality. 30 • card(city) < card(state) < card (country). 25 – There are exceptions, e.g., {day, month, quarter, year}. 20 15 – Automatic generation of categorical hierarchies based on 10 cardinality heuristic: 5 • location: {country, street, city, region, big-region, province}. Amount 0 • Numerical hierarchies: 10000 30000 50000 70000 90000 – Many algorithms are applicable for generation of hierarchies 2000-97000 based on data distribution. 2000-25000 25000-97000 – Range-based vs. distribution-based (different binning methods) 2000-12000 12000-25000 25000-38000 38000-97000  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 9 Principles of Knowledge Discovery in Data University of Alberta 10 Dynamic Adjustment of Concept Automatic Hierarchy Adjustment Hierarchies • Why adjusting hierarchies dynamically? Original concept Hierarchy CANADA – Different applications may view data differently. Maritime Western Central – Example: Geography in the eyes of politicians, researchers, 68 212 97 15 9 9 B.C. Prairies Ontario Quebec Nova Scotia New Brunswick New Foundland and merchants. 40 8 15 • How to adjust the hierarchy? Alberta Manitoba Saskatchewan – Maximally preserve the given hierarchy shape . Adjusted Concept Hierarchy CANADA – Node merge and split based on certain weighted measure (such as count, sum, etc.) (Maritime) Western Central 33 68 40 23 212 97 • E.g., small nodes (such as small provinces) should be Maritime B.C. Man+Sas Ontario Quebec Alberta merged and big nodes should be split. 8 15 15 9 9 Manitoba Saskatchewan Nova Scotia New Brunswick New Foundland  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 11 12 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

  4. Data Summarization Methods of Descriptive Data Mining Outline • Data cube-based approach: – Dimensions: Attributes form concept hierarchies – Measures: sum, count, avg, max, standard-deviation, etc. • What are summarization and generalization? – Drilling: generalization and specialization. • What are the methods for descriptive data mining? – Limitations: dimension/measure types, intelligent analysis. • What is the difference with OLAP? • Can we discriminate between data classes? • Attribute-oriented induction: – Proposed in 1989 (KDD’89 workshop). – Not confined to categorical data nor particular measures. – Can be presented in both table and rule forms.  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 13 Principles of Knowledge Discovery in Data University of Alberta 14 Basic Principles of Attribute-Oriented Basic Algorithm for Attribute-Oriented Induction Induction • Data focusing: task-relevant data, including dimensions, and the result is the initial relation . • InitialRel: Query processing of task-relevant data, deriving the • Attribute-removal: remove attribute A if there is a large set of initial relation . distinct values for A but (1) there is no generalization operator on • PreGen: Based on the analysis of the number of distinct values A , or (2) A ’s higher level concepts are expressed in terms of other in each attribute, determine generalization plan for each attribute: attributes. removal? or how high to generalize? • Attribute-generalization: If there is a large set of distinct values • PrimeGen: Based on the PreGen plan, perform generalization to for A , and there exists a set of generalization operators on A , then the right level to derive a “prime generalized relation”. select an operator and generalize A . • Presentation: User interaction: (1) adjust levels by drilling, (2) • Attribute-threshold control: typical 2-8, specified/default. pivoting, (3) mapping into rules, cross tabs, visualization • Generalized relation threshold control: control the final presentations. relation/rule size.  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 15 16 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend