chapter 4 data w arehouse and olap operations
play

Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho - PDF document

11/16/2017 CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapter 4 , Data W


  1. 11/16/2017 CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse & OLAP Operations  Basic Concept of Data W arehouse  Data W arehouse Modeling  Data W arehouse Architecture  Data W arehouse I m plem entation  From Data W arehousing to Data Mining 1

  2. 11/16/2017 What is Data Warehouse?  Data W arehouse ( defined in many different ways )  A decision support database that is maintained separately from the organization’s operational database  The support of information processing by providing a solid platform of consolidated, historical data for analysis  “A data warehouse is a (1) subject-oriented , (2) integrated , (3) time-variant , and (4) nonvolatile collection of data in support of management’s decision-making process.” — W. H. I nmon  Data W arehousing  The process of constructing and using data warehouses Data Warehouse – Subject-Oriented  Organized around major subjects  e.g., customers, products, sales  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing  Provide a simple and concise view around particular subject issues  Excluding data that are not useful in the decision support process 2

  3. 11/16/2017 Data Warehouse – Integrated  I ntegrating multiple, heterogeneous data sources  Relational databases, flat files, on-line transaction records  Apply data cleaning and data integration techniques  Ensures consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources Data Warehouse – Time Variant  The time horizon of data warehouses is significantly longer than that of operational systems  Operational databases have current data values  Data warehouses provide information from a historical perspective (e.g., 10-20 years)  Time is a key structure in data warehouses  Contain the attribute of time (explicitly or implicitly) 3

  4. 11/16/2017 Data Warehouse – Nonvolatile  A physically separate storage of data transformed from operational databases  Operational update of data does not occur  Not require transaction processing, recovery, and concurrency control mechanisms  Require only two operations, initial loading of data and access of data Data Integration Methods  Methods (1) Process to provide uniform interface to multiple data sources → Tradition Database I ntegration (2) Process to combine multiple data sources into coherent storage → Data warehousing  Traditional DB I ntegration  A query-driven approach  Wrappers / mediators on top of heterogeneous data sources  Data W arehousing  An update-driven approach  Combined the heterogeneous data sources in advance  Stored them in a warehouse for direct query and analysis 4

  5. 11/16/2017 OLTP vs. OLAP  OLTP ( on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: e.g., purchasing, inventory, manufacturing, banking, payroll, registration, accounting, etc.  OLAP ( on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making  Distinct Features ( OLTP vs. OLAP)  User and system orientation (customers vs. market analysts)  Data contents (current, detailed vs. historical, consolidated)  Database design (ER + application vs. star + subject)  View (current, local vs. evolutionary, integrated) OLTP vs. OLAP Feature OLTP OLAP Characteristic operational processing information processing Orientation transaction analysis Users clerk, DBA knowledge worker (CEO, analyst) Function day-to-day operations decision support DB Design application-oriented subject-oriented Data current, up-to-date historical, integrated, summarized Unit of work short, simple transaction complex query Access read/write/update read-only (lots of scans) 5

  6. 11/16/2017 Why Data Warehouse?  Perform ance I ssue  DBMS: tuned for OLTP e.g., access methods, indexing, concurrency control, recovery  Data warehouse: tuned for OLAP e.g., complex queries, multidimensional view, consolidation  Data I ssue  Decision support requires historical data, consolidated and summarized data, consistent data CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse & OLAP Operations  Basic Concept of Data W arehouse  Data W arehouse Modeling  Data W arehouse Architecture  Data W arehouse I m plem entation  From Data W arehousing to Data Mining 6

  7. 11/16/2017 Data Format for Warehouse  Dim ensions  Multi-dimensional data model  Data are stored in the form of a data cube  Data Cube  A view of multi-dimensions  Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year)  Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables  Cuboid  Each combination of dimensional spaces in a data cube  0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid The Lattice of Cuboids all 0-D (apex) cuboid time item location supplier 1-D cuboids time,location item,location location,supplier time,item 2-D cuboids time,supplier item,supplier time,item,location time,location,supplier 3-D cuboids time,item,supplier item,location,supplier 4-D (base) cuboid time, item, location, supplier 7

  8. 11/16/2017 Conceptual Modeling  Key of Modeling Data W arehouses  Handling dimensions & measures  Exam ples  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Example of Star Schema time item time_key day item_key day_of_the_week Sales Fact Table item_name month brand time_key quarter type year supplier_type item_key branch_key branch location_key location branch_key units_sold location_key branch_name branch_type street dollars_sold city state avg_sales country Measures 8

  9. 11/16/2017 Example of Snowflake Schema time item time_key supplier day item_key day_of_the_week Sales Fact Table item_name supplier_key month brand supplier_type time_key quarter type year supplier_key item_key branch_key branch location_key location branch_key units_sold branch_name location_key branch_type street city dollars_sold city_key city_key avg_sales city state Measures country Example of Fact Constellation Shipping Fact Table time item time_key time_key day item_key day_of_the_week item_key Sales Fact Table item_name month brand shipper_key time_key quarter type year supplier_type from_location item_key to_location branch_key branch dollars_cost location_key location branch_key units_shipped units_sold location_key branch_name branch_type street dollars_sold shipper city state avg_sales shipper_key country shipper_name Measures location_key shipper_type 9

  10. 11/16/2017 Cube Definition in DMQL  Cube Definition (Fact Table)  define cube < cube_name> [ < dimension_list> ] : < measure_list>  Dim ension Definition (Dimension Table)  define dimension < dimension_name> as (< attribute_or_dimension_list> )  Special Case (Shared Dimension Table)  define dimension < dimension_name> as < dimension_name_first> in cube < cube_name_first> Star Schema Definition in DMQL  Exam ple  define cube sales [ time, item, branch, location] : dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(* )  define dimension time as (time_key, day, day_of_week, month, quarter, year)  define dimension item as (item_key, item_name, brand, type, supplier_type)  define dimension branch as (branch_key, branch_name, branch_type)  define dimension location as (location_key, street, city, state, country) 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend