 
              11/16/2017 CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse & OLAP Operations  Basic Concept of Data W arehouse  Data W arehouse Modeling  Data W arehouse Architecture  Data W arehouse I m plem entation  From Data W arehousing to Data Mining 1
11/16/2017 What is Data Warehouse?  Data W arehouse ( defined in many different ways )  A decision support database that is maintained separately from the organization’s operational database  The support of information processing by providing a solid platform of consolidated, historical data for analysis  “A data warehouse is a (1) subject-oriented , (2) integrated , (3) time-variant , and (4) nonvolatile collection of data in support of management’s decision-making process.” — W. H. I nmon  Data W arehousing  The process of constructing and using data warehouses Data Warehouse – Subject-Oriented  Organized around major subjects  e.g., customers, products, sales  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing  Provide a simple and concise view around particular subject issues  Excluding data that are not useful in the decision support process 2
11/16/2017 Data Warehouse – Integrated  I ntegrating multiple, heterogeneous data sources  Relational databases, flat files, on-line transaction records  Apply data cleaning and data integration techniques  Ensures consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources Data Warehouse – Time Variant  The time horizon of data warehouses is significantly longer than that of operational systems  Operational databases have current data values  Data warehouses provide information from a historical perspective (e.g., 10-20 years)  Time is a key structure in data warehouses  Contain the attribute of time (explicitly or implicitly) 3
11/16/2017 Data Warehouse – Nonvolatile  A physically separate storage of data transformed from operational databases  Operational update of data does not occur  Not require transaction processing, recovery, and concurrency control mechanisms  Require only two operations, initial loading of data and access of data Data Integration Methods  Methods (1) Process to provide uniform interface to multiple data sources → Tradition Database I ntegration (2) Process to combine multiple data sources into coherent storage → Data warehousing  Traditional DB I ntegration  A query-driven approach  Wrappers / mediators on top of heterogeneous data sources  Data W arehousing  An update-driven approach  Combined the heterogeneous data sources in advance  Stored them in a warehouse for direct query and analysis 4
11/16/2017 OLTP vs. OLAP  OLTP ( on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: e.g., purchasing, inventory, manufacturing, banking, payroll, registration, accounting, etc.  OLAP ( on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making  Distinct Features ( OLTP vs. OLAP)  User and system orientation (customers vs. market analysts)  Data contents (current, detailed vs. historical, consolidated)  Database design (ER + application vs. star + subject)  View (current, local vs. evolutionary, integrated) OLTP vs. OLAP Feature OLTP OLAP Characteristic operational processing information processing Orientation transaction analysis Users clerk, DBA knowledge worker (CEO, analyst) Function day-to-day operations decision support DB Design application-oriented subject-oriented Data current, up-to-date historical, integrated, summarized Unit of work short, simple transaction complex query Access read/write/update read-only (lots of scans) 5
11/16/2017 Why Data Warehouse?  Perform ance I ssue  DBMS: tuned for OLTP e.g., access methods, indexing, concurrency control, recovery  Data warehouse: tuned for OLAP e.g., complex queries, multidimensional view, consolidation  Data I ssue  Decision support requires historical data, consolidated and summarized data, consistent data CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse & OLAP Operations  Basic Concept of Data W arehouse  Data W arehouse Modeling  Data W arehouse Architecture  Data W arehouse I m plem entation  From Data W arehousing to Data Mining 6
11/16/2017 Data Format for Warehouse  Dim ensions  Multi-dimensional data model  Data are stored in the form of a data cube  Data Cube  A view of multi-dimensions  Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year)  Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables  Cuboid  Each combination of dimensional spaces in a data cube  0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid The Lattice of Cuboids all 0-D (apex) cuboid time item location supplier 1-D cuboids time,location item,location location,supplier time,item 2-D cuboids time,supplier item,supplier time,item,location time,location,supplier 3-D cuboids time,item,supplier item,location,supplier 4-D (base) cuboid time, item, location, supplier 7
11/16/2017 Conceptual Modeling  Key of Modeling Data W arehouses  Handling dimensions & measures  Exam ples  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Example of Star Schema time item time_key day item_key day_of_the_week Sales Fact Table item_name month brand time_key quarter type year supplier_type item_key branch_key branch location_key location branch_key units_sold location_key branch_name branch_type street dollars_sold city state avg_sales country Measures 8
11/16/2017 Example of Snowflake Schema time item time_key supplier day item_key day_of_the_week Sales Fact Table item_name supplier_key month brand supplier_type time_key quarter type year supplier_key item_key branch_key branch location_key location branch_key units_sold branch_name location_key branch_type street city dollars_sold city_key city_key avg_sales city state Measures country Example of Fact Constellation Shipping Fact Table time item time_key time_key day item_key day_of_the_week item_key Sales Fact Table item_name month brand shipper_key time_key quarter type year supplier_type from_location item_key to_location branch_key branch dollars_cost location_key location branch_key units_shipped units_sold location_key branch_name branch_type street dollars_sold shipper city state avg_sales shipper_key country shipper_name Measures location_key shipper_type 9
11/16/2017 Cube Definition in DMQL  Cube Definition (Fact Table)  define cube < cube_name> [ < dimension_list> ] : < measure_list>  Dim ension Definition (Dimension Table)  define dimension < dimension_name> as (< attribute_or_dimension_list> )  Special Case (Shared Dimension Table)  define dimension < dimension_name> as < dimension_name_first> in cube < cube_name_first> Star Schema Definition in DMQL  Exam ple  define cube sales [ time, item, branch, location] : dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(* )  define dimension time as (time_key, day, day_of_week, month, quarter, year)  define dimension item as (item_key, item_name, brand, type, supplier_type)  define dimension branch as (branch_key, branch_name, branch_type)  define dimension location as (location_key, street, city, state, country) 10
Recommend
More recommend