w arehousing the most common form of information in
play

W arehousing The most common form of information in - PDF document

W arehousing The most common form of information in tegration: cop y sources in to a single DB and try to k eep it up-to-date. Usual metho d: p erio dic reconstruction of the w arehouse, p erhaps o v


  1. W arehousing The most common form of information � in tegration: cop y sources in to a single DB and try to k eep it up-to-date. Usual metho d: p erio dic reconstruction of the � w arehouse, p erhaps o v ernigh t. 1

  2. OL TP V ersus OLAP Most database op erations are of a t yp e called � ( OL TP ). on-line tr ansaction pr o c essing ✦ Short, simple queries and frequen t up dates in v olving one or a small n um b er of tuples. ✦ Examples: answ ering queries from a W eb in terface, recording sales at cash-registers, selling airline tic k ets. 2

  3. Of increasing imp ortance are op erations of the � ( OLAP ) t yp e. on-line analytic pr o c essing ✦ F ew, but v ery complex and time- consuming queries (can run for hours). ✦ Up dates are infrequen t, and/or the answ er to the query is not dep enden t on ha ving an absolutely up-to-date database. ✦ Example: Amazon analyzes purc hases b y all its customers to come up with an individual screen with pro ducts of lik ely in terest to the customer. ✦ Example: Analysts at W al-Mart lo ok for items with increasing sales at stores in some region. Common arc hitecture: Lo cal databases, sa y � one p er branc h store, handle OL TP , while a w arehouse in tegrating information from all branc hes handles OLAP . The most complex OLAP queries are often � referred to as mining . data 3

  4. Star Sc hemas Commonly , the data at a w arehouse is of t w o t yp es: 1. : V ery large, accum ulation of facts F act Data suc h as sales. ✦ Often \insert-only"; once there, a tuple remains. 2. : Smaller, generally static, Dimension Data information ab out the en tities in v olv ed in the facts. 4

  5. Example Supp ose w e w an ted to record ev ery sale of b eer at all bars: the bar, the b eer, the drink er who b ough t the b eer, the da y and time, the price c harged. F act data is in a relation with sc hema: � Sales(bar, beer, drinker, day, time, price) Dimension data could include a relation for � bars, one for b eers, and one for drink ers. Bars(bar, addr, lic) Beers(beer, manf) Drinkers(drinker, addr, phone) 5

  6. Tw o Approac hes to Building W arehouses 1. (Relational OLAP): relational R OLAP database system tuned for star sc hemas, e.g. using sp ecial index structures suc h as: ✦ \Bitmap indexes" (for eac h k ey of a dimension table, e.g., bar name, a bit- v ector telling whic h tuples of the fact table ha v e that v alue). ✦ = answ ers to general Materialize d views queries from whic h more sp eci�c queries can b e answ ered with less w ork than if w e had to w ork from the ra w data. 2. (Multidi mensional OLAP): A MOLAP sp eciali zed mo del based on a \cub e" view of data. 6

  7. R OLAP T ypical queries b egin with a complete \star join," for example: SELECT * FROM Sales, Bars, Beers, Drinkers WHERE Sales.bar = Bars.bar AND Sales.beer = Beers.beer AND Sales.drinker = Drinkers.drinker; T ypical OLAP query will: � 1. Do all or part of the star join. 2. Filter in teresting tuples based on fact and/or dimension data. 3. Group b y one or more dimensions. 4. Aggregate the result. Example: \F or eac h bar in P alo Alto, �nd � the total sale of eac h b eer man ufactured b y Anheuser-Busc h." 7

  8. P erformance Issues If the fact table is large, queries will tak e � m uc h to o long. Materializ ed views can help. � Example F or the question ab out bars in P alo Alto and b eers b y Anheuser-Busc h, w e w ould b e aided b y the materialized view: CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS SELECT bar, addr, beer, manf, SUM(price) AS sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers GROUP BY bar, addr, beer, manf; 8

  9. MOLAP Based on \data cub e": k eys of dimension tables form axes of the cub e. Example: for our running example, w e migh t � ha v e four dimensions: bar, b eer, drink er, and time. Dep enden t attributes (price of the sale in our � example) app ear at the p oin ts of the cub e. But the cub e also includes aggregations (sums, � t ypicall y) along the margins. ✦ Example: in our 4-dimensional cub e, w e w ould ha v e the sum o v er eac h bar, eac h b eer, eac h drink er, and eac h time instan t (p erhaps group b y da y). ✦ W e w ould also ha v e aggregations b y all subsets of the dimensions, e.g., b y eac h bar and b eer, or b y eac h b eer, drink er, and da y . 9

  10. Slicing and Dicing = select a v alue along one dimension, Slic e � e.g., a particular bar. = the same thing along another Dic e � dimension, e.g., a particular b eer. Drill-Do wn and Roll-Up = \de-aggregate" = break an Dril l-down � aggregate in to its constituen ts. ✦ Example: ha ving determined that Jo e's Bar in P alo Alto is selling v ery few Anheuser-Busc h b eers, break do wn his sales b y the particular b eer. = aggregate along one dimension. R ol l-up � ✦ Example: giv en a table of ho w m uc h Budw eiser eac h drink er consumes at eac h bar, roll it up in to a table of amoun t consumed b y eac h drink er. 10

  11. P erformance As with R OLAP , materialized views can help. Data-cub es in vite materialized views that are � aggregations in one or more dimensions. Dimensions need not b e aggregated � completely . Rather, grouping b y attributes of the dimension table is p ossible. ✦ Example: a materialized view migh t aggregate b y drink er completely , b y b eer not at all, b y time according to the da y , and b y bar only according to the cit y of the bar. ✦ Example: time is a really in teresting dimension, since there are natural groupings, suc h as w eeks and mon ths, that are not commensurate. 11

  12. Data Mining Large-scale queries designed to extract patterns from data. Big example: \asso ciation-rule s" or \frequen t � itemsets." Mark et-Bask et Data An imp ortan t source of data for asso ciation rules is askets . market b As a customer passes through the c hec k out, � w e learn what items they buy together, e.g., ham burger and k etc h up. Giv es us data with sc hema Baskets(bid, � item) . Mark eters w ould lik e to kno w what items � p eople buy together. ✦ Example: if p eople tend to buy ham burger and k etc h up together, put them near eac h other, with p otato c hips b et w een. ✦ Example: run a sale on ham burger and raise the price of k etc h up. 12

  13. Simplest Problem: Find the F requen t P airs of Items Giv en a s , w e could ask: supp ort thr eshold Find the pairs of items that app ear together in � at least bask ets. s SELECT b1.item, b2.item FROM Baskets b1, Baskets b2 WHERE b1.bid = b2.bid AND b1.item < b2.item GROUP BY b1.item, b2.item HAVING COUNT(*) >= s; 13

  14. A-Priori T ric k Ab o v e query is prohibitiv ely exp ensiv e for � large data. uses the fact that a pair A-priori algorithm � ( i; ) cannot ha v e supp ort unless and j s i j b oth ha v e supp ort b y themselv es. s More e�cien t implemen tation uses an � in termediate relation Baskets1 . INSERT INTO Baskets1(bid, item) SELECT * FROM Baskets WHERE item IN ( SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s ); Then run the query for pairs on Baskets1 � instead of Baskets . 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend