W arehousing The most common form of information in - - PDF document

w arehousing the most common form of information in
SMART_READER_LITE
LIVE PREVIEW

W arehousing The most common form of information in - - PDF document

W arehousing The most common form of information in tegration: cop y sources in to a single DB and try to k eep it up-to-date. Usual metho d: p erio dic reconstruction of the w arehouse, p erhaps o v


slide-1
SLIDE 1 W arehousing
  • The
most common form
  • f
information in tegration: cop y sources in to a single DB and try to k eep it up-to-date.
  • Usual
metho d: p erio dic reconstruction
  • f
the w arehouse, p erhaps
  • v
ernigh t. 1
slide-2
SLIDE 2 OL TP V ersus OLAP
  • Most
database
  • p
erations are
  • f
a t yp e called
  • n-line
tr ansaction pr
  • c
essing (OL TP).

Short, simple queries and frequen t up dates in v
  • lving
  • ne
  • r
a small n um b er
  • f
tuples.

Examples: answ ering queries from a W eb in terface, recording sales at cash-registers, selling airline tic k ets. 2
slide-3
SLIDE 3
  • Of
increasing imp
  • rtance
are
  • p
erations
  • f
the
  • n-line
analytic pr
  • c
essing (OLAP) t yp e.

F ew, but v ery complex and time- consuming queries (can run for hours).

Up dates are infrequen t, and/or the answ er to the query is not dep enden t
  • n
ha ving an absolutely up-to-date database.

Example: Amazon analyzes purc hases b y all its customers to come up with an individual screen with pro ducts
  • f
lik ely in terest to the customer.

Example: Analysts at W al-Mart lo
  • k
for items with increasing sales at stores in some region.
  • Common
arc hitecture: Lo cal databases, sa y
  • ne
p er branc h store, handle OL TP , while a w arehouse in tegrating information from all branc hes handles OLAP .
  • The
most complex OLAP queries are
  • ften
referred to as data mining. 3
slide-4
SLIDE 4 Star Sc hemas Commonly , the data at a w arehouse is
  • f
t w
  • t
yp es: 1. F act Data : V ery large, accum ulation
  • f
facts suc h as sales.

Often \insert-only";
  • nce
there, a tuple remains. 2. Dimension Data : Smaller, generally static, information ab
  • ut
the en tities in v
  • lv
ed in the facts. 4
slide-5
SLIDE 5 Example Supp
  • se
w e w an ted to record ev ery sale
  • f
b eer at all bars: the bar, the b eer, the drink er who b
  • ugh
t the b eer, the da y and time, the price c harged.
  • F
act data is in a relation with sc hema: Sales(bar, beer, drinker, day, time, price)
  • Dimension
data could include a relation for bars,
  • ne
for b eers, and
  • ne
for drink ers. Bars(bar, addr, lic) Beers(beer, manf) Drinkers(drinker, addr, phone) 5
slide-6
SLIDE 6 Tw
  • Approac
hes to Building W arehouses 1. R OLAP (Relational OLAP): relational database system tuned for star sc hemas, e.g. using sp ecial index structures suc h as:

\Bitmap indexes" (for eac h k ey
  • f
a dimension table, e.g., bar name, a bit- v ector telling whic h tuples
  • f
the fact table ha v e that v alue).

Materialize d views = answ ers to general queries from whic h more sp ecic queries can b e answ ered with less w
  • rk
than if w e had to w
  • rk
from the ra w data. 2. MOLAP (Multidi mensional OLAP): A sp eciali zed mo del based
  • n
a \cub e" view
  • f
data. 6
slide-7
SLIDE 7 R OLAP T ypical queries b egin with a complete \star join," for example: SELECT * FROM Sales, Bars, Beers, Drinkers WHERE Sales.bar = Bars.bar AND Sales.beer = Beers.beer AND Sales.drinker = Drinkers.drinker;
  • T
ypical OLAP query will: 1. Do all
  • r
part
  • f
the star join. 2. Filter in teresting tuples based
  • n
fact and/or dimension data. 3. Group b y
  • ne
  • r
more dimensions. 4. Aggregate the result.
  • Example:
\F
  • r
eac h bar in P alo Alto, nd the total sale
  • f
eac h b eer man ufactured b y Anheuser-Busc h." 7
slide-8
SLIDE 8 P erformance Issues
  • If
the fact table is large, queries will tak e m uc h to
  • long.
  • Materializ
ed views can help. Example F
  • r
the question ab
  • ut
bars in P alo Alto and b eers b y Anheuser-Busc h, w e w
  • uld
b e aided b y the materialized view: CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS SELECT bar, addr, beer, manf, SUM(price) AS sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers GROUP BY bar, addr, beer, manf; 8
slide-9
SLIDE 9 MOLAP Based
  • n
\data cub e": k eys
  • f
dimension tables form axes
  • f
the cub e.
  • Example:
for
  • ur
running example, w e migh t ha v e four dimensions: bar, b eer, drink er, and time.
  • Dep
enden t attributes (price
  • f
the sale in
  • ur
example) app ear at the p
  • in
ts
  • f
the cub e.
  • But
the cub e also includes aggregations (sums, t ypicall y) along the margins.

Example: in
  • ur
4-dimensional cub e, w e w
  • uld
ha v e the sum
  • v
er eac h bar, eac h b eer, eac h drink er, and eac h time instan t (p erhaps group b y da y).

W e w
  • uld
also ha v e aggregations b y all subsets
  • f
the dimensions, e.g., b y eac h bar and b eer,
  • r
b y eac h b eer, drink er, and da y . 9
slide-10
SLIDE 10 Slicing and Dicing
  • Slic
e = select a v alue along
  • ne
dimension, e.g., a particular bar.
  • Dic
e = the same thing along another dimension, e.g., a particular b eer. Drill-Do wn and Roll-Up
  • Dril
l-down = \de-aggregate" = break an aggregate in to its constituen ts.

Example: ha ving determined that Jo e's Bar in P alo Alto is selling v ery few Anheuser-Busc h b eers, break do wn his sales b y the particular b eer.
  • R
  • l
l-up = aggregate along
  • ne
dimension.

Example: giv en a table
  • f
ho w m uc h Budw eiser eac h drink er consumes at eac h bar, roll it up in to a table
  • f
amoun t consumed b y eac h drink er. 10
slide-11
SLIDE 11 P erformance As with R OLAP , materialized views can help.
  • Data-cub
es in vite materialized views that are aggregations in
  • ne
  • r
more dimensions.
  • Dimensions
need not b e aggregated completely . Rather, grouping b y attributes
  • f
the dimension table is p
  • ssible.

Example: a materialized view migh t aggregate b y drink er completely , b y b eer not at all, b y time according to the da y , and b y bar
  • nly
according to the cit y
  • f
the bar.

Example: time is a really in teresting dimension, since there are natural groupings, suc h as w eeks and mon ths, that are not commensurate. 11
slide-12
SLIDE 12 Data Mining Large-scale queries designed to extract patterns from data.
  • Big
example: \asso ciation-rule s"
  • r
\frequen t itemsets." Mark et-Bask et Data An imp
  • rtan
t source
  • f
data for asso ciation rules is market b askets.
  • As
a customer passes through the c hec k
  • ut,
w e learn what items they buy together, e.g., ham burger and k etc h up.
  • Giv
es us data with sc hema Baskets(bid, item).
  • Mark
eters w
  • uld
lik e to kno w what items p eople buy together.

Example: if p eople tend to buy ham burger and k etc h up together, put them near eac h
  • ther,
with p
  • tato
c hips b et w een.

Example: run a sale
  • n
ham burger and raise the price
  • f
k etc h up. 12
slide-13
SLIDE 13 Simplest Problem: Find the F requen t P airs
  • f
Items Giv en a supp
  • rt
thr eshold s, w e could ask:
  • Find
the pairs
  • f
items that app ear together in at least s bask ets. SELECT b1.item, b2.item FROM Baskets b1, Baskets b2 WHERE b1.bid = b2.bid AND b1.item < b2.item GROUP BY b1.item, b2.item HAVING COUNT(*) >= s; 13
slide-14
SLIDE 14 A-Priori T ric k
  • Ab
  • v
e query is prohibitiv ely exp ensiv e for large data.
  • A-priori
algorithm uses the fact that a pair (i; j ) cannot ha v e supp
  • rt
s unless i and j b
  • th
ha v e supp
  • rt
s b y themselv es.
  • More
ecien t implemen tation uses an in termediate relation Baskets1. INSERT INTO Baskets1(bid, item) SELECT * FROM Baskets WHERE item IN ( SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s );
  • Then
run the query for pairs
  • n
Baskets1 instead
  • f
Baskets. 14