OLAP over Imprecise Data with Domain Constraints Doug Burdick - - PowerPoint PPT Presentation

olap over imprecise data with domain constraints
SMART_READER_LITE
LIVE PREVIEW

OLAP over Imprecise Data with Domain Constraints Doug Burdick - - PowerPoint PPT Presentation

WILD PROJECT REVIEW WILD PROJECT REVIEW WILD PROJECT REVIEW OLAP over Imprecise Data with Domain Constraints Doug Burdick University of Wisconsin Madison Joint work with AnHai Doan (UW-Madison), Raghu Ramakrishnan (Yahoo! Research),


slide-1
SLIDE 1

WILD PROJECT REVIEW WILD PROJECT REVIEW WILD PROJECT REVIEW

OLAP over Imprecise Data with Domain Constraints

Doug Burdick University of Wisconsin – Madison Joint work with AnHai Doan (UW-Madison), Raghu Ramakrishnan (Yahoo! Research), Shivakumar Vaithyanathan (IBM Research at Almaden)

slide-2
SLIDE 2
  • Traditional OLAP: Data Model

Mil Mad LA SJC CA WI ALL Loc Civic Sierra F150 Camry Truck Sedan ALL Auto

200 100 500 100 Repair Mil Mil Mad Mad Loc Sierra F150 Sierra F150 Auto p4 p3 p2 p1 FactID p3 p1 p4 p2

slide-3
SLIDE 3
  • Traditional OLAP: Queries

200 100 500 100 Repair Mil Mil Mad Mad Loc Sierra F150 Sierra F150 Auto p4 p3 p2 p1 FactID p3 p1 p4 p2

Auto = Truck Loc = Mil SUM(Repair) = ?

Mil Mad LA SJC CA WI ALL Loc Civic Sierra F150 Camry Truck Sedan

ALL

Auto

Answer: 300

slide-4
SLIDE 4
  • Querying Information Extracted from Text

After my old car was totaled in the Madison flood, I bought a BMW 330. It’s at the mechanic’s all the time. p3 My 5-speed Subaru Outback handles well in Wisconsin

  • winters. Great value at $25000

p2 I love the reliability of my F150 from Zimbrick Ford in

  • Milwaukee. Much better than my
  • Sierra. Paid $30000 for a 4WD.

p1

Review Text ID

330 BMW 330 Madison p3 25000 Subaru Outback Wisconsin p2 30000 {F150, Sierra} Milwaukee p1

Price Model Location ID

For each location, what is the average price for different cars?

In a dataset from a real-world application at IBM Almaden with 800,000 facts, 30% were imprecise

slide-5
SLIDE 5
  • [VLDB 05] Proposed Solution:

Allow Imprecise Facts

p3 p1 p4 p2 100 Mil Truck p5 200 Mil Sierra p4 100 Mil F150 p3 500 Mad Sierra p2 100 Mad F150 p1 Repair Loc Auto FactID p5

Mil Mad LA SJC CA WI ALL Loc Civic Sierra F150 Camry Truck Sedan

ALL

Auto

slide-6
SLIDE 6
  • [VLDB 05] Problem: How to Query

Imprecise Facts

p3 p1 p4 p2 p5

Mil Mad Sierra F150

100 200 100 500 100 Repair Mil Mil Mil Mad Mad Loc Truck p5 Sierra F150 Sierra F150 Auto p4 p3 p2 p1 FactID

Truck WI

Auto = F150 Loc = Mil SUM(Repair) = ? Answer: ?

slide-7
SLIDE 7
  • [VLDB 05] Solution: Use possible

worlds

EDB D’ Imprecise fact table D

Q

Allocation

w1 w2 w3 w4

A

Possible worlds

Query answer is expected value over possible worlds

slide-8
SLIDE 8
  • [VLDB 05] Example

100 200 100 500 100 Repair Mil Truck p5 Mil Mil Mad Mad Loc Sierra F150 Sierra F150 Auto p4 p3 p2 p1 FactID 6 5 4 3 2 1 ID 100 100 200 100 500 100 Repair 0.6 Mil F150 p5 0.4 1.0 1.0 1.0 1.0 Alloc Mil Mil Mil Mad Mad Loc Sierra p5 Sierra F150 Sierra F150 Auto p4 p3 p2 p1 FactID

Imprecise Fact Table D Extended Database D’

p3 p1 p4 p2 Mil Mad Sierra F150 Truck WI p5 p3 p1 p4 p2 Mil Mad Sierra F150 Truck WI p5 p5

0.60.4

slide-9
SLIDE 9
  • p3

p1 p4 p2 p5

Mil Mad Sierra F150 Truck WI

[VLDB 05] Example

p3 p1 p4 p2 p5

Mil Mad Sierra F150

p3 p1 p4 p2 p5

Mil Mad Sierra F150

P(w1) = 0.6 P(w2) = 0.4 w1 w2

slide-10
SLIDE 10
  • Contributions [VLDB 05, VLDB 06]

Formalize entire process Develop several allocation policies Show how to execute allocation efficiently Demonstrate how to answer queries

efficiently

Assumes all imprecise facts are independent

slide-11
SLIDE 11
  • Challenge: Incorporate Domain

Constraints

130 Madison Honda Dells p4 130 Dells Honda Madison p3 250 John Smith F150 Wisconsin p2 100 John Smith F150 Wisconsin p1 Cost Name Auto Loc FactID

Madison, Honda, broken

  • ex. pipe, Dells & I-90,

towed 25 miles, $130 r3 customer John Smith brought F150 to garage engine noise, WI, $250 r2 F150, oil change, $100, WI, John Smith r1

Repair Text ID

“Two facts with same person name and model must have same city” “Exactly one of facts p3 or p4 exists”

slide-12
SLIDE 12
  • Summary of Contributions

Present constraint language L

Define both syntax of L and semantics of

answering queries with constraints defined in L

Efficiently answer queries with constraints

using a marginal database D*

Present algorithms to efficiently construct

marginal database D*

slide-13
SLIDE 13
  • Constraint Language: Examples

“Two facts with same person name and model must

have same location”

(r.Name = r’.Name) ^ (r.Auto = r’.Auto)

(r.Loc = r’.Loc)

“Exactly one of facts p3 or p4 exists”

exists(p3)

exists(p4)

exists(p4)

exists(p3)

“If the location for p1 is Madison, then p3 must exist

(and p4 cannot exist)”

(p1.Loc = “Madison”) exists(p3) ^ exists(p4)

¬ ¬ ¬

slide-14
SLIDE 14
  • Constraint Language: Syntax

A constraint has form A B where A,B are

conjunctions of atoms

Atoms have form [r.A Θ c ] or [r.A Θ r’.A] or

exists(r), exists(r) where

r,r' are either specific factIDs themselves variables that bind to factIDs in D r.A is the value of attribute A of fact r. Θ

{=, ≠, ≤,<,≥,>} is a comparison operator over the appropriate domain

c is a constant from dom(A), and exists(r) ( exists(r)) is a predicate that holds if

fact r exists (cannot exist)

¬ ¬

slide-15
SLIDE 15
  • Constraint Language: Semantics

A possible world satisfying all constraints

is valid

Query answer is expected value over

valid possible worlds

EDB D’

Imprecise fact table D

Q

Allocation

w1 w2 w3 w4

A

Possible worlds Constraints C

w2 w4

slide-16
SLIDE 16
  • Efficient Query Answering

EDB D’

Imprecise fact table D

Q

Allocation

w1 w2 w3 w4

A

Possible worlds Constraints C

w2 w4

Can compute expected value over valid possible

worlds in single scan of Marginal Database (MDB) D*

EDB D’

Imprecise fact table D

Q A

MDB D*

Allocation Marginalization

slide-17
SLIDE 17
  • Mad Dells

Civ Cam

r2 r2 r1 r1 FactID Cam Cam Cam Cam Model Dells Mad Dells Mad Loc 400 400 100 100 Cost 0.2 0.8 0.3 0.7 Alloc

P(w1) = 0.56 P(w2) = 0.24

Constraint: (r.Model = r’.Model)

  • (r.Loc = r’.Loc)

r2 r2 r1 r1 FactID Cam Cam Cam Cam Model Dells Mad Dells Mad Loc 400 400 100 100 Cost 0.1 0.9 0.1 0.9 Mar

r1 r2 Mad Dells Civ Cam r1 r2 Mad Dells Civ Cam P(w3) = 0.56 P(w4) = 0.06 r1 r2 Mad Dells Civ Cam r1 r2 Mad Dells Civ Cam PN(w1) = 0.90 PN(w2) = 0 r1 r2 Mad Dells Civ Cam r1 r2 Mad Dells Civ Cam PN(w3) = 0 PN(w4) = 0.10 r1 r2 Mad Dells Civ Cam r1 r2

EDB D’ MDB D*

slide-18
SLIDE 18
  • Marginal Database (MDB) D*

Let D’ be EDB obtained from imprecise fact table D Each claim in D’ has tuple ft with allocation weight wt Let W be set of valid possible worlds satisfying a

given set of constraints C

Let mt be the total probability of worlds in W where ft

is true.

We refer to mt as the marginal probability of ft and

(ft, mt) is a marginal tuple.

Store all marginal tuples in marginal database

(MDB) D*

slide-19
SLIDE 19
  • Marginalization Algorithms

MDB D* EDB D’ Decomp CC1 CC2 CC3 Constraint Hypergraph G

Can process connected component in

constraint hypergraph independently

slide-20
SLIDE 20
  • Constraint Hypergraph: Example

Mad Dells WI Civ Cam Sedan

Loc Model

r1

r1 r2 r3 r4

r2 r3 r4

Constraint: (r.Model = r’.Model)

  • (r.Loc = r’.Loc)
slide-21
SLIDE 21
  • Constraint Hypergraph: G=(V,H)

Nodes V: For each fact r in given imprecise

database D, introduce a node to V

Hyperedges H: For each minimal set of facts

with a combination of completions violating a constraint, introduce a hyperedge to H

slide-22
SLIDE 22
  • Experimental Setup

Algorithms evaluated on several datasets

Real-world dataset: 798K facts , 4 dimensions Used several synthetic datasets Scalability (up to 3.2 million tuples)

Constraint sets

Randomly generated several constraint sets of

varying “complexity”

Develop suitable complexity metric

slide-23
SLIDE 23
  • Performance

800K Facts

Total Time GenerateComponents ProcessComponents Best Fit (Total Time) Best Fit (GenComps) Best Fit (ProcComps)

slide-24
SLIDE 24
  • Performance

3200K Facts

Total Time GenerateComponents ProcessComponents Best Fit (Total Time) Best Fit (GenComps) Best Fit (ProcComps)

slide-25
SLIDE 25
  • Component Sizes
slide-26
SLIDE 26
  • Related work

Imprecise data with constraints

MayBMS [Antova et al. 07] Representing and Querying Correlated Tuples in

Probabilistic Databases [Sen, Deshpande 07]

ConQuer [Fuxman et al 05]

Probabilistic databases

Probabilistic Databases [Dalvi et al. 04] TRIO system for uncertain data [Widom et al.05]

OLAP

Constraints in OLAP [Hurtado et. al 02] OLAP over Incomplete Data [Dyreson 96]

slide-27
SLIDE 27
  • Summary

We extend our framework for OLAP over

imprecise data to support domain information.

Eliminate the strong independence

assumptions required earlier

Often violated in many applications (e.g., IE

from text)

First work we are aware of to consider OLAP

aggregation queries over imprecise data in the presence of constraints