Part II: Applications in Database Systems Some slides courtesy of - - PowerPoint PPT Presentation

part ii applications
SMART_READER_LITE
LIVE PREVIEW

Part II: Applications in Database Systems Some slides courtesy of - - PowerPoint PPT Presentation

Application of Graphical Models Part II: Applications in Database Systems Some slides courtesy of Amol Deshpande Outline Selectivity Estimation and Query Optimization Probabilistic Relational Models Probabilistic Databases


slide-1
SLIDE 1

Part II: Applications

Application of Graphical Models in Database Systems

Some slides courtesy of Amol Deshpande

slide-2
SLIDE 2

Outline

  • Selectivity Estimation and Query

Optimization

  • Probabilistic Relational Models
  • Probabilistic Databases
  • Sensor/Stream Data Management
  • References
slide-3
SLIDE 3

Selectivity Estimation

  • Estimate the intermediate result sizes that may be

generated during query processing

  • Equivalently, selectivity of predicates over tables
  • Key to obtaining good plans during optimization

SSN .. Income .. Homeowner? .. .. 100000 .. Yes .. .. 11000 .. Yes

Customer

Single-table predicates: income > 90,000 and homeowner = ‘yes’ (on customer) Multi-table predicates: c.homeowner = ‘yes’ and p.amount > 10,000 and p.ssn = c.ssn (over Customer c and Purchases p)

SSN Store .. Amount

Purchases

slide-4
SLIDE 4

Optimizer’s Assumption

  • Attribute value independence assumption
  • Attributes assumed to be independently distributed
  • Rarely true in practice

SSN .. Income .. Homeowner? .. .. 100000 .. Yes .. .. 11000 .. Yes .. .. 50000 .. No .. .. 30000 .. No .. .. 200000 .. Yes

Customer Estimate p(income > 90,000 and homeowner = yes) as p(income > 90,000) * p(homeowner = yes) Can result in severe underestimation In reality: p(income > 90,000, homeowner = yes) ≈ p(homeowner = yes)

slide-5
SLIDE 5

Optimizer’s Assumption

  • Join uniformity assumption
  • Tuples from one relation assumed equally likely to join with

tuples from other relation

  • Real datasets exhibit large skews

SSN .. Income .. Homeowner? .. .. 100,000 .. Yes .. .. 11,000 .. Yes .. .. 50,000 .. No .. .. 30,000 .. No .. .. 200,000 .. Yes

Customer

SSN Store .. Amount

Purchases

slide-6
SLIDE 6

Selectivity Estimation using PGMs

  • Eliminating attribute value independence assumption

[GTK’01,DGR’01,LWV’03,PMW’03]

SSN age Income zipcode Home

  • wner?

.. .. 100000 .. Yes .. .. 11000 .. Yes .. .. 50000 .. No .. .. 30000 .. No .. .. 200000 .. Yes

Customer Learn a PGM

Income Age Home..?

Approximate CPDs using Histograms Learning process modified to optimize for accuracy as well as storage space

slide-7
SLIDE 7

Selectivity Estimation using PGMs

  • Eliminating attribute value independence assumption

[GTK’01,DGR’01,LWV’03,PMW’03]

SSN age Income zipcode Home

  • wner?

.. .. 100000 .. Yes .. .. 11000 .. Yes .. .. 50000 .. No .. .. 30000 .. No .. .. 200000 .. Yes

Customer Learn a PGM

Income Age Home..?

Approximate CPDs using Histograms Inference Algorithm Query

Selectivity Estimates

slide-8
SLIDE 8

Outline

  • Selectivity Estimation and Query

Optimization

  • Probabilistic Relational Models
  • Probabilistic Databases
  • Sensor/Stream Data Management
  • References
slide-9
SLIDE 9

Example Probabilistic Database

  • Example from Dalvi and Suciu [2004]

A B m 1 n 1 prob 0.6 0.5 s1 s2

S

C D 1 p prob 0.4 t1

T

instance probability {s1, s2, t1} 0.12 {s1, s2} 0.18 {s1, t1} 0.12 {s1} 0.18 {s2, t1} 0.08 {s2} 0.12 {t1} 0.08 {} 0.12

Possible worlds

slide-10
SLIDE 10

Probabilistic Databases

  • Much of probabilistic data is naturally correlated
  • E.g. sensor data, data integration [AFM’06]
  • If not, query processing introduces correlation
  • Can use graphical models to capture such

correlations

slide-11
SLIDE 11

Example: Mutual Exclusiveness

A B m 1 n 1 prob 0.6 0.5 s1 s2

S

C D 1 p prob 0.4 t1

T

Xs1 Xt1 f1() 1 0.4 1 0.6 1 1 Xs2 f2() 0.5 1 0.5 instance probability {s1, s2, t1} {s1, s2} 0.3 {s1, t1} {s1} 0.3 {s2, t1} 0.2 {s2} {t1} 0.2 {}

Possible worlds

Possible worlds (if desired) computed using inference

slide-12
SLIDE 12

Outline

  • Selectivity Estimation and Query

Optimization

  • Probabilistic Relational Models
  • Probabilistic Databases
  • Sensor/Stream Data Management
  • References
slide-13
SLIDE 13

Motivation

  • Unprecedented, and rapidly increasing, instrumentation
  • f our every-day world

Wireless sensor networks RFID Distributed measurement networks (e.g. GPS) Industrial Monitoring Network Monitoring

slide-14
SLIDE 14

Outline

  • A generic temporal model for sensor

stream data

  • A range of applications
  • Model-based query processing
  • Object tracking and monitoring
slide-15
SLIDE 15

1 3 5 2 4

SENSOR NETWORK

True temperature at X1 at time t X1,t X2,t X5,t X3,t X4,t Interpretation: X4,t independent of X2,t given X1,t and X5,t O1,t O2,t O5,t O3,t O4,t Observed temperature at X1 at time t

slide-16
SLIDE 16

1 3 2

SENSOR NETWORK

X2,t O2,t X1,t O1,t X3,t O3,t X2,t-1 O2,t-1 X1,t-1 O1,t-1 X3,t-1 O3,t-1 X2,t+1 O2,t+1 X1,t+1 O1,t+1 X3,t+1 O3,t+1

Markov Property Interpretation: {Xi,t+1 } independent of {Xi,t-1 } given {Xi,t }

slide-17
SLIDE 17

State evolution can be modeled as a Dynamic Bayesian Network

X2,t O2,t X1,t O1,t X3,t O3,t X2,t-1 O2,t-1 X1,t-1 O1,t-1 X3,t-1 O3,t-1 X2,t+1 O2,t+1 X1,t+1 O1,t+1 X3,t+1 O3,t+1

slide-18
SLIDE 18

Parameters ? (1) System model Prior: p(X1,0,X2,0,X3,0)

Evolution: p(X1,t,X2,t,X3,t | X1,t-1,X2,t-1,X3,t-1)

X2,t O2,t X1,t O1,t X3,t O3,t X2,t-1 O2,t-1 X1,t-1 O1,t-1 X3,t-1 O3,t-1 X2,t+1 O2,t+1 X1,t+1 O1,t+1 X3,t+1 O3,t+1

slide-19
SLIDE 19

Parameters ? (2) Measurement model p(O1,t,O2,t,O3,t | X1,t,X2,t,X3,t)

X2,t O2,t X1,t O1,t X3,t O3,t X2,t-1 O2,t-1 X1,t-1 O1,t-1 X3,t-1 O3,t-1 X2,t+1 O2,t+1 X1,t+1 O1,t+1 X3,t+1 O3,t+1

slide-20
SLIDE 20

Application: Model-based Query Processing [DGMHH’04,SBEMY’06]

Declarative Query Select nodeID, temp ± .1C, conf(.95) Where nodeID in {1..6} Observation Plan {[temp, 1], [voltage, 3], [voltage, 6]} Data 1, temp = 22.73, 3, voltage = 2.73 6, voltage = 2.65

USER SENSOR NETWORK

1 4 6 5 2 3 Query Results 1, 22.73, 100% … 6, 22.1, 99%

Probabilistic Model Query Processor

slide-21
SLIDE 21

Application: Model-based Query Processing [DGMHH’04,SBEMY’06]

Declarative Query Select nodeID, temp ± .1C, conf(.95) Where nodeID in {1..6} Observation Plan {[temp, 1], [voltage, 3], [voltage, 6]} Data 1, temp = 22.73, 3, voltage = 2.73 6, voltage = 2.65

USER SENSOR NETWORK

1 4 6 5 2 3 Query Results 1, 22.73, 100% … 6, 22.1, 99%

Probabilistic Model Query Processor

Advantages:

Exploit correlations Handle noise, biases in the data Predict missing or future values Reduce communication cost

slide-22
SLIDE 22

Object Tracking and Monitoring

  • Mobile RFID readers
  • Handheld, robot-mounted
  • Incomplete, noisy data
  • Environmental factors
  • Orientation of reading
  • Not directly queriable
  • Raw data: <tag id, reader id, ts>
  • Data needed for querying: e.g.,

precise object locations

+

slide-23
SLIDE 23

Graphical Modeling

  • A generative model p(X,O)
  • X: true object location (x,y,z)
  • O: boolean for RFID readings
  • How state of the world changes
  • Object movement, reader motion
  • How sensing generates data from

the state of the world

  • Sensor measurement model

 Probabilistic inference over RFID streams in mobile Environments. T. Tran,

  • C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. Shenoy. ICDE 2009.
slide-24
SLIDE 24

Inference over RFID Streams

  • Probabilistic inference over streams -- p(X|O)
  • Particle filtering: sampling-based inference
  • Key to performance: using a small number of samples

> 1000 readings/sec for 20,000 objects 0.1 reading/sec for 20 objects Performance 0.1 - 0.5 foot 0.6 - 0.8 foot Accuracy Our optimizations Particle filtering

7 orders of magnitude improvement!

slide-25
SLIDE 25

Open Discussion

  • Where does our contribution lie when applying graphical

models?

  • Devise the right model
  • Local probability distributions
  • Parameter estimation
  • Efficiency and scalability
  • Number of variables (e.g., objects)
  • Inference on streams (one pass, constant time/item)
  • Distributed query processing
  • The giant graphical model is distributed