Answering Queries from Statistics and Probabilistic Views Nilesh - PowerPoint PPT Presentation

Answering Queries from Statistics and Probabilistic Views Nilesh Dalvi and Dan Suciu, University of W ashington.

Background • ‘ Query answering using Views ’ problem: fi nd answers to a query q over a database schema R using a set of views V = { v 1 , v2 L } over R . • Example : R ( name,dept,phone ) v2 ( d,p ) : R ( n,d,p ) v 1 ( n,d ) : R ( n,d,p ) dept phone name dept v2= v 1 = Sales x1234 Larry Sales Sales x5678 John Sales HR x2222 q ( p ) : R ( Larry ,d,p ) 2

Background: Certain Answers Let U be a fi nite universe of size n. Consider all possible data instances over U D 1 D2 D3 D4 Dm ….... Data instances consistent with the views V D 1 D2 D3 D4 Dm ….... Certain Answers : tuples that occur as answers in all data instances consistent with V 3

Example v2 ( d,p ) : R ( n,d,p ) v 1 ( n,d ) : R ( n,d,p ) dept phone name dept v2= v 1 = Sales x1234 Larry Sales Sales x5678 John Sales HR x2222 q ( p ) : R ( Larry ,d,p ) Data instances consistent with the views: D 1 = D2= name dept phone name dept phone Frank Sales x5678 Larry Sales x1234 ….... Larry Sales x1111 John Sales x5678 John Sales x1234 Sue HR x2222 Sue HR x2222 4

Example ( contd. ) dept phone name dept v2= v 1 = Sales x1234 Larry Sales Sales x5678 John Sales HR x2222 • No certain answers, but some answers are more likely that others. • Domain is huge, cannot just guess Larry ’ s number. • A data instance is much smaller. If we know average employes per dept = 5, then x1234 and x5678 have 0.2 probability of being answer. 5

Going beyond certain answers • Certain answers approach assumes complete ignorance about the knowledge of how likely is each possible database • Often we have additional knowledge about the data in form of various statistics Can we use such information to fi nd answers to queries that are statistica � y meaningful ? 6

Why Do W e Care? • Data Privacy : publishers can analyze the amount of information disclosed by public views about private information in the database • Ranked Search : a ranked list of probable answers can be returned for queries with no certain answers. 7

Using Common Knowledge • Suppose we have a priori distribution Pr over all possible databases: Pr: { D 1 , ... ,D m } → [ 0,1 ] • W e can compute the probability of a tuple t being an answer to q using Pr [( t ∈ q ) | V ] Query Answering using views = Computing conditional probabilities on a distribution 8

Part I Query answering using views under some speci fi c distributions 9

Binomial Distribution U : a domain of size n W e start from a simple case - R ( name,dept,phone ) a relation of arity 3 - Expected size of R is c Binomial : Choose each of the n 3 possible tuples independently with probability p. Expected size of R is c ⇒ p = c/n 3 Let µ n denote the resulting distribution. For any instance D, µ n [ D ] = p k ( 1 - p ) n 3 - k , where k = | D | 10

Binomial: Example I R ( name,dept,phone ) | R | = c, domain size = n v : R ( Larry , - , -) q : R (- , - , x1234 ) µ n [ q | v ] ≈ ( c+ 1 ) /n = negligible if n is large limn → ∞ µ n [ q | v ] = 0 v gives negligible information about q when domain is large 11

Binomial: Example II R ( name,dept,phone ) | R | = c, domain size = n v : R ( Larry , - , -) , R (- , - , x1234 ) q : R ( Larry , - , x1234 ) limn → ∞ µ n [ q | v ] = 1 / ( 1 +c ) v gives non - negligible information about q, even for large domains 12

Binomial: Example III R ( name,dept,phone ) | R | = c, domain size = n v : R ( Larry , Sales , -) , R (- , Sales , x1234 ) q : R ( Larry , Sales , x1234 ) limn → ∞ µ n [ q | v ] = 1 Binomial distribution cannot express more interesting statistics. 13

A V ariation on Binomial • Suppose we have following statistics on R ( name,dept,phone ) : – Expected number of distinct R.dept = c 1 – Expected number of distinct tuples for each R.dept = c2 • Consider the following distribution µ n – For each xd ∈ U, choose it as a R.dept value with probability c 1 /n For each xd chosen above, for each ( xn,xp ) ∈ U2, include – the tuple ( xn,xd,xp ) in R with probability c2/n2 14

Examples R ( name,dept,phone ) | dept | =c 1 , | dept ⇒ name,phone | = c 2 , | R | =c 1 c 2 Example 1: v : R ( Larry , - , -) , R (- , - , x1234 ) q : R ( Larry , - , x1234 ) µ [ q | v ] = 1 / ( c 1 c 2 + 1 ) Example 2: v : R ( Larry , sales , -) , R (- , sales , x1234 ) q : R ( Larry , sales , x1234 ) µ [ q | v ] = 1 / ( c 2 + 1 ) 15

Part II : Representing Knowledge as a Probability Distribution 16

Knowledge about data • A set of statistics Γ on the database - cardinality statistics : card R [ A ] = c - fanout statistics: fanout R [ A ⇒ B ] = c - • A set of integrity constraints Σ - functional dependencies: R.A → R.B - inclusion dependencies: R.A ⊆ R.B 17

Representing Knowledge Statistics and constraints are statements on the probability distribution P – cardR [ A ] = c implies the following Σ i P [ D i ] card ( Π A ( R D i )) = c – fanoutR [ A ⇒ B ] implies a similar condition – A constraint Σ implies that P [ D i ] = 0 on data instances D i that violate Σ Problem: P is not uniquely de fi ned by these statements! 18

The Maximum Entropy Principle • Among all the probability distributions that satisfy Σ and Γ , choose the one with maximum entropy. • Widely used to convert prior information into prior probability distribution • Gives a distribuion that commits the least to any speci fi c instance while satisfying all the equations. 19

Examples of Entropy Maximization • R ( name,dept,phone ) a relation of arity 3 • Example 1: Γ = empty, Σ = { card [ R ] = c } Entropy maximizing distribution = Binomial • Example 2: Γ = empty , Σ = { cardR [ dept ] = c 1 , fanoutR [ dept ⇒ name,phone ] = c 2 } Entropy maximizing distribution = variation on Binomial distribution we studies earlier. 20

Query answering problem Given a set of statistics Σ and constraints Γ , let µ Σ , Γ ,n denote the maximum entropy distribution assuming a domain of size n. Problem : Given statistics Σ , constraints Γ , and boolean conjunctive queries q and v, compute the asymptotic limit of µ Σ , Γ ,n [ q | v ] as n → ∞ 21

Main Result • For Boolean conjunctive queries q and v, the quantity µ Σ , Γ ,n [ q | v ] always has an asymptotic limit and we show how to compute it. 22

Glimpse into Main Result • For any conjunctive query Q, we show that µ Σ , Γ ,n [ Q ] is a polynomial of the form c 1 ( 1 /n ) d + c 2 ( 1 /n ) d+ 1 + ... • µ Σ , Γ ,n [ q | v ] = µ Σ , Γ ,n [ qv ] / µ Σ , Γ ,n [ v ] = ratio of two polynomials. • Only the leading coe ffi cient and exponent matter, and we show how to compute them. 23

Conclusions • W e show how to use common knowledge about data to fi nd answers to queries that are statistically meaningful - Provides a formal framework for studying database privacy breaches using statistical attacks. • W e use the principle of entropy maximization to represent statistics as a prior probability distribution. • The techniques are also applicable when the contents of views are themselves uncertain. 24

Questions? 25

Answering Queries from Statistics and Probabilistic Views Nilesh - PowerPoint PPT Presentation

Answering Queries from Statistics and Probabilistic Views Nilesh Dalvi and Dan Suciu, University of W ashington. Background Query answering using Views problem: fi nd answers to a query q over a database schema R using a set of

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Views 2 Designing the user interface Roy Scholten hi Views . Views 2 Views 2 have you heard

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Question Answering What is Ques+on Answering? Dan Jurafsky Ques%on

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

A Dichotomy for Non-Repeating Queries with Negation in Probabilistic Databases Robert Fink and Dan

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

TTC'18: Hawk solution Answering queries with the Neo4j graph database What is Hawk? Hawk is

Selecting and Using Views To Compute Aggregate Queries Foto Afrati (NTUA Greece) and Rada

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

Geometric Algorithms Range & windowing queries (2 lectures) Database queries 2/180 G.

Computational Geometry Lecture 14: Windowing queries Computational Geometry Lecture 14:

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

A Recareering Framework @fuzzing_panda A Recareering Framework 1. Framework a. Recon b.

Let X 1 , X 2 , . . . , X n be a random sample from a normal population with mean and variance

Liquid Noble DM Detectors Hugh Lippincott Cosmic Science Working Group 20 April 2017 Current

Contents: Why? How? Anthropogenic CO 2 production The increase of atmospheric CO 2 is half

Preparatory Course: Syntax 1 Lecture 1 (10.10.2008) PD Dr.Valia Kordoni Email:

arXiv:1512.05327v2 [hep-ph] 23 Dec 2015 onica Sanz 4 and Tevong You 5 Ver 1 Theoretical

Te Aho Ngrahu Local Curriculum Resources Local curriculum funding As part of Budget 2017, we

Opening the legislative and government control processes : Laurent Cottereau Ludovic Pnet

Answering Queries from Statistics and Probabilistic Views Nilesh - PowerPoint PPT Presentation

Answering Queries from Statistics and Probabilistic Views Nilesh Dalvi and Dan Suciu, University of W ashington. Background Query answering using Views problem: fi nd answers to a query q over a database schema R using a set of

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Views 2 Designing the user interface Roy Scholten hi Views . Views 2 Views 2 have you heard

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Question Answering What is Ques+on Answering? Dan Jurafsky Ques%on

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

A Dichotomy for Non-Repeating Queries with Negation in Probabilistic Databases Robert Fink and Dan

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

TTC'18: Hawk solution Answering queries with the Neo4j graph database What is Hawk? Hawk is

Selecting and Using Views To Compute Aggregate Queries Foto Afrati (NTUA Greece) and Rada

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

Geometric Algorithms Range &amp; windowing queries (2 lectures) Database queries 2/180 G.

Computational Geometry Lecture 14: Windowing queries Computational Geometry Lecture 14:

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

A Recareering Framework @fuzzing_panda A Recareering Framework 1. Framework a. Recon b.

Let X 1 , X 2 , . . . , X n be a random sample from a normal population with mean and variance

Liquid Noble DM Detectors Hugh Lippincott Cosmic Science Working Group 20 April 2017 Current

Contents: Why? How? Anthropogenic CO 2 production The increase of atmospheric CO 2 is half

Preparatory Course: Syntax 1 Lecture 1 (10.10.2008) PD Dr.Valia Kordoni Email:

arXiv:1512.05327v2 [hep-ph] 23 Dec 2015 onica Sanz 4 and Tevong You 5 Ver 1 Theoretical

Te Aho Ngrahu Local Curriculum Resources Local curriculum funding As part of Budget 2017, we

Opening the legislative and government control processes : Laurent Cottereau Ludovic Pnet

Geometric Algorithms Range & windowing queries (2 lectures) Database queries 2/180 G.