A Novel Probabilistic Pruning Approach to Speed Up Similarity - - PowerPoint PPT Presentation

a novel probabilistic pruning approach to speed up
SMART_READER_LITE
LIVE PREVIEW

A Novel Probabilistic Pruning Approach to Speed Up Similarity - - PowerPoint PPT Presentation

LUDWIG- MAXIMILIANS- DEPARTMENT DATABASE UNIVERSITY INSTITUTE FOR SYSTEMS MUNICH INFORMATICS GROUP A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases Thomas Bernecker*, Tobias Emrich*,


slide-1
SLIDE 1

LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

Thomas Bernecker*, Tobias Emrich*, Hans-Peter Kriegel*, Nikos Mamoulis**, Matthias Renz* and Andreas Zuefle*

*) Ludwig-Maximilians-Universität München (LMU) Munich, Germany http://www.dbs.ifi.lmu.de {bernecker, emrich, kriegel, renz, zuefle} @dbs.ifi.lmu.de **) University of Hong Kong (HKU) Hong Kong http://www.cs.hku.hk nikos@cs.hku.hk

slide-2
SLIDE 2

DATABASE SYSTEMS GROUP

  • Background

– Uncertain Data Model – Similarity Queries

  • Probabilistic Pruning

– Obtaining probability bounds – Using probability bounds for pruning

  • Evaluation

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases 2

Outline

slide-3
SLIDE 3

DATABASE SYSTEMS GROUP

Uncertain Data Model

3

  • Uncertain attribute

An attribute x is uncertain if its value is given by a probabilistic density function (PDF), which describes all possible values v of x, associated with probability P(x = v). − Discrete PDF (e.g. derived from missing data – See Julia’s talk, derived from time series data – See Saket’s talk) − Continuous PDF (e.g., sensor measurement error)

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

slide-4
SLIDE 4

DATABASE SYSTEMS GROUP

Uncertain Data Model

4

  • Uncertain Object X

− Has at least d ≥ 1 uncertain attributes. − X is a random variable, where the set of attribute values of X is described by a multi-dimensional probability distribution. − X has a spatial region URX (Uncertain Region), where PDFX (t) > 0 if t URX and PDFX (t) = 0 otherwise.

  • Uncertain Object Database

− Contains N uncertain objects − Object Independence Assumption

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

A B C

PDFX

slide-5
SLIDE 5

DATABASE SYSTEMS GROUP

Probabilistic Similarity Queries

5

Q A B C

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

  • Probabilistic k-Nearest Neighbor query

− What are the k objects closest to Q?

  • Probabilistic Similarity Ranking

− Return all objects sorted by their distance to Q.

  • Probabilistic Reverse k-Nearest Neighbor queries

Note: The query

  • bject may now be

uncertain.as well!

slide-6
SLIDE 6

DATABASE SYSTEMS GROUP

Similarity Queries: Example

6

Q A B C

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

  • Probabilistic Nearest Neighbor query
  • Which object is the nearest neighbor of Q?
slide-7
SLIDE 7

DATABASE SYSTEMS GROUP

Q A B C

7 A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

In some possible worlds A is the nearest neighbor of Q, …

  • Probabilistic Nearest Neighbor queries
  • Which object is the nearest neighbor of Q?

Similarity Queries: Example

slide-8
SLIDE 8

DATABASE SYSTEMS GROUP

Q A B C

8 A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

…in other possible worlds, A is not the nearest neighbor of Q.

  • Probabilistic Nearest Neighbor queries
  • Which object is the nearest neighbor of Q?

Similarity Queries: Example

slide-9
SLIDE 9

DATABASE SYSTEMS GROUP

  • Efficient probabilistic similarity search:

– Approximation (Index)

  • Simplification of spatial-probabilistic keys

– Spatial Filter

  • Filter objects according to simple spatial keys

– Probabilistic Filter

  • Derive lower/upper bounds of qualification probability (by means
  • f simple spatial-probabilistic keys)
  • Filter objects according to lower/upper probability bounds

– Verification

  • Computation of the exact probability (very expensive)
  • Monte-Carlo Sampling (many samples required)

General Framework

9 A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

slide-10
SLIDE 10

DATABASE SYSTEMS GROUP

Pruning based on rectangular approximations only [1].

[1] Tobias Emrich, Hans-Peter Kriegel, Peer Kröger, Matthias Renz, Andreas Züfle: Boosting Spatial Pruning: On Optimal Pruning of MBRs. SIGMOD Conference 2010: 39-50

Spatial Filter

10

B A

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

For any Q in this region, A is closer to Q than B. For any Q in this region, A is not closer to Q than B. For any Q in this region, A may possibly be closer to Q than B.

slide-11
SLIDE 11

DATABASE SYSTEMS GROUP

Probabilistic Pruning

11

Lower Probability Bound

“B1 is closer to Q than A with a Probability of at least x%”

Q A B1 Q A B2

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

How many objects are closer to Q than A?

Upper Probability Bound

“B2 is closer to Q than A with a Probability of at most x%”

slide-12
SLIDE 12

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

12

  • What we have now is:

− B1 is closer to Q than A with a probability of at least p1

lb

and at most p1

ub

− B2 is closer to Q than A with a probability of at least p2

lb

and at most p2

ub

− ...

  • How can we derive the probability that at least (at most,

exactly) k objects are closer to Q than A?

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

slide-13
SLIDE 13

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

13

  • Let φ be a predicate and let X1, …, Xn be uncertain objects.

Let pi

lb and pi ub be lower and upper bounds of the

probability that Xi satisfies φ.

  • How many objects satisfy φ?
  • We consider the following generating function:

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

=

− + − +

n i ub i lb i ub i lb i

p y p p x p

1

) 1 ( ) (

slide-14
SLIDE 14

DATABASE SYSTEMS GROUP

Example

14

  • Assume the following probability bounds have been

derived:

− X1 satisfies φ with a probability of at least 0.2 and at most 0.5 − X2 satisfies φ with a probability of at least 0.6 and at most 0.8

  • What is the probability that the number #X of objects that

satisfy φ is at least (at most, exactly) k?

− Consider the following Generating Function: (0.2x + 0.3y + 0.5) * (0.6x + 0.2y + 0.2) − Expansion yields: 0.12x² + 0.34x + 0.1 + 0.22xy + 0.16y + 0.06y²

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

slide-15
SLIDE 15

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

15

− Expansion yields: 0.12x² + 0.34x + 0.1 + 0.22xy + 0.16y + 0.06y²

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

1 2 20 % 40 % 60 % 80 % k P(#X=k)

slide-16
SLIDE 16

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

16

− Expansion yields: 0.12x² + 0.34x + 0.1 + 0.22xy + 0.16y + 0.06y²

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

1 2 20 % 40 % 60 % 80 % k P(#X=k)

slide-17
SLIDE 17

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

17

− Expansion yields: 0.12x² + 0.34x + 0.1 + 0.22xy + 0.16y + 0.06y²

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

1 2 20 % 40 % 60 % 80 % P(#X=k) k

slide-18
SLIDE 18

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

18

− Expansion yields: 0.12x² + 0.34x + 0.1 + 0.22xy + 0.16y + 0.06y²

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

1 2 20 % 40 % 60 % 80 % k P(#X=k)

slide-19
SLIDE 19

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

19

− Expansion yields: 0.12x² + 0.34x + 0.1 + 0.22xy + 0.16y + 0.06y²

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

1 2 20 % 40 % 60 % 80 % k P(#X=k)

slide-20
SLIDE 20

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

20

− Expansion yields: 0.12x² + 0.34x + 0.1 + 0.22xy + 0.16y + 0.06y²

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

1 2 20 % 40 % 60 % 80 % k P(#X=k)

slide-21
SLIDE 21

DATABASE SYSTEMS GROUP

Approximated PDF

21 A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

The result is an approximated PDF of #X.

1 2 20 % 40 % 60 % 80 % k P(#X=k)

slide-22
SLIDE 22

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

22 A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

1 2 20% 40% 60% 80% k P(#X=k) Now let #X denote the number of objects that are closer to Q than A. The pdf of #X corresponds directly of the similarity rank of A to Q. Example Query: Return all objects that are the nearest neighbor of Q with a probability of at least 50%. A can be pruned.

slide-23
SLIDE 23

DATABASE SYSTEMS GROUP

Uncertain Generating Functions

23 A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

1 2 20% 40% 60% 80% k P(#X=k) Now let #X denote the number of objects that are closer to Q than A. The pdf of #X corresponds directly of the similarity rank of A to Q. Example Query: Return the most likely rank of each object. For A, Rank 1 can be pruned.

slide-24
SLIDE 24

DATABASE SYSTEMS GROUP

Evaluation

24 A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

20 40 60 80 100 120 140 160 180 1 3 5 7 9 11 13 15 17 19 21 23 25

τ = 0.5 MC

k runtime (sec) with PF w/o PF

slide-25
SLIDE 25

DATABASE SYSTEMS GROUP

Summary

25 A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

  • Algorithm to handle probabilistic similarity queries

with an uncertain query object

  • Use of spatial pruning technique to obtain probability

bounds

  • Efficient and correct accumulation of bounds using

uncertain generation functions