Discovering conditional Functional Dependencies We n f e i F a n , - - PowerPoint PPT Presentation

discovering conditional functional dependencies
SMART_READER_LITE
LIVE PREVIEW

Discovering conditional Functional Dependencies We n f e i F a n , - - PowerPoint PPT Presentation

Discovering conditional Functional Dependencies We n f e i F a n , F l o r i s G e e r t s , J i a n z h o n g L i , a n d M i n g X i o n g I C D E 2 0 0 9 Amira Ghenai Outline Introduction and Motivation Contributions of the


slide-1
SLIDE 1

Discovering conditional Functional Dependencies

We n f e i F a n , F l o r i s G e e r t s , J i a n z h o n g L i , a n d M i n g X i o n g I C D E 2 0 0 9

Amira Ghenai

slide-2
SLIDE 2
  • Introduction and Motivation
  • Contributions of the paper
  • Algorithms description

CFDMiner CTANE FastCFD

  • Experimental Evaluation
  • Summary

Outline

Discovering Conditional Functional Dependencies 2

slide-3
SLIDE 3
  • CFD (as previously discussed) are introduced for

data cleaning purposes

  • CFDs are more effective than FDs in detecting and

repairing inconsistencies Unrealistic to rely on human experts to design CFDs via experiments Automatically discover CFDs The discovery problem is highly non-trivial

Introduction & Motivation

Discovering Conditional Functional Dependencies 3

slide-4
SLIDE 4

Example

Discovering Conditional Functional Dependencies 4

FDs: CFDs: cust: (country code (CC), area code (AC), phone number (PN)), name (NM), and address (street (STR), city (CT), zip code (ZIP)). (CC,ZIP,STR) Variable CFDs Constant CFDs

slide-5
SLIDE 5
  • Three algorithms for CFDs discovery:

1.

CFDMiner: for discovering constant CFDs only using depth-first search schema

2.

CTANE: extension of TANE (presented last week) that uses levelwise approach to discover FDs

3.

FastCFD: depth-first approach to discover general CFDs and it’s an extension to FastFD.

  • Experimental study on real life datasets

Main Contributions

Discovering Conditional Functional Dependencies 5

slide-6
SLIDE 6
  • Minimal CFDs

A minimal CFD is a non-trivial one i.e. left-reduced. A CFD ᵩ=(XA, 𝑢𝑞) is left-reduced if:

  • None of its LRS attributes can be removed (X)
  • None of the constants in the LHS can be

upgraded to “_” i.e. make 𝑢𝑞 “most general”. (Applied in variable CFDs only)

Discovering Conditional Functional Dependencies 6

Problem Statement

slide-7
SLIDE 7
  • Minimal CFDs Example

Discovering Conditional Functional Dependencies 7

Problem Statement

𝜒2 = ( 𝐷𝐷, 𝐵𝐷 → (44,131||𝐹𝐸𝐽))

  • Constant CFD
  • True for 𝑢5and

𝑢6

  • Can’t remove CC
  • r AC from LRS
  • > Minimal CFD

𝜒3 = ( 𝐷𝐷, 𝐵𝐷 → 𝐷𝑈, (01,212|𝑂𝑍𝐷))

  • Only true for 𝑢3
  • Even if we remove CC from LRS, still holds
  • > Non

Non-minimal CFD

slide-8
SLIDE 8
  • Frequent CFDs

Given CFD ᵩ=(XA, 𝑢𝑞) in r, there exist a support denoted by sup(ᵩ,r) defined as a set of tuples that 𝑢[X]≤𝑢𝑞[X]and t[A] ≤𝑢𝑞[A]. Example:

Discovering Conditional Functional Dependencies 8

Problem Statement

𝜒1 = ( 𝐷𝐷, 𝐵𝐷 → 𝐷𝑈, (01,908||𝑁𝐼))  3-frequent f1: [CC,AC] → CT  8-frequent

slide-9
SLIDE 9
  • Goal: Given an instance r of R and a support

threshold k, the algorithm finds a canonical cover of k-frequent minimal constant CFDs of the form XA, (𝑢𝑞||𝑏)

  • The algorithm users the notion of free and

closed item sets for a given item set pair (X,𝑢𝑞):

Closed set: can’t be extended without decreasing support Free set: can’t be generalized without increasing support

Discovering Conditional Functional Dependencies 9

CFDMiner Algorithm

slide-10
SLIDE 10

`

Discovering Conditional Functional Dependencies 10

CFDMiner Algorithm

slide-11
SLIDE 11
  • The relation between free/closed item sets and

left-reduced constant CFDs is:

For an instance r in R, any k-frequent left-reduced constant CFD ᵩ=(XA, 𝑢𝑞||a) holds iff:

1.

Item set (X,𝑢𝑞) is a free k-frequent set and does not contain (A,a)

2.

Item set clo(X,𝑢𝑞) ⪯(A,a) (less general) and

3.

(X,𝑢𝑞) does not contain a smaller free set (Y, 𝑡𝑞) such that

1. (X,𝑢𝑞) ⪯ (Y,𝑡𝑞) (i.e (Y,𝑡𝑞) is more general) and 2. Clo (Y,𝑡𝑞) ⪯ (A,a)

Discovering Conditional Functional Dependencies 11

CFDMiner Algorithm

slide-12
SLIDE 12
  • Example

Discovering Conditional Functional Dependencies 12

CFDMiner Algorithm

𝜒1 = ( 𝐷𝐷, 𝐵𝐷 → 𝐷𝑈, (01,908||𝑁𝐼))

  • 1. φ1 is extracted from 3-constant

CFD and matches the free item set ([CC,AC],(01,908))

  • 2. φ1 contains a free item set

([AC],908) which belongs to a closed set ([AC,CT],908,MH) which is more general

  • Clo (Y,𝒕𝒒) ⪯ (A,a)
  • > not left-reduced

Closed sets and free sets that contain (CT,MH); i.e (A,a) = (CT,MH)

slide-13
SLIDE 13
  • 1. Get top k-frequent closet item sets (X, 𝑢𝑞) and their

corresponding free sets

  • 2. Associate with every free item set (Y, 𝑡𝑞) the RHS (Y,

𝑡𝑞) = (X\Y, 𝑢𝑞[X\Y])

  • 3. An ordered list L will be constructed to keep track of

all k-frequent free item sets.

  • 4. For each free item set (Y, 𝑡𝑞) in L:

a.

Replace RHS(Y, 𝑡𝑞) with RHS(Y, 𝑡𝑞)∩ RHS(Y’, 𝑡𝑞[𝑍′]) where Y’⊈Y.

b.

After checking all subsets, CFDMiner outputs K- frequent CFDs

Discovering Conditional Functional Dependencies 13

CFDMiner Algorithm

slide-14
SLIDE 14
  • Goal: Levelwise algorithm for discovering

minimal k-frequent (variable and constant)

  • CFDs. An extension of TANE algorithm.
  • Briefly, the algorithm works as follows:

1.

Compute the RHS for minimal CFDs with their LHS in 𝑀𝑚(where 𝑀𝑚is the corresponding level in the lattice)

2.

For each (X, 𝑢𝑞)∈ 𝑀𝑚, we look for CFDs

3.

Prune 𝑀𝑚

4.

Generate next level 𝑀𝑚+1

  • The following demonstrative example …

Discovering Conditional Functional Dependencies 14

CTANE Algorithm

slide-15
SLIDE 15

Discovering Conditional Functional Dependencies 15

CTANE Algorithm

Assume a support threshold k ≥ 3 for attributes [CC,AC,ZIP,STR] Figure showing two levels of the lattice and partial third level showing [CC,AC,ZIP] attributes

slide-16
SLIDE 16
  • Goal: Find minimal k-frequent variable and

constant CFDs in a depth-first search inspired by FastFD algorithm.

  • Key idea: Minimal CFDs are minimal covers of

difference sets

  • Difference Sets:

D(𝑢1, 𝑢2; r)={B∈attr(R)|𝑢1[B] ≠ 𝑢2[B]} (the set of attributes which are different in 𝑢1and 𝑢2) 𝐸𝐵

𝑠 is set {Y\{A}|Y∈ 𝐸𝑠, 𝐵 ∈Y}

16

FastCFD Algorithm

D(𝑢1, 𝑢2; 𝑠0)={NM}

slide-17
SLIDE 17

1. FindCover Algorithm:

I .

Extract the list of k-frequent free item sets in r (A)

I I .

For each item set, produces the minimal difference sets 𝐸𝐵

𝑛 (B) I I I .

Calls FindMin to find the minimal cover of 𝐸𝐵

𝑛

2. FindMin Algorithm: (down-left of example)

I .

Orders attributes (alphabetically in example)

I I . All subsets of attributes are enumerated in a

depth-first, left -to-right fashion. Example: for sets{[PN],[AC,CT]}, we can have the possible subsets [AC,PN],[CT,PN]…

I I I .

By getting possible subsets, the algorithm verifies if the CFD is minimal. For example: Tree for CC=01 and Y=[AC,PN] and we are looking for STR (input)

ᵩ ′= [CC,AC,PN] ->STR,(01, -,-||-)

Discovering Conditional Functional Dependencies 17

FastCFD Algorithm

Free pattern 𝑠𝐷𝐷=01AND K≥2 for [CC,AC,PN,CT,ZIP,STR]

slide-18
SLIDE 18

Discovering Conditional Functional Dependencies 18

FastCFD Algorithm

Conditions of checking whether a CFD is valid or no or whether it is minimal or not Frk: list of free sets, D: minimal difference set, A: attributes in R

slide-19
SLIDE 19
  • Differences compared to FastFD:

More complicated (constants, unnamed variables) K-frequent CFDs instead of 1-frequent FD

  • Needs efficient way of computing sets

NaiveFast Algorithm: Stripped partition-passed, Naïve and fast approach FastCFD Algorithm:

  • Considering the 2-frequent closed item sets only in r

which will be computed by CFDMiner algorithm.

  • Difference set can be computed more efficiently

Reorder attributes such that ones that cover most

  • f the sets are treated first to improve efficiency.

Discovering Conditional Functional Dependencies 19

FastCFD Algorithm

slide-20
SLIDE 20
  • Two Real-life datasets and a synthetic dataset:
  • The experiments studied the effect of:

The support threshold k The number of tuples DBSIZE The number of columns (Arity) The correlation factor (average range of distinct values in an attribute domain)

Discovering Conditional Functional Dependencies 20

Experimental Evaluation

slide-21
SLIDE 21
  • Scalability wrt DBSIZE

Discovering Conditional Functional Dependencies 21

Experimental Evaluation

slide-22
SLIDE 22
  • Scalability wrt Arity and k

Discovering Conditional Functional Dependencies 22

Experimental Evaluation

slide-23
SLIDE 23
  • Scalability wrt CF
  • The results were on synthetic dataset
  • Similar results were achieved on real datasets

Discovering Conditional Functional Dependencies 23

Experimental Evaluation

slide-24
SLIDE 24
  • CFDMiner is efficient in discovering constant

CFDs.

  • CTANE works well with databases where arity

is small and support threshold is large.

  • NaiveFast and FastCFD are very efficient when

arity of relation is very large.

  • FastCFD is more efficient than the NaiveFast

implementation especially when the arity is large.

Discovering Conditional Functional Dependencies 24

Summary