B-HUNT: Automatic Discovery of Fuzzy Algebraic Constraints in - PowerPoint PPT Presentation

B-HUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data Paul G. Brown & Peter J. Haas IBM Almaden Research Center San Jose, CA VLDB 2003

A Motivating Example n Shipment data: orderID shipDate orderID deliveryDate deliveryTime 2A5 2001-01-03 2A5 2001-01-06 09:50 3C2 2001-04-15 3C2 2001-04-27 13:00 3B8 2002-11-25 3B8 2002-12-19 11:20 2E1 2002-10-31 2E1 2001-12-02 16:10 3D6 2002-07-25 3D6 2002-07-29 08:50 … … … … … orders deliveries VLDB 2003

Example: Fuzzy Constraints SELECT DAYS(deliveryDate) – DAYS(shipDate) FROM orders, deliveries WHERE orders.orderID = deliveries.orderID Parcel Post FedEx Air Shipping # of orders wrong dock strike address 0 4 8 12 16 20 24 28 32 36 40 44 48 deliveryDate - shipDate (in days) (deliveryDate BETWEEN shipDate + 2 AND shipDate + 5) (25%) OR (deliveryDate BETWEEN shipDate + 12 AND shipDate + 19) (50%) (25%) OR (deliveryDate BETWEEN shipDate + 31 AND shipDate + 35) VLDB 2003

Exploiting the Constraints A plan SELECT COUNT(*) FROM orders, deliveries | WHERE shipDate = ‘ 2003-07-02 ’ IScan: Deliveries.orderID AND deliveryTime > ’ 17:00 ’ Pred: deliveryTime AND orders.orderID = deliveries.orderID Scan: orders Indexes: Pred: shipDate orders.ordersID, deliveries.orderID deliveries.deliveryDate (NOT orders.shipDate) A better plan | IScan: orders.ordersID Derived predicate: Pred: shipDate (2003-07-04 [ deliveryDate [ 2003-07-07) OR (2003-07-14 [ deliveryDate [ 2003-07-21) IScan: deliveries.deliveryDate OR (2003-08-02 [ deliveryDate [ 2003-08-06) Pred: * Pred: deliveryTime VLDB 2003

Example 2: Partitioned Data orderID shipDate deliveryDate SELECT COUNT(*) FROM orders 2A5 2003-01-03 2003-01-06 WHERE shipDate = ‘ 2003-07-01 ’ 7D3 2003-01-17 2003-01-20 Derived predicate: 3C2 2003-04-15 2003-04-27 (2003-07-03 [ deliveryDate [ 2003-07-10) 3B8 2003-06-19 2003-07-02 OR (2003-07-13 [ deliveryDate [ 2003-07-24) OR (2003-08-01 [ deliveryDate [ 2003-08-05) 2E1 2003-06-16 2003-07-03 3D6 2003-08-25 2003-08-29 4D2 2003-09-12 2003-09-22 Fragment elimination! … Horizontally range-partitioned VLDB 2003

B-HUNT Overview n Automatic discovery of fuzzy algebraic constraints n Why useful? q Query optimization (new plans, costing) q Advice on data partitioning, view/index creation q Constraints interesting in themselves n Hidden constraints abound in real world q Unknown to application developer and DBA q Enforced by application but unknown to DBA q Known to DBA but not enforced due to high cost q Constraint is fuzzy, so not a standard DB “rule” per se VLDB 2003

Fuzzy Algebraic Constraints ⊕ ∈ a a I n Algebraic relationships: 1 2 ⊕ q is +, -, x, ÷, etc. q are attributes a a , 1 2 q I is subset of real numbers n Pairing rule P q Determines which value goes with which value a a 1 2 ∅ R q Trivial pairing rule for table R : a value paired with value in same row of R a n 1 2 q If attributes in different tables: P = join predicate Self-joins OK also n = ⊕ AC ( , a a , , , ) P I n Algebraic constraint: 1 2, VLDB 2003

Algebraic Constraints, Continued n Previous Example 1: = = q deliveries.deliveryDate, orders.shipDate a a 1 2 ⊕ q is subtraction operator q P : ‘orders.orderID = deliveries.orderID’ = ∪ ∪ K I {2,3,4,5} {12,13, ,19} {31 ,32,33,34,35} q n Previous Example 2: same as Example 1 except = a q orders.deliveryDate, 1 = ∅ orders q P = ∪ ∪ ∪ L I I I I n Focus on case where 1 2 k I 's q The are disjoint “bump intervals” (of real line or integers) n VLDB 2003

Outline of B-HUNT Algorithm = ⊕ n Find candidates of form: C ( , a a P , , ) 1 2 q Find useful pairing rules a a ⊕ ( , , ) q For each rule P find useful triples 1 2 n For each candidate, construct bump intervals q Based on sampled rows of (key) table q Use histogramming, segmentation, or clustering q Choose sample size to control # of exceptions For query optimization: n At load time: partition data into compliant + exceptions n During query processing: combine results of q Running modified query that incorporates constraints q Running original query over (small) exception data VLDB 2003

Candidate Generation: Pairing Rules T1 T2 T3 T4 1. Generate trivial pairing rules: ∅ ∅ ∅ ∅ , , , T 1 T 2 T 3 T 4 2. Generate set K of “ key-like ” attributes: declared primary and unique keys (and declared compound keys) attributes a such that ÷ ≈ #rows( ) a # distinctValues a ( ) 1 ∈ a K , 3. For each add ‘ R.a = S.b ’ to set of pairing rules iff (i) a and b are of same datatype and either (ii) ( a , b ) is declared (primary key,foreign key) pair; or (iii) Every value in a sample from b has a match in a VLDB 2003

Pruning the Pairing Rules n Adjustable heuristic pruning criteria: q Trade off thoroughness and efficiency q For optimization: want pairing rules that Lead to constraints with impact n Are easy to exploit at run time n Occur frequently in workload n n Examples: prune a pairing rule “ R.a = S.b ” if q R and S are “small” (no impact) q R or S has no index (hard to exploit) ∈ q and is “small” a K (spurious relationship) S b . / R a . q S.b is a system-generated key (spurious relationship) VLDB 2003

From Pairing Rules to Candidates P 1 P = ∅ 2 T 3 T1 T2 T3 For each pairing rule, consider all attribute pairs such that ( , a a ) 1 2 and can be operated on by ⊕ a a 1 2 not equal to attributes in pairing-rule join predicate ( , a a ) 1 2 ⊕ ( a a P , , , ) Prune candidate if, e.g., 1 2 attributes have different data types too many NULL values either attribute lacks an index VLDB 2003

Phrenology: Hunting the Bumps = ⊕ Ω C ( , a a P , , ) n Each candidate defines set of points 1 2 C Ω n Bump hunt on sample of points from C q Because bump hunting must be scalable n No exceptions in sample q I.e., segment the sample points I 1 I 2 I 3 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 L 1 = x 4 – x 1 L 2 = x 7 – x 5 L 3 = x 9 – x 8 n Choose sample size to control # of exceptions in full DB VLDB 2003

Direct “ Optimal ” Segmentation n Trade off filtering power and complexity vs n Rough cost function (k = # intervals): complexity ⎡ ⎤ k 1 ∑ = + − c S ( ) wk (1 w ) L ⎢ ⎥ j Δ ⎣ ⎦ Filtering power = j 1 q w is a weight between 0 and 1 Δ q is estimated range of data values n To minimize c ( S ): + − < q adjacent points in same segment iff where x x d *, i 1 i ( ) = Δ − d * w /(1 w ) + ε q For discrete data types use max( *,1 d ) VLDB 2003

Histogram-Based Segmentation x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 I 1 I 2 I 3 n Use 2 h ( n ) buckets: q h ( n ) = (2 n ) 1/3 is “oversmoothing” lower bound q Minimizes asymptotic mean integrated squared error Δ 2 ( )/ h n q Center an interval of length around each isolated point VLDB 2003

Choosing the Sample Size Uses approximate (conservative) estimate n *( k ) of n required sample size for a k -segmentation χ − 2 k + = 1 p ,2( k 1) + n k *( ) 4 f 2 With probability p , fraction of exceptions is at most f q Uses theory of tolerance intervals (Tukey and Sheffé) q Iterative procedure: n 1. (Initialization) Set k = 1 ≥ n n k *( ) 2. Take sample of size k ′ 3. Compute constraint and observe number of bump intervals n k ′ k ′ ≥ = 4. If then go to step 5, else set and go to step 2 n *( ) k 5. (Cleanup) Adjust for NULLs, Bernoulli fluctuations VLDB 2003

Using the Constraints for Optimization n Choose most important constraints (e.g. by filtering power) n Partition data into “ compliant ” and “ exception ” q Physical partitioning or partial indexes q Table creation, e.g.: CREATE TABLE exceptions(…); AND NOT ( INSERT INTO exceptions AS (deliveryDate BETWEEN shipDate + 2 DAYS (SELECT orders.orderID, deliveries.orderID, AND shipDate + 5 DAYS) orders.shipDate, deliveries.deliveryDate, OR (deliveryDate BETWEEN shipDate + 12 DAYS deliveries.deliveryTime AND shipDate + 19 DAYS) FROM orders, deliveries OR (deliveryDate BETWEEN shipDate + 31 DAYS WHERE orders.orderID = deliveries.orderID AND shipDate + 35 DAYS)); n Subsequent optimization builds on standard query processing technology VLDB 2003

An Empirical Study n The Database q 7 years of synthetic retail data q Similar to TPC-D schema q > 2.3 terabytes q Two largest tables exceed 13.8 billion and 3.45 billion rows n Discovered constraints include: ≤ ≤ + q t1.orderDate t 2.shipDate t 1.orderDate 4 MONTHS ≤ ≤ + q t2.shipDate t 2.receiveDate t 2.shipDate 1 MONTH n Time to discover constraints: q 4 minutes (in addition to ordinary statistics collection) q Versus hours or days for fancy mining methods VLDB 2003

Empirical Study, Continued 8 7 6 5 Speedup 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Query Number q Improvement for 50% of the queries q Significant improvement for 25% q Best speedup: 6.8x (accesses to largest table reduced 100x) q No significant performance decreases VLDB 2003

B-HUNT: Automatic Discovery of Fuzzy Algebraic Constraints in - PowerPoint PPT Presentation

B-HUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data Paul G. Brown & Peter J. Haas IBM Almaden Research Center San Jose, CA VLDB 2003 A Motivating Example n Shipment data: orderID shipDate orderID deliveryDate

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Applications Three sample applications Fuzzy inferno Nostalgic cow Twilight Eden Fuzzy inferno

7 Transformations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

Semi-Heuristic Target-Based Fuzzy Target . . . Fuzzy Target . . . Fuzzy Decision Procedures:

M odels for Inexact Reasoning Fuzzy Logic Lesson 8 Fuzzy Controllers M aster in

11 Fuzzy Rule-Based Models Fuzzy Systems Engineering Toward Human-Centric Computing Contents

What do you do with the temporarily placed programs? The problem is more widespread than just

Fuzzy Reasoning Outline Introduction Bivalent & Multivalent Logics Fundamental

A fuzzy clustering method using Genetic Algorithm and Fuzzy Subtractive Clustering Thanh Le, Tom

M odels for Inexact Reasoning Fuzzy Logic Lesson 1 Crisp and Fuzzy Sets M aster in

5 Operations and Aggregations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric

On using Different Distance Measures for Fuzzy Numbers in Fuzzy Linear Regression Models Duygu

10 Fuzzy Modeling: Principles and Methodology Fuzzy Systems Engineering Toward Human-Centric

2 Notions and Concepts of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

What Is the Definition of Charity, and Who Decides How to Define It? April 15, 2013 Alexander

Pulmonary Vein Isolation Using a Novel Endoscopic Ablation System A. Metzner, B. Schmidt, F.

Project Management Office: A Means for System Integration Janet Bjornson, Director Reid

Life is more than just the job... Agricultural Economy (Industry 1.0) Helping clients

Outline The Path of Inclusion Identity and Cognitive Diversity Prediction Problems Solving

Uploading My Presentation for PHS Directions 1. Access the CERCA Canvas course and select Enroll in

Fre reshman shman Teams ams Fre reshman shman Teams ams Tutors/Mentors

1476 Students 58% Free & Reduced Lunch 12% English Learners

B-HUNT: Automatic Discovery of Fuzzy Algebraic Constraints in - PowerPoint PPT Presentation

B-HUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data Paul G. Brown & Peter J. Haas IBM Almaden Research Center San Jose, CA VLDB 2003 A Motivating Example n Shipment data: orderID shipDate orderID deliveryDate

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Applications Three sample applications Fuzzy inferno Nostalgic cow Twilight Eden Fuzzy inferno

7 Transformations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

Semi-Heuristic Target-Based Fuzzy Target . . . Fuzzy Target . . . Fuzzy Decision Procedures:

M odels for Inexact Reasoning Fuzzy Logic Lesson 8 Fuzzy Controllers M aster in

11 Fuzzy Rule-Based Models Fuzzy Systems Engineering Toward Human-Centric Computing Contents

What do you do with the temporarily placed programs? The problem is more widespread than just

Fuzzy Reasoning Outline Introduction Bivalent &amp; Multivalent Logics Fundamental

A fuzzy clustering method using Genetic Algorithm and Fuzzy Subtractive Clustering Thanh Le, Tom

M odels for Inexact Reasoning Fuzzy Logic Lesson 1 Crisp and Fuzzy Sets M aster in

5 Operations and Aggregations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric

On using Different Distance Measures for Fuzzy Numbers in Fuzzy Linear Regression Models Duygu

10 Fuzzy Modeling: Principles and Methodology Fuzzy Systems Engineering Toward Human-Centric

2 Notions and Concepts of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

What Is the Definition of Charity, and Who Decides How to Define It? April 15, 2013 Alexander

Pulmonary Vein Isolation Using a Novel Endoscopic Ablation System A. Metzner, B. Schmidt, F.

Project Management Office: A Means for System Integration Janet Bjornson, Director Reid

Life is more than just the job... Agricultural Economy (Industry 1.0) Helping clients

Outline The Path of Inclusion Identity and Cognitive Diversity Prediction Problems Solving

Uploading My Presentation for PHS Directions 1. Access the CERCA Canvas course and select Enroll in

Fre reshman shman Teams ams Fre reshman shman Teams ams Tutors/Mentors

1476 Students 58% Free &amp; Reduced Lunch 12% English Learners

Fuzzy Reasoning Outline Introduction Bivalent & Multivalent Logics Fundamental

1476 Students 58% Free & Reduced Lunch 12% English Learners