What queries can the Domain mediator answer for me? Developer - - PowerPoint PPT Presentation

what queries can the domain mediator answer for me
SMART_READER_LITE
LIVE PREVIEW

What queries can the Domain mediator answer for me? Developer - - PowerPoint PPT Presentation

Large-Scale Data Integration Systems Exporting and Interactively Querying Application Domain CNET Computer Developer Web Service-Accessed Sources: PCWorld Portals Application Application The CLIDE System


slide-1
SLIDE 1

1 Exporting and Interactively Querying Web Service-Accessed Sources: The CLIDE System

Michalis Petropoulos

Database Seminar, February 2010

2

Large-Scale Data Integration Systems

Source Domain Application Domain Integration Domain

Application Data Source Data Source Mediator Integrated Schema

Developer

Integration Engineer

Source Owner

Application Source Schema …

Web Service Web Service Web Service

Source Schema …

  • Dell Computers
  • Cisco Routers
  • HP Printers
  • Dell Computers by CPU
  • Cisco Routers by Rate
  • HP Printers by Speed
  • CNET Computer
  • PCWorld Portals

Compatible Combinations

  • f Computers, Routers

and Printers

3

Large-Scale Data Integration Systems

What queries can the mediator answer for me?

CLIDE

Source Domain Application Domain Integration Domain

Application Data Source Data Source Mediator Integrated Schema

Developer

Integration Engineer

Source Owner

Application Source Schema …

Web Service Web Service Web Service

Source Schema …

4

Running Example

Schema

Computers(cid, cpu, ram, price) NetCards(cid, rate, standard, interface)

Views

V1 ComByCpu(cpu) → (Computer)* SELECT DISTINCT Com1.* FROM Computers Com1 WHERE Com1.cpu=cpu V2 ComNetByCpuRate(cpu, rate) → (Computer, NetCard)* SELECT DISTINCT Com1.*, Net1.* FROM Computers Com1, Network Net1 WHERE Com1.cid=Net1.cid AND Com1.cpu=cpu AND Net1.rate=rate

Parameterized Views

Schema

Routers(rate, standard, price, type)

Views

V3 RouWired() → (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wired' V4 RouWireless() → (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wireless'

Conjunctive Queries CQ

  • Equality & Comparison Conditions
  • Parameters

Computers for a given cpu Computers & NetCards for a given cpu & rate Wired Routers Wireless Routers

slide-2
SLIDE 2

2

5

Running Example

  • Integrated schema puts together

the Dell and Cisco schemas Attribute Associations

  • (Computers.cid, NetCards.cid)
  • (NetCards.rate, Routers.rate)
  • (NetCards.standard, Routers.standard)

Integrated Schema

V1

Application

V3 V2

Dell Cisco Mediator Integrated Schema Developer

V4

6

Sophisticated Mediators Make Feasible Queries Hard to Predict

Feasible Queries FQ

  • Equivalent CQ query rewritings using the views
  • Might involve more than one views
  • Order might matter

V4

Mediator

RouWireless() Routers.* 10 .11b 50 Wireless 54 .11g 120 Wireless

A B

V2

ComNetByCpuRate(‘P4’, ‘10’)

C D

Computers.* NetCards.* A123 P4 512 400 A123 10 .11b USB B123 P4 1024 550 B123 54 .11g USB

Feasible

ComNetByCpuRate(‘P4’, ‘54’) Computers.* NetCards.* Routers.* A123 P4 512 400 A123 10 .11b USB 10 .11b 50 Wireless B123 P4 1024 550 B123 54 .11g USB 54 .11g 120 Wireless

E

Query: Get all ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers

V1

Mediator

Query: Get all Computers Infeasible

7

Problem

  • 1. Large number of sources
  • 2. Large number of views (web-services)
  • 3. Mediator capabilities

Developer formulates an application query  Is an application query feasible?  If not, how do I know which ones are feasible? Previous options:

– The developer had to browse the view definitions and somehow formulate a feasible query – Or formulate queries until a feasible one is found (trial-and-error)

No system-provided guidance

8

The CLIDE Solution

 A query formulation interface, which interactively guides the developer toward feasible queries by employing a coloring scheme

CLIDE

V1

Application

V3 V2

Dell Cisco Mediator Integrated Schema Developer

V4

slide-3
SLIDE 3

3

9

QBE-Like Interfaces

Microsoft SQL-Server

10

CLIDE Interface

  • Table, selection, projection and join actions
  • Feasibility Flag
  • Color-based suggestions

Projection Box Table Boxes Selection Boxes Table Alias Feasibility Flag Last/Next Step

11

Example Interaction

Yellow  required action

– All feasible queries require this action

White  optional action

– Feasible queries can be formulated w/ or w/o these actions

Snapshot 1

12

Example Interaction

Snapshot 2 Blue  required choice of action

– At least one feasible query cannot be formulated unless this action is performed

V1

Mediator

ComByCpu(‘P4’) cid cpu ram price A123 P4 512 400 B123 P4 1024 550 ram price 512 400 1024 550

A B C

slide-4
SLIDE 4

4

13

Example Interaction

Join Lines:

  • Only yellow and blue are displayed
  • Must appear in Attribute Associations

Snapshot 3

14

Example Interaction

Snapshot 4

  • *  any other constant
  • Red  prohibited action

– Does not appear in any feasible query – Lead to “Dead End” state

15

Example Interaction

Computers.* NetCards.* A123 P4 512 400 A123 10 .11b 50 B123 P4 1024 550 B123 54 .11g 120

Snapshot 5

Mediator

RouWireless() Routers.* 10 .11b 512 Wireless 54 .11g 1024 Wireless

A B

ComNetByCpuRate(‘P4’, rate)

D E

ram price rate interface price 512 400 10 USB 50 1024 550 54 USB 120

F

V4 V2

16

Demo

slide-5
SLIDE 5

5

17

CLIDE Properties

  • Completeness of Suggestions

– Every feasible query can be formulated by performing yellow and blue actions at every step

  • Summarization of Suggestions

– At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness

  • Rapid Convergence By Following

Suggestions

– The shortest sequence of actions from a query to any feasible query consists of suggested actions

18

Join Action Table Action Selection Action

Interaction Graph

  • Nodes are queries: One for each q∈CQ
  • Edges are actions: Table, selection, projection and join actions
  • Green nodes are feasible queries
  • Infinitely big structure

– All CQ queries – All possible combinations of actions formulating them

Com1.cid=Net1.cid Com1.cpu=‘P4’ Com1 Com1.ram Rou1

… …

Com1.price

… … … … … … …

Net1

… 19

Interaction Graph: Colorable Actions

  • Colorable actions AC label
  • utgoing edges of the current node

Net1 Com1.cpu=* Com1.price=* Rou1 Com1.ram=* Com1.cid=* Com2 Com1.cid

… … … … …

Com1.cpu

… … … …

Current Node

20

Com1.cpu=*

Interaction Graph: Colors

Com1.cpu=*

… … … … … …

… … …

Current Node Net1 Com1.cid=Net1.cid Com2.cid=Net1.cid Com2 Com2.cpu=‘P4’ Net1.rate=‘54Mbps’ Net1.rate=’54Mbps’

… … … … … … …

Com1.cpu=* Rou1 Net1.rate=Rou1.rate

… … … …

Net1.rate=’54Mbps’

Com1.cid=Net1.cid Com1.cid=Net1.cid

Net1 Com1.price=* Rou1 Com1.ram=* Com1.cid=* Com2 Com1.cid Com1.cpu

  • Yellow action α

– Every path from current node n to a feasible node contains α

  • Blue action α

– At least one feasible query cannot be formulated unless this action is performed (summarization)

  • Red action α

– No path to a feasible node contains α

Current Node Com1.cid=Net1.cid Com2 Rou1 Net1.rate=’54Mbps’

slide-6
SLIDE 6

6

21

CLIDE Architecture

  • Back-End invoked every time the user performs an action

– i.e., the user arrives at a new node in the interactions graph

Back-End

Closest Feasible Queries Algorithm

User Closest Feasible Queries FQC Current Query

Color Algorithm

Colored Actions + Feasibility Flag Aliases Collapse Rule Maximally-Contained Rewriter Views Schemas Column Associations Minimal Feasible Extension Queries

Front-End

Actions

Parameters Algorithm

Seed Queries SQ

22

Color Determined By a Finite Set of Feasible Queries

  • FQC is sufficient to color actions in AC
  • Theorem: Set of Closest Feasible Queries is Finite

n

… … … … … …

Closest Feasible Queries FQC

Challenge: Infinitely Many Feasible Queries

R a d i u s

?

Solution: Closest Feasible Queries FQC Challenge: How far can the Closest Feasible Queries FQC be? Solution: Based on Maximally Contained Queries FQMC

23

Maximally Contained Queries FQMC

  • Assuming fixed SELECT clause (projection list)
  • Covered extensively in literature

– MiniCon, Bucket, InverseRules Algorithms

  • FQMC is finite

Maximally Contained Query

Query: Q1 Get all Computers Query: Q2 Get all Computers with a given cpu Query: Q3 Get all Computers with a given cpu & ram

Not Maximally Contained Maximally Contained Query

Query: Q4 Get all Computers with a given ram

24

Closest Feasible Queries FQC Algorithm

  • Compute maximally contained queries FQMC
  • Theorem: All FQC queries are reachable

via a path of length p ≤ pL

  • The radius pL is the longest path to a maximally contained

query

Closest Feasible Queries FQC Maximally Contained Queries FQMC

n

… … … … … …

p

L

R a d i u s …

Solution: Maximally Contained Queries FQMC Challenge: How far can the Closest Feasible Queries FQC be?

slide-7
SLIDE 7

7

25

Closest Feasible Queries FQC Algorithm

  • Theorem: All queries in FQMC are in FQC
  • But not all queries in FQC are in FQMC

Closest Feasible Queries FQC Maximally Contained Feasible Queries FQMC

… … … … … …

More feasible nodes

n

Challenge: Find the Closest Feasible Queries

26

Closest Feasible Queries FQC Algorithm

  • Collapse Aliases to compute FQC \ FQMC
  • Check satisfiability

Closest Feasible Queries FQC Maximally Contained Feasible Queries FQMC

n

… … … … … …

Solution: Collapse Aliases

27

Color Algorithm

Yellow and Blue

  • An action α is colored based on which closest feasible

queries it appear in

  • Yellow, if α appears in all queries in FQC
  • Blue, if α appears in at least one (but not all) query in FQC

White and Red

  • Attach Maximum Projection Lists to Closest Feasible

Queries

– Projections that can be added to a feasible query, without compromising feasibility

  • Projection α is white if in the maximum projection list
  • Color selections based on projections

28

CLIDE Implementation & Optimizations

  • Views expansion introduce redundancy

– Affects CLIDE’s rapid convergence and summarization

  • Efficient containment test crucial to redundancy removal

Maximally-Contained Rewriter

Feasible Extension Queries + Maximum Projection Lists Maximally-Contained Feasible Extension Queries + Maximum Projection Lists Maximally-Contained Feasible Queries over Views + Containment Mappings

MiniCon

Containment Mappings Logging

Redundant Queries Removal

Minimal Feasible Extension Queries FQME + Maximum Projection Lists

Redundant Actions Removal Views Expansion

Back-End

Closest Feasible Queries Algorithm

Closest Feasible Queries FQC Current Query

Color Algorithm

Colored Actions + Feasibility Flag Aliases Collapse Rule Maximally-Contained Rewriter Views Schemas Column Associations Minimal Feasible Extension Queries

Front-End

Parameters Algorithm

Seed Queries SQ

slide-8
SLIDE 8

8

29

CLIDE Performance

  • Queries

A-span = 7 B-span = 3 Selections = 4,6,8,10 A B1 … C1 B2 C1 A BK B1 … C1 CL …

  • Schema

… Bi … Ci

  • Views

A BK B1 … C1 CL … … … BiM Bi1 … CiM Ci1 …

Chains of Stars – No Parameters

500 1000 1500 2000 2500 3000 3500 4000 (2,1) (2,2) (4,3) (4,4) (6,5) (6,6) (8,7) (8,8) (10,9) (10,10) (12,11) (12,12) (14,13) (14,14) Time (ms) (# of Joins, # of Selections) In Current Query 112 Views 168 Views 224 Views 280 Views

30

CLIDE Performance

Chains of Stars – No Parameters

500 1000 1500 2000 2500 3000 3500 4000 7720 142 7820 143 7920 144 8020 145 8120 146 8220 147 8320 148 Time (ms) # of Containment Tests Rest of Algorithms Containment Testing

31

CLIDE Performance

  • Queries

A-span = 7 B-span = 3 Selections = 4,6,8,10 A B1 … C1 B2 C1 A BK B1 … C1 CL …

  • Schema

… Bi … Ci

  • Views

A BK B1 … C1 CL … … … BiM Bi1 … CiM Ci1 …

Chains of Stars – With Parameters

500 1000 1500 2000 2500 3000 3500 4000 (2,1) (2,2) (4,3) (4,4) (6,5) (6,6) (8,7) (8,8) (10,9) (10,10) (12,11) (12,12) (14,13) (14,14) Time (ms) (# of Joins, # of Selections) In Current Query 112 Views 168 Views 224 Views 280 Views 200 400 600 800 1000 1200 1400 1600 (2,1) (2,2) (4,3) (4,4) (6,5) (6,6) (8,7) (8,8) (10,9) (10,10) (12,11) (12,12) (14,13) (14,14) Time (ms) (# of Joins, # of Selections) In Current Query 140 Views 196 Views 252 Views 308 Views

32

CLIDE Performance

Chains of Stars – With Parameters

200 400 600 800 1000 1200 1400 1600 1696 156 1697 157 1698 158 1699 159 1700 160 1701 161 1702 162 Time (ms) # of Containment Tests Rest of Algorithms Containment Testing

slide-9
SLIDE 9

9

33

CLIDE Summary

First interactive query formulation interface based on source and mediator capabilities Applicability

  • Service-Oriented Architectures
  • Privacy-Preserving Services

Contributions

  • Interaction Guarantees: Rapid Convergence, Completeness,

Summarization of Suggestions

  • Interaction Graph
  • Back-End Algorithms

– Closest Feasible Queries, Colors, Parameters

  • Modular, Customizable Architecture

http://www.clide.info