Querying Heterogeneous Information Sources Using Source Descriptions - PowerPoint PPT Presentation

Data Integration: Querying Heterogeneous Information Sources Using Source Descriptions & Data Integration: The Teenage Years CPSC 534P Rachel Pottinger September 19, 2011

Administrative Notes Homework 1 due… now I’ll get grades on your first reviews back ASAP If you got a 2, it almost certainly means that you need more analysis/synthesis Try to ask more questions that you think would be good for discussion (partially my fault) It’s good to think of weaknesses, but remember that your work’s not perfect either Project proposals due next Monday

For today’s class, I’ll wear three hats The presenter’s hat The discussion leader’s hat The “me” hat I’ll try to make it clear which is which, but if you get confused, let me know

Data Integration Up until now: one database – one schema Queries programmed by experts General users issue pre-programmed queries Interaction between databases Not very common Extremely manually intensive to set up Expensive Time consuming Hard to change

Planning a Beach Vacation Beach Good Cheap Weather Flight weather. wunder Fodors ca ground CAA Expedia Orbitz

BioMedical Research Phenotype Gene Protein Nucleotide Sequence Swiss- OMIM HUGO Prot Gene- Locus- Entrez Tests Link

Modern Data Management Many overlapping databases Vast user base Users want data from multiple sources Users want to combine data from many databases without knowing where it comes from The catch? They all have different schemas

Data Integration Systems User Query Virtual Mediated “Airport” database Schema Local Schema 1 Local Schema N Local Local Database 1 Database N Expedia Orbitz

Discussion question Where do you think this mediated schema comes from? What kinds of information should be taken into account when building one?

How can we relate concepts in one schema to concepts in another? Views, glorious views! (I told you they were handy) In a materialized view , we compute what the answers are and save the result

Previous Data Integration Architecture: Global-As-View (GAV) User Query Mediated Schema Local Schema 1 Local Schema N Local Local Database 1 Database N Expedia Orbitz Global sources are views on source schemas

Example of Global-As-View (GAV) Mediated schema: Airport(code, city) Feature(city, attraction) Source schemas: Expedia-Air(aircode,postalcode) CanadaPost(postalcode, city) Mapping: Airport(code, city) :- Expedia-Air(code, postcode), CanadaPost(postalcode,city) How do you answer a query? What if you want to add OrbitzA(code,postcode)?

Information Manifold Data Integration Architecture: Local-As-View (LAV) User Query Mediated Schema Local Schema 1 Local Schema N Local Local Database 1 Database N Expedia Orbitz Local sources are views on mediated schema

Local As View (LAV) A view is a named query LAV: local source is materialized view over mediated schema Mediated Mediated Schema: Schema Airport(code, city) Feature(city, attraction) CAA-Air … Beaches Local Sources/Views: CAA-Air(code, city) :- Airport(code, city) Beaches(code) :- Airport(code, city), Feature(city, “Beach”)

Local As View (LAV) A view is a named query LAV: local source is materialized view over mediated schema Mediated Mediated Schema: Schema Airport(code, city) Feature(city, attraction) CAA-Air … Beaches Local Sources/Views: CAA-Air(code, city) :- Airport(code, city) Beaches(code) :- Airport(code, city), Feature(city, “Beach”)  Adding new sources is easy  Rewriting queries is NP-complete

Answering Queries Using Views Query: Dest(code) :- Airport(code, city), Feature(city, “Beach”) Sources/Views: Q CAA-Air(code, city) :- Airport(code, city) Fodors(city, POI) :- Feature(city, POI) MS CAA Fodors … Rewriting: Dest(code):-CAA- Air(code, city), Fodors(city, “Beach”) Maximally Contained Rewriting: all answers to Query are a subset of those of Rewriting, and Rewriting contains all possible answers given local sources

Answering Queries Using Views Query: Dest(code) :- Airport(code, city), Feature(city, “Beach”) Sources/Views: Q CAA-Air(code, city) :- Airport(code, city) Fodors(city, POI) :- Feature(city, POI) MS Sun-Surf(city) :- Feature(city, “Beach”) CAA Fodors … Rewriting: Dest(code):-CAA- Air(code, city), Fodors(city, “Beach”)  Dest(code):-CAA-Air(code, city), Sun-Surf(city) Maximally Contained Rewriting: all answers to Query are a subset of those of Rewriting, and Rewriting contains all possible answers given local sources

Containment, what is it? For two queries, Q 1 and Q 2 , if all answers to Q 1 are a subset of those for Q 2 for all databases, then Q 1 is contained in Q 2 . Denoted as Q 1  Q 2 . For example, if Q 1 (x,x):-e1(x,x) Q 2 (y,z):-e1(y,z) Q 1  Q 2 .

Equivalent queries Q 1  Q 2 if they return the same answers for all databases. This is the same as Q 1  Q 2 and Q 2  Q 1 For example, if Q 1 (X,Y):- e 1 (X,Z),e 2 (Z,Y),e 1 (X,W) Q 2 (X,Y):-e 1 (X,Z), e 2 (Z,Y) Q 1  Q 2 .

How do you prove containment? There are a number of different ways, but don’t worry about it. The key thing is that even for conjunctive queries, it’s still NP - complete in the number of subgoals in the query.

So what’s a maximally contained rewriting then? It’s a rewriting where the rewritten query is contained in the original query, but it has as many answers as possible given the sources. Like the example above So how do you compute them?

Bucket Algorithm: Populating buckets For each subgoal in the query, place relevant views in the subgoal’s bucket Inputs: Q(x):- r 1 (x,y) & r 2 (y,x) V 1 (a):-r 1 (a,b) V 2 (d):-r 2 (c,d) V 3 (f):- r 1 (f,g) & r 2 (g,f) Buckets: r 2 (y,x) r 1 (x,y) V 2 (x), V 3 (x) V 1 (x),V 3 (x) 22

Combining Buckets For every combination in the Cartesian products from the buckets, check containment in the query Q(x):- r 1 (x,y) & r 2 (y,x) V 1 (a):-r 1 (a,b) V 2 (d):-r 2 (c,d) Candidate rewritings: V 3 (f):- r 1 (f,g) & r 2 (g,f) Q’ 1 (x) :- V 1 (x) & V 2 (x)  Bucket Algorithm checks Q’ 2 (x) :- V 1 (x) & V 3 (x)  all possible combinations Q’ 3 (x) :- V 3 (x) & V 2 (x)  Buckets: Q’ 4 (x) :- V 3 (x) & V 3 (x)  r 2 (y,x) r 1 (x,y) V 2 (x), V 3 (x) r 1 (x,y) r 2 (y,x) V 1 (x),V 3 (x)

Sample Data Integration Architecture User Query catalog Query Reformulation Global Schema Query Optimization & Execution Engine Wrapper Wrapper Local Schema Data Data Source Source

Discussion This paper won the 10 year test of time award. Why do you think that the committee chose it?

So that’s the initial data integration paper. What happened then?

Schema mappings (coming up a bit in a few weeks) Where do those mappings come from? What do they look like?

Peer Data Management Systems (coming up Wednesday) Rather than have a centralized authority, make things distributed

Model Management Most metadata applications are redone from scratch every time. It would be nice to have an algebra (like relational algebra) only on the schema level so that these algorithms could be reused

Data Spaces (coming up next Monday) Pay as you go data integration

Discussion Which of these topics would you most want to work on and why?

Industry: Data Integration  Enterprise Information Integration Challenges: Scale up and performance Horizontal (general) vs. vertical (solving entire problem) Integration with EAI and other middleware But did make it

Discussion The second paper was a result of a 10 year “test of time award”. As such it was not subject to rigorous peer review. What should we expect to be different about such papers from normal ones? What should we expect to be the same?

Any questions about what I expect? Things to keep in mind for the presenters: It is not necessary to present the entire paper (I’ll give you a list of things not to skip) You do not need to understand every last detail of the paper Things to keep in mind for the discussion leaders Make sure you don’t leave all the discussion until the end If you have trouble calling on people, I will help.

Querying Heterogeneous Information Sources Using Source Descriptions - PowerPoint PPT Presentation

Data Integration: Querying Heterogeneous Information Sources Using Source Descriptions & Data Integration: The Teenage Years CPSC 534P Rachel Pottinger September 19, 2011 Administrative Notes Homework 1 due now Ill get grades on

The problem Combining querying of XML data with ontology queries Example XML document

Wavelets for Efficient Querying of Large Wavelets for Efficient Querying of Large

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Querying XML Documents Querying XML Documents How XML may be supported in databases with

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Critical Social Futures: Querying Systems of Disability Support 19 th June, 2015 University of

Distance Measure for Querying Arrangements of Temporal Intervals Orestis Kostakis, Panagiotis

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Provenance Exchange, Integration and Querying Marta Ma&oso Federal

Introduction to data stream querying and mining Georges HEBRAIL Workshop Franco-Brasileiro sobre

Relational Algebra Spring 2012 Instructor: Hassan Khosravi Querying relational databases

LINQ to SQL: Taking the Boredom out of Querying Introduction LINQ = Language INtegrated Query =

CSE 4/562 Database Systems Practicum Component Project Outline Parser & SQL Query

First things first Project 1 Hashing Still working on grading Definitely by

HMEH: write-optimal extendible hashing for hybrid DRAM-NVM memory Xiaomin Zou 1 , Fang Wang 1 *,

Intro to AWS and Boto3 IN TRODUCTION TO AW S BOTO IN P YTH ON Maksim Pecherskiy Data

So You Want To Start A Green Business Program? Green Business Engagement National Network (GBENN)

Gamification: Behind the buzz Unity Studios GOTO, October 2013 Unity Studios GOTO,

What is the buzz term in the current field of computer science? A. Cloud Computing B. Grid

Policies in Stateful Networks Seyed K. Fayaz , Tianlong Yu, Yoshiaki Tobioka, Sagar Chaki, Vyas

Querying Heterogeneous Information Sources Using Source Descriptions - PowerPoint PPT Presentation

Data Integration: Querying Heterogeneous Information Sources Using Source Descriptions & Data Integration: The Teenage Years CPSC 534P Rachel Pottinger September 19, 2011 Administrative Notes Homework 1 due now Ill get grades on

The problem Combining querying of XML data with ontology queries Example XML document

Wavelets for Efficient Querying of Large Wavelets for Efficient Querying of Large

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Querying XML Documents Querying XML Documents How XML may be supported in databases with

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Critical Social Futures: Querying Systems of Disability Support 19 th June, 2015 University of

Distance Measure for Querying Arrangements of Temporal Intervals Orestis Kostakis, Panagiotis

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Provenance Exchange, Integration and Querying Marta Ma&amp;oso Federal

Introduction to data stream querying and mining Georges HEBRAIL Workshop Franco-Brasileiro sobre

Relational Algebra Spring 2012 Instructor: Hassan Khosravi Querying relational databases

LINQ to SQL: Taking the Boredom out of Querying Introduction LINQ = Language INtegrated Query =

CSE 4/562 Database Systems Practicum Component Project Outline Parser &amp; SQL Query

First things first Project 1 Hashing Still working on grading Definitely by

HMEH: write-optimal extendible hashing for hybrid DRAM-NVM memory Xiaomin Zou 1 , Fang Wang 1 *,

Intro to AWS and Boto3 IN TRODUCTION TO AW S BOTO IN P YTH ON Maksim Pecherskiy Data

So You Want To Start A Green Business Program? Green Business Engagement National Network (GBENN)

Gamification: Behind the buzz Unity Studios GOTO, October 2013 Unity Studios GOTO,

What is the buzz term in the current field of computer science? A. Cloud Computing B. Grid

Policies in Stateful Networks Seyed K. Fayaz , Tianlong Yu, Yoshiaki Tobioka, Sagar Chaki, Vyas

Provenance Exchange, Integration and Querying Marta Ma&oso Federal

CSE 4/562 Database Systems Practicum Component Project Outline Parser & SQL Query