Enhancing Traditional Databases to Support Broader Data Management - - PowerPoint PPT Presentation
Enhancing Traditional Databases to Support Broader Data Management - - PowerPoint PPT Presentation
Enhancing Traditional Databases to Support Broader Data Management Applications Yi Chen Computer Science & Engineering Arizona State University What Is a Database System? Of course, there are traditional relational database management
2
Yi Chen --- January 23, 2006
What Is a Database System?
Of course, there are traditional relational
database management systems (RDBMS)
Was introduced in 1970 by Dr. E. F. Codd (of
IBM)
Commercial relational databases began to
appear in the 1980s
The focus of most work in the past 30 years
3
Yi Chen --- January 23, 2006
A Relational Database (RDBMS)
Row (Tuple, Record) Column (Attribute) Table (Relation)
Name Skill age James Beginner 21 Bob Experienced 33 Climber Name Route Date Duration Bob Last Tango 10/10/05 5 Bob Last Tango 1/10/06 4.5 Climbs
A predefined data structure (schema) is required. Refers to
4
Yi Chen --- January 23, 2006
projection: ∏Route = “Last
Tango”
selection: σName = “James”
Querying RDBMS: SQL
Name Skill age James Beginner 21 Bob Experienced 33 Climber Name Route Date Duration Bob Last Tango 10/10/05 5 Bob Last Tango 1/10/06 4.5 Climbs
Climber.name = climbs.name Climbs
join: Climber
Name Skill Age Route Date Duration Bob Experienced 33 Last Tango 10/10/05 5 Bob Experienced 33 Last Tango 1/10/06 4.5
5
Yi Chen --- January 23, 2006
The Advantages of RDBMS
Good data organization High efficiency for large datasets via indexing
and query optimization
Concurrency control and reliability
6
Yi Chen --- January 23, 2006
But, 80% of the World’s Data is Not in RDBMS!
Examples:
WWW, Emails Personal data, documents of various format Sensor data A lot of scientific data (experimental data, large images,
documentation, etc)
Why not? There are several assumptions in relational databases that
do not fit for handling this data.
My research addresses how to enhance RDBMS to
manage them.
7
Yi Chen --- January 23, 2006
Challenges for RDBMS (I)
RDBMS Assumption: data conforms to a predefined
fixed schema, which is separated from the data itself
Reality:
Data may be collected from different sources on the
web, therefore has different schemas
Schema can change over time for a single source
Requirements: We need to handle data of different
schemas and have the schemas tightly associated with the data
8
Yi Chen --- January 23, 2006
XML as a Data Representation Format
XML has become a standard data format for
various applications, because of:
Flexibility in schemas -- semi-structured data Self - describing feature Representing tree data model naturally
9
Yi Chen --- January 23, 2006
XML: the Standard for Web Data Representation
PubMed GenBank BLAST
Internet
Web Service Requester NCBI Web Service Publisher
...
XML Data XML Data
10
Yi Chen --- January 23, 2006
XML: Representing Phylogenetic Trees
From the Tree of the Life Website, University of Arizona
Orangutan Gorilla Chimpanzee Human
11
Yi Chen --- January 23, 2006
Challenges for RDBMS (II)
RDBMS Assumption: Data is clean and consistent. Reality: real world data is dirty
Data collected from different sources may have
missing and conflicting information
Data that is obtained from data mining is often not
error-prone
Experimental data often contains random errors
Requirements: we need to measure data quality and
handle imprecise and/or incomplete data
12
Yi Chen --- January 23, 2006
Roadmap of This Talk
Managing XML by leveraging mature RDBMS
[Chen et al 04]
Introduction to XML A generic and efficient XML-to-RDBMS mapping
Data mapping from trees to tables Query translation from tree navigation queries
to SQL queries that are efficient
Handling imprecise and incomplete data in
DBMS [Chen et al 06]
13
Yi Chen --- January 23, 2006
Sample XML Data
<books> <book> <title> The lord of the rings... </title> <section> <title> Locating middle-earth </title> ... </section> … </book> </books>
books book title section title section title figure description “The lord
- f the
rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”
... ... ...
14
Yi Chen --- January 23, 2006
...
Sample XML Queries
XML query languages are
based on hierarchical structure navigation (e.g. XPath)
Sample queries:
What are all the section
titles: //section/title
books book title section title section title figure description “The lord
- f the
rings …” “Locating middle- earth” “King Theoden's golden hall”
... ...
“A hall fit for a king”
Descendant axis Child axis
15
Yi Chen --- January 23, 2006
...
Sample XML Queries
XML query languages are
based on hierarchical structure navigation (e.g. XPath)
Sample queries:
What are all the section
titles: //section/title
What are the titles of
sections that contain a figure: //section[/figure]/title
books book title section title section title figure description “The lord
- f the
rings …” “Locating middle- earth” “King Theoden's golden hall”
... ...
“A hall fit for a king”
Predicates
16
Yi Chen --- January 23, 2006
How to Query XML Data efficiently?
RDBMS have achieved high performance in
query evaluation.
Can we leverage RDBMS by encoding XML to
tables?
17
Yi Chen --- January 23, 2006
Analogy: Fourier Transforms
g * h = ∫∫-∞g(u)h(u)du G(f)H(f) Complex
+∞
Efficient
18
Yi Chen --- January 23, 2006
Mapping XML Data to RDBMS
XPath XML data SQL Query Translation XML fragments Relational databases Storage Mapping
Challenge: How to build the bridge between hierarchies and tables?
19
Yi Chen --- January 23, 2006
Data Mapping
ID Tag Value Structural Information 1 books 2 book 3 title The... 4 section 5 title Locating… … … … … T
(5)
books book title section title section title figure description “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”
(4) (3) (2) (1)
“The lord of the rings …”
Parent ID
[Florescu & Kossmann 99]
20
Yi Chen --- January 23, 2006
Data Mapping
ID Tag Value Structural Information 1 books 2 book 3 title The... 4 section 5 title Locating… … … … … T
(5)
books book title section title section title figure description “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”
(4) (3) (2) (1)
“The lord of the rings …”
. Design special labels to encode node relationships
21
Yi Chen --- January 23, 2006
Query Translator Architecture
How to choose XPath subqueries, such that:
they can be easily translated to SQL subqueries the SQL subqueries can be efficiently evaluated
How to combine SQL subqueries to a complete one?
XPath decomposition
XPath
Sub-query translation SQL sub-query composition
Query Translator SQL
22
Yi Chen --- January 23, 2006
Query Translator
Q: //book[//figure]/section/title
section book figure title
23
Yi Chen --- January 23, 2006
Query Translator: (I) Decomposition to Suffix Paths
Q: //book[//figure]/section/title
book figure section title book
24
Yi Chen --- January 23, 2006
(342000,343000)
σ342000 ≤ Plabel ≤ 343000T
Encoding Suffix Paths Using P-labeling
//book/section/title
books book title section title section title figure description “The lord
- f the
rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”
(1) (2) (3) (4)
... ... ...
(100)
id Plabel 1 100000 2 210000 3 321000 4 421000 5 342100 … …
T
(5)
books book title section title section title figure description “The lord
- f the
rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”
(1) (2) (3) (4)
... ... ...
(100)
Evaluating suffix paths SQL selections on P-labels
25
Yi Chen --- January 23, 2006
Query Translator: (II) Selection on P-labels
Q: //book[//figure]/section/title
book figure section title book
26
Yi Chen --- January 23, 2006
D-labeling Scheme
books book title section title section title figure description “The lord
- f the
rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”
(1, 20000, 1) (6, 1200, 2) (10,80,3) (81, 250,3)
... ... ...
(100, 200,4)
- D-labeling is used to connect
suffix paths.
- D-labels (start, end, depth)
can be used to detect ancestor-descendant relationships between nodes in a tree.
(120, 160, 5)
27
Yi Chen --- January 23, 2006
A Bi-labeling Based Query Translation
//book//figure
books book title section title section title Figure (120, 160, 5) description “The lord
- f the
rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”
(1, 20000, 1) (6, 1200, 2) (10,80,3) (81, 250,3)
... ... ...
(100, 200,4)
(σPlabel for “//book”(ρT→T’))
T’.Start ≤ T.Start∧T’.End ≥ T.End
(σPlabel for “//figure”T )
Plabel Start End Depth Val 100000 1 20000 1 … 210000 6 1200 2 … 321000 10 80 3 … 421000 81 250 3 … 442100 100 200 4 … 554680 120 160 5 … … … … … … T
Stitching suffix paths SQL joins on D-labels
28
Yi Chen --- January 23, 2006
Query Translator: (III) Join on D-labels
Q: //book[//figure]/section/title
book figure section title book
29
Yi Chen --- January 23, 2006
Comparison with Previous Approach
book figure section title book
Ours: fewer disk accesses, fewer joins
book figure section title
Previous Approach
[Li & Moon 01, Zhang et al 01, Tatarinov et al 02, Grust 02, DeHaan et al 03, etc]
30
Yi Chen --- January 23, 2006
Compare our system (BLAS) with the previous approach using D-labeling scheme only
Data sets Query sets
Suffix path queries Path queries XPath queries Benchmark queries
Query Engines: DB2, TwigStack Join [Bruno et al 02]
Experiment Setup
Data Set Size(MB) Nodes(K) Tags Depth DTD Protein 70 2277 66 7 Tree Shakespeare 26 640 19 7 Acyclic graph Auction 69 1238 77 12 Cyclic graph
31
Yi Chen --- January 23, 2006
Query Execution Time
Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% QA1 QA2 QA3 QP1 QP2 QP3 QS1 QS2 QS3 Queries
Time
D-labeling BLAS
32
Yi Chen --- January 23, 2006
2 4 6 8 10 12 14 35 70 105 139 174
File Size (MB) Time (seconds) DLabeling BLAS
Benchmark data, Benchmark query Q3
Scalability
33
Yi Chen --- January 23, 2006
Summary of XML Data Management
We proposed a generic XML-to-RDB mapping,
based on a bi-labeling scheme.
It is more efficient compared with previous
approach, since it generates SQL queries that require:
fewer disk accesses fewer joins fewer intermediate results
Experiments show the effectiveness
34
Yi Chen --- January 23, 2006
Roadmap of This Talk
Managing XML by leveraging mature RDBMS
[Chen et al 04]
Introduction to XML A generic and efficient XML-to-RDBMS mapping
Data mapping from trees to tables Query translation from tree navigation queries
to SQL queries that are efficient
Handling imprecise and incomplete data in
DBMS [Chen et al 06]
35
Yi Chen --- January 23, 2006
Probabilistic Databases: Managing Imprecise Data
How to measure the imprecision of the data? The simplest model: associate each tuple a
probability
Name Skill age Pr James Beginner 21 p1 Bob Experienced 33 p2 Climber Name Route Date Duration Pr Bob Last Tango 10/10/05 5 q1 Bob Last Tango 1/10/06 4.5 q2 Climbs
Assume that all the tuples are independent events
36
Yi Chen --- January 23, 2006
How Does This Affect Query Evaluation? [Fuhr&Roellke 97]
σ
v p v p v1 p1 v1 v2 p1 p2 v2 p2
Π
v p1 v p2 v 1-(1-p1)(1-p2)
37
Yi Chen --- January 23, 2006
LT 1-(1-p2q1)(1- p2q2)
Π
LT 1-(1-q1)(1-q2) Bob LT p2(1-(1-q1)(1-q2))
Wrong ! Correct
Challenges: Correctness Depends on Execution Order [Suciu et al 05]!
Name Skill Pr Bob Exp p2 Name Route Pr Bob LT q1 Bob LT q2
Find distinct climb routes that have been climbed by an experienced climber.
Name Route Pr Bob LT q1 Bob LT q2
Π
Name Skill Pr Bob Exp p2 Bob LT p2q1 Bob LT p2q2
38
Yi Chen --- January 23, 2006
This can not be convertible !
How to Handle Incomplete Data?
Seller Make Model Body Style Year Price Mark BMW 325Cic convertible 2006 36000 Alice Ford Taurus 2001 8000
Our techniques infer possible values with probability by mining data statistics Find cars that are convertible and have price less than 10k.
39
Yi Chen --- January 23, 2006
Querying Incomplete Databases
Given missing value prediction
Querying incomplete databases Querying probabilistic databases
What if we are not able to store the predicted
values? --- e.g. data integration applications
On-the-fly query rewriting
How should we handling imprecise and incomplete XML Data?
40
Yi Chen --- January 23, 2006
Conclusions
Traditional RDBMS does not satisfy the requirements
in ever growing scientific and web applications
We have discussed two enhancement to RDBMS
Efficient XML data management Handling imprecise and incomplete data
Other enhancement to RDBMS that I am working on
Data stream processing Scientific workflow modeling and query processing
41
Yi Chen --- January 23, 2006