Enhancing Traditional Databases to Support Broader Data Management - - PowerPoint PPT Presentation

enhancing traditional databases to support broader data
SMART_READER_LITE
LIVE PREVIEW

Enhancing Traditional Databases to Support Broader Data Management - - PowerPoint PPT Presentation

Enhancing Traditional Databases to Support Broader Data Management Applications Yi Chen Computer Science & Engineering Arizona State University What Is a Database System? Of course, there are traditional relational database management


slide-1
SLIDE 1

Enhancing Traditional Databases to Support Broader Data Management Applications

Yi Chen Computer Science & Engineering Arizona State University

slide-2
SLIDE 2

2

Yi Chen --- January 23, 2006

What Is a Database System?

Of course, there are traditional relational

database management systems (RDBMS)

Was introduced in 1970 by Dr. E. F. Codd (of

IBM)

Commercial relational databases began to

appear in the 1980s

The focus of most work in the past 30 years

slide-3
SLIDE 3

3

Yi Chen --- January 23, 2006

A Relational Database (RDBMS)

Row (Tuple, Record) Column (Attribute) Table (Relation)

Name Skill age James Beginner 21 Bob Experienced 33 Climber Name Route Date Duration Bob Last Tango 10/10/05 5 Bob Last Tango 1/10/06 4.5 Climbs

A predefined data structure (schema) is required. Refers to

slide-4
SLIDE 4

4

Yi Chen --- January 23, 2006

projection: ∏Route = “Last

Tango”

selection: σName = “James”

Querying RDBMS: SQL

Name Skill age James Beginner 21 Bob Experienced 33 Climber Name Route Date Duration Bob Last Tango 10/10/05 5 Bob Last Tango 1/10/06 4.5 Climbs

Climber.name = climbs.name Climbs

join: Climber

Name Skill Age Route Date Duration Bob Experienced 33 Last Tango 10/10/05 5 Bob Experienced 33 Last Tango 1/10/06 4.5

slide-5
SLIDE 5

5

Yi Chen --- January 23, 2006

The Advantages of RDBMS

Good data organization High efficiency for large datasets via indexing

and query optimization

Concurrency control and reliability

slide-6
SLIDE 6

6

Yi Chen --- January 23, 2006

But, 80% of the World’s Data is Not in RDBMS!

Examples:

WWW, Emails Personal data, documents of various format Sensor data A lot of scientific data (experimental data, large images,

documentation, etc)

Why not? There are several assumptions in relational databases that

do not fit for handling this data.

My research addresses how to enhance RDBMS to

manage them.

slide-7
SLIDE 7

7

Yi Chen --- January 23, 2006

Challenges for RDBMS (I)

RDBMS Assumption: data conforms to a predefined

fixed schema, which is separated from the data itself

Reality:

Data may be collected from different sources on the

web, therefore has different schemas

Schema can change over time for a single source

Requirements: We need to handle data of different

schemas and have the schemas tightly associated with the data

slide-8
SLIDE 8

8

Yi Chen --- January 23, 2006

XML as a Data Representation Format

XML has become a standard data format for

various applications, because of:

Flexibility in schemas -- semi-structured data Self - describing feature Representing tree data model naturally

slide-9
SLIDE 9

9

Yi Chen --- January 23, 2006

XML: the Standard for Web Data Representation

PubMed GenBank BLAST

Internet

Web Service Requester NCBI Web Service Publisher

...

XML Data XML Data

slide-10
SLIDE 10

10

Yi Chen --- January 23, 2006

XML: Representing Phylogenetic Trees

From the Tree of the Life Website, University of Arizona

Orangutan Gorilla Chimpanzee Human

slide-11
SLIDE 11

11

Yi Chen --- January 23, 2006

Challenges for RDBMS (II)

RDBMS Assumption: Data is clean and consistent. Reality: real world data is dirty

Data collected from different sources may have

missing and conflicting information

Data that is obtained from data mining is often not

error-prone

Experimental data often contains random errors

Requirements: we need to measure data quality and

handle imprecise and/or incomplete data

slide-12
SLIDE 12

12

Yi Chen --- January 23, 2006

Roadmap of This Talk

Managing XML by leveraging mature RDBMS

[Chen et al 04]

Introduction to XML A generic and efficient XML-to-RDBMS mapping

Data mapping from trees to tables Query translation from tree navigation queries

to SQL queries that are efficient

Handling imprecise and incomplete data in

DBMS [Chen et al 06]

slide-13
SLIDE 13

13

Yi Chen --- January 23, 2006

Sample XML Data

<books> <book> <title> The lord of the rings... </title> <section> <title> Locating middle-earth </title> ... </section> … </book> </books>

books book title section title section title figure description “The lord

  • f the

rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”

... ... ...

slide-14
SLIDE 14

14

Yi Chen --- January 23, 2006

...

Sample XML Queries

XML query languages are

based on hierarchical structure navigation (e.g. XPath)

Sample queries:

What are all the section

titles: //section/title

books book title section title section title figure description “The lord

  • f the

rings …” “Locating middle- earth” “King Theoden's golden hall”

... ...

“A hall fit for a king”

Descendant axis Child axis

slide-15
SLIDE 15

15

Yi Chen --- January 23, 2006

...

Sample XML Queries

XML query languages are

based on hierarchical structure navigation (e.g. XPath)

Sample queries:

What are all the section

titles: //section/title

What are the titles of

sections that contain a figure: //section[/figure]/title

books book title section title section title figure description “The lord

  • f the

rings …” “Locating middle- earth” “King Theoden's golden hall”

... ...

“A hall fit for a king”

Predicates

slide-16
SLIDE 16

16

Yi Chen --- January 23, 2006

How to Query XML Data efficiently?

RDBMS have achieved high performance in

query evaluation.

Can we leverage RDBMS by encoding XML to

tables?

slide-17
SLIDE 17

17

Yi Chen --- January 23, 2006

Analogy: Fourier Transforms

g * h = ∫∫-∞g(u)h(u)du G(f)H(f) Complex

+∞

Efficient

slide-18
SLIDE 18

18

Yi Chen --- January 23, 2006

Mapping XML Data to RDBMS

XPath XML data SQL Query Translation XML fragments Relational databases Storage Mapping

Challenge: How to build the bridge between hierarchies and tables?

slide-19
SLIDE 19

19

Yi Chen --- January 23, 2006

Data Mapping

ID Tag Value Structural Information 1 books 2 book 3 title The... 4 section 5 title Locating… … … … … T

(5)

books book title section title section title figure description “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”

(4) (3) (2) (1)

“The lord of the rings …”

Parent ID

[Florescu & Kossmann 99]

slide-20
SLIDE 20

20

Yi Chen --- January 23, 2006

Data Mapping

ID Tag Value Structural Information 1 books 2 book 3 title The... 4 section 5 title Locating… … … … … T

(5)

books book title section title section title figure description “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”

(4) (3) (2) (1)

“The lord of the rings …”

. Design special labels to encode node relationships

slide-21
SLIDE 21

21

Yi Chen --- January 23, 2006

Query Translator Architecture

How to choose XPath subqueries, such that:

they can be easily translated to SQL subqueries the SQL subqueries can be efficiently evaluated

How to combine SQL subqueries to a complete one?

XPath decomposition

XPath

Sub-query translation SQL sub-query composition

Query Translator SQL

slide-22
SLIDE 22

22

Yi Chen --- January 23, 2006

Query Translator

Q: //book[//figure]/section/title

section book figure title

slide-23
SLIDE 23

23

Yi Chen --- January 23, 2006

Query Translator: (I) Decomposition to Suffix Paths

Q: //book[//figure]/section/title

book figure section title book

slide-24
SLIDE 24

24

Yi Chen --- January 23, 2006

(342000,343000)

σ342000 ≤ Plabel ≤ 343000T

Encoding Suffix Paths Using P-labeling

//book/section/title

books book title section title section title figure description “The lord

  • f the

rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”

(1) (2) (3) (4)

... ... ...

(100)

id Plabel 1 100000 2 210000 3 321000 4 421000 5 342100 … …

T

(5)

books book title section title section title figure description “The lord

  • f the

rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”

(1) (2) (3) (4)

... ... ...

(100)

Evaluating suffix paths SQL selections on P-labels

slide-25
SLIDE 25

25

Yi Chen --- January 23, 2006

Query Translator: (II) Selection on P-labels

Q: //book[//figure]/section/title

book figure section title book

slide-26
SLIDE 26

26

Yi Chen --- January 23, 2006

D-labeling Scheme

books book title section title section title figure description “The lord

  • f the

rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”

(1, 20000, 1) (6, 1200, 2) (10,80,3) (81, 250,3)

... ... ...

(100, 200,4)

  • D-labeling is used to connect

suffix paths.

  • D-labels (start, end, depth)

can be used to detect ancestor-descendant relationships between nodes in a tree.

(120, 160, 5)

slide-27
SLIDE 27

27

Yi Chen --- January 23, 2006

A Bi-labeling Based Query Translation

//book//figure

books book title section title section title Figure (120, 160, 5) description “The lord

  • f the

rings …” “Locating middle- earth” “A hall fit for a king” “King Theoden's golden hall”

(1, 20000, 1) (6, 1200, 2) (10,80,3) (81, 250,3)

... ... ...

(100, 200,4)

(σPlabel for “//book”(ρT→T’))

T’.Start ≤ T.Start∧T’.End ≥ T.End

(σPlabel for “//figure”T )

Plabel Start End Depth Val 100000 1 20000 1 … 210000 6 1200 2 … 321000 10 80 3 … 421000 81 250 3 … 442100 100 200 4 … 554680 120 160 5 … … … … … … T

Stitching suffix paths SQL joins on D-labels

slide-28
SLIDE 28

28

Yi Chen --- January 23, 2006

Query Translator: (III) Join on D-labels

Q: //book[//figure]/section/title

book figure section title book

slide-29
SLIDE 29

29

Yi Chen --- January 23, 2006

Comparison with Previous Approach

book figure section title book

Ours: fewer disk accesses, fewer joins

book figure section title

Previous Approach

[Li & Moon 01, Zhang et al 01, Tatarinov et al 02, Grust 02, DeHaan et al 03, etc]

slide-30
SLIDE 30

30

Yi Chen --- January 23, 2006

Compare our system (BLAS) with the previous approach using D-labeling scheme only

Data sets Query sets

Suffix path queries Path queries XPath queries Benchmark queries

Query Engines: DB2, TwigStack Join [Bruno et al 02]

Experiment Setup

Data Set Size(MB) Nodes(K) Tags Depth DTD Protein 70 2277 66 7 Tree Shakespeare 26 640 19 7 Acyclic graph Auction 69 1238 77 12 Cyclic graph

slide-31
SLIDE 31

31

Yi Chen --- January 23, 2006

Query Execution Time

Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% QA1 QA2 QA3 QP1 QP2 QP3 QS1 QS2 QS3 Queries

Time

D-labeling BLAS

slide-32
SLIDE 32

32

Yi Chen --- January 23, 2006

2 4 6 8 10 12 14 35 70 105 139 174

File Size (MB) Time (seconds) DLabeling BLAS

Benchmark data, Benchmark query Q3

Scalability

slide-33
SLIDE 33

33

Yi Chen --- January 23, 2006

Summary of XML Data Management

We proposed a generic XML-to-RDB mapping,

based on a bi-labeling scheme.

It is more efficient compared with previous

approach, since it generates SQL queries that require:

fewer disk accesses fewer joins fewer intermediate results

Experiments show the effectiveness

slide-34
SLIDE 34

34

Yi Chen --- January 23, 2006

Roadmap of This Talk

Managing XML by leveraging mature RDBMS

[Chen et al 04]

Introduction to XML A generic and efficient XML-to-RDBMS mapping

Data mapping from trees to tables Query translation from tree navigation queries

to SQL queries that are efficient

Handling imprecise and incomplete data in

DBMS [Chen et al 06]

slide-35
SLIDE 35

35

Yi Chen --- January 23, 2006

Probabilistic Databases: Managing Imprecise Data

How to measure the imprecision of the data? The simplest model: associate each tuple a

probability

Name Skill age Pr James Beginner 21 p1 Bob Experienced 33 p2 Climber Name Route Date Duration Pr Bob Last Tango 10/10/05 5 q1 Bob Last Tango 1/10/06 4.5 q2 Climbs

Assume that all the tuples are independent events

slide-36
SLIDE 36

36

Yi Chen --- January 23, 2006

How Does This Affect Query Evaluation? [Fuhr&Roellke 97]

σ

v p v p v1 p1 v1 v2 p1 p2 v2 p2

Π

v p1 v p2 v 1-(1-p1)(1-p2)

slide-37
SLIDE 37

37

Yi Chen --- January 23, 2006

LT 1-(1-p2q1)(1- p2q2)

Π

LT 1-(1-q1)(1-q2) Bob LT p2(1-(1-q1)(1-q2))

Wrong ! Correct

Challenges: Correctness Depends on Execution Order [Suciu et al 05]!

Name Skill Pr Bob Exp p2 Name Route Pr Bob LT q1 Bob LT q2

Find distinct climb routes that have been climbed by an experienced climber.

Name Route Pr Bob LT q1 Bob LT q2

Π

Name Skill Pr Bob Exp p2 Bob LT p2q1 Bob LT p2q2

slide-38
SLIDE 38

38

Yi Chen --- January 23, 2006

This can not be convertible !

How to Handle Incomplete Data?

Seller Make Model Body Style Year Price Mark BMW 325Cic convertible 2006 36000 Alice Ford Taurus 2001 8000

Our techniques infer possible values with probability by mining data statistics Find cars that are convertible and have price less than 10k.

slide-39
SLIDE 39

39

Yi Chen --- January 23, 2006

Querying Incomplete Databases

Given missing value prediction

Querying incomplete databases Querying probabilistic databases

What if we are not able to store the predicted

values? --- e.g. data integration applications

On-the-fly query rewriting

How should we handling imprecise and incomplete XML Data?

slide-40
SLIDE 40

40

Yi Chen --- January 23, 2006

Conclusions

Traditional RDBMS does not satisfy the requirements

in ever growing scientific and web applications

We have discussed two enhancement to RDBMS

Efficient XML data management Handling imprecise and incomplete data

Other enhancement to RDBMS that I am working on

Data stream processing Scientific workflow modeling and query processing

slide-41
SLIDE 41

41

Yi Chen --- January 23, 2006

Thank you!

Questions ?