[PDF] - Framework Information Integration : Making 1. XML databases from PDF Document

SLIDE 1

1

XML

Semi-structured Data Extensible Markup Language Document Type Definitions

2

Framework

1.

Information Integration : Making databases from various places work as

ne.

2.

Semi-structured Data : A new data model designed to cope with problems

f information integration.

3.

XML : A standard language for describing semi-structured data schemas and representing data.

3

1. Information Integration
Generally databases in an enterprises have:
Several underlying database management

systems

Oracle, MS SQL Server, DB2, Informix, Sybase (SQL

Server), MS Access, etc.

Several underlying database schemas
Information in an employee table can contain
Employee Name, SSN, DOB, title, hrsPerWeek.

modifiedTime, modifiedBy

Employee Name, SSN, DOB, title, degree, createTime,

createBy

Employee Name, SSN, DOB, title, salary, modifiedTime,

modifiedBy, createTime, createBy

4

2. Semi-structured Data
A new data model designed to cope

with problems of information integration

Accommodates of different DBMS
Integrates different schemas

5

3. XML
XML : A standard language for

describing semi-structured data schemas and representing data.

6

The Information-Integration Problem

Major bottleneck in enterprise

application integration

For example,
Hewlett Packard split into HP and Agilent
HP bought Compaq
Need to integrate data from different

sources

SLIDE 2

2

7

The Information-Integration Problem

Related data exists in many places

and could, in principle, work together.

But different databases differ in:

1.

Model (relational, object-oriented?).

2.

Schema (normalized/unnormalized?).

3.

Terminology: are consultants employees? Retirees? Subcontractors?

4.

Conventions (meters versus feet?).

8

Example

Consider merger of three stores in a

mall

There is some overlap in the products

sold but the databases are different

9

Example

Every store has a database.

One may use a relational DBMS; another

keeps the menu in an MS-Word document.

One stores the phones of distributors,

another does not.

One distinguishes products in one

department and another doesn’t.

One counts inventory by number of items,

another by cases.

10

Two Approaches to Integration

1.

Warehousing

Makes a copy of the data
More developed of the two

2.

Mediation

Creates a view of the data
Newer and less developed

11

Warehousing

Make copies of the data sources at a central

site and transform it to a common schema.

Reconstruct data daily/weekly
Do not try to keep it more up-to-date than that.
Pro:
Very well-developed, and several commercial tools are

available

Con:
Data can be old since updates are expensive

12

Mediation

Create a view of all sources, as if they

were integrated.

Answer a view query by translating it to

terminology of the sources and querying them.

Pro:
Current data
Con:
Can be slow
Availability of tools

SLIDE 3

3

13

Warehouse Diagram

Warehouse Wrapper Wrapper Source 1 Source 2

14

A Mediator

Mediator Wrapper Wrapper Source 1 Source 2

User query Query Query Query Query Result Result Result Result Result

15

Semi-structured: Motivation

Most effective approach to Information

Integration:

Semi-structured Data Model

r Semi-structured Objects

16

Semi-structured: Motivation

Main limitation of Object-Oriented

Models: Object Models are Strongly Typed

Objects of a class have one structure only

Semi-structured approach solves this

problem

17

Semi-structured Data

Purpose:

Represent data from independent sources

more flexibly than

either relational

r object-oriented models.

18

Semi-structured Data

Each object has a class of their own and

properties are defined whatever labels are attached to that object

Properties mean attributes, relationships,

methods, etc.

SLIDE 4

4

19

Semi-structured Data

Think of objects, but with the type of

each object its own business, not that

f its “class.”

Labels to indicate meaning of

substructures.

20

Semi-structured Graphs

Easy to think of Semi-structured data as

Graphs

Nodes = objects. Labels on arcs:

attributes leading to a leaf node Relationships leading to another node. 21

Semi-structured Graphs

Atomic values at leaf nodes

nodes with no arcs out.

Flexibility: no restriction on:

Labels out of a node. Number of successors with a given label.

22

Example: Data Graph

Pepsi PepsiCo BestSeller

2003

Main St

KFC

Sobe

soda soda rest manf manf sellsAt name name name addr prize year award root

The restaurant object for KFC (arc-in called rest; arc-out labeled name to KFC) The soda object for Pepsi (arc-in called soda; arc-out called name to Pepsi)

Notice a new kind

f data.

Root object represents the entire DB. Often look like trees, but are not.

23

XML

XML = Extensible Markup Language. While HTML uses tags for formatting

(e.g., “italic”), XML uses tags for semantics (e.g., “this is an address”).

Key idea: create tag sets for a domain

(e.g., genomics), and translate all data into properly tagged XML documents.

24

Well-Formed and Valid XML

Well-Formed XML allows you to invent your

wn tags.

Similar to labels in semi-structured data graph.

Valid XML involves a DTD (Document Type

Definition), which

gives a grammar for the use of labels limits the set of labels our of node the order and number of times a label occurs

SLIDE 5

5

25

Well-Formed XML: Header

Start the document with a declaration,

surrounded by < ? … ?> .

Normal declaration for Well-Formed

XML is:

<? XML VERSION = “1.0” STANDALONE = “yes” ?>

Version indicates version number Standalone = “yes” means no DTD

provided.

26

Well-Formed XML: Body

Body of document is a root tag

surrounding nested tags.

Body can include:

several properly matching tags (as in html

structure)

Root tag can

have a special meaning such as document type

r can be generic

27

Example: Well-Formed XML

< ? XML VERSION = “1.0” STANDALONE = “yes” ?> < RESTS> < REST> < NAME> Taco Bell< /NAME> < SODA> < NAME> Pepsi< /NAME> < PRICE> 1.00< /PRICE> < / SODA> < SODA> < NAME> Sobe< /NAME> < PRICE> 2.00< /PRICE> < /SODA> < /REST > < REST> … < /REST > … < /RESTS>

Root tag RESTS surrounds the entire document One of several nested REST tags representing information about a single REST < NAME> tag specifies the REST name < SODA> tags have names and price for each Soda nested in < NAME> and < PRI CE> tags.

29

XML and Semi-structured Data

Well-Formed XML documents with

nested tags is exactly the same idea as trees of semi-structured data.

Tags are the labels on edges Nodes represent data between matching

nesting in XML

30

XML and Semi-structured Data

Semi-structured approach allows for

non-tree structures

We shall see that XML also enables

non-tree structures, as does the semi- structured data model.

SLIDE 6

6

31

Exercise

Convert the following into a Semi-

structured representation

< ? XML VERSI ON = “1.0” STANDALONE = “yes” ?> < RESTS> < REST> < NAME> Taco Bell< / NAME> < SODA> < NAME> Pepsi< / NAME> < PRI CE> 1.00< / PRI CE> < / SODA> < SODA> < NAME> Sobe< / NAME> < PRI CE> 2.00< / PRI CE> < / SODA> < / REST > < REST> … < / REST > … < / RESTS>

32

Solution

The < RESTS> XML document is: KFC Pepsi 1.00 Sobe 2.00 PRICE REST REST RESTS NAME . . . REST PRICE NAME SODA SODA NAME

Note: Data is stores in leaf nodes and structure (tags) in internal nodes

33

Document Type Definitions

Most interesting use of XML: Valid XML

Essentially a context-free grammar for

describing XML tags and their nesting.

Each domain of interest creates one

DTD that describes all the documents this group will share.

For example, electronic components, travel

industry, etc., will have their own DTDs

34

DTD Structure

<!DOCTYPE < root tag> [ <!ELEMENT < name> ( < components> )

< more elements>

]>

Note: !DOCTYPE is key word with < root tag> being the name of DOCTYPE Between [… ] list of ELEMENT definition Each !ELEMENT has a < name> with the allowed list of < components> usually in the order listed

35

DTD Elements

Element definition consists of its name

(tag), and a parenthesized description

f any nested tags.

includes order of subtags and their multiplicity (0, 1, many times).

Leaves (text elements) have # PCDATA

in place of nested tags.

36

Example: DTD

< !DOCTYPE Rests [ < !ELEMENT RESTS (REST* )> < !ELEMENT REST (NAME, SODA+ )> < !ELEMENT NAME (# PCDATA)> < !ELEMENT SODA (NAME, PRICE)> < !ELEMENT NAME (# PCDATA)> < !ELEMENT PRICE (# PCDATA)> ]>

RESTS can have * (0 or more) REST REST has NAME and then + (1 or more)

SODA. Order matters!

NAME and PRI CE are data (# PCDATA): No more tags just text. SODA has NAME followed PRI CE.

SLIDE 7

7

37

Element Descriptions Rules

Subtags must appear in order shown. A tag may be followed by a symbol to

indicate its multiplicity:

Identical to UNIX regular expressions. * = zero or more. + = one or more. ? = zero or one.

Symbol | can connect alternative sequences

f tags.

38

Example: Element Description

A name is

Either an optional title (e.g., “Dr.”), a first

name, and a last name, in that order,

r it is an IP address:

<!ELEMENT NAME ( (TITLE?, FIRST, LAST) | IPADDR )>

39

Use of DTDs

In order to specify a document follows

a particular DTD

1.

Set STANDALONE = “no”.

2.

Either:

a)

Include the DTD as a preamble of the XML document, or

b)

Follow DOCTYPE and the < root tag> by SYSTEM and a path to the file where the DTD is stored.

40

Example (a)

< ? XML VERSI ON = “1.0” STANDALONE = “no” ?> < !DOCTYPE Rests [ < !ELEMENT RESTS (REST* )> < !ELEMENT REST (NAME, SODA+ )> < !ELEMENT NAME (# PCDATA)> < !ELEMENT SODA (NAME, PRI CE)> < !ELEMENT NAME (# PCDATA)> < !ELEMENT PRI CE (# PCDATA)> ]> < RESTS> < REST> < NAME> Taco Bell< / NAME> < SODA> < NAME> Pepsi< / NAME> < PRI CE> 1.00< / PRI CE> < / SODA> < SODA> < NAME> Sobe< / NAME> < PRI CE> 2.00< / PRI CE> < / SODA> < / REST> < REST> … < / RESTS> DTD Document: Same as earlier but this time it conforms to the above DTD

41

Example (b)

Assume the RESTS DTD is in file rest.dtd.

< ? XML VERSION = “1.0” STANDALONE = “no” ?> < !DOCTYPE Rests SYSTEM “rest.dtd”> < RESTS> < REST> < NAME> Taco Bell< /NAME> < SODA> < NAME> Pepsi< /NAME> < PRICE> 1.00< /PRICE> < /SODA> < SODA> < NAME> Sobe< /NAME> < PRICE> 2.00< /PRICE> < /SODA> < /REST> < REST> … < /RESTS>

Get the DTD from the file rest.dtd Document: Same as earlier but this time it conforms to the DTD in rest.dtd

42

Attributes

Attributes are another important component

f DTD and XML docs

Opening tags in XML can have attributes, like

< A HREF = “…”> in HTML.

In a DTD,

<!ATTLIST < element name> … > gives a list of attributes and their data types for this element.

SLIDE 8

8

43

Example: Attributes

Rests can have an attribute kind, which

is either qsr, family, or other.

The element definition is unchanged. However, we add an ATTLIST.

<!ELEMENT REST (NAME SODA*)> <!ATTLIST REST kind = “qsr” | “family” | “other”>

44

Example: Attribute Use

In a document that allows REST tags, we might

see: <REST kind = “qsr”> <NAME>KFC</NAME> <SODA><NAME>Sierra Mist</NAME> <PRICE>1.00</PRICE></SODA> ... </REST>

New info: kind = “qsr”

45

ID’s and IDREFs

Introduce links from one object to

another

Allows the structure of an XML

document to be a general graph, rather than just a tree.

These are pointers from one object to

another, in analogy to HTML’s NAME = “blah” and HREF = “# blah”.

46

Creating ID’s

We give an element Elephant an

attribute Attention of type ID in the DTD.

When using tag < Elephant> in an XML

document, give its attribute Attention a unique value. For example,

47

Creating IDREFs

IDREFs are similar to IDs:

To allow objects of type Fig to refer to

another object with an ID attribute, give Fig an attribute of type IDREF (single string of type ID).

Or, let the attribute have type IDREFS,

so the Fig –object can refer to any number of other objects (any number strings of type ID).

48

Example: ID’s and IDREFs

Let’s redesign our RESTS DTD to include both

REST and SODA sub-elements.

Both rests and sodas will have ID attributes

called name.

Rests have PRICE sub-objects, consisting of a

number (the price of one soda) and an IDREF theSoda leading to that soda.

Sodas have attribute soldBy, which is an

IDREFS leading to all the rests that sell it.

SLIDE 9

9

49

The DTD

< !DOCTYPE Rests [ < !ELEMENT RESTS (REST* , SODA* )> < !ELEMENT REST (PRICE+ )> < !ATTLIST REST name = ID> < !ELEMENT PRICE (# PCDATA)> < !ATTLIST PRICE theSoda = IDREF> < !ELEMENT SODA ()> < !ATTLIST SODA name = ID, soldBy = IDREFS> ]>

RESTS have 0+ REST and 0+ SODA REST objects have name as an ID attribute and have

ne or more PRICE

sub-objects. PRICE objects have a number (the price) and one reference to a soda. Soda objects have an ID attribute called name, and a soldBy attribute that is a set of Rest names.

50

Example Document

< RESTS> < REST name = “KFC”> < PRICE theSoda = “Pepsi”> 1.00< /PRICE> < PRICE theSoda = “Sobe”> 2.00< /PRICE> < /REST> … < SODA name = “Pepsi”, soldBy = “KFC, TacoBell,…”> < /SODA> … < /RESTS>

1

XML

Semi-structured Data Extensible Markup Language Document Type Definitions

Framework

Information Integration : Making databases from various places work as

Semi-structured Data : A new data model designed to cope with problems

XML : A standard language for describing semi-structured data schemas and representing data.

systems

with problems of information integration

describing semi-structured data schemas and representing data.

The Information-Integration Problem

application integration

sources

2

The Information-Integration Problem

and could, in principle, work together.

Model (relational, object-oriented?).

Schema (normalized/unnormalized?).

Terminology: are consultants employees? Retirees? Subcontractors?

Conventions (meters versus feet?).

Example

mall

sold but the databases are different

Example

keeps the menu in an MS-Word document.

another does not.

department and another doesn’t.

another by cases.

Two Approaches to Integration

Warehousing

Mediation

Warehousing

site and transform it to a common schema.

Mediation

were integrated.

terminology of the sources and querying them.

3

Warehouse Diagram

Warehouse Wrapper Wrapper Source 1 Source 2

A Mediator

Mediator Wrapper Wrapper Source 1 Source 2

Semi-structured: Motivation

Integration:

Semi-structured: Motivation

Models: Object Models are Strongly Typed

problem

Semi-structured Data

more flexibly than

Semi-structured Data

properties are defined whatever labels are attached to that object

methods, etc.

4

Semi-structured Data

each object its own business, not that

substructures.

Semi-structured Graphs

Graphs

Semi-structured Graphs

Example: Data Graph

XML

(e.g., “italic”), XML uses tags for semantics (e.g., “this is an address”).

(e.g., genomics), and translate all data into properly tagged XML documents.

Well-Formed and Valid XML

Definition), which

5

Well-Formed XML: Header

surrounded by < ? … ?> .

XML is:

provided.

Well-Formed XML: Body

surrounding nested tags.

structure)

Tags

matched pairs, as

Example: Well-Formed XML

XML and Semi-structured Data

nested tags is exactly the same idea as trees of semi-structured data.

tags

nesting in XML

XML and Semi-structured Data