Framework Information Integration : Making 1. XML databases from - - PDF document

framework
SMART_READER_LITE
LIVE PREVIEW

Framework Information Integration : Making 1. XML databases from - - PDF document

Framework Information Integration : Making 1. XML databases from various places work as one. Semi-structured Data : A new data 2. model designed to cope with problems Semi-structured Data of information integration. Extensible Markup


slide-1
SLIDE 1

1

1

XML

Semi-structured Data Extensible Markup Language Document Type Definitions

2

Framework

1.

Information Integration : Making databases from various places work as

  • ne.

2.

Semi-structured Data : A new data model designed to cope with problems

  • f information integration.

3.

XML : A standard language for describing semi-structured data schemas and representing data.

3

  • 1. Information Integration
  • Generally databases in an enterprises have:
  • Several underlying database management

systems

  • Oracle, MS SQL Server, DB2, Informix, Sybase (SQL

Server), MS Access, etc.

  • Several underlying database schemas
  • Information in an employee table can contain
  • Employee Name, SSN, DOB, title, hrsPerWeek.

modifiedTime, modifiedBy

  • Employee Name, SSN, DOB, title, degree, createTime,

createBy

  • Employee Name, SSN, DOB, title, salary, modifiedTime,

modifiedBy, createTime, createBy

4

  • 2. Semi-structured Data
  • A new data model designed to cope

with problems of information integration

  • Accommodates of different DBMS
  • Integrates different schemas

5

  • 3. XML
  • XML : A standard language for

describing semi-structured data schemas and representing data.

6

The Information-Integration Problem

  • Major bottleneck in enterprise

application integration

  • For example,
  • Hewlett Packard split into HP and Agilent
  • HP bought Compaq
  • Need to integrate data from different

sources

slide-2
SLIDE 2

2

7

The Information-Integration Problem

  • Related data exists in many places

and could, in principle, work together.

  • But different databases differ in:

1.

Model (relational, object-oriented?).

2.

Schema (normalized/unnormalized?).

3.

Terminology: are consultants employees? Retirees? Subcontractors?

4.

Conventions (meters versus feet?).

8

Example

Consider merger of three stores in a

mall

There is some overlap in the products

sold but the databases are different

9

Example

Every store has a database.

One may use a relational DBMS; another

keeps the menu in an MS-Word document.

One stores the phones of distributors,

another does not.

One distinguishes products in one

department and another doesn’t.

One counts inventory by number of items,

another by cases.

10

Two Approaches to Integration

1.

Warehousing

  • Makes a copy of the data
  • More developed of the two

2.

Mediation

  • Creates a view of the data
  • Newer and less developed

11

Warehousing

  • Make copies of the data sources at a central

site and transform it to a common schema.

  • Reconstruct data daily/weekly
  • Do not try to keep it more up-to-date than that.
  • Pro:
  • Very well-developed, and several commercial tools are

available

  • Con:
  • Data can be old since updates are expensive

12

Mediation

  • Create a view of all sources, as if they

were integrated.

  • Answer a view query by translating it to

terminology of the sources and querying them.

  • Pro:
  • Current data
  • Con:
  • Can be slow
  • Availability of tools
slide-3
SLIDE 3

3

13

Warehouse Diagram

Warehouse Wrapper Wrapper Source 1 Source 2

14

A Mediator

Mediator Wrapper Wrapper Source 1 Source 2

User query Query Query Query Query Result Result Result Result Result

15

Semi-structured: Motivation

Most effective approach to Information

Integration:

Semi-structured Data Model

  • r Semi-structured Objects

16

Semi-structured: Motivation

Main limitation of Object-Oriented

Models: Object Models are Strongly Typed

Objects of a class have one structure only

Semi-structured approach solves this

problem

17

Semi-structured Data

Purpose:

Represent data from independent sources

more flexibly than

either relational

  • r object-oriented models.

18

Semi-structured Data

Each object has a class of their own and

properties are defined whatever labels are attached to that object

Properties mean attributes, relationships,

methods, etc.

slide-4
SLIDE 4

4

19

Semi-structured Data

Think of objects, but with the type of

each object its own business, not that

  • f its “class.”

Labels to indicate meaning of

substructures.

20

Semi-structured Graphs

Easy to think of Semi-structured data as

Graphs

Nodes = objects. Labels on arcs:

attributes leading to a leaf node Relationships leading to another node. 21

Semi-structured Graphs

Atomic values at leaf nodes

nodes with no arcs out.

Flexibility: no restriction on:

Labels out of a node. Number of successors with a given label.

22

Example: Data Graph

Pepsi PepsiCo BestSeller

2003

Main St

KFC

Sobe

soda soda rest manf manf sellsAt name name name addr prize year award root

The restaurant object for KFC (arc-in called rest; arc-out labeled name to KFC) The soda object for Pepsi (arc-in called soda; arc-out called name to Pepsi)

Notice a new kind

  • f data.

Root object represents the entire DB. Often look like trees, but are not.

23

XML

XML = Extensible Markup Language. While HTML uses tags for formatting

(e.g., “italic”), XML uses tags for semantics (e.g., “this is an address”).

Key idea: create tag sets for a domain

(e.g., genomics), and translate all data into properly tagged XML documents.

24

Well-Formed and Valid XML

Well-Formed XML allows you to invent your

  • wn tags.

Similar to labels in semi-structured data graph.

Valid XML involves a DTD (Document Type

Definition), which

gives a grammar for the use of labels limits the set of labels our of node the order and number of times a label occurs

slide-5
SLIDE 5

5

25

Well-Formed XML: Header

Start the document with a declaration,

surrounded by < ? … ?> .

Normal declaration for Well-Formed

XML is:

<? XML VERSION = “1.0” STANDALONE = “yes” ?>

Version indicates version number Standalone = “yes” means no DTD

provided.

26

Well-Formed XML: Body

Body of document is a root tag

surrounding nested tags.

Body can include:

several properly matching tags (as in html

structure)

Root tag can

have a special meaning such as document type

  • r can be generic

27

Tags

Tags, as in HTML, are normally

matched pairs, as

< BLAH> … < /BLAH> .

Tags may be nested arbitrarily. Some tags requiring no matching ender,

such as < P> in HTML, are also permitted. however, we will not use these in examples

28

Example: Well-Formed XML

< ? XML VERSION = “1.0” STANDALONE = “yes” ?> < RESTS> < REST> < NAME> Taco Bell< /NAME> < SODA> < NAME> Pepsi< /NAME> < PRICE> 1.00< /PRICE> < / SODA> < SODA> < NAME> Sobe< /NAME> < PRICE> 2.00< /PRICE> < /SODA> < /REST > < REST> … < /REST > … < /RESTS>

Root tag RESTS surrounds the entire document One of several nested REST tags representing information about a single REST < NAME> tag specifies the REST name < SODA> tags have names and price for each Soda nested in < NAME> and < PRI CE> tags.

29

XML and Semi-structured Data

Well-Formed XML documents with

nested tags is exactly the same idea as trees of semi-structured data.

Tags are the labels on edges Nodes represent data between matching

tags

Parent-child relationship is immediate

nesting in XML

30

XML and Semi-structured Data

Semi-structured approach allows for

non-tree structures

We shall see that XML also enables

non-tree structures, as does the semi- structured data model.

slide-6
SLIDE 6

6

31

Exercise

Convert the following into a Semi-

structured representation

< ? XML VERSI ON = “1.0” STANDALONE = “yes” ?> < RESTS> < REST> < NAME> Taco Bell< / NAME> < SODA> < NAME> Pepsi< / NAME> < PRI CE> 1.00< / PRI CE> < / SODA> < SODA> < NAME> Sobe< / NAME> < PRI CE> 2.00< / PRI CE> < / SODA> < / REST > < REST> … < / REST > … < / RESTS>

32

Solution

The < RESTS> XML document is: KFC Pepsi 1.00 Sobe 2.00 PRICE REST REST RESTS NAME . . . REST PRICE NAME SODA SODA NAME

Note: Data is stores in leaf nodes and structure (tags) in internal nodes

33

Document Type Definitions

Most interesting use of XML: Valid XML

Essentially a context-free grammar for

describing XML tags and their nesting.

Each domain of interest creates one

DTD that describes all the documents this group will share.

For example, electronic components, travel

industry, etc., will have their own DTDs

34

DTD Structure

<!DOCTYPE < root tag> [ <!ELEMENT < name> ( < components> )

< more elements>

]>

Note: !DOCTYPE is key word with < root tag> being the name of DOCTYPE Between [… ] list of ELEMENT definition Each !ELEMENT has a < name> with the allowed list of < components> usually in the order listed

35

DTD Elements

Element definition consists of its name

(tag), and a parenthesized description

  • f any nested tags.

includes order of subtags and their multiplicity (0, 1, many times).

Leaves (text elements) have # PCDATA

in place of nested tags.

36

Example: DTD

< !DOCTYPE Rests [ < !ELEMENT RESTS (REST* )> < !ELEMENT REST (NAME, SODA+ )> < !ELEMENT NAME (# PCDATA)> < !ELEMENT SODA (NAME, PRICE)> < !ELEMENT NAME (# PCDATA)> < !ELEMENT PRICE (# PCDATA)> ]>

RESTS can have * (0 or more) REST REST has NAME and then + (1 or more)

  • SODA. Order matters!

NAME and PRI CE are data (# PCDATA): No more tags just text. SODA has NAME followed PRI CE.

slide-7
SLIDE 7

7

37

Element Descriptions Rules

Subtags must appear in order shown. A tag may be followed by a symbol to

indicate its multiplicity:

Identical to UNIX regular expressions. * = zero or more. + = one or more. ? = zero or one.

Symbol | can connect alternative sequences

  • f tags.

38

Example: Element Description

A name is

Either an optional title (e.g., “Dr.”), a first

name, and a last name, in that order,

  • r it is an IP address:

<!ELEMENT NAME ( (TITLE?, FIRST, LAST) | IPADDR )>

39

Use of DTDs

  • In order to specify a document follows

a particular DTD

1.

Set STANDALONE = “no”.

2.

Either:

a)

Include the DTD as a preamble of the XML document, or

b)

Follow DOCTYPE and the < root tag> by SYSTEM and a path to the file where the DTD is stored.

40

Example (a)

< ? XML VERSI ON = “1.0” STANDALONE = “no” ?> < !DOCTYPE Rests [ < !ELEMENT RESTS (REST* )> < !ELEMENT REST (NAME, SODA+ )> < !ELEMENT NAME (# PCDATA)> < !ELEMENT SODA (NAME, PRI CE)> < !ELEMENT NAME (# PCDATA)> < !ELEMENT PRI CE (# PCDATA)> ]> < RESTS> < REST> < NAME> Taco Bell< / NAME> < SODA> < NAME> Pepsi< / NAME> < PRI CE> 1.00< / PRI CE> < / SODA> < SODA> < NAME> Sobe< / NAME> < PRI CE> 2.00< / PRI CE> < / SODA> < / REST> < REST> … < / RESTS> DTD Document: Same as earlier but this time it conforms to the above DTD

41

Example (b)

Assume the RESTS DTD is in file rest.dtd.

< ? XML VERSION = “1.0” STANDALONE = “no” ?> < !DOCTYPE Rests SYSTEM “rest.dtd”> < RESTS> < REST> < NAME> Taco Bell< /NAME> < SODA> < NAME> Pepsi< /NAME> < PRICE> 1.00< /PRICE> < /SODA> < SODA> < NAME> Sobe< /NAME> < PRICE> 2.00< /PRICE> < /SODA> < /REST> < REST> … < /RESTS>

Get the DTD from the file rest.dtd Document: Same as earlier but this time it conforms to the DTD in rest.dtd

42

Attributes

Attributes are another important component

  • f DTD and XML docs

Opening tags in XML can have attributes, like

< A HREF = “…”> in HTML.

In a DTD,

<!ATTLIST < element name> … > gives a list of attributes and their data types for this element.

slide-8
SLIDE 8

8

43

Example: Attributes

Rests can have an attribute kind, which

is either qsr, family, or other.

The element definition is unchanged. However, we add an ATTLIST.

<!ELEMENT REST (NAME SODA*)> <!ATTLIST REST kind = “qsr” | “family” | “other”>

44

Example: Attribute Use

In a document that allows REST tags, we might

see: <REST kind = “qsr”> <NAME>KFC</NAME> <SODA><NAME>Sierra Mist</NAME> <PRICE>1.00</PRICE></SODA> ... </REST>

New info: kind = “qsr”

45

ID’s and IDREFs

Introduce links from one object to

another

Allows the structure of an XML

document to be a general graph, rather than just a tree.

These are pointers from one object to

another, in analogy to HTML’s NAME = “blah” and HREF = “# blah”.

46

Creating ID’s

We give an element Elephant an

attribute Attention of type ID in the DTD.

When using tag < Elephant> in an XML

document, give its attribute Attention a unique value. For example,

<Elephant Attention = “213”>

47

Creating IDREFs

IDREFs are similar to IDs:

To allow objects of type Fig to refer to

another object with an ID attribute, give Fig an attribute of type IDREF (single string of type ID).

Or, let the attribute have type IDREFS,

so the Fig –object can refer to any number of other objects (any number strings of type ID).

48

Example: ID’s and IDREFs

Let’s redesign our RESTS DTD to include both

REST and SODA sub-elements.

Both rests and sodas will have ID attributes

called name.

Rests have PRICE sub-objects, consisting of a

number (the price of one soda) and an IDREF theSoda leading to that soda.

Sodas have attribute soldBy, which is an

IDREFS leading to all the rests that sell it.

slide-9
SLIDE 9

9

49

The DTD

< !DOCTYPE Rests [ < !ELEMENT RESTS (REST* , SODA* )> < !ELEMENT REST (PRICE+ )> < !ATTLIST REST name = ID> < !ELEMENT PRICE (# PCDATA)> < !ATTLIST PRICE theSoda = IDREF> < !ELEMENT SODA ()> < !ATTLIST SODA name = ID, soldBy = IDREFS> ]>

RESTS have 0+ REST and 0+ SODA REST objects have name as an ID attribute and have

  • ne or more PRICE

sub-objects. PRICE objects have a number (the price) and one reference to a soda. Soda objects have an ID attribute called name, and a soldBy attribute that is a set of Rest names.

50

Example Document

< RESTS> < REST name = “KFC”> < PRICE theSoda = “Pepsi”> 1.00< /PRICE> < PRICE theSoda = “Sobe”> 2.00< /PRICE> < /REST> … < SODA name = “Pepsi”, soldBy = “KFC, TacoBell,…”> < /SODA> … < /RESTS>