HTML Hyper Text Markup Language HTML 4.0 has strict compliance - - PDF document

html
SMART_READER_LITE
LIVE PREVIEW

HTML Hyper Text Markup Language HTML 4.0 has strict compliance - - PDF document

HTML and XML Venkat Subramaniam svenkat@cs.uh.edu 1 HTML Hyper Text Markup Language HTML 4.0 has strict compliance with XML standard Presentation details presented with information using markups Browsers act as


slide-1
SLIDE 1

1

Venkat Subramaniam – svenkat@cs.uh.edu

HTML and XML

2

Venkat Subramaniam – svenkat@cs.uh.edu

HTML

  • Hyper Text Markup Language
  • HTML 4.0 has strict compliance with XML

standard

  • Presentation details presented with

information

– using markups

  • Browsers act as interpreters/ parsers in

– parsing through HTML documents – displaying the contents of the documents

slide-2
SLIDE 2

3

Venkat Subramaniam – svenkat@cs.uh.edu

Tags, Elements and Attributes

<STRONG>boldface Text</STRONG> <HR> <TABLE BORDER="1">…</TABLE>

  • Tag starts with < and ends with >
  • Elements generally have start and end tags

– starts with < TagName> – ends with < / TagName> (optional in some cases) – contents of elements included between tags

  • Attributes

– Name= Value specifies information about contents in an element – Provided between tag name and ending > – Multiple attributes separated by space

4

Venkat Subramaniam – svenkat@cs.uh.edu

Tags, Case, well-formedness

  • HTML is relaxed when it comes to case and

well-formedness

  • <HR> is as good as <hr> as are <Hr> and <hR>
  • <STRONG>This is <I> italics</I> Text</STRONG>
  • However,

– <STRONG>This is <I> italics</STRONG> </I> Text

– Is generally accepted, though not well-formed – How does a browser handle this? Try it on different browsers

  • XML on the other hand is well-formed and case

sensitive

  • XHMTL is HTML following XML restrictions
slide-3
SLIDE 3

5

Venkat Subramaniam – svenkat@cs.uh.edu

Tags, Line Breaks, Special Characters

  • Block-level tags affect a block of text/ content

– HEAD, BODY, P, H1, BR, UL, TABLE

  • Inline tags affect only a few letters or words

– EM, B, IMG

  • Line breaks

– generally include automatic in block-level tags – Not so with inline tags

  • Special characters

– < , > , & and " are special characters – To display these use names (&lt;, &gt;, &amp;, &quot;) or numbers ()

6

Venkat Subramaniam – svenkat@cs.uh.edu

Common Tags

  • < HTML>

Optional tag indicating content type

  • < TITLE>

Title of a web page

  • < BODY>

Content of a web page

  • < Hn ALIGN= direction>

Level 1 to 6 of header (Times New Roman 24, 18, 14, 12, 10 and 8 points) direction = left, right or center

  • < P ALIGN= direction>

Space between paragraphs

slide-4
SLIDE 4

7

Venkat Subramaniam – svenkat@cs.uh.edu

Text Formatting – Font, Size

  • Specifying Font (deprecated in HTML 4.0)
  • <FONT SIZE="value" FACE="name1, name2"

COLOR="value">

– Size value may be 1 to 7 (Times 8, 10, 12, 14, 18, 24, 36) – Size may also be + n or –n to specify a point higher

  • r lower
  • Also may be altered with < BIG> or < SMALL> tags

– If name1 is not available on system, select name2

  • More alternatives may be specified

– If none of the alternatives available, choose default

  • You may set default size for entire document

using <BASEFONT SIZE=“value”>

8

Venkat Subramaniam – svenkat@cs.uh.edu

Text Formatting - Color

  • Color value can be specified

– using either # rrggbb value – Or using “color” for one of 16 predefined colors

  • < BODY TEXT= “value”>

– Sets the default color for text in the document

  • < FONT COLOR= “value”>

– Sets the color for the content of this element

slide-5
SLIDE 5

9

Venkat Subramaniam – svenkat@cs.uh.edu

Text Formatting - Miscellaneous

  • < SUB> for subscript
  • < SUP> for superscript
  • < STRIKE> for strikeout
  • < U> for underline
  • < B> or < STRONG> for boldface
  • < I> or < EM> for italics
  • < CODE> , < KBD> , < SAMP> , < TT> for monospace
  • < BLINK> for blinking text
  • < !– to start comments and end with -->
  • All these tags have a start and end tag

10

Venkat Subramaniam – svenkat@cs.uh.edu

Links

  • Links are used to relate documents together

– to navigate, to view, to take some action, etc.

  • Link has three parts destination, label and target

<A HREF=“anotherPage.html” >Next</A> – HREF provides target, Next is the label – A special attribute called TARGET may be used to tell browser to display in another frame or new window (_blank)

  • target names are case sensitive
  • <BASE TARGET=“…”> in head section sets default target for page
  • Good practice to use relative URL

– use absolute for outside web pages

  • Links may be of other types: ftp, news, mailto, etc.
slide-6
SLIDE 6

11

Venkat Subramaniam – svenkat@cs.uh.edu

Links and Anchors

  • You may define an anchor within a

document

– <A NAME=“anchorName”>…</A>

  • You may link to that location in document

by

– <A HREF=“#anchorName”>label</A> – <A HREF=“URL#anchorName”>label</A>

12

Venkat Subramaniam – svenkat@cs.uh.edu

Tables

<TABLE> <TR> <TD>cell 1 content</TD><TD>cell 2 content</TD> </TR> … </TABLE>

  • TABLE attribute BORDER=n defines thickness

– default is 2 – If you do not specify, the border is drawn with space, not line – to add extra space around table, use HSPACE or VSPACE

  • TABLE attribute ALIGN=center will center the table
  • TABLE or TD attribute WIDTH=n sets cell width pixels

– size specified ignored if specified space is too small for contents

  • Attribute of TD, COLSPAN=n specifies number of columns to span

– use ROWSPAN to span across rows

  • Use <TH> for table header, centered and boldfact
  • Use <CAPTION> for a table caption

– attribute ALIGN=direction (top, bottom, left, right)

slide-7
SLIDE 7

13

Venkat Subramaniam – svenkat@cs.uh.edu

Lists

  • You may create (un)ordered list and definitions

lists

– May be plain, numbered, bulleted

<OL TYPE=X> <LI> list item 1</LI> <LI> list item 2</LI> </OL>

– Type is optional (defaults to 1 for numbers) – A for capital letters, a for small letters, I for capital roman numerals, i for small roman numerals – Use START= n for initial value for list item

  • always numeric and converted automatically to proper type

– In LI, may override TYPE, VALUE for this & following items 14

Venkat Subramaniam – svenkat@cs.uh.edu

Unordered List

  • Use <UL> to create unordered list
  • Use attribute TYPE= shape for bullet type

– disc for solid round bullet (default for 1st level) – circle for an empty round bullet (default for 2nd level) – square for square bullets (default for > = 3rd level)

  • <LI> may override the type
slide-8
SLIDE 8

15

Venkat Subramaniam – svenkat@cs.uh.edu

Definition Lists

  • Great to create lists that describe items

– Like glossaries

<DL>Text here will appear on own line <DT>Text To Appear On Own Line Aligned Left</DT> <DD> Definition text </DD> … </DL>

– You may have multiple of DLs and DTs to allow multiple words

  • r definitions

16

Venkat Subramaniam – svenkat@cs.uh.edu

Images

  • HTML tag IMG allows placement of images
  • <IMG SRC=“LocationAndNameOfImageFile”>
  • Attributes

– BORDER= “n” – ALT= “tooltip or alternate text”

  • specify a text that may appear instead of image
  • this also serves a tool tip on windows
  • a required attribute in HTML 4

– WIDTH= “x” HEIGHT= “y”

  • allows browser to optimize size for image while displaying text

– LOWSRC

  • specify a fast load low resolution image to be shown first
  • high resolution image is loaded slowly replacing the low resolution

image

– ALIGN

  • align left or right to allow text wrapping around image

– HSPACE= “pixel” VSPACE= “pixel”

  • Provides padding on sides (horizontal and vertical) around image
slide-9
SLIDE 9

17

Venkat Subramaniam – svenkat@cs.uh.edu

BR, CLEAR and Text Wrapping

  • < BR> command provides a line break
  • CLEAR attribute says do not begin text until the

specified margin is clear

– < BR CLEAR= “left”>

  • Do not begin text until left margin is clear of images

– < BR CLEAR= “right”>

  • Do not begin text until right margin is clear of images

– < BR CLEAR= “all”>

  • Do not begin text until both margins are clear of images

18

Venkat Subramaniam – svenkat@cs.uh.edu

Forms

  • Form has three parts

– FORM tag with URL of the action script – form elements, text, radio buttons, etc. – Submit button to send data to the script <FORM METHOD=POST ACTION=“scriptURL”> …

</FORM>

  • The method may be POST or GET

– GET is limiting in the amount of information sent

  • sent as part of query string
slide-10
SLIDE 10

19

Venkat Subramaniam – svenkat@cs.uh.edu

FORM elements

  • Elements are created using

<INPUT TYPE=“type” NAME=“name” VALUE=“initvalue”>

– name and user given value are sent as name= value – Use attributes DISABLED or READONLY if desired

  • Text box

– TYPE=“text”

– Attributes: SIZE=“n” MAXLENGTH=n

– last two attributes are in number of characters,

  • ptional

– SIZE defaults to 20

  • Password box

– A text box where what you type is not shown (asterisks) – Not encrypted when sent to server, though

20

Venkat Subramaniam – svenkat@cs.uh.edu

  • Radio button

– TYPE=“radio” – NAME=“radioset”

  • where radioset is group name for mutually exclusive buttons
  • verifies that only one of the group is set
  • This is the name sent to server side script, as well

– attribute CHECKED if you like button checked initially – VALUE=“value” is the value sent if this button checked

  • Check box

– TYPE=“checkbox” – attribute CHECKED if you like button checked initially – VALUE=“value” is the value sent if this button checked

FORM elements…

slide-11
SLIDE 11

21

Venkat Subramaniam – svenkat@cs.uh.edu

  • Uploading files

– TYPE=“file” – NAME=“title” for server to identify – SIZE=n number of chars of field to enter path/ file

  • default 20

– In the FORM tag, use attribute ENCTYPE=“multipart/form-data” – METHOD on FORM should be POST

  • Hidden fields

– Useful to maintain session information – TYPE=“hidden”

FORM elements…

22

Venkat Subramaniam – svenkat@cs.uh.edu

  • Menu

< SELECT NAME= “name” SIZE= “n” MULTIPLE> < OPTION SELECTED VALUE= “value”> label< / OPTION> … < / SELECT>

– SIZE is height in lines – SELECTED is optional, initial selection of menu item

  • Text Area

– When one line is not enough

– <TEXTAREA NAME=“name” ROWS=“n” COLS=“n” WRAP>

– ROWS defaults to 4 and COLS to 40, WRAP optional – User may provide up to 32,700 chars

FORM elements…

slide-12
SLIDE 12

23

Venkat Subramaniam – svenkat@cs.uh.edu

  • Submit button

< INPUT TYPE= “submit” VALUE= “button text”>

– if you do not provide value, the word Submit appears – if you set the name attribute, value is sent to server

  • Use TYPE= “reset” to provide a clear/ reset

button

  • HTML 4 adds BUTTON tag that allows you to

– change the font – background color – image

<BUTTON TYPE=“submit” NAME=“name” VALUE=“value” STYLE=“font: size FontName;background:color”> Text to left of image <IMG SRC=“imageFileName”> Text to right of image </BUTTON>

FORM elements…

24

Venkat Subramaniam – svenkat@cs.uh.edu

  • You may also use an image to send

information

  • <INPUT TYPE=“image”

SRC=“imageFileName” NAME=“name”>

  • Mouse coordinate on which user clicks is

sent

– as name.x and name.y – Top-left of image is (0, 0)

FORM elements…

slide-13
SLIDE 13

25

Venkat Subramaniam – svenkat@cs.uh.edu

Organizing Form Elements

  • You may put a box around elements

< FORM… > < FIELDSET>

< LEGEND ALIGN= right> box caption< / LEGEND> … elements …

< / FIELDSET>

… other fieldsets

< / FORM>

  • Simply surround elements with FIELDSET

element

26

Venkat Subramaniam – svenkat@cs.uh.edu

Running a Script on Input

  • It is useful to run a script when user

makes a selection

– JavaScript is the default scripting language

  • Simply add an attribute of an event type

to the tag

  • Specify the code to execute

– You may either type the code right there or refer to it

<BUTTON TYPE=“button” NAME=“Time” ONCLICK=“alert(‘Today is ‘ + Date())”> Current Time</Button> We will see this put to work in JavaScript session

slide-14
SLIDE 14

27

Venkat Subramaniam – svenkat@cs.uh.edu

HTML Events

  • ONBLUR

user leaves an element that has focus

  • ONCHANGE

user modifies content of element (like INPUT)

  • ONCLICK / ONDBLCLICK

user clicks / double clicks on specified area

  • ONFOCUS

user selects, clicks or tabs to element

  • ONKEYDOWN / ONKEYPRESS

user types something in the specified area

  • ONKEYUP

user releases key after typing

  • ONLOAD

page is loaded in browser

  • ONMOUSEDOWN

mouse pressed down over the element

  • ONMOUSEMOVE

mouse moved over after pointing at element

  • ONMOUSEOVER

mouse moved away from element after being

  • ver
  • ONMOUSEUP

mouse released after the click

  • ONRESET

form’s reset button clicked

  • ONSELECT

selected one or more words in element

  • ONSUBMIT

form’s submit button clicked

  • ONUNLOAD

browser loads different page after specified page

28

Venkat Subramaniam – svenkat@cs.uh.edu

Cascading Style Sheets

  • HTML allows specification of fonts, colors,

etc.

  • These may be placed through out the

document

– results in poor maintainability – What if you want to change these

  • This is where CSS comes in
  • You specify the formatting or styling

separately in

– the top of the document – or in a separate document

slide-15
SLIDE 15

29

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: Specifying Style

  • Instead of defining style all over document,
  • specify at the top and simply refer to it in

document

  • Specification has two parts:

– selector

  • this is a name you associate a style with

– declarations

  • this is definition of how it should look
  • The specification may be local, internal or

external

  • The cascade:

– local overrides internal which in turn may override external specifications

30

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: Local Style

  • This style applies to the element on which

it is declared

  • This takes a local effect
  • Useful to alter the style specified

internally in the document or externally from another file

slide-16
SLIDE 16

31

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: Internal Style

  • Specified between the < HEAD> and the

< / HEAD>

  • Provide one or more selectors

– Separate by comma for declarations to apply to all of selectors – Separate by space if declarations to apply to

  • nly nested selectors and not other

appearances

  • Provide the declarations

– within the { } , separated by ;

32

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: External Style Sheet

  • Writing the style in a separate file allows

sharing of the style and applying it to more than one page

  • Pages link the style sheet that specifies

the style

  • You may apply internal style sheet as

well as local at the same time

slide-17
SLIDE 17

33

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: Defining Classes

  • You can define a class or category and

style for that class

  • Any element defined to be as part of that

class will use the specified style for that class

  • Classes are defined to belong to a certain

selector type using the format selectoryName.className

34

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: Defining IDs

  • ID can be defined for individual elements in

your document

– The ID must be unique

  • Style can be specified for that tag/ element

– Tag name followed by # followed by the ID

  • The style applies only for that element with

that ID

  • Scripts may also identify that element in

document

slide-18
SLIDE 18

35

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: DIV and SPAN

  • Style may be specified on pre-defined

tags

– like Hn and P – how to apply style on a wide range of items?

  • DIV and SPAN allows you to define areas
  • f document over which a style may be

applied

  • DIV is a block-level tag while SPAN in an

inline tag

36

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: Font Styles

  • font-family
  • specify a list of fonts to choose from
  • font-family:”Times Roman”, “Helvetica”, “Ariel”
  • font-style
  • specify whether font should be italic, oblique, or normal
  • font-style:italic
  • to remove italic font-style:normal
  • font-weight
  • specifies boldness of text; possible values: bold, bolder,

lighter

  • or multiple of 100s between 100 and 900, with 400 for book

weight and 700 for bold

  • normal will remove bold
  • font-size
  • specify absolute font size: xx-small, x-small, small, medium,

large, x-large, xx-large

  • specify relative font size: large, small
  • exact point size: 18pt
  • percentage relative size: 200%
slide-19
SLIDE 19

37

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: Font Style…

  • line-height

– specifies the space between lines (leading) within a paragraph – line-height:15pt or line-height:50%

  • All the font-styles may be specified in one

shot as well

– Specify in the following order, space separated:

  • font-size/ line-height font-weight small-cap font-

size font-family

– / separates font-size from line-height 38

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: Text Color Style

  • color

– specify one of 16 colors or # rrggbb or rgb(r, g, b) or (r% , g% , b% )

  • background

– transparent or a color value – url(image.gif) to specify an image file name – repeat to tile the image, repeat-x for horizontal tiling, repeat-y for vertical tiling – fixed or scroll for background to scroll along canvas – x y for position of background image from top-left corner

slide-20
SLIDE 20

39

Venkat Subramaniam – svenkat@cs.uh.edu

CSS: Text Spacing Style

  • word-spacing
  • letter-spacing
  • text-indent
  • white-space

– pre to preserve extra spaces; nowrap to keep elements on same line; normal to return to normal behavior

  • text-align

– left, center, right, justify

  • text-decoration

– underline, overline, line-through, none, blink

  • blink not supported by IE, generally not recommended as well
  • text-transform

– capitalize, uppercase, lowercase, none

  • font-variant: small-caps will type uppercase in lowercase size

40

Venkat Subramaniam – svenkat@cs.uh.edu

Markup and XML

  • Markup

– conveying metadata with literals/ tags to delimit, describe – Generalized Markup Language (GML) – Standard Generalized Markup Language (SGML)

  • adopted by ISO
  • Popular use, however, too complex
  • eXtensible Markup Language (XML)

– designed by World Wide Web Consortium (W3C)

– subset of SGML – simpler to read, write and develop parsers

slide-21
SLIDE 21

41

Venkat Subramaniam – svenkat@cs.uh.edu

Why XML?

  • HTML is de facto standard for mark up

– Markup for information presentation – Talks about how information looks, is presented – Does not let you add more markups of your

  • wn
  • What about the information itself?
  • Need to

– describe information – Extend the descriptions – Must be structured, easy to express and validate

42

Venkat Subramaniam – svenkat@cs.uh.edu

What is XML?

  • XML is about extensibility and flexibility
  • tags describe and surround the data
  • Example:

< ?xml version = "1.0" ?> < equipment> < pump> < name> p01 < / name> < pressure units= "psi"> 32.23 < / pressure> < / pump> < pump> < name> p02 < / name> < pressure units= "psi"> 22.887 < / pressure> < / pump> < / equipment>

  • Open, extensible
  • Platform

independent

  • Self describing data

– Data Exchange

  • Supports query and

discovery of data

  • Dynam ic Data

Exchange

slide-22
SLIDE 22

43

Venkat Subramaniam – svenkat@cs.uh.edu

What does XML provide?

  • Tags delimit content

– lets you define structure of arbitrary complexity

  • Self Describing Data

– tags describe and name the data being defined – name related to the information it models/ represents

  • standard eXtensibility

– in defining new tags & semantics

  • Vocabularies

– description of data used for information exchange – within specific domains

  • Separates contents from presentation

44

Venkat Subramaniam – svenkat@cs.uh.edu

XML System

XML Document XML Constraint (DTD, Schema) XML Parser /Processor/Styling XML APP

slide-23
SLIDE 23

45

Venkat Subramaniam – svenkat@cs.uh.edu

  • Well-Formed syntax
  • Document Type Definitions (DTDs)

– Captures rules added to extend core syntax rules

  • Document Object Model (DOM)

– API for manipulating, parsing, creating XML documents – provides a tree-structured view of the document – Standard API

  • Simple API for XML (SAX)

– Provides events as document is being parsed – Leaves it to application to keep state and content information

  • Styling and Transformation (XSL and XSLT)

Features of XML technologies

46

Venkat Subramaniam – svenkat@cs.uh.edu

The Markup Syntax

  • XML Entity

– A file or stream with a well-formed structure

  • Tags delimit the elem ents of the

structure

  • XML Tags are case-sensitive
  • XML uses Unicode character set
  • Names are used to identify structures

– Names begin with letter, underscore or colon

  • Followed by any chars, including numbers, hyphen & period

Start Tag Attributes Content End Tag

slide-24
SLIDE 24

47

Venkat Subramaniam – svenkat@cs.uh.edu

Epilog (Optional) : comments, processing instructions

Structure of a Document

Prolog (Optional) : comments, processing instructions BODY : Root Element

comments processing instructions Elements

Attributes

CDATA, Entities, ID,…

PCDATA Entity References

Entity References CDATA Sections

Document Type Declaration

comments processing instructions Document Type Definitions

Element Declarations Attribute Declarations Entity Declarations Notation Declarations

48

Venkat Subramaniam – svenkat@cs.uh.edu

Markups that go in XML Document

  • The following tags may be contained in

any XML document –Element start and end tags –Attributes –Comments –Entity references –Processing instructions –Character data sections (CDATA) –Document type declarations

slide-25
SLIDE 25

49

Venkat Subramaniam – svenkat@cs.uh.edu

A Sample XML File

50

Venkat Subramaniam – svenkat@cs.uh.edu

Elements

  • Building blocks of an XML document
  • Element content may include

– Other elements – Character data – Character references – Entity references – Processing instructions – Comments – CDATA sections

  • Empty elements may be abbreviated to

save space

– < ElementTypeName/ > indicates an empty element StartTag Content EndTag <ElementTypeName> </ElementTypeName>

slide-26
SLIDE 26

51

Venkat Subramaniam – svenkat@cs.uh.edu

Document and Elements

  • XML document may be viewed as a

hierarchical tree

Document Document Root Prolog Document Element Epilog Element * Represents containment/aggregation *

52

Venkat Subramaniam – svenkat@cs.uh.edu

Contents

  • Element Content

– Contains other elements but no character data

  • Mixed content

– Contains character data and other elements

  • Character content

– Contains nothing but character data

  • Empty element

– Contains nothing

slide-27
SLIDE 27

53

Venkat Subramaniam – svenkat@cs.uh.edu

Nesting

  • XML requires proper nesting of elements
  • Items must be fully contained within their

nested level

  • XML is strict about proper nesting unlike

HTML

– Allowing ambiguity leads to programming complexity – Keep it simply policy – Gives not well-formed error if encountered – Results in fatal error/ termination of parsing

54

Venkat Subramaniam – svenkat@cs.uh.edu

Name

  • A name

– begins with an alphabetic character or an underscore – followed by alphanumeric characters, periods, hyphens, underscores or full stops

Name = (Letter | '_') (Char)* Char = Letter | Digit | '.' | '-' | '_'

slide-28
SLIDE 28

55

Venkat Subramaniam – svenkat@cs.uh.edu

XML String Literals

  • Literals are delimited by apostrophe or quote
  • "hello"

'hi'

  • Character used as delimiter can’t appear in literal
  • "George, What's up!"
  • 'He said "what a nice day!"'
  • Following is not valid: 'what's up'

– apostrophe may be used as an escape character in front of a quote

  • "He said '"what a nice day!'""

– quote may be used as an escape character in front

  • f an apostrophe
  • 'George, What"'s up!'
  • What if you need to use apostrophe and quote

– You may use entity reference: the &apos; or &quot;

  • 'I asked George, What&apos;s up, "He said, fine"'

56

Venkat Subramaniam – svenkat@cs.uh.edu

Attributes

  • Element generally describes & contains

information

  • Attributes provide information that are

part of element rather than being contained in it

– Generally talks about the information format, etc.

  • Name-value pair
  • attributeName="value"
  • attributeName='value'

– The value must be a string literal; numbers not allowed – An attribute may appear only ones within a tag

slide-29
SLIDE 29

57

Venkat Subramaniam – svenkat@cs.uh.edu

Special Attributes

  • xml: space

– White spaces are not generally preserved – How does one indicate that there is a space – xml: space tells that a space is encoded into the document – Recommends that the space must be preserved – Applications may choose to honor or ignore the space – Must take a value of "preserve" or "default"

  • xml: lang

– Indicates the language/ locale info of the XML document

  • If present, these two attributes apply on all nested

elements as well

58

Venkat Subramaniam – svenkat@cs.uh.edu

Special Characters

  • White spaces:

– Horizontal Tab (09), Line-feed (0A), Carriage-return (0D), space (20) – Parsers preserve white spaces within element content – May remove from attributes and element tags

  • End-of-line

– End of line is generally indicated by

  • A carriage-return followed by line-feed
  • Only a line-feed
  • Only a carriage-return
  • XML parsers required to convert to single

line-feed

– UNIX-style favored

slide-30
SLIDE 30

59

Venkat Subramaniam – svenkat@cs.uh.edu

Character References

  • Character References

– Represent displayable characters that can’t be placed in a well-formed document as is – The character may be represented using

  • &# prefixed before a decimal number

representing char

  • &# x prefixed before a hexadecimal number

representing char

60

Venkat Subramaniam – svenkat@cs.uh.edu

Entity References

  • Entity References

– Think of these as macro definitions – Allows insertion of string literals – Provides mnemonic equivalence – Starts with an & and ends with a ; – Predefined Entity references:

  • &amp;, &lt;, &gt;, &apos;, &quot;
  • Rather than repeating content, you can

refer where to find it

– Declare the substitution text in doctype – Refer to it by &name;

slide-31
SLIDE 31

61

Venkat Subramaniam – svenkat@cs.uh.edu

Processing Instructions

  • Processing Instructions (PI) allows you to

provide hints to applications as part of the document

  • PI consists of two things:

– a target tag followed by instruction

  • <?target instruction ?>

– The target tag is an XML name that identifies the application the instruction is intended for – Instruction is a string literal

  • To avoid confusion with

– < ?xml version = "1.0" ?>

– PI can’t be a string "xml" or "XML"

62

Venkat Subramaniam – svenkat@cs.uh.edu

XML Comments

  • Comments may be present any where in

a document

– Except as part of other markup

  • Comments start with < !-- and end with -->
  • May contain any string that does not

– have -- – does not end with -

  • Entities within comments are not

expanded

  • Markups within comments are not

interpreted

slide-32
SLIDE 32

63

Venkat Subramaniam – svenkat@cs.uh.edu

CDATA Sections

  • CDATA sections are bulk of document

that will not be interpreted for markup <![CDATA[ ]]>

  • Starts with the tag:

– <![CDATA[

  • Ends with the tag

– ]]>

  • The contained text can’t have

– String that contains the delimiter ]]> – Nested CDATA

non parsed data

64

Venkat Subramaniam – svenkat@cs.uh.edu

Prolog

  • Optional member of an XML document
  • Provides hints and information on

encoding methods

  • Contains

– Optional XML declaration – Optional comments (several) – PIs – White space characters – Optional Document Type Declarations (not DTDs)

  • Ties DTD to the document
slide-33
SLIDE 33

65

Venkat Subramaniam – svenkat@cs.uh.edu

XML Declaration

  • XML declaration is optional
  • If present

– Must be the first in the document

  • No comments or white spaces allowed to precede

– The xml tag must be lowercase

  • <?xml version="1.0" ?>
  • Attributes:

– version required. For future versions – encoding optional. UTF-8, UTF-16, IS-8859-1 (Latin-1),

etc.

– standalone optional. yes or no (external DTD required)

66

Venkat Subramaniam – svenkat@cs.uh.edu

Epilog

  • Optional member of an XML document
  • Contains

– Optional comments (several) – PIs – White space characters

  • Use of this is ambiguous since it is
  • ptional and most applications may not

wait for reading this

slide-34
SLIDE 34

67

Venkat Subramaniam – svenkat@cs.uh.edu

Well-formed Document

  • An XML document is said to be well-

formed if

– The document syntax conforms to XML specifications – Elements form a hierarchical tree with a single root node – There are no references to external entities

  • Unless DTD is provided

– A Well-formed XML document is

  • case sensitive
  • expects you to close tags
  • does not allow overlapping tags

68

Venkat Subramaniam – svenkat@cs.uh.edu

Parsers

  • An XML Processor or Parser is an application

that will read through an XML document and interpret it

  • Parser Types

– Non-validating

  • Ensures data object/ document is well-formed XML

– Validating

  • Validates, using DTD, well-formed data object’s form and content
  • Parser Implementations

– Event-driven Parsers

  • Parser calls back into application as it identifies data
  • Applications handle the data
  • Parser does not keep the tree structure or the data upon parsing
  • Memory resource usage is minimal

– Tree-based Parsers

  • A tree structure of the document is built in memory
  • This tree is then manipulated using an interface
slide-35
SLIDE 35

69

Venkat Subramaniam – svenkat@cs.uh.edu

XML Parsers

  • Several parsers available in the market

– Xerces (Apache) – JAXP (More of an API from Sun) – MSXML (Microsoft) – Expat (James Clark) – RXP (Richard Tobin) – XP (James Clark) – XML4J (IBM) – XML: : Parser (Clark Cooper) – Pyexpat (Jack Jansen) – Lark (Tim Bray) – TclXML (Steve Ball)

70

Venkat Subramaniam – svenkat@cs.uh.edu

Major APIs

  • DOM API
  • SAX API
  • JDOM
  • XSLT
  • XPath