Sea rch Analysis and Integration of W eb Do cuments A Case - - PowerPoint PPT Presentation

sea rch analysis and integration of w eb do cuments a
SMART_READER_LITE
LIVE PREVIEW

Sea rch Analysis and Integration of W eb Do cuments A Case - - PowerPoint PPT Presentation

Sea rch Analysis and Integration of W eb Do cuments A Case Study with FLORID Rainer Himmer oder P aulTh Kandzia Bertram Lud ascher W olfgang Ma y Geo rg Lausen Institut f ur Info


slide-1
SLIDE 1 Sea rch Analysis and Integration
  • f
W eb Do cuments A Case Study with FLORID Rainer Himmer
  • der
P aulTh Kandzia Bertram Lud
  • ascher
W
  • lfgang
Ma y Geo rg Lausen Institut f
  • ur
Info rmatik Universit
  • at
F reiburg Germany Overview
  • Intro
ductionMotivation
  • FLORID
W eb mo del
  • Integration
CIA W ORLD F A CTBOOK and W ORLD ONLINE
  • Semistructured
Data
  • Conclusions
slide-2
SLIDE 2 MOTIV A TION
  • Goal
A unifo rm framew
  • rk
fo r
  • Querying
the W eb
  • exp
ress decla ratively ho w to querynavigate
  • n
the W eb
  • extract
data from W eb pages fo r p
  • pulating
a database W ebdata w a rehousing
  • Management
  • f
Semistructured Data
  • structure
is irregula r pa rtial unkno wn implicit in the data
  • example
HTML pages
  • queryingnavigation
using general path exp ressions
  • discover
structure
  • Info
rmation Integration
  • heterogeneous
sources with dierent structure
  • wrapp
ers mediato rs
slide-3
SLIDE 3 QUERYING THE WEB WITH FLOGICFLORID
  • DOOD
P a radigm
  • deduction
  • fo
r datadriven explo ration
  • f
the W eb and high level querying
  • bjecto
rientation
  • fo
r exible mo deling
  • f
semistructured data
  • ptional
metho ds instead
  • f
NULLs W ebFLORID
  • extension
  • f
Flogic fo r querying and restructuring the W eb
  • decla
rative rulebased p rogramming st yle unifo rm language fo r wrapp ers
  • mediato
rs
  • meta
features schema b ro wsingreasoning va riables at classmetho d p
  • sitions
  • restructuring
  • f
info rmation
  • navigation
b y general path exp ressions
  • unifo
rm access to lo cal db
  • W
eb data
  • integration
  • f
heterogenous info rmation
slide-4
SLIDE 4 FLOGIC IN A NUTSHELL
  • Basic
Constructs ObjectClass
  • ISArelation
  • SubClassClass
  • SUBCLASSrelation
  • Class
MethodPtypes
  • Rtype
  • SIGNA
TURE singlevalued Class MethodPtypes
  • Rtypes
  • and
multivalued Object MethodParams
  • R
  • D
A T A singlevalued Object MethodParams
  • fRRg
  • and
multivalued Obj MP Spec
  • MP
Spec
  • P
A TH EXPRESSION Object Creation via P ath Exp ressions in the Head Xfatherman
  • Xperson
Xmotherwoman
  • Xperson
  • personMC
Mfather Cman Mmother Cwoman
slide-5
SLIDE 5 WEB MODEL
  • url
  • HTMLHEADHEA
D
  • A
HREFurl label A
  • HTML
  • z
  • wd
url
  • HTMLHEADHEAD
  • A
HREF A
  • HTML
  • z
  • wd
hrefslabel Link Structure Signature
  • webdoc
hrefsstring
  • url
  • Example
  • wdwebdoc
hrefslabel
  • url
  • F
urther A ttributes webdoc self
  • url
address
  • string
modif
  • string
  • error
  • string
slide-6
SLIDE 6 FLOGIC VIEW ON THE WEB
  • url
FLOGICDB webdoc u
  • get
hrefs address urlstring get
  • webdoc
  • RuleBased
Explo ration
  • Uget
  • Uurl
  • generate
OID
  • Ugetwebdoc
  • add
to webdoc
  • Uget
address
  • hrefs
  • ll
in slots Uexplored
  • Uurlget
  • Uunexplored
  • Uurl
not Uexplored
slide-7
SLIDE 7 SEMANTICS
  • Extension
  • f
Flogic b y
  • P
ath Exp ressions FLUVLDB closure axioms
  • extended
Herb rand universe U
  • Herb
rand base HB
  • W
eb Interface
  • set
  • f
reserved names R get url hrefs
  • explo
re U RL
  • PHB
U RL
  • R
  • maps
URLs to sets
  • f
new facts
  • W
eb Access Axiom fo r H
  • HB
  • H
j
  • uurl
  • uget
  • H
j
  • new
fo r all new
  • explo
reu if get is dened fo r a URL u then all explo red data is in H
  • minimal
Herb rand W eb Mo del
  • Integration
with Bottomup Evaluation T W
  • P
H
  • H
  • T
  • P
H
  • fexplo
re u j u url
  • u
get
  • T
  • P
H g
  • decla
rative semantics if explo re
  • then
W ebFLORID
  • FLORID
slide-8
SLIDE 8 EXAMPLE INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE
  • CIA
W ORLD F A CTBOOK CIA
  • geography
  • p
eople government economy
  • no
cities apa rt from country capitals
  • info
rmation link structure fo rmatted text
  • very
structured and regula r
  • complete
W ORLD ONLINE W OL
  • administrative
divisions main cities
  • info
rmation link structure tables
  • not
very regula r
  • incomplete
  • W
OL autho r All visito rs must realize that this site ie collecting the data and putting it up here is a logical development
  • f
  • ne
  • f
my hobbies y
  • u
therefo re cannot exp ect all data to b e
  • f
academic standa rd What y
  • u
see is what y
  • u
get although I try to b e as tho rough as p
  • ssible
slide-9
SLIDE 9 EXAMPLE INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE
slide-10
SLIDE 10 INTEGRA TION METHODOLOGY T ypical Steps and Rules
  • ACCESSING
RELEVANT PAGES
  • CurlciaU
  • CcontinentfileciaFN
strcatciasrcFNU Uurlget
  • CcontinenturlciaU
  • EXTRACTING
RAW DATA
  • patterncapitalCapital
n
  • patterntotalareatotal
areansq km CMethod
  • X
  • patternMethod
RegEx pmatchCcountryurlciage t RegEx
  • X
  • RESTRUCTURING
AND DATA CLEANING
  • Crealcountry
  • CcountrycapitalCA
not substrnone CA
  • INTEGRATION
OF SOURCES OBJECT FUSION
  • C
  • C
  • CcountrycontinentCTm
ain citi esna me wol N
  • CcountrycontinentCTcap
ital N namecia not CC
slide-11
SLIDE 11 QUERYING THE INTEGRA TED D A T A
  • QUERY
Name the capitals from CIA with their p
  • pulation
from W OL
  • countrynamecia
  • Country
capital
  • City
citynamewol
  • City
population
  • P
P CityVienna CountryAustria P CityPrague CountryCzech Republic P CityParis CountryFrance P CityBerlin CountryGermany P CityBudapest CountryHungary P CityMadrid CountrySpain P CityStockholm CountrySweden P CityBern CountrySwitzerland
  • P
CityLondon CountryUnited Kingdom
  • utputs
printed
slide-12
SLIDE 12 SEMISTRUCTURED D A T A
  • So
fa r Structure
  • f
do cuments kno wn in advance
  • contentbased
queries data extraction Ho w ever do cument structure is
  • ften
unkno wnirregula rpa rtial
  • semistructured
data Def A semistructured database is a nite set
  • f
lab eled edges x
  • y
  • D
  • x
  • y
  • x
  • fy
g
  • Mapping
a ssdb to Flogic Xno de Y no de Llab el XL
  • fYg
  • ssdbXLY
Example W eb Sk eleton Extracto r P ext
  • ro
  • tsrc
  • fu
  • u
n g
  • dene
ro
  • t
no des no de
  • url
  • no
des a re urls Uno deget
  • ro
  • tsrc
  • fUg
  • get
ro
  • t
no des Y no de Llab el XL
  • fYg
  • dene
new no deslableslinks Xno degethrefsL
  • fYg
  • b
y follo wing hrefs Yget
  • Y
no de
  • access
no des which satisfy
slide-13
SLIDE 13 SEMISTRUCTURED D A T A
  • Sp
ecialization
  • f
the Sk eleton Extracto r fo r DBLP
  • ro
  • tsrc
  • fdblpg
dblp
  • httpwwwinfo
rmatikunitrierdeleydb
  • substrtrierY
and consider
  • nly
urls containing trier
  • substrdbjournalsisY
restrict to IS journal
  • Queries
with path exp ressions
  • dblpInf
SystemsLMichae l E Senko Def General path exp ressions GPE
  • L
  • fanyg
  • GPE
  • if
M
  • N
  • GPE
and n
  • I
N
  • then
the follo wing a re in GPE
  • M
  • N
  • M
j N
  • M
  • M
  • M
  • M
  • M
  • n
  • if
  • is
bina ry relation symb
  • l
then if
  • GPE
  • if
  • L
and
  • is
a una ry relation symb
  • l
then
  • GPE
  • sp
ecicationimplementation b y simple path exp ressions
  • rules
slide-14
SLIDE 14 CONCLUSIONS
  • Summa
ry
  • DOOD
pa radigm attractive fo r querying and restructuring the W eb
  • unifo
rm access to lo cal db
  • W
eb data
  • integration
  • f
heterogenous info rmation
  • reasoning
ab
  • ut
do cument structure and W eb structure
  • use
  • f
sea rch engines AltaVista
  • Implementation
in W ebFLORID
  • Flo
rid
  • httpwwwinformatik
un if rei bur gd ed bis fl
  • ri
d Outlo
  • k
  • SGML
pa rser
  • Output
p rimitives