sea rch analysis and integration of w eb do cuments a
play

Sea rch Analysis and Integration of W eb Do cuments A Case - PowerPoint PPT Presentation

Sea rch Analysis and Integration of W eb Do cuments A Case Study with FLORID Rainer Himmer oder P aulTh Kandzia Bertram Lud ascher W olfgang Ma y Geo rg Lausen Institut f ur Info


  1. Sea rch� Analysis� and Integration of W eb Do cuments� A Case Study with FLORID Rainer Himmer� oder P aul�Th� Kandzia Bertram Lud � ascher W olfgang Ma y Geo rg Lausen Institut f ur � Info rmatik� Universit � at F reiburg� Germany Overview � Intro duction�Motivation � FLORID W eb mo del � Integration� CIA W ORLD F A CTBOOK and W ORLD ONLINE � �Semistructured Data� � Conclusions

  2. MOTIV A TION � Goal� A unifo rm framew o rk fo r � Querying the W eb� � exp ress decla ratively ho w to query�navigate on the W eb � extract data from W eb pages fo r p opulating a database � W eb�data w a rehousing � � Management of Semistructured Data� � structure is irregula r� pa rtial� unkno wn� implicit in the data � example� HTML pages � querying�navigation using general path exp ressions � discover structure � Info rmation Integration� � heterogeneous sources with di�erent structure � wrapp ers� mediato rs

  3. QUERYING THE WEB WITH F�LOGIC�FLORID � DOOD P a radigm� � deduction � fo r data�driven explo ration of the W eb and high level querying � object�o rientation � fo r �exible mo deling of semistructured data �optional metho ds instead of NULLs� � extension of F�logic fo r querying and restructuring the W eb� W eb�FLORID � decla rative rule�based p rogramming st yle� unifo rm language fo r wrapp ers � mediato rs � meta features� schema b ro wsing�reasoning� va riables at class�metho d p ositions � restructuring of info rmation � navigation b y �general� path exp ressions � � unifo rm access to lo cal db � W eb data integration of heterogenous info rmation

  4. F�LOGIC IN A NUTSHELL � Basic Constructs� � ISA�relation � � � � Object�Class � SUBCLASS�relation � � � SubClass��Class Class � � � SIGNA TURE� single�valued Method��P�types� �� R�type Class � � � ��� and multi�valued Method��P�types� ��� R�types Object � � � D A T A� single�valued Method��Params� �� R Object � f R��R� g � � ��� and multi�valued Method��Params� ��� M���P�� � � �� M���P�� � � � P A TH EXPRESSION Obj� Spec� Spec� Object Creation via P ath Exp ressions in the Head� X�father�man X�person� � X�mother�woman X�person� � �� �person�M�C� M�father� C�man� M�mother� C�woman

  5. WEB MODEL � hrefs��label� � � �url�� �url�� �HTML��HEAD������HEA D� �HTML��HEAD������HEAD� ��� ��� �A HREF��url�� �label ��A� �A HREF��������� ��A� ��� ��� ��HTML� ��HTML� � �z � � �z � wd� wd� Link Structure� Signature � webdoc � � hrefs��string� ��� url Example � wd��webdoc � � hrefs���label�� ��� �url�� F urther A ttributes� webdoc � � � self �� url� address �� string� modif �� string� ��� � error ��� string

  6. F�LOGIC VIEW ON THE WEB � F�LOGIC�DB url webdoc hrefs u get � � � � address url��string � � get �� webdoc Rule�Based Explo ration � U�get � � � generate OID U�url� ��� � � � ��� add to U�get�webdoc webdoc � U�get � � � ��� �ll in slots address �� ���� hrefs������ ��� ��� U�explored U�url�get � � � � U�unexplored U�url� not U�explored� �

  7. SEMANTICS � Extension of F�logic b y � P ath Exp ressions �FLU�VLDB���� � HB closure axioms extended Herb rand universe U � Herb rand base � W eb Interface � set of reserved names � get � url � ���� R hrefs � explo re � U RL � P � HB � U RL � �� � maps URLs to sets of new facts R � W eb Access Axiom � fo r � HB � H j � � � j � fo r all � explo re � u � H u �url u �get H new new �if is de�ned fo r a URL u � then all explo red data is in H� get � minimal Herb rand W eb Mo del � Integration with Bottom�up Evaluation � � W � H � �� � � H � � f explo re � u � j � � H � g T H T u � url � u � get T � � � P P P � � decla rative semantics � if explo re �� then W eb�FLORID � FLORID

  8. EXAMPLE� INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE � CIA W ORLD F A CTBOOK �CIA� � geography � p eople� government� economy � ��� no cities �apa rt from country capitals� � info rmation� link structure� fo rmatted text � very structured and regula r � complete W ORLD ONLINE �W OL� � administrative divisions� main cities � info rmation� link structure� tables � not very regula r � � incomplete � W OL autho r� �All visito rs must realize that this site �i�e� collecting the data and putting it up here� is a logical development of one of my hobbies� y ou therefo re cannot exp ect all data to b e of academic standa rd� What y ou see is what y ou get� although I try to b e as tho rough as p ossible��

  9. EXAMPLE� INTEGRA TION CIA W ORLD F A CTBOOK and W ORLD ONLINE �

  10. INTEGRA TION METHODOLOGY� T ypical Steps and Rules � ������������������������� ACCESSING RELEVANT PAGES� ������������������������� C�url��cia���U� �� C�continent�file��cia���FN�� strcat�cia�src�FN�U�� U�url�get �� C�continent�url��cia���U�� ������������������������ EXTRACTING ��RAW DATA��� ������������������������ pattern�capital���Capital� ���n ����� ��� pattern�total�area���total area����n���sq km����� C�Method �� X� �� pattern�Method� RegEx�� pmatch�C�country�url��cia��ge t� RegEx� ����� X�� �������������������������� ���� �� RESTRUCTURING AND DATA CLEANING� �������������������������� ���� �� C�real�country �� C�country�capital��CA�� not substr��none�� CA�� �������������������������� ���� ����� ��� INTEGRATION OF SOURCES� OBJECT FUSION� �������������������������� ���� ����� ��� C� � C� �� C��country�continent��CT���m ain� citi es�na me�� wol� ��N� � C��country�continent��CT�cap ital ��N� name��cia������ not C��C��

  11. QUERYING THE INTEGRA TED D A T A �� QUERY� �Name the capitals �from CIA� with their p opulation �from W OL�� �� ��country�name��cia� �� Country� capital �� City�� ��city�name��wol� �� City� population �� P�� P������������ City��Vienna� Country��Austria� P������������ City��Prague� Country��Czech Republic� P������������ City��Paris� Country��France� P������������ City��Berlin� Country��Germany� P������������ City��Budapest� Country��Hungary� P������������ City��Madrid� Country��Spain� P���������� City��Stockholm� Country��Sweden� P���������� City��Bern� Country��Switzerland � P������������ City��London� Country��United Kingdom� � output�s� printed

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend