Efficient Keyword Search over Virtual XML Views Feng Shao, Lin Guo, - PowerPoint PPT Presentation

VLDB 2007 Efficient Keyword Search over Virtual XML Views Feng Shao, Lin Guo, Chavdar Botev, Anand Bhaskar, Muthiah Chettiar, Fan Yang Cornell University Jayavel Shanmugasundaram Yahoo! Research

Applications - Personal Portal “Beetles Record” Sports News “Chevy Malibu” Auto Finances “NCAA “DOW Rankings” Index” Overlap/Duplicate Views

Applications – Information Integration “Vista”, “Vista” “budget” “budget” personalized Projects/Feedback Projects/Feedback views, by XML View XML View privilege Email with comments on project.doc (in XML projects (in XML format) � format) � <project> <comment> <feedback> <comment> <title>…</title> … … … </comment> </feedback> … </comment> <project>

Keyword Search over XML View � Materialized XML Views? • Similar to keyword search over XML documents � Many well-studied algorithms � Materialize views when loading documents • Not applicable in emerging applications! � Overlap/Duplicate/Update overhead � View definitions not known a-priori � � Keyword Search over Virtual XML Views

Related Work � Scoring and Indexing in IR community • DBXplorer [Agrawal02], Banks [Bhalotia02], ObjectRank [Balmin04], XRank [Guo02], Discover [Hristidis 02] • Work with materialized documents � Integrating keyword search and structural queries • GTP [Chen 03], TermJoin [Khalifa 03] • Access base data to evaluate the view � Projecting XML documents [Marian 03] • Access base data; not leveraging indexes

Outline � Motivation � Problem Definition � High-level Overview � PDT Generation Algorithm � Experimental Results � Conclusion

Problem Definition � Ranked Keyword Search over Virtual XML Views • Input: a set of keywords Q = {k1, k2, …, kn}, an XML view definition V over an XML database D • Output: k view elements with highest scores • TF-IDF scores � TF(k, e): # occurences of the keyword k in an element e � IDF(k): the inverse of # of elements containing k � Score(e, Q) = Σ i TF(k i ,e) * IDF(k i ) � Score(e, Q) is further normalized by the length of the view elements

Running Example Virtual View “XML” & “Search” for $book in fn:doc(books.xml)/books//book where $book/year > 1995 return <book> $book/title for $review in fn:doc(reviews.xml)/reviews//review where $review/isbn = $book/isbn return <review> $review/content </review> </book> reviews.xml books.xml books reviews book book review review review book rating content title isbn isbn year publisher 5 111-11 111-11 This book Design 1997 Princeton describes … -1111 -1111 Patterns

Running Example “XML” & “Search” Materialized View book book title title review review review review Design XML content content content content Patterns Primer This book Excellent! … search Decent book describes … and query … on XML… reviews.xml books.xml books reviews book book review review book review publisher rating content title isbn isbn year 5 111-11 111-11 This book Design 1997 Princeton describes … -1111 -1111 Patterns

Our Approach “XML” Traditional Approach View Results “Search” “XML” View Results “Search” Materialization PDT Ranked Results Generator Scoring Pruned Pruned Results Document Keyword Keyword Processor Trees Processor (PDTs) � book Evaluator Evaluator Materialized Pruned View View indexes indexes reviews > 300s reviews books books 5s

Our Approach “XML” books View Results “Search” book book title isbn year publisher Materialization PDT 111-11 XML Ranked Results Princeton Generator 1997 -1111 Primer Scoring PDT (Pruned PDTs Pruned Results books Document Keyword Tree) � book Processor title year isbn Evaluator Pruned View 111-11 Id=“1.2.1” 1997 -1111 kwd1=“xml indexes ”tf=“1” length = “10” reviews books Orders of magnitude smaller!

Our Approach -- Challenges “XML” View Results 1. Joining books & reviews “Search” requires isbn (data value) � -- how to get data values Materialization PDT without accessing the Ranked Results Generator base data? Scoring PDTs Pruned Results 2. Scoring view elements requires aggregate Keyword Processor statistical data (e.g., tf from book and review)? Evaluator Pruned View -- How to collect them without materializing the indexes view elements? reviews books

�� "��# $��%&'(�$)��%�' �� B+-Tree �� *��*��*�� (�� ! �� *��*��*�� *��*��*�� -�� !(��,�! �� *+� ��2��3�$4��%!'($5��2%&' B+ tree index �� ! � ��,�! � � -�� ./01 � �� (ID, TF) �

XML View � Query Pattern Tree (QPT) � Similar to GTP, proposed for $book in fn:doc(books.xml)/books//book where $book/year > 1995 by Chen 2003 for normal return query evaluation <book> $book/title for $review infn:doc(reviews.xml)/reviews//review • Captures the structural where $review/isbn = $book/isbn return parts required by queries <review> $review/content </review> • Mandatory/Optional edges </book> � New features • Node annotations books � V: value required to evaluate the view book � C: content used in the view mandatory optional isbn year>1995 title v c

PDT Intuition �� • Restrictions enforced by QPT �� Predicate Restriction Descendant Restriction �� Ancestor Restriction �� books book book author year title isbn author year title isbn 1994 Database 111-11 publisher 1997 XML Primer 121-32- publisher Concepts 1112 id:1.2.1 8663 kwd1=“xml”tf=1 length = 10

PDT Generation “XML” View Results “Search” Materialization 1. Get ID lists for PDT Ranked Results paths in the QPT Generator Scoring 2. Merge IDs in the PDTs Pruned Results lists to create the PDT Keyword Processor Evaluator Pruned View indexes reviews books

Step 1: Get List of IDs QPT B+-Tree books PathID Value IDList … … … book /books/book/isbn “111-11-111” 1.1.1 /books/book/isbn “121-23-1321” 1.2.1 … … … /books/book/author/fn “Jane” 1.2.3, 1.7.3 isbn title year>1995 v c Key idea: for each node without mandatory child edges, obtain the corresponding list of ids books//book/isbn: (1.1.1:”111-11-111”),(1.2.1,”121-23-1321”) � books//book/title: 1.1.4, 1.2.3, 1.9.3 books//book/year: (1.2.6, 1.5.1:”1996”), (1.6.1:”1997”) �

Step 2: Merging IDs -- Challenges books QPT � Makes a single pass over relevant id lists book • Flat indices � nested structure • Enforce ancestor/descendant restrictions title isbn year>1995 1 books 1.1 book book 1.2 author isbn title year author isbn 1.1.4 1.2.7 title year 1.1.5 1.1.1 1.1.2 1.1.3 1.2.8 publisher 1.2.1 1.2.3 1.2.6 publisher

PDT Generator – Merging IDs �� !��"��"� "# �#"�!�� $��"�#��#��%� ��"�&��&!��&��!��'(��%� PDT IDs PDT Candidate Tree Idea: a loop that merges ids in the lists, and creates the CT nodes in dewey id order At each step, we check the min id in the CT if satisfies all restrictions � PDT if satisfies descendant restriction and not ancestor � PDT Cache if not satisfies descendant restriction and does not have child node in the CT � Discard

Efficient Keyword Search over Virtual XML Views Feng Shao, Lin Guo, - PowerPoint PPT Presentation

VLDB 2007 Efficient Keyword Search over Virtual XML Views Feng Shao, Lin Guo, Chavdar Botev, Anand Bhaskar, Muthiah Chettiar, Fan Yang Cornell University Jayavel Shanmugasundaram Yahoo! Research Applications - Personal Portal Beetles

Module 2 Module 2 XML Basics XML Basics (XML, Namespaces, (XML, Namespaces, Usage scenarios,

XML and Web Services Lecture 8 1 Outline XML (Section 17) XML syntax, semistructured

Binary XML and its Characterization Robin Berjon, XML Prague, 25/06/2005 What is Binary XML?

Java 2 Micro Edition XML F. Ricci 2010/2011 J2Me XML overview XML, REST Parsing XML :

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.

Querying XML Documents Querying XML Documents How XML may be supported in databases with

XML in Programming Patryk Czarnik XML and Applications 2015/2016 Lecture 5 4.04.2016 XML in

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Transforming XML Documents Transforming XML Documents How the XSLT language transforms XML

Session 23 XML XML Reading and Reference Reading https://en.wikipedia.org/wiki/XML

XML and Content Management Lecture 3: Modelling XML Documents: XML Schema Maciej Ogrodniczuk,

Modelling XML Applications Patryk Czarnik XML and Applications 2015/2016 Lecture 2

XML Walking the Tree Modifying the Tree Generating XML Documents Creating Documents Volker

Modelling XML Applications Patryk Czarnik XML and Applications 2013/2014 Lecture 2

How does does it it look? look? How <?xml version= <?xml version= 1.0 1.0

Modelling XML Applications Patryk Czarnik XML and Applications 2013/2014 Lecture 2

CARE STAR RATING STRONGER WITH SHIELD Presentation Objectives Purpose of Star Ratings

Evaluating the Content and Quality of Next Generation Assessments Nancy Doorey Morgan Polikoff

Lecture 08 Android Permissions Demystified Adrienne Porter Felt, Erika Chin, Steve Hanna, Dawn

Android: Content Providers

CS 4518 Mobile and Ubiquitous Computing Lecture 20: Movie Rating Emmanuel Agu Your Reaction

Recommender Systems: The Power of Personalization Presenter Moderator Dr. Joseph A. Konstan

Resources 1. Web Page: www.cs.rpi.edu/ magdon/courses/learn.php course info:

Important AIHce 2020 Dates July 8 Call for Proposals opened July 1 August 12

Efficient Keyword Search over Virtual XML Views Feng Shao, Lin Guo, - PowerPoint PPT Presentation

VLDB 2007 Efficient Keyword Search over Virtual XML Views Feng Shao, Lin Guo, Chavdar Botev, Anand Bhaskar, Muthiah Chettiar, Fan Yang Cornell University Jayavel Shanmugasundaram Yahoo! Research Applications - Personal Portal Beetles

Module 2 Module 2 XML Basics XML Basics (XML, Namespaces, (XML, Namespaces, Usage scenarios,

XML and Web Services Lecture 8 1 Outline XML (Section 17) XML syntax, semistructured

Binary XML and its Characterization Robin Berjon, XML Prague, 25/06/2005 What is Binary XML?

Java 2 Micro Edition XML F. Ricci 2010/2011 J2Me XML overview XML, REST Parsing XML :

XML Documents XML Documents The XML Namespace mechanism Anders Mller &amp; Michael I.

Querying XML Documents Querying XML Documents How XML may be supported in databases with

XML in Programming Patryk Czarnik XML and Applications 2015/2016 Lecture 5 4.04.2016 XML in

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Transforming XML Documents Transforming XML Documents How the XSLT language transforms XML

Session 23 XML XML Reading and Reference Reading https://en.wikipedia.org/wiki/XML

XML and Content Management Lecture 3: Modelling XML Documents: XML Schema Maciej Ogrodniczuk,

Modelling XML Applications Patryk Czarnik XML and Applications 2015/2016 Lecture 2

XML Walking the Tree Modifying the Tree Generating XML Documents Creating Documents Volker

Modelling XML Applications Patryk Czarnik XML and Applications 2013/2014 Lecture 2

How does does it it look? look? How &lt;?xml version= &lt;?xml version= 1.0 1.0

Modelling XML Applications Patryk Czarnik XML and Applications 2013/2014 Lecture 2

CARE STAR RATING STRONGER WITH SHIELD Presentation Objectives Purpose of Star Ratings

Evaluating the Content and Quality of Next Generation Assessments Nancy Doorey Morgan Polikoff

Lecture 08 Android Permissions Demystified Adrienne Porter Felt, Erika Chin, Steve Hanna, Dawn

Android: Content Providers

CS 4518 Mobile and Ubiquitous Computing Lecture 20: Movie Rating Emmanuel Agu Your Reaction

Recommender Systems: The Power of Personalization Presenter Moderator Dr. Joseph A. Konstan

Resources 1. Web Page: www.cs.rpi.edu/ magdon/courses/learn.php course info:

Important AIHce 2020 Dates July 8 Call for Proposals opened July 1 August 12

XML Documents XML Documents The XML Namespace mechanism Anders Mller & Michael I.

How does does it it look? look? How <?xml version= <?xml version= 1.0 1.0