Acknowledgments S.Sudarshan ArvindHulgeri B.Aditya Parag Two - PDF document

✔ ✝ ✕ Text Search for Fine-grained Semi-structured Data Soumen Chakrabarti Indian Institute of Technology, Bombay www.cse.iitb.ac.in/~soumen/ Acknowledgments S.�Sudarshan Arvind�Hulgeri B.�Aditya Parag Two extreme search paradigms Searching a RDBMS Information Retrieval Complex data model: Collection = set of � � tables, rows, documents, document columns, data types = sequence of terms Expressive, powerful Terms and phrases � � query language present or absent Need to know No (nontrivial) � � schema to query schema to learn Answer = unordered Answer = sequence � � set of rows of documents Ranking: afterthought Ranking: central to IR � � �✂✁☎✄✂✆✞✝✠✟✠✟☎✝ ✡☞☛✍✌✏✎✒✑ ✌✏✓✍✌✠✑ 1

✶ ✸ ✷ ✧ ★ ✩ Convergence? SQL � XML search Web search � IR � Trees, reference links � Documents are nodes in a graph � Labeled edges � Hyperlink edges have � Nodes may contain important but � Structured data unspecified semantics � Free text fields � Google, HITS Data vs. document � Query language � Query involves node remains primitive data and edge labels � No data types � Partial knowledge of � No use of tag-tree schema ok � Answer = URL list � Answer = set of paths ✖✂✗☎✘✂✙✞✚✠✛✠✛☎✚ ✜☞✢✍✣✏✤✒✥ ✣✏✦✍✣✠✥ Outline of this tutorial Review of text indexing and � information retrieval (IR) Support for text search and similarity join in � relational databases with text columns Text search features in major XML query � languages (and what’s missing) A graph model for semi-structured data with � “free-form” text in nodes Proximity search formulations and techniques; � how to rank responses Folding in user feedback � Trends and research problems � ✪✂✫☎✬✂✭✞✮✠✯✠✯☎✮ ✰☞✱✍✲✏✳✒✴ ✲✏✵✍✲✠✴ 2

❨ P ① ② ❨ P ❫ ❨ ❨ ❨ ❞ ❇ ❙ ❆ ❅ ③ ❶ ❷ Text indexing basics “Inverted index” maps from � term to document IDs D1 ❈❊❉●❋■❍☞❏▲❑◆▼✞❖◗P ❘❚❙ ❯▲❘✂❘❱❯●❲❳❍✂❏●❑◆▼ Term offset info enables � ❩✂❬❭❯❪❙ ❫❴❍☞❏▲❑❵▼❛❫✞❯❝❜☞▼ phrase and proximity ❯❢❡▲❑❣❍✂❏●❑◆▼❤P ❘❱✐●❏❥P ❜❦❯●❲ (“near”) searches ❍☞❏▲❑◆▼ ❩✂❬❚❜✂▼ ❍☞❏●❑◆▼ ❯❝❜ Document boundary and � D2 ❍☞❏●❑◆▼ D1:�1,�5,�8 limitations of “near” queries D2:�1,�5,�8 Can extend inverted index � ❜✂▼ D2:�7 to map terms to ❯❪❙ D1:�7 � Table names, column names ❯▲❘✂❘ D1:�3 � Primary keys, RIDs � XML DOM node IDs ✹✂✺☎✻✂✼✞✽✠✾✠✾☎✽ ✿☞❀✍❁✏❂✒❃ ❁✏❄✍❁✠❃ Information retrieval basics Stopwords and stemming � ⑤◆⑥⑧⑦⑩⑨ ❼❾❽▲❿ Each term t in lexicon gets a ⑨▲⑦ ➀➂➁❵⑨❻⑤❹➃ ⑦⑩➄ � dimension in vector space ❷☞❸❹❸ Documents and the query � ❷❻❺ Scale Scale up are vectors in term space down Component of d along axis t is TF( d , t ) � � Absolute term count or scaled by max term count Downplay frequent terms: IDF( t ) = log(1+| D |/| D ④ |) � � Better model: document vector d has component TF( d , t ) IDF( t ) for term t Query is like another “document”; documents � ranked by cosine similarity with query ❧✂♠☎♥✂♦✞♣✠q✠q☎♣ r☞s✍t✏✉✒✈ t✏✇✍t✠✈ 3

➑ ➔ ➒ ➑ ➒ ➓ Map Data�model Relational XML-like None SQL,Datalog XML-QL,�Xquery Schema WHIRL ELIXIR,�XIRQL IR� DBXplorer,� support No� EasyAsk,�Mercado,� BANKS,� schema DataSpot,�BANKS DISCOVER “None” = nothing more than string equality, containment � (substring), and perhaps lexicographic ordering “Schema”: Extensions to query languages, user needs to � know data schema, IR-like ranking schemes, no implicit joins “No schema”: Keyword queries, implicit joins � ➅✂➆☎➇✂➈✞➉✠➊✠➊☎➉ ➋☞➌✍➍✏➎✒➏ ➍✏➐✍➍✠➏ WHIRL (Cohen 1998) place(univ,state) and job(univ,dept) Ranked retrieval from a RDBMS: � � select univ from job where dept ~ ‘Civil’ Ranked similarity join on text columns: � � select state, dept from place, job where place.univ ~ job.univ Limit answer to best k matches only � Avoid evaluating full Cartesian product � � “Iceberg” query Useful for data cleaning and integration � ➅✂➆☎➇✂➈✞➉✠➊✠➊☎➉ ➋☞➌✍➍✏➎✒➏ ➍✏➐✍➍✠➏ 4

➧ ➥ ➧ ➧ ➧ ➤ ➢ ➾ ➚ WHIRL scoring function A where-clause in WHIRL is a Boolean predicate as in SQL ( age=35 ) � � Score for such clauses are 0/1 Similarity predicate ( job ~ ‘Web design’ ) � � Score = cosine( job , ‘Web design’ ) Conjunction or disjunction of clauses � � Sub-clause scores interpreted as probabilities ➦ ∧ … ∧ B ; θ )= Π ➨ , θ ) � score( B ➦ ≤ score( B ➨ ≤ ➦ ∨ … ∨ B ; θ )=1 — Π ➨ , θ ) ) ( 1—score( B � score( B ➦ ≤ ➨ ≤ →✂➣☎↔✂↕✞➙✠➛✠➛☎➙ ➜☞➝✍➞✏➟✒➠ ➞✏➡✍➞✠➠ Query execution strategy select state, dept from place, job where place.univ ~ job.univ Start with place(U1,S) and job(U2,D) � where U1 , U2 , S and D are “free” � Any binding of these variables to constants is associated with a score Greedily extend the current bindings for � maximum gain in score Backtrack to find more solutions � ➩✂➫☎➭✂➯✞➲✠➳✠➳☎➲ ➵☞➸✍➺✏➻✒➼ ➺✏➽✍➺✠➼ ➪➶➳ 5

ß Þ Ï Ð XQuery � Quilt + Lorel + YATL + XML-QL � Path expressions recipes.xml <dishes_with_flour> { FOR $r IN document("recipes.xml") //recipe[//ingredient[@name="flour"]] RETURN <dish>{$r/title/text()}</dish> } </dishes_with_flour> recipe $r title name Tortilla ingredient “flour” ➹✂➘☎➴✂➷✞➬✠➮✠➮☎➬ ➱☞✃✍❐✏❒✒❮ ❐✏❰✍❐✠❮ Ñ☎Ñ Early text support in XQuery Title of books containing some para mentioning � both “sailing” and “windsurfing” FOR $b IN document("bib.xml")//book WHERE SOME $p IN $b//paragraph SATISFIES (contains($p,"sailing") AND contains($p,"windsurfing")) RETURN $b/title Title and text of documents containing at least � three occurrences of “stocks” FOR $a IN view("text_table") WHERE numMatches($a/text_document,"stocks") > 3 RETURN <text>{$a/text_title}{$a/text_document}</> Ò✂Ó☎Ô✂Õ✞Ö✠×✠×☎Ö Ø☞Ù✍Ú✏Û✒Ü Ú✏Ý✍Ú✠Ü àáÖ 6

ï î î ï Tutorial outline Data�model Relational XML-like None SQL,Datalog XML-QL,�Xquery Schema WHIRL ELIXIR,�XIRQL IR� DBXplorer,� support No� EasyAsk,�Mercado,� BANKS,� schema DataSpot,�BANKS DISCOVER Review of text indexing and information retrieval � Support for text search and similarity join in � relational databases with text columns (WHIRL) Adding IR-like text search features to XML query � languages (Chinenyanga et al. Führ et al. 2001) â✂ã☎ä✂å✞æ✠ç✠ç☎æ è☞é✍ê✏ë✒ì ê✏í✍ê✠ì ð➶ñ ELIXIR: Adding IR to XQuery Ranked select � for $t in document(“db.xml”)/items/(book|cd) where $t/text() ~ “Ukrainian recipe” return <dish>$t</dish> Ranked similarity join: find titles in recent � VLDB proceedings similar to speeches in Macbeth for $vi in document(“vldb.xml”)/issue[@volume>24], $si in document(“macbeth.xml”)//speech where $vi//article/title ~ $si return <similar><title>$vi//article/title</> <speech>$si</></similar> â✂ã☎ä✂å✞æ✠ç✠ç☎æ è☞é✍ê✏ë✒ì ê✏í✍ê✠ì ðáò 7

Acknowledgments S.Sudarshan ArvindHulgeri B.Aditya Parag Two - PDF document

Text Search for Fine-grained Semi-structured Data Soumen Chakrabarti Indian Institute of Technology, Bombay www.cse.iitb.ac.in/~soumen/ Acknowledgments S.Sudarshan ArvindHulgeri B.Aditya Parag Two extreme search

Parametric Query Optimization for Linear and Piecewise Linear Cost Functions Arvind Hulgeri S.

Parag Milk Foods Emerges as Fortune India's Next 500 Company PARAG MILK FOODS LIMITED

Parag Milk Foods Emerges as Fortune India's 'Next 500 Company PARAG MILK FOODS LIMITED Q1 FY17

Parag Milk Foods Emerges as Fortune India's 'Next 500 Company PARAG MILK FOODS LIMITED Q4

BANKS BANKS Browsing rowsing an and d K Keyword eyword S Search earch B in Relational

CSL 101 Logistics Instructors Huzur Saran (saran@cse.iitd.ac.in) Parag Singla

GlideinWMS Parag Mhashilkar Stakeholders Meeting May 15, 2015 Overview

THE SOLVABLE CHALLENGE OF AIR POLLUTION IN INDIA Anant Sudarshan, Michael Greenstone (University

About Us KaiGO A part of Arvind Limited Arvind Limited (established in 1931 as a part of

Water ++ About Us Envisol A part of Arvind Limited Arvind Limited, a part of the $1.7 billion

Aditya Birla Group in Nackawic Aditya Birla Group - Leadership Position Globally The largest

SAT Solvers Aditya Parameswaran Luv Kumar 1st June, 2005 and 14th June, 2005 Aditya

ADITYA BIRLA INSULATORS ADITYA BIRLA GROUP Vision & Values Research & Development

Introduction To Puppet And Usage In Cloud Aditya Patawari Fedora Ambassador and Contributor

Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya

Energy Efficiency & Performance - Arvind Ltd ,Div .- Ankur Textiles. A Brief : Ankur Textiles

Strings IV WARM UP: Write a func1on called count_dups

Polynomial Completeness in Expanded Groups Erhard Aichinger Institute for Algebra Johannes

Charlie Garrod Bogdan Vasilescu School of Computer Science 17-214 1 Administrivia Homework

Memory Hierarchy and Direct Map Caches Lecture 11 CDA 3103 06-25-2014 5.1 Introduction

Our Re-Opening and Our Housing June 23, 2020 All participants are in listen-only mode

GENERAL OVERVIEW OF THE BABEL DEMONSTRATOR SYSTEM RESPITE PROJECT www.babeltech.com

Extending Code Genera/on to Support Pla6orm-Independent Event-B Models Asieh Salehi, Michael

Generic attacks The discrete-logarithm problem and index calculus Define = 1000003. D. J.

Acknowledgments S.Sudarshan ArvindHulgeri B.Aditya Parag Two - PDF document

Text Search for Fine-grained Semi-structured Data Soumen Chakrabarti Indian Institute of Technology, Bombay www.cse.iitb.ac.in/~soumen/ Acknowledgments S.Sudarshan ArvindHulgeri B.Aditya Parag Two extreme search

Parametric Query Optimization for Linear and Piecewise Linear Cost Functions Arvind Hulgeri S.

Parag Milk Foods Emerges as Fortune India's Next 500 Company PARAG MILK FOODS LIMITED

Parag Milk Foods Emerges as Fortune India's 'Next 500 Company PARAG MILK FOODS LIMITED Q1 FY17

Parag Milk Foods Emerges as Fortune India's 'Next 500 Company PARAG MILK FOODS LIMITED Q4

BANKS BANKS Browsing rowsing an and d K Keyword eyword S Search earch B in Relational

CSL 101 Logistics Instructors Huzur Saran (saran@cse.iitd.ac.in) Parag Singla

GlideinWMS Parag Mhashilkar Stakeholders Meeting May 15, 2015 Overview

THE SOLVABLE CHALLENGE OF AIR POLLUTION IN INDIA Anant Sudarshan, Michael Greenstone (University

About Us KaiGO A part of Arvind Limited Arvind Limited (established in 1931 as a part of

Water ++ About Us Envisol A part of Arvind Limited Arvind Limited, a part of the $1.7 billion

Aditya Birla Group in Nackawic Aditya Birla Group - Leadership Position Globally The largest

SAT Solvers Aditya Parameswaran Luv Kumar 1st June, 2005 and 14th June, 2005 Aditya

ADITYA BIRLA INSULATORS ADITYA BIRLA GROUP Vision &amp; Values Research &amp; Development

Introduction To Puppet And Usage In Cloud Aditya Patawari Fedora Ambassador and Contributor

Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya

Energy Efficiency &amp; Performance - Arvind Ltd ,Div .- Ankur Textiles. A Brief : Ankur Textiles

Strings IV WARM UP: Write a func1on called count_dups

Polynomial Completeness in Expanded Groups Erhard Aichinger Institute for Algebra Johannes

Charlie Garrod Bogdan Vasilescu School of Computer Science 17-214 1 Administrivia Homework

Memory Hierarchy and Direct Map Caches Lecture 11 CDA 3103 06-25-2014 5.1 Introduction

Our Re-Opening and Our Housing June 23, 2020 All participants are in listen-only mode

GENERAL OVERVIEW OF THE BABEL DEMONSTRATOR SYSTEM RESPITE PROJECT www.babeltech.com

Extending Code Genera/on to Support Pla6orm-Independent Event-B Models Asieh Salehi, Michael

Generic attacks The discrete-logarithm problem and index calculus Define = 1000003. D. J.

ADITYA BIRLA INSULATORS ADITYA BIRLA GROUP Vision & Values Research & Development

Energy Efficiency & Performance - Arvind Ltd ,Div .- Ankur Textiles. A Brief : Ankur Textiles