over taxonomies
play

over Taxonomies Yodsawalai Chodpathumwan University of Illinois at - PowerPoint PPT Presentation

Cost-Effective Conceptual Design over Taxonomies Yodsawalai Chodpathumwan University of Illinois at Urbana-Champaign Ali Vakilian Massachusetts Institute of Technology Arash Termehchy, Amir Nayyeri Oregon State University Most information


  1. Cost-Effective Conceptual Design over Taxonomies Yodsawalai Chodpathumwan University of Illinois at Urbana-Champaign Ali Vakilian Massachusetts Institute of Technology Arash Termehchy, Amir Nayyeri Oregon State University

  2. Most information over the web is unstructured. Medical articles, HTML pages, … Users have to usually query over unstructured data. Wikipedia articles query <article id=1> <article id=2> “John Adams, politician” John Adams has been a John Adams is a former member of Ohio composer whose House of Representative music inspired by from 2007 to 2014 … nature … ranked list <article id=3> John Adams is a public high school located <article id=2> on the east side of Cleveland, Ohio, … <article id=3> Precision@ 𝑙 = #returned relevant answers in top 𝑙 answers #returned answers in top 𝑙 answers <article id=1> precision@3 = 1/3 Only Article id 1 is poor ranking quality! about a politician. 2

  3. Annotating a dataset improves the effectiveness of answering queries. Taxonomy : thing * DAG DBpedia taxonomy * Vertex = concept * Edge = subclass relation place agent Will consider tree taxonomy person populated place organization school city state athlete artist politician legislature Wikipedia articles <article id=1> John Adams has been a <article id=2> politician former member of John Adams is a composer artist Ohio House of Representative whose music inspired by legislature from 2007 to 2014 … nature … <article id=3> John Adams is a public high school located on the east school side of Cleveland, Ohio, … city state 3

  4. Users can submit queries with concepts over annotated dataset. politician artist Annotated Wikipedia articles query <article id=2> <article id=1> John Adams is a John Adams has been a Politician(“John Adams”) composer whose former member of music inspired by Ohio House of Representative nature … from 2007 to 2014 … ranked list <article id=3> John Adams is a public high school located <article id=1> on the east side of Cleveland, Ohio, … state city school legislature precision@3 = 1/1 = 1 Perfect! 4

  5. Concept annotation is costly. Instances of concepts are annotated by a program called concept annotator. Researchers estimate that annotating each article in MEDLINE/PubMED dataset using concepts in MeSH taxonomy costs about $9.4 [K.Liu, 2015] . It is costly to develop, execute, and maintain a concept annotator. • Development: • Hand-tuned programming rules – need experts, thousands of rules • Machine learning technique – find and extract lots of relevant features • Execution: may take several days and require lots of computational resources • Maintenance: datasets evolve over time – rewrite and re-execute concept annotators 5

  6. It is not usually possible to annotate all concepts. Ideally , we would like to annotate instances of all concepts in a given taxonomy from a dataset to answer all queries effectively. With limited budget , we can only annotate instances of some concepts because concept annotation is costly. thing DBpedia taxonomy place agent person populated place organization school athlete artist politician legislature state city Wikipedia articles <article id=1> John Adams has been a former <article id=2> person politician member of John Adams is a composer person artist Ohio House of Representative whose music inspired by legislature from 2007 to 2014 … nature … <article id=3> John Adams is a public high school located on the east side of school organization Cleveland, Ohio, … state 6 city

  7. Annotating datasets with only a subset of concepts from a taxonomy still improves the effectiveness of answering queries. … person organization athlete politician artist school legislature person person Annotated Wikipedia articles query Politician(“John Adams”) <article id=2> <article id=1> John Adams is a John Adams has been a composer whose former member of ranked list music inspired by Ohio House of Representative from 2007 to 2014 … nature … <article id=2> <article id=3> John Adams is a public high school located <article id=1> on the east side of Cleveland, Ohio, … organization precision@3 = 1/2 > 1/3 7 Precision over unannotated dataset

  8. A subset of concepts in a taxonomy used to annotate a dataset is called a conceptual design for the data. politician artist Annotated Wikipedia articles 𝑻 𝟐 = {politician, artist, school, <article id=2> <article id=1> John Adams is a John Adams has been a city, state, legislature} composer whose former member of music inspired by Ohio House of Representative nature … from 2007 to 2014 … <article id=3> John Adams is a public high school located on the east side of Cleveland, Ohio, … school state person city Annotated Wikipedia articles legislature person <article id=2> <article id=1> John Adams is a John Adams has been a composer whose former member of music inspired by Ohio House of Representative nature … from 2007 to 2014 … 𝑻 𝟑 = {person, organization} <article id=3> John Adams is a public high school located on the east side of Cleveland, Ohio, … organization 8

  9. Which conceptual design to pick? Given a dataset, a taxonomy, a sample of query workload and a budget, find a subset of concepts from an input taxonomy that maximizes the effectiveness of answering queries. Precision@k thing dataset place agent person organization populated place politician athlete artist city legislature school state p@3 = 0.1 p@3 = 0.2 Sample {person, agent}, {state, city}, Query {person,organization} , … budget p@3 = 0.5 I want largest I will pick {person,organization} average precision because it is the most effective over these queries. and under my budget! 9

  10. Problem of Cost-Effective Conceptual Design (CECD) Given a dataset, a sample of query workload, a taxonomy, a available budget We would like to select a conceptual design 𝑇 such that Cost function • σ 𝐷∈𝑇 𝑥(𝐷) ≤ 𝐶 Budget • 𝑇 provides the largest improvement in the average precision@k of answering queries amongst all designs that satisfy the budget constraint. Let’s quantify the amount of improvement in precision@k: the queriability of a design 𝑇 or 𝑅𝑉(𝑇) 10

  11. Partitions of a conceptual design Annotating a concept in a taxonomy also improves quality of answering queries with the concepts that are subclass or descendant of them. thing 𝑇 3 = {agent, person} place agent person organization populated place politician athlete artist legislature city school state 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 agent = {legislature, school} 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 person = {politician, athlete, artist} 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜(𝑇) is the set of partitions of each concept in the conceptual design 𝑇 . 11

  12. A conceptual design may not help all the queries. thing 𝑇 3 = {agent, person} place agent person organization populated place politician athlete artist legislature city 𝑔𝑠𝑓𝑓 𝑇 = {state, city} school state 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 agent = {legislature, school} 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 person = {politician, athlete, artist} A set of leaf concepts that do not belong to any partition of 𝑇 is called 𝑔𝑠𝑓𝑓(𝑇) . 12

  13. Conceptual design 𝑻 improves the effectiveness of answering queries whose concepts are in partition s of 𝑻 . … agent Query : Politician(“John Adams”) … organization person 𝑇 = {person, organization} politician … school artist … … politician ∈ 𝑞𝑏𝑠𝑢𝑗𝑢𝑗𝑝𝑜 (person) Dataset annotated by 𝑇 𝑒 𝑑 : fraction of documents of concept 𝑑 in a dataset politician Likelihood of returning relevant answers with concept “ politician ” is person 𝑒 politician 𝑒 person organization Improvement over unannotated dataset 13

  14. Conceptual design 𝑻 improves the effectiveness of answering queries whose concepts are in partition s of 𝑻 . … 𝑇 = {person, organization} agent school(…) … organization person politician(…) politician(…) Portion of queries about “ politician ” is 𝑣 politician politician … school artist … … artist(…) query workload Dataset annotated by 𝑇 Overall improvement for concept “politician” is 𝑣(politician)𝑒 politician 𝑒 person politician Total improvement from partition of “ person ” is person 𝑣(𝑑)𝑒(𝑑) ෍ 𝑒(person) organization 𝑑∈𝑞𝑏𝑠𝑢( person ) Total improvement from design 𝑇 is 𝒗 𝒅 𝒆 𝒅 ෍ ෍ 𝒆(𝑸) 𝑸∈𝒒𝒃𝒔𝒖𝒋𝒖𝒋𝒑𝒐(𝑻) 𝒅∈𝒒𝒃𝒔𝒖𝒋𝒖𝒋𝒑𝒐(𝑸) 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend