over Taxonomies Yodsawalai Chodpathumwan University of Illinois at - - PowerPoint PPT Presentation

over taxonomies
SMART_READER_LITE
LIVE PREVIEW

over Taxonomies Yodsawalai Chodpathumwan University of Illinois at - - PowerPoint PPT Presentation

Cost-Effective Conceptual Design over Taxonomies Yodsawalai Chodpathumwan University of Illinois at Urbana-Champaign Ali Vakilian Massachusetts Institute of Technology Arash Termehchy, Amir Nayyeri Oregon State University Users have to query


slide-1
SLIDE 1

Cost-Effective Conceptual Design

  • ver Taxonomies

Yodsawalai Chodpathumwan

University of Illinois at Urbana-Champaign

Ali Vakilian

Massachusetts Institute of Technology

Arash Termehchy, Amir Nayyeri

Oregon State University

slide-2
SLIDE 2

Users have to query over unstructured dataset.

Medical articles, HTML pages, …

“John Adams, politician”

keyword query

ranked list <article id=1> <article id=2> <article id=3>

Only Article id 1 is about a politician.

poor ranking quality!

precision = 1/3

Precision = #returned relevant answers #returned answers

<article id=1>

John Adams has been a former member of Ohio House of Representative from 2007 to 2014 …

<article id=2>

John Adams is a composer whose music inspired by nature …

<article id=3>

John Adams is a public high school located

  • n the east side of Cleveland, Ohio, …

Wikipedia article excerpts

slide-3
SLIDE 3

Annotating a dataset helps answering the query.

We can annotate the dataset using concepts from a taxonomy.

thing agent place person

  • rganization

populated place artist politician athlete DBpedia taxonomy school legislature state city

<article id=1>

John Adams has been a former member of

Ohio House of Representative

from 2007 to 2014 …

<article id=2>

John Adams is a composer whose music inspired by nature …

<article id=3>

John Adams is a public high school located on the east side of Cleveland, Ohio, …

Wikipedia article excerpts Taxonomy: * Tree-shaped graph * Vertex = concept * Edge = subclass relation

artist politician city school state legislature

slide-4
SLIDE 4

Users can submit structured queries over annotated dataset.

Politician(“John Adams”)

Structured keyword query

ranked list <article id=1>

<article id=1>

John Adams has been a former member of

Ohio House of Representative

from 2007 to 2014 …

<article id=2>

John Adams is a composer whose music inspired by nature …

<article id=3>

John Adams is a public high school located

  • n the east side of Cleveland, Ohio, …

Wikipedia article excerpts

precision = 1/1 = 1

Perfect!

city school state artist politician legislature

slide-5
SLIDE 5

Concept annotation is costly

Instances of concepts are annotated by a program called concept annotator.

It is costly to develop, execute, and maintain a concept annotator.

  • Hand-tuned program rules – need experts, time-consuming
  • Machine learning technique – lots of relevant features, thousands of rules
  • Executing concepts annotator may take several days and require lots of

computational resources

  • Datasets evolve over time – rewrite and re-execute concept annotators
slide-6
SLIDE 6

It is not possible to always annotate all concepts.

Ideally, we would like to annotate instances of all concepts in a given taxonomy from a dataset to answer all queries effectively. Reality, we can only annotate instances of some concepts.

thing agent place person

  • rganization

populated place artist politician athlete DBpedia taxonomy school legislature state city

<article id=1>

John Adams has been a former member of Ohio House of Representative from 2007 to 2014 …

<article id=2>

John Adams is a composer whose music inspired by nature …

<article id=3>

John Adams is a public high school located on the east side of Cleveland, Ohio, …

Wikipedia article excerpts

person

  • rganization

person

slide-7
SLIDE 7

Annotating dataset with only a subset of concepts from a taxonomy still helps.

Politician(“John Adams”)

Structured keyword query

ranked list

<article id=1>

John Adams has been a former member of

Ohio House of Representative

from 2007 to 2014 …

<article id=2>

John Adams is a composer whose music inspired by nature …

<article id=3>

John Adams is a public high school located

  • n the east side of Cleveland, Ohio, …

Wikipedia article excerpts

  • rganization

person person … person

  • rganization

politician … …

<article id=1> <article id=2>

precision = 1/2 > 1/3

Precision over unannotated dataset

slide-8
SLIDE 8

Many taxonomies contain large number of concepts.

  • Medical Subject Headings (MeSH), Plant Ontology, …
  • An organization has limited amount of resources
  • Annotate a dataset using only a subset of concepts from a given

taxonomy: a conceptual design for the data

slide-9
SLIDE 9

Which conceptual design to pick?

I can only annotate a few concepts

  • ver this dataset.

dataset Query I want largest average precision over these queries.

Find a cost-effective subset of concepts from an input taxonomy that maximizes the effectiveness of answering queries.

thing agent place person

  • rganization

populated place artist politician athlete school legislature state city

Precision@k

slide-10
SLIDE 10

Problem of Cost-Effective Conceptual Design (CECD)

Given a dataset, a query workload, a taxonomy, a fixed budget We would like to select a conceptual design 𝑇 such that

  • 𝐷∈𝑇 𝑥(𝐷) ≤ 𝐶
  • 𝑇 provides the largest precision@k of answering queries more

than other designs that satisfy the budget constraint.

Fixed budget Cost function

Let’s quantify the amount of improvement for precision@k: the Queriability of a design

slide-11
SLIDE 11

Partitions of a conceptual design

Given a design 𝑇 over a taxonomy 𝑌, the partition of a concept 𝑑 ∈ 𝑇 or 𝒒𝒃𝒔𝒖(𝒅) is a subset of leaf nodes in 𝑌 such that, for every concept 𝑒 ∈ 𝑞𝑏𝑠𝑢(𝑑), the lowest ancestor of 𝑒 in 𝑻 is 𝑑 or 𝑒 = 𝑑.

Each leaf concept in 𝑌 belongs to at most one partition of a design 𝑇.

thing agent place person

  • rganization

populated place artist politician athlete school legislature state city

𝑇 = {agent, person}

𝑞𝑏𝑠𝑢 person = {politician, athlete, artist} 𝑞𝑏𝑠𝑢 agent = {legislature, school} 𝑔𝑠𝑓𝑓 𝑇 = {state, city}

A set of leaf concepts that do not belong to any partition of 𝑇 is called 𝑔𝑠𝑓𝑓(𝑇).

slide-12
SLIDE 12

Conceptual design 𝑻 helps answering queries whose concepts are in partitions of 𝑻.

dataset

person

politician artist

Total improvement from design 𝑻 is 𝑸∈𝑻 𝒅∈𝒒𝒃𝒔𝒖(𝑸)

𝒗 𝒅 𝒆 𝒅 𝒆(𝑸)

agent

  • rganization

person

… artist politician school … … … … 𝑒 𝑑 : frequency of documents of concept 𝑑

school(…) politician(…) politician(…) artist(…)

query workload 𝑣 𝑑 : popularity of concept 𝑑 in query workload

Improvement is

𝑣(politician)𝑒 politician 𝑒 person

Portion of queries about “politician” is 𝑣 politician Fraction of “politician” documents amongst “person” is

𝑒 politician 𝑒 person

Total improvement from partition of “person” is

𝑣(politician)𝑒 politician 𝑒 person + 𝑣(artist)𝑒 artist 𝑒 person + ⋯ =

𝑑∈𝑞𝑏𝑠𝑢(person)

𝑣(𝑑)𝑒(𝑑) 𝑒(person)

𝐷 = “politician”, 𝑇 = {person,…}

slide-13
SLIDE 13

The contribution of a design for queries whose concepts are not in any partition of the design.

Generally, the concepts with more instances in the dataset are more likely to appear in the top answers. Thus, it is more likely they contain some relevant answers for the query. The total improvement by concepts in 𝑔𝑠𝑓𝑓(𝑇) is

𝒅∈𝒈𝒔𝒇𝒇 𝑻

𝒗 𝒅 𝒆 𝒅

Portion of instances in the dataset that belong to 𝒅 Portion of queries whose concepts are 𝒅

dataset

person

  • rganization
  • rganization

person

answers relevant answers

  • rganization

person

slide-14
SLIDE 14

Formal definition of Cost-Effective Conceptual Design Problem

Given a taxonomy 𝑌, a dataset 𝐸, query workload 𝑅 and a budget 𝐶,

find a conceptual design 𝑇 over 𝑌 such that

𝑑∈𝑇

𝑥 𝑑 ≤ 𝐶

and 𝑇 maximizes the queriablity

𝑅𝑉 𝑇 =

𝑄∈𝑇 𝑑∈𝑞𝑏𝑠𝑢(𝑄)

𝑣 𝑑 𝑒 𝑑 𝑞𝑠(𝑄) 𝑒(𝑄) +

𝑑∈𝑔𝑠𝑓𝑓(𝑇)

𝑣 𝑑 𝑒(𝑑)

slide-15
SLIDE 15

We have proposed an approximation algorithm called “Level-wise Algorithm” (LW)

Find a design whose concepts are all from a same level of the input taxonomy

Infections Skin-Infections Eye-Infections Bone-Infections Trachoma Hordeolum Ecthyma Erysipelas Periostitis Spondylitis … … … …

Find the design with maximum queriability for each level using APM algorithm [Termehchy, SIGMOD’14]

APM returns a design with largest queriability over a set

  • f concepts.

𝑅𝑉({Infections,…}) 𝑅𝑉({Eye−Infections,...}) 𝑅𝑉({Trachoma,...})

𝑻𝒎𝒇𝒘𝒇𝒎 ← a design with 𝐧𝐛𝐲{𝑹𝑽, 𝑹𝑽, 𝑹𝑽, … } 𝑻𝒎𝒇𝒃𝒈 ← leaf concept with largest popularity (𝒗)

Return a design with 𝐧𝐛𝐲 𝑹𝑽 𝑻𝒎𝒇𝒘𝒇𝒎 , 𝑹𝑽 𝑻𝒎𝒇𝒃𝒈

slide-16
SLIDE 16

Level-wise algorithm has a bounded approximation ratio over a special case of the CECD problem

  • Sometimes it is easier to use and manage a conceptual design

whose concepts are not subclass/superclass of each other.

  • We call this design a disjoint design.
  • May restrict the solution in the CECD problem to disjoint designs.
  • We call this problem a disjoint CECD problem.

Theorem The Level-wise algorithm is a 𝑃 log |𝐷| -approximation for the disjoint CECD problem.

slide-17
SLIDE 17

Experiment Settings

  • 8 extracted tree taxonomies from YAGO ontology, T1-T8
  • Number of concepts between 10 – 400 with height of 2 – 9
  • 8 Datasets of articles from English Wikipedia Collection
  • Bing (bing.com) query log whose relevant answers are Wikipedia

article.

  • Effectiveness metric: precision at 3 (𝑞@3)
  • Two cost models: uniform cost and random cost
slide-18
SLIDE 18

Validation of Queriability Function

  • Oracle: enumerate all feasible designs and find design with maximum

precision at 3.

  • Queriability Maximization (QM): enumerate all feasible designs and

find design with maximum queriability.

B T1 T2 T3

Oracle QM Oracle QM Oracle QM Uniform 0.1 0.2 0.3 0.4 … 0.149 0.168 0.177 0.192 0.149 0.168 0.177 0.192 0.241 0.303 0.318 0.320 0.232 0.285 0.315 0.318 0.222 0.281 0.304 0.306 0.210 0.269 0.304 0.304

B=1 : enough budget to annotate all concepts Results for random cost is similar to the results for uniform cost

slide-19
SLIDE 19

Effectiveness of Level-Wise Algorithm

  • Compare LW with APM.

B T1 T2 T3 T4 T5 T6 T7 T8

APM LW APM LW APM LW APM LW APM LW APM LW APM LW APM LW

Uniform

0.1 0.2 0.3 0.4 … .089 .149 .164 .164 .103 .164 .164 .183 .234 .253 .292 .320 .232 .285 .316 .318 .208 .258 .288 .297 .210 .269 .297 .304 .158 .177 .191 .215 .179 .212 .231 .240 .178 .214 .228 .237 .206 .227 .242 .248 .229 .244 .247 .248 .240 .248 .248 .248 .243 .259 .260 .261 .254 .261 .261 .261 .250 .262 .262 .263 .259 .263 .263 .263

B=1 : enough budget to annotate all concepts Results for random cost is similar to the results for uniform cost

slide-20
SLIDE 20

Efficiency of Level-Wise Algorithm

Level-Wise algorithm

  • Took 2 seconds over T4 and T5
  • Took 5 seconds over T6
  • Took 6 seconds over T7 and T8

APM algorithm

  • Took 2 seconds over T4, T5 and T6
  • Took 13 seconds over T7
  • Took 40 seconds over T8
slide-21
SLIDE 21

Conclusion

  • We introduce the cost-effective conceptual design over

taxonomies

  • We propose an efficient approximation (LW) algorithm for the

problem.

  • Our empirical results show that LW is generally effective and

scalable.

  • For future work, we are working on variations of the problem

including taxonomies that are directed acyclic graphs, multi- concept queries, and cost-dependency model.

slide-22
SLIDE 22
slide-23
SLIDE 23

Unused Slides

23

slide-24
SLIDE 24

Running time

  • APM algorithm is 𝑃( 𝐷 log 𝐷 )
  • LW algorithm is 𝑃(ℎ 𝐷 log 𝐷 )

Since we perform APM for each level in the taxonomy, in fact, for balanced tree, the running time of LW is 𝑃( 𝐷1 log 𝐷1 + ⋯ 𝐷ℎ log 𝐷ℎ ) which is smaller than 𝑃 𝐷1 ∪ ⋯ ∪ 𝐷ℎ log 𝐷1 ∪ ⋯ ∪ 𝐷ℎ = 𝑃( 𝐷 log 𝐷 ).

24

slide-25
SLIDE 25

How to quantify the queriability of a design?

  • Effectiveness of returned answers is usually measured using

precision (MRR, recall, ...).

  • Precision = #returned relevant answers / #returned answers
  • If a conceptual design helps improve ranking quality, it should

replace non-relevant answers with the relevant one.

  • It also improves the precision (MRR, recall, …)
  • Queriability – estimates the improve in query answering using

the amount by which a design increases the fraction of relevant answers in the returned answers.

slide-26
SLIDE 26

Level-Wise Algorighm

26

slide-27
SLIDE 27

Conceptual design 𝑻 helps answering queries whose concepts are in partitions of 𝑻.

Given a query 𝐷(terms) such that 𝐷 belongs to the partition 𝑄 in the design 𝑇

Infections Skin-Infections

Eye-Infections

… Trachoma Hordeolum Ecthyma Erysipelas …

dataset Eye-Infections Trachoma 𝐷 = “Trachoma”, 𝑇 = {Eye-Infections,…}

Let 𝑒(𝐷) be a frequency of documents with concept 𝐷 Let 𝑣(𝐷) be a popularity of queries with concept 𝐷

Fraction of documents about “Trachoma” within the set of documents annotated by “Eye-Infection” is

𝑒 Trachoma 𝑒 Eye−Infections .

Portion of query about “Trachoma” is 𝑣(Trachoma). Hence, the improvement is

𝑣 Trachoma 𝑒 Trachoma 𝑒 Eye−Infections

. Total improvement from partition of “Eye-Infection” is

𝑣 Trachoma 𝑒 Trachoma 𝑒 Eye−Infections + 𝑣 Hordeolum 𝑒 Hordeolum 𝑒 Eye−Infections + ⋯ Hordeolum

For each partition, 𝑑∈𝑞𝑏𝑠𝑢(𝑄)

𝑣(𝑑)𝑒(𝐷) 𝑒(𝑄) .

Given 𝑇, the total improvement is 𝑄∈𝑇 𝑑∈𝑞𝑏𝑠𝑢(𝑄)

𝑣 𝑑 𝑒 𝑑 𝑒(𝑄)

slide-28
SLIDE 28

Validation of Queriability Function

  • Oracle: enumerate all feasible designs and find design with

maximum precision at 3.

  • Queriability Maximization (QM): enumerate all feasible designs

and find design with maximum queriability.

B T1 T2 T3

Oracle QM Oracle QM Oracle QM Uniform 0.1 0.2 0.3 0.4 … 0.149 0.168 0.177 0.192 0.149 0.168 0.177 0.192 0.241 0.303 0.318 0.320 0.232 0.285 0.315 0.318 0.222 0.281 0.304 0.306 0.210 0.269 0.304 0.304 Random 0.1 0.2 0.3 0.4 … 0.124 0.163 0.179 0.187 0.124 0.163 0.177 0.183 0.264 0.320 0.317 0.323 0.262 0.295 0.316 0.318 0.248 0.288 0.304 0.306 0.239 0.281 0.304 0.306

B=1 : enough budget to annotate all concepts

slide-29
SLIDE 29

Effectiveness of Level-Wise Algorithm

  • Compare LW with APM.

B T1 T2 T3 T4 T5 T6 T7 T8

APM LW APM LW APM LW APM LW APM LW APM LW APM LW APM LW

Uniform

0.1 0.2 0.3 0.4 … .089 .149 .164 .164 .103 .164 .164 .183 .234 .253 .292 .320 .232 .285 .316 .318 .208 .258 .288 .297 .210 .269 .297 .304 .158 .177 .191 .215 .179 .212 .231 .240 .178 .214 .228 .237 .206 .227 .242 .248 .229 .244 .247 .248 .240 .248 .248 .248 .243 .259 .260 .261 .254 .261 .261 .261 .250 .262 .262 .263 .259 .263 .263 .263

Random

0.1 0.2 0.3 0.4 … .097 .111 .122 .164 .104 .145 .175 .185 .239 .263 .300 .321 .257 .291 .317 .320 .240 .263 .275 .294 .235 .283 .301 .305 .177 .180 .188 .212 .189 .212 .230 .239 .183 .216 .231 .239 .210 .230 .242 .248 .231 .245 .247 .248 .240 .248 .248 .248 .245 .259 .260 .261 .256 .261 .261 .261 .255 .262 .263 .263 .259 .263 .263 .263

B=1 : enough budget to annotate all concepts

slide-30
SLIDE 30

<article id=1>

Granular conjunctivitis causes pain in the outer surface or cornea …

<article id=2>

Stye may lead to pain on the eyelids …

<article id=3>

GAS caused infections cause pain in tissues …