Cost-Effective Conceptual Design
- ver Taxonomies
Yodsawalai Chodpathumwan
University of Illinois at Urbana-Champaign
Ali Vakilian
Massachusetts Institute of Technology
Arash Termehchy, Amir Nayyeri
Oregon State University
over Taxonomies Yodsawalai Chodpathumwan University of Illinois at - - PowerPoint PPT Presentation
Cost-Effective Conceptual Design over Taxonomies Yodsawalai Chodpathumwan University of Illinois at Urbana-Champaign Ali Vakilian Massachusetts Institute of Technology Arash Termehchy, Amir Nayyeri Oregon State University Users have to query
Yodsawalai Chodpathumwan
University of Illinois at Urbana-Champaign
Ali Vakilian
Massachusetts Institute of Technology
Arash Termehchy, Amir Nayyeri
Oregon State University
“John Adams, politician”
keyword query
ranked list <article id=1> <article id=2> <article id=3>
Only Article id 1 is about a politician.
precision = 1/3
Precision = #returned relevant answers #returned answers
<article id=1>
John Adams has been a former member of Ohio House of Representative from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located
Wikipedia article excerpts
We can annotate the dataset using concepts from a taxonomy.
thing agent place person
populated place artist politician athlete DBpedia taxonomy school legislature state city
<article id=1>
John Adams has been a former member of
Ohio House of Representative
from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located on the east side of Cleveland, Ohio, …
Wikipedia article excerpts Taxonomy: * Tree-shaped graph * Vertex = concept * Edge = subclass relation
artist politician city school state legislature
Politician(“John Adams”)
Structured keyword query
ranked list <article id=1>
<article id=1>
John Adams has been a former member of
Ohio House of Representative
from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located
Wikipedia article excerpts
precision = 1/1 = 1
city school state artist politician legislature
computational resources
Ideally, we would like to annotate instances of all concepts in a given taxonomy from a dataset to answer all queries effectively. Reality, we can only annotate instances of some concepts.
thing agent place person
populated place artist politician athlete DBpedia taxonomy school legislature state city
<article id=1>
John Adams has been a former member of Ohio House of Representative from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located on the east side of Cleveland, Ohio, …
Wikipedia article excerpts
person
person
Politician(“John Adams”)
Structured keyword query
ranked list
<article id=1>
John Adams has been a former member of
Ohio House of Representative
from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located
Wikipedia article excerpts
person person … person
politician … …
<article id=1> <article id=2>
precision = 1/2 > 1/3
Precision over unannotated dataset
I can only annotate a few concepts
dataset Query I want largest average precision over these queries.
thing agent place person
populated place artist politician athlete school legislature state city
Precision@k
Fixed budget Cost function
Let’s quantify the amount of improvement for precision@k: the Queriability of a design
Given a design 𝑇 over a taxonomy 𝑌, the partition of a concept 𝑑 ∈ 𝑇 or 𝒒𝒃𝒔𝒖(𝒅) is a subset of leaf nodes in 𝑌 such that, for every concept 𝑒 ∈ 𝑞𝑏𝑠𝑢(𝑑), the lowest ancestor of 𝑒 in 𝑻 is 𝑑 or 𝑒 = 𝑑.
Each leaf concept in 𝑌 belongs to at most one partition of a design 𝑇.
thing agent place person
populated place artist politician athlete school legislature state city
𝑇 = {agent, person}
𝑞𝑏𝑠𝑢 person = {politician, athlete, artist} 𝑞𝑏𝑠𝑢 agent = {legislature, school} 𝑔𝑠𝑓𝑓 𝑇 = {state, city}
A set of leaf concepts that do not belong to any partition of 𝑇 is called 𝑔𝑠𝑓𝑓(𝑇).
dataset
person
politician artist
𝒗 𝒅 𝒆 𝒅 𝒆(𝑸)
agent
person
… artist politician school … … … … 𝑒 𝑑 : frequency of documents of concept 𝑑
school(…) politician(…) politician(…) artist(…)
query workload 𝑣 𝑑 : popularity of concept 𝑑 in query workload
Improvement is
𝑣(politician)𝑒 politician 𝑒 person
Portion of queries about “politician” is 𝑣 politician Fraction of “politician” documents amongst “person” is
𝑒 politician 𝑒 person
Total improvement from partition of “person” is
𝑣(politician)𝑒 politician 𝑒 person + 𝑣(artist)𝑒 artist 𝑒 person + ⋯ =
𝑑∈𝑞𝑏𝑠𝑢(person)
𝑣(𝑑)𝑒(𝑑) 𝑒(person)
𝐷 = “politician”, 𝑇 = {person,…}
Generally, the concepts with more instances in the dataset are more likely to appear in the top answers. Thus, it is more likely they contain some relevant answers for the query. The total improvement by concepts in 𝑔𝑠𝑓𝑓(𝑇) is
𝒅∈𝒈𝒔𝒇𝒇 𝑻
𝒗 𝒅 𝒆 𝒅
Portion of instances in the dataset that belong to 𝒅 Portion of queries whose concepts are 𝒅
dataset
person
person
answers relevant answers
person
𝑑∈𝑇
𝑥 𝑑 ≤ 𝐶
𝑅𝑉 𝑇 =
𝑄∈𝑇 𝑑∈𝑞𝑏𝑠𝑢(𝑄)
𝑣 𝑑 𝑒 𝑑 𝑞𝑠(𝑄) 𝑒(𝑄) +
𝑑∈𝑔𝑠𝑓𝑓(𝑇)
𝑣 𝑑 𝑒(𝑑)
Find a design whose concepts are all from a same level of the input taxonomy
Infections Skin-Infections Eye-Infections Bone-Infections Trachoma Hordeolum Ecthyma Erysipelas Periostitis Spondylitis … … … …
Find the design with maximum queriability for each level using APM algorithm [Termehchy, SIGMOD’14]
APM returns a design with largest queriability over a set
𝑅𝑉({Infections,…}) 𝑅𝑉({Eye−Infections,...}) 𝑅𝑉({Trachoma,...})
𝑻𝒎𝒇𝒘𝒇𝒎 ← a design with 𝐧𝐛𝐲{𝑹𝑽, 𝑹𝑽, 𝑹𝑽, … } 𝑻𝒎𝒇𝒃𝒈 ← leaf concept with largest popularity (𝒗)
Return a design with 𝐧𝐛𝐲 𝑹𝑽 𝑻𝒎𝒇𝒘𝒇𝒎 , 𝑹𝑽 𝑻𝒎𝒇𝒃𝒈
Theorem The Level-wise algorithm is a 𝑃 log |𝐷| -approximation for the disjoint CECD problem.
article.
precision at 3.
find design with maximum queriability.
B T1 T2 T3
Oracle QM Oracle QM Oracle QM Uniform 0.1 0.2 0.3 0.4 … 0.149 0.168 0.177 0.192 0.149 0.168 0.177 0.192 0.241 0.303 0.318 0.320 0.232 0.285 0.315 0.318 0.222 0.281 0.304 0.306 0.210 0.269 0.304 0.304
B=1 : enough budget to annotate all concepts Results for random cost is similar to the results for uniform cost
B T1 T2 T3 T4 T5 T6 T7 T8
APM LW APM LW APM LW APM LW APM LW APM LW APM LW APM LW
Uniform
0.1 0.2 0.3 0.4 … .089 .149 .164 .164 .103 .164 .164 .183 .234 .253 .292 .320 .232 .285 .316 .318 .208 .258 .288 .297 .210 .269 .297 .304 .158 .177 .191 .215 .179 .212 .231 .240 .178 .214 .228 .237 .206 .227 .242 .248 .229 .244 .247 .248 .240 .248 .248 .248 .243 .259 .260 .261 .254 .261 .261 .261 .250 .262 .262 .263 .259 .263 .263 .263
B=1 : enough budget to annotate all concepts Results for random cost is similar to the results for uniform cost
23
Since we perform APM for each level in the taxonomy, in fact, for balanced tree, the running time of LW is 𝑃( 𝐷1 log 𝐷1 + ⋯ 𝐷ℎ log 𝐷ℎ ) which is smaller than 𝑃 𝐷1 ∪ ⋯ ∪ 𝐷ℎ log 𝐷1 ∪ ⋯ ∪ 𝐷ℎ = 𝑃( 𝐷 log 𝐷 ).
24
26
Given a query 𝐷(terms) such that 𝐷 belongs to the partition 𝑄 in the design 𝑇
Infections Skin-Infections
Eye-Infections
… Trachoma Hordeolum Ecthyma Erysipelas …
dataset Eye-Infections Trachoma 𝐷 = “Trachoma”, 𝑇 = {Eye-Infections,…}
Let 𝑒(𝐷) be a frequency of documents with concept 𝐷 Let 𝑣(𝐷) be a popularity of queries with concept 𝐷
Fraction of documents about “Trachoma” within the set of documents annotated by “Eye-Infection” is
𝑒 Trachoma 𝑒 Eye−Infections .
Portion of query about “Trachoma” is 𝑣(Trachoma). Hence, the improvement is
𝑣 Trachoma 𝑒 Trachoma 𝑒 Eye−Infections
. Total improvement from partition of “Eye-Infection” is
𝑣 Trachoma 𝑒 Trachoma 𝑒 Eye−Infections + 𝑣 Hordeolum 𝑒 Hordeolum 𝑒 Eye−Infections + ⋯ Hordeolum
For each partition, 𝑑∈𝑞𝑏𝑠𝑢(𝑄)
𝑣(𝑑)𝑒(𝐷) 𝑒(𝑄) .
𝑣 𝑑 𝑒 𝑑 𝑒(𝑄)
maximum precision at 3.
and find design with maximum queriability.
B T1 T2 T3
Oracle QM Oracle QM Oracle QM Uniform 0.1 0.2 0.3 0.4 … 0.149 0.168 0.177 0.192 0.149 0.168 0.177 0.192 0.241 0.303 0.318 0.320 0.232 0.285 0.315 0.318 0.222 0.281 0.304 0.306 0.210 0.269 0.304 0.304 Random 0.1 0.2 0.3 0.4 … 0.124 0.163 0.179 0.187 0.124 0.163 0.177 0.183 0.264 0.320 0.317 0.323 0.262 0.295 0.316 0.318 0.248 0.288 0.304 0.306 0.239 0.281 0.304 0.306
B=1 : enough budget to annotate all concepts
B T1 T2 T3 T4 T5 T6 T7 T8
APM LW APM LW APM LW APM LW APM LW APM LW APM LW APM LW
Uniform
0.1 0.2 0.3 0.4 … .089 .149 .164 .164 .103 .164 .164 .183 .234 .253 .292 .320 .232 .285 .316 .318 .208 .258 .288 .297 .210 .269 .297 .304 .158 .177 .191 .215 .179 .212 .231 .240 .178 .214 .228 .237 .206 .227 .242 .248 .229 .244 .247 .248 .240 .248 .248 .248 .243 .259 .260 .261 .254 .261 .261 .261 .250 .262 .262 .263 .259 .263 .263 .263
Random
0.1 0.2 0.3 0.4 … .097 .111 .122 .164 .104 .145 .175 .185 .239 .263 .300 .321 .257 .291 .317 .320 .240 .263 .275 .294 .235 .283 .301 .305 .177 .180 .188 .212 .189 .212 .230 .239 .183 .216 .231 .239 .210 .230 .242 .248 .231 .245 .247 .248 .240 .248 .248 .248 .245 .259 .260 .261 .256 .261 .261 .261 .255 .262 .263 .263 .259 .263 .263 .263
B=1 : enough budget to annotate all concepts
<article id=1>
Granular conjunctivitis causes pain in the outer surface or cornea …
<article id=2>
Stye may lead to pain on the eyelids …
<article id=3>
GAS caused infections cause pain in tissues …