Chapter 12: Query Processing
You have to think anyway, so why not think big?
- - Donald Trump
Computers are useless, they can only give you answers.
- - Pablo Picasso
There are lies, damn lies, and workload assumptions.
- - anonymous
IRDM WS 2015 12-1
Chapter 12: Query Processing Computers are useless, they can only - - PowerPoint PPT Presentation
Chapter 12: Query Processing Computers are useless, they can only give you answers. -- Pablo Picasso You have to think anyway, so why not think big? -- Donald Trump There are lies, damn lies, and workload assumptions. -- anonymous 12-1
IRDM WS 2015 12-1
IRDM WS 2015 12-2
IRDM WS 2015 12-3
Index lists
s(t1,d1) = 0.7 … s(tm,d1) = 0.2
…
Data items: d1, …, dn
… …
d1 0.7 d78 0.9 d88 0.2 d64 0.8 d78 0.1 d78 0.5 d99 0.2 d10 0.8 d23 0.8
d1 0.2 d10 0.6 d23 0.6
d10 0.7 d34 0.1 d64 0.4
IRDM WS 2015 12-4
d1, 1.0 d4, 2.0 d7, 0.2 d8, 0.1 a d4, 1.0 d7, 2.0 d8, 0.2 d9, 0.1 b d4, 3.0 d7, 1.0 c
d1 : 1.0 d4 : 6.0 d7 : 3.2 d8 : 0.3 d9 : 0.1
IRDM WS 2015 12-5
Accumulators
[Broder et al. 2003]
IRDM WS 2015 12-6
[Broder et al. 2003]
IRDM WS 2015 12-7
(i=1.3 maxscorei > min-k, i=1.4 maxscorei min-k)
cannot contain any docid [102,599]
d1, 1.0 d4, 2.0 d7, 0.2 d8, 0.1 a d4, 1.0 d7, 2.0 d8, 0.2 d9, 0.1 b d4, 3.0 d7, 1.0 c
d1 : 0.0 d4 : 0.0 d7 : 0.0 d8 : 0.0 d9 : 0.0 d1 : 1.0 d4 : 0.0 d7 : 0.0 d8 : 0.0 d9 : 0.0 d1 : 1.0 d4 : 2.0 d7 : 0.0 d8 : 0.0 d9 : 0.0 d1 : 1.0 d4 : 2.0 d7 : 0.2 d8 : 0.0 d9 : 0.0 d1 : 1.0 d4 : 2.0 d7 : 0.2 d8 : 0.1 d9 : 0.0 d1 : 1.0 d4 : 3.0 d7 : 0.2 d8 : 0.1 d9 : 0.0 d1 : 1.0 d4 : 3.0 d7 : 2.2 d8 : 0.1 d9 : 0.0 d1 : 1.0 d4 : 3.0 d7 : 2.2 d8 : 0.3 d9 : 0.0 d1 : 1.0 d4 : 3.0 d7 : 2.2 d8 : 0.3 d9 : 0.1 d1 : 1.0 d4 : 6.0 d7 : 2.2 d8 : 0.3 d9 : 0.1 d1 : 1.0 d4 : 6.0 d7 : 3.2 d8 : 0.3 d9 : 0.1
Accumulators
IRDM WS 2015 12-8
Index lists
s(t1,d1) = 0.7 … s(tm,d1) = 0.2
…
Data items: d1, …, dn
… …
d78 0.9 d1 0.7 d88 0.2 d1 0.2 d78 0.1 d99 0.2 d34 0.1 d23 0.8 d10 0.8
d64 0.8 d23 0.6 d10 0.6
d10 0.7 d78 0.5 d64 0.4
IRDM WS 2015 12-9
IRDM WS 2015 12-10
m 1 i j i i j
j i j i j i i
[Zobel/Moffat 1996]
IRDM WS 2015 12-11
following R. Fagin: Optimal aggregation algorithms for middleware, JCSS. 66(4), 2003
IRDM WS 2015 12-12
[Fagin 01,Güntzer 00, Nepal 99, Buckley 85]
Index lists
s(t1,d1) = 0.7 … s(tm,d1) = 0.2
…
Data items: d1, …, dn Query: q = (t1, t2, t3)
… …
d78 0.9 d88 0.2 d78 0.1 d34 0.1 d23 0.8 d10 0.8
d64 0.9 d23 0.6 d10 0.6
d10 0.7 d78 0.5 d64 0.3
Threshold algorithm (TA):
scan index lists; consider d at posi in Li; highi := s(ti,d); if d top-k then { look up s(d) in all lists L with i; score(d) := aggr {s(d) | =1..m}; if score(d) > min-k then add d to top-k and remove min-score d’; min-k := min{score(d’) | d’ top-k}; threshold := aggr {high | =1..m}; if threshold min-k then exit; Scan depth 1 Scan depth 2 Scan depth 3
simple & DB-style; needs only O(k) memory
Scan depth 4
d1 0.7 d99 0.2 d12 0.2 2 d64 0.9 Rank Doc Score 2 d64 1.2 1 d78 0.9 1 d78 1.5 1 d78 1.5 2 d64 0.9 Rank Doc Score 2 d78 1.5 1 d10 2.1 Rank Doc Score 1 d10 2.1 2 d78 1.5 Rank Doc Score 1 d10 2.1 2 d78 1.5 Rank Doc Score 1 d10 2.1 2 d78 1.5 STOP!
IRDM WS 2015 12-13
Index lists
s(t1,d1) = 0.7 … s(tm,d1) = 0.2
…
Data items: d1, …, dn Query: q = (t1, t2, t3)
Rank Doc Worst- score Best- score 1 d78 0.9 2.4 2 d64 0.8 2.4 3 d10 0.7 2.4 Rank Doc Worst- score Best- score 1 d78 1.4 2.0 2 d23 1.4 1.9 3 d64 0.8 2.1 4 d10 0.7 2.1 Rank Doc Worst- score Best- score 1 d10 2.1 2.1 2 d78 1.4 2.0 3 d23 1.4 1.8 4 d64 1.2 2.0 … …
d78 0.9 d1 0.7 d88 0.2 d12 0.2 d78 0.1 d99 0.2 d34 0.1 d23 0.8 d10 0.8
d64 0.8 d23 0.6 d10 0.6
d10 0.7 d78 0.5 d64 0.4 STOP!
No-random-access algorithm (NRA):
scan index lists; consider d at posi in Li; E(d) := E(d) {i}; highi := s(ti,d); worstscore(d) := aggr{s(t,d) | E(d)}; bestscore(d) := aggr{worstscore(d), aggr{high | E(d)}}; if worstscore(d) > min-k then add d to top-k min-k := min{worstscore(d’) | d’ top-k}; else if bestscore(d) > min-k then cand := cand {d}; threshold := max {bestscore(d’) | d’ cand}; if threshold min-k then exit; Scan depth 1 Scan depth 2 Scan depth 3
sequential access (SA) faster than random access (RA) by factor of 20-1000
IRDM WS 2015 12-14
if „wild guesses“ are allowed, then no deterministic algorithm is instance-optimal
IRDM WS 2015 12-15
𝑛−1 𝑛 ) with high prob. and space O(1)
IRDM WS 2015 12-16
[Fagin et al.2003] :
IRDM WS 2015 12-17
scan depth
drop d from priority queue
Approximate top-k with
probabilistic guarantees:
bestscore(d) worstscore(d) min-k
score
worstscore(d) > min-k
min-k, otherwise keep in PQ TA family of algorithms based on invariant (with sum as aggr):
i i i i E( d ) i E( d ) i E( d )
worstscore(d) bestscore(d)
i i i E( d ) i E( d )
Often overly conservative (deep scans, high memory for PQ) discard candidates d from queue if p(d) score predictor estimates convolution with histograms or poisson mixtures or … E[rel. precision@k] = 1
with = min-k
IRDM WS 2015 12-18
[Fagin et al. 03]
IRDM WS 2015 12-19
IRDM WS 2015 12-20
A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: 0.1
1.48
1.7 batch of b = i=1..m bi steps: choose bi values so as to achieve high score reduction + carefully chosen RAs: score lookups for „interesting“ candidates
IRDM WS 2015 12-21
A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: 0.1
compute top-1 result using flexible SAs and RAs
IRDM WS 2015 12-22
A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: 0.1
A: [0.8, 2.4] G: [0.7, 2.4] Y: [0.9, 2.4] ?: [0.0, 2.4] candidates:
IRDM WS 2015 12-23
A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: 0.1
A: [1.5, 2.0] G: [0.7, 1.6] Y: [0.9, 1.6] ?: [0.0, 1.4] candidates:
IRDM WS 2015 12-24
A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: 0.1
A: [1.5, 2.0] G: [0.7, 1.2] Y: [1.4, 1.6] candidates:
IRDM WS 2015 12-25
A: 0.8 B: 0.2 K: 0.19 F: 0.17 M: 0.16 Z: 0.15 W: 0.1 Q: 0.07 G: 0.7 H: 0.5 R: 0.5 Y: 0.5 W: 0.3 D: 0.25 W: 0.2 A: 0.2 Y: 0.9 A: 0.7 P: 0.3 F: 0.25 S: 0.25 T: 0.2 Q: 0.15 X: 0.1
A: [1.7, 2.0] Y: [1.4, 1.6] candidates:
execution costs: 9 SA + 1 RA
IRDM WS 2015 12-26
[Marian et al. 2004]
IRDM WS 2015 12-27
[Ilyas et al. 2008]
BlueDragon Chinese
€15 SB Haiku Japanese
€30 SB
Mahatma Indian
€20 IGB Mescal Mexican
€10 IGB BigSchwenk German
€25 SLS
Name Type Quality Price City
RestaurantsGuide
BlueSmoke Tombstone 7.5 SB Oscar‘s Hero 8.2
SB Holly‘s
Die Hard 6.6 SB GoodNight Seven
7.7 IGB
BigHits Godfather 9.1 IGB
Theater Movie Rating City
CinemasProgram
IRDM WS 2015 12-28
IRDM WS 2015 12-29
“Star Wars Episode 7“, “The Force Awakens“, “Obi Wan Kenobi“, “dark lord“ “Wir schaffen das“, “to be or not to be“, “roots of cubic polynomials“, “evil empire “
term doc score ... empire 77 0.85 empire 39 0.82 ... evil 49 0.81 evil 39 0.78 evil 12 0.75 ... evil 77 0.12 ...
Inverted Index
term doc offset ... empire 39 191 empire 77 375 ... evil 12 45 evil 39 190 evil 39 194 evil 49 190 ... evil 77 190 ...
Position Index
IRDM WS 2015 12-30
IRDM WS 2015 12-31
m .. 1 i i m 1
i j j k i k 2 j i j
IRDM WS 2015 12-32
IRDM WS 2015 12-33
i j j i j
2
IRDM WS 2015 12-34
IRDM WS 2015 12-35
D d D d
IRDM WS 2015 12-36
IRDM WS 2015 12-37
(from TREC benchmark) traffic tunnel disasters transit highway car truck metro train “ rail car“ … tube underground “Mont Blanc” … catastrophe accident fire flood earthquake “land slide” … 0.9 0.8 0.7 0.6 0.6 0.5 0.1 1.0 0.9 0.7 0.6 0.6 0.5 0.9 0.8 0.7 1.0 1.0 1.0
IRDM WS 2015 12-38
word word sense (synset, Concept) 200 000 concepts and lexical relations can be cast into
for concept-concept relatedness strength http://wordnet.princeton.edu
IRDM WS 2015 12-39
IRDM WS 2015 12-40
IRDM WS 2015 12-41
IRDM WS 2015 12-42
lecturer: 0.7
37: 0.9 44: 0.8
...
22: 0.7 23: 0.6 51: 0.6 52: 0.6
scholar: 0.6
92: 0.9 67: 0.9
...
52: 0.9 44: 0.8 55: 0.8
research
index on terms
57: 0.6 44: 0.4
...
professor
52: 0.4 33: 0.3 75: 0.3 12: 0.9 14: 0.8
...
28: 0.6 17: 0.55 61: 0.5 44: 0.5 44: 0.4
meta-index (ontology / thesuarus)
professor
lecturer: 0.7 scholar: 0.6 academic: 0.53 scientist: 0.5 ...
primadonna teacher investigator magician wizard intellectual artist alchemist director professor scholar academic, academician, faculty member scientist researcher Hyponym (0.749) mentor Related (0.48) lecturer
[M. Theobald et al.: SIGIR 2005]
IRDM WS 2015 12-43
the activity, and, if possible, collaborating organizations and the countries involved.
IRDM WS 2015 12-44
~META[1.00|1.00][{gangdom[1.00|1.00], gangland[0.742|1.00], "organ[0.213|1.00] & crime[0.312|1.00]", camorra[0.254|1.00], maffia[0.318|1.00], mafia[0.154|1.00], "sicilian[0.201|1.00] & mafia[0.154|1.00]", "black[0.066|1.00] & hand[0.053|1.00]", mob[0.123|1.00], syndicate[0.093|1.00]}],
columbian[0.686|0.20], cartel[0.466|0.20], ...}}
the activity, and, if possible, collaborating organizations and the countries involved.
135530 sorted accesses in 11.073s.
1. Interpol Chief on Fight Against Narcotics 2. Economic Counterintelligence Tasks Viewed 3. Dresden Conference Views Growth of Organized Crime in Europe 4. Report on Drug, Weapons Seizures in Southwest Border Region 5. SWITZERLAND CALLED SOFT ON CRIME
IRDM WS 2015 12-45
Dice coefficient:
2 { docs with c1} { docs with c2} { docs with c1} { docs with c2}
Jaccard coefficient:
Conditional probability: P[doc has c1| doc has c2] PMI (Pointwise Mutual Information):
) 2 c ( freq ) 1 c ( freq ) 2 c and 1 c ( freq log
Wu-Palmer distance: |path(c1,lca(c1,c2))| + path(c2,lca(c1,c2)) with lowest common ancestor lca(c1,c2) in DAG| Variants with edge weights based on edge type (hyponym, hypernym, …)
IRDM WS 2015 12-46
D d
relative frequency
relative frequency of d being clicked on when w appears in query
q w
IRDM WS 2015 12-47
IRDM WS 2015 12-48
(examples: “apple farm“, “mpi research“, “darwin expedition“, “Star Wars 7: The Force Awakens“, “physics nobel prize“, …)
𝑗, 𝑟 − (1 − 𝛽) 𝑗≠𝑘 𝑡𝑗𝑛 𝑠 𝑗, 𝑠 𝑘
IRDM WS 2015 12-49
(trained on query-click logs, Wikipedia disambiguation pages, etc.):
1
m i R r i i
𝑗=1..𝑙 𝑡𝑗𝑛(𝑠 𝑗, 𝑟) − 1 − 𝛽 max 𝑗≠𝑘 𝑡𝑗𝑛 𝑠 𝑗, 𝑠 𝑘 = 𝑛𝑏𝑦!
at least one r clicked given intention ti for q
IRDM WS 2015 12-50
IRDM WS 2015 12-51
[Carbonell/Goldstein 1998]
IRDM WS 2015 12-52
IRDM WS 2015 12-53
evaluation using a two-level retrieval process, CIKM 2003
Journal of Computer and System Sciences 2003
in Relational Database Systems, ACM Comp. Surveys 40(4), 2008
ACM TODS 2004
with tunable pruned term and term pair indexes. ACM TOIS 2012
IRDM WS 2015 12-54
using associated queries. CIKM 2003
utilizing WordNet and recognizing phrases, SIGIR 2004
Query Expansion for Top-k Query Processing, SIGIR 2005
Succinct Index, SIGIR 2006
PVLDB 2008
query reformulations in Web search, TOIS 2007
WSDM 2009
IRDM WS 2015 12-55