SLIDE 3 3
13
App # 1: Relation Search
- Schema statistics can help improve both:
- Relation Recovery (Metadata detection)
- Ranking
- By computing a schema coherency score S(R) for relation R,
and adding it to feature vector
- Measures how well a schema “hangs together”
- High: {make
, mode l }
, z i pcode }
- Average pairwise Pointwise Mutual Information score for all
attributes in schema
14
App # 1: Experiments
- Metadata detection, when adding schema stats scoring
- Precision 0.79 ⇒ 0.89
- Recall 0.84 ⇒ 0.85
- Ranking: compared 4 rankers on test set
- Naïve: Top-10 pages from google.com
- Filter: Top-10 good tables from google.com
- Rank: Trained ranker
- Rank-Stats: Trained ranker with coherency score
- What fraction of top-k are relevant?
0.66 (94% ) 0.56 (70% ) 0.43 (65% ) Rank 0.68 (100%) 0.59 (74% ) 0.34 30 0.59 (79%) 0.47 (42% ) 0.33 20 0.47 (80%) 0.35 (34% ) 0.26 10 Rank-Stats Filter Naïve k
15
App # 2: Schema Autocomplete
Input: topic attribute (e.g., make
)
Output: relevant schema
{make , mode l , yea r , p r i ce }
“tab-complete” for your database
For input set I, output S, threshold t
while p(S-I | I) > t
newAttr = max p(newAttr, S-I | I) S = S ∪ newAttr emit newAttr 16
App # 2: Schema Autocomplete
name, s i ze , l as t
i f i ed, t ype name i n s t ruc to r , t ime , t i t l e , days , r
cou rse i n s t ruc to r e lec ted , pa r t y , d i s t r i c t , i n cumben t , s t a tus , … e lec ted ab , h , r , bb , so , r b i , avg , l
, h r , pos , ba t t e r s ab sq f t , p r i ce , ba ths , beds , yea r , t ype , l
f t , … sq f t
17
App # 2: Experiments
Asked experts for schemas in 10 areas What was autocompleter’s recall?
18
App # 3: Synonym Discovery
- Input: topic attribute (e.g., addr
ess)
- Output: relevant synonym pairs
(te
lephone = tel
- #)
- Used for schema matching
[VLDB01, “Generic Schema Matching…”, Madhavan et al]
- Linguistic thesauri are incomplete; hand-made
thesauri are burdensome
- For attributes a, b and input domain C,
when p(a,b)= 0