Supervised Learning for Linking Named En55es to KB Entries - - PowerPoint PPT Presentation

▶

supervised learning for linking named en55es to kb entries

Supervised Learning for Linking Named En55es to KB Entries - - PowerPoint PPT Presentation

Mar 07, 2024 394 likes •602 views

Supervised Learning for Linking Named En55es to KB Entries Unstructured Data Informa*on Extrac*on Structured Data Semi-Structured Data 2 ID: NIL

slide-1

SLIDE 1

Supervised ¡Learning ¡for ¡Linking ¡ Named ¡En55es ¡to ¡KB ¡Entries ¡

slide-2

SLIDE 2

2 ¡

Informaon ¡ ¡ Extracon ¡

Unstructured ¡Data ¡ Semi-‑Structured ¡Data ¡ Structured ¡Data ¡

slide-3

SLIDE 3

3 ¡

ID: ¡1 ¡ ¡ ID: ¡9 ¡ ID: ¡23 ¡ ¡ ID: ¡55 ¡ ¡ ID: ¡NIL ¡ ¡

Bush ¡

slide-4

SLIDE 4

4 ¡

Problem ¡Defini5on: ¡ ¡ Given ¡a ¡name ¡(query) ¡and ¡a ¡background ¡document, ¡provide ¡the ¡ID ¡

f ¡the ¡KB ¡entry ¡to ¡which ¡the ¡name ¡refers, ¡or ¡NIL ¡if ¡there ¡is ¡no ¡such ¡
entry. ¡Also, ¡cluster ¡NIL ¡queries ¡referring ¡to ¡the ¡same ¡en==es. ¡

¡ Our ¡Goals: ¡ ¡

Develop ¡a ¡baseline ¡system ¡based ¡on ¡supervised ¡learning ¡

principles ¡and ¡simple ¡to ¡compute ¡features; ¡

¡

Study ¡the ¡importance ¡of ¡different ¡features ¡and ¡learning ¡
algorithms. ¡

slide-5

SLIDE 5

5 ¡

Referent ¡ Nil ¡Resoluon ¡ Candidate ¡Validaon ¡ Candidate ¡Ranking ¡ Candidate ¡Generator ¡ Query ¡Expansion ¡ Query ¡ Knowledge ¡ Base ¡

slide-6

SLIDE 6

6 ¡

Regular ¡expressions ¡for ¡acronym ¡queries ¡
The ¡American ¡Broadcast ¡Company ¡(ABC) ¡is ¡an ¡… ¡ ¡
Apple ¡(AAPL) ¡sold ¡1.7 ¡million ¡in ¡the ¡first ¡weekend ¡… ¡
NEW ¡YORK ¡(CNN) ¡-‑-‑ ¡Finance ¡ministers ¡from ¡… ¡
The ¡US ¡(United ¡States ¡of ¡America) ¡are ¡currently ¡… ¡

¡

Named ¡en55es ¡containing ¡the ¡query ¡
As ¡president, ¡Barack ¡Obama ¡signed ¡an ¡economic ¡s*mulus ¡… ¡
The ¡United ¡States ¡Secretary ¡of ¡State ¡is ¡the ¡head ¡of ¡the ¡… ¡

slide-7

SLIDE 7

Candidates ¡selected ¡based ¡on ¡the ¡n-‑gram ¡similarity ¡between ¡

query ¡and ¡KB ¡entry ¡name. ¡ ¡n ¡= ¡[1,4] ¡

KB ¡entries ¡expanded ¡with ¡alterna5ve ¡names ¡taken ¡from: ¡
Wikipedia’s ¡redirect ¡pages ¡
Wikipedia’s ¡disambigua*on ¡Pages ¡
Wikipedia’s ¡anchors ¡

¡

slide-8

SLIDE 8

8 ¡

Learning ¡to ¡Rank ¡(L2R) ¡approach ¡

slide-9

SLIDE 9

10 ¡

Considered ¡features ¡ ¡

Popularity ¡
Text-‑length, ¡# ¡alterna*ve ¡names ¡
Text ¡similarity ¡
E.g., ¡TF-‑IDF ¡cosine ¡similarity ¡ ¡
Topic ¡similarity ¡
E.g., ¡LDA ¡cosine ¡similarity ¡
Named ¡en**es ¡similarity ¡
E.g., ¡type-‑match, ¡common ¡en**es ¡
String ¡similarity ¡
E.g., ¡Levenstein ¡distance, ¡exact-‑match ¡
Page ¡type ¡
E.g., ¡web, ¡newswire ¡

40+ ¡ranking ¡features ¡

slide-10

SLIDE 10

11 ¡

Considered ¡L2R ¡algorithms ¡

Coordinate ¡Ascent ¡ ¡
ListNet ¡
AdaRank ¡
Ranking ¡Perceptron ¡
SVMrank ¡
We ¡also ¡experimented ¡with ¡models ¡trained ¡specifically ¡for ¡the ¡

es*mated ¡query ¡type. ¡ ¡

slide-11

SLIDE 11

12 ¡

Supervised ¡Learning ¡approach ¡ ¡

Algorithms ¡
SVM ¡(RBF ¡kernel) ¡
Random ¡Forest ¡
Query-‑specific ¡models ¡
Nil-‑only ¡features ¡
Ranking ¡score ¡
Ranking ¡score ¡sta*s*cs ¡
E.g., ¡mean, ¡standard ¡devia*on ¡
Ranking ¡score ¡test ¡for ¡outliers ¡
E.g., ¡Dixon’s ¡Q ¡test, ¡Grubb’s ¡test ¡

¡

slide-12

SLIDE 12

13 ¡

Find ¡Pairs ¡ Compute ¡ Pairwise ¡ Features ¡ Assign ¡Labels ¡ Train ¡Classifier ¡

Step ¡1: ¡

slide-13

SLIDE 13

14 ¡

Find ¡Pairs ¡ Compute ¡ Pairwise ¡ Features ¡ Apply ¡ Classifier ¡ Build ¡Query ¡ Graph ¡ Compute ¡ Transi*ve ¡ Closure ¡

Step ¡2: ¡

slide-14

SLIDE 14

15 ¡

PER ¡ ORG ¡ GPE ¡ ALL ¡ NIL ¡ 2009 ¡ Train ¡ 627 ¡ 2710 ¡ 567 ¡ 3904 ¡ 57.1% ¡ Test ¡ 500 ¡ 500 ¡ 500 ¡ 1500 ¡ 28.4% ¡ 2010 ¡ Train ¡ 1127 ¡ 3210 ¡ 1067 ¡ 5404 ¡ 49.1% ¡ Test ¡ 750 ¡ 750 ¡ 750 ¡ 2250 ¡ 54.7% ¡ 2011 ¡ Train ¡ 1877 ¡ 3960 ¡ 1817 ¡ 7654 ¡ 50.8% ¡ Test ¡ 750 ¡ 750 ¡ 750 ¡ 2250 ¡ 50.0% ¡

Datasets ¡

slide-15

SLIDE 15

16 ¡

Best ¡accuracy: ¡82.2% ¡(2009), ¡85.8% ¡(2010), ¡??% ¡(2011) ¡

0.817 ¡ 0.817 ¡ 0.783 ¡ 0.802 ¡ 0.823 ¡ 0.835 ¡ 0.846 ¡ 0.833 ¡ 0.832 ¡ 0.848 ¡ 0.793 ¡ 0.79 ¡ 0.76 ¡ 0.768 ¡ 0.788 ¡ SVMrank ¡ Ranking ¡ Perceptron ¡ AdaRank ¡ ListNet ¡ Coordinate ¡Ascent ¡ 2009 ¡ 2010 ¡ 2011 ¡

slide-16

SLIDE 16

17 ¡

Query ¡es5mate ¡accuracy: ¡87% ¡(2009), ¡82% ¡(2010), ¡79% ¡(2011) ¡

‑1.40% ¡
‑2.00% ¡
‑0.40% ¡

0.20% ¡

‑3.30% ¡

1.40% ¡ 1.30% ¡ 0.10% ¡ 1.10% ¡

‑0.60% ¡

0.00% ¡ 0.60% ¡

‑1.50% ¡

0.00% ¡

‑1.20% ¡

2009 ¡ 2010 ¡ 2011 ¡

slide-17

SLIDE 17

19 ¡

Results ¡for ¡SVMrank ¡+ ¡Random ¡Forests ¡

0.864 ¡ 0.906 ¡ 0.817 ¡ 0.847 ¡ 0.897 ¡ 0.835 ¡ 0.779 ¡ 0.892 ¡ 0.793 ¡ ranking ¡ valida*on ¡

verall ¡

2009 ¡ 2010 ¡ 2011 ¡

slide-18

SLIDE 18

20 ¡

72.0% ¡ 73.0% ¡ 74.0% ¡ 75.0% ¡ 76.0% ¡ 77.0% ¡ 78.0% ¡ Text ¡ NER ¡ Name ¡ Popularity ¡ LDA ¡ Page ¡type ¡

slide-19

SLIDE 19

21 ¡

74.0% ¡ 75.0% ¡ 76.0% ¡ 77.0% ¡ 78.0% ¡ 79.0% ¡ 80.0% ¡ All ¡ Name ¡ Text ¡ NER ¡ LDA ¡ Popularity ¡ Page ¡type ¡

slide-20

SLIDE 20

22 ¡

¡

Developed ¡a ¡fully ¡func*onal, ¡and ¡data-‑driven, ¡en*ty-‑linking ¡ ¡

system ¡with ¡state-‑of-‑the-‑art ¡results ¡for ¡many ¡cases; ¡ ¡

Compared ¡different ¡algorithm ¡and ¡feature ¡contribu*ons; ¡
Studied ¡the ¡impact ¡of ¡query-‑specific ¡models, ¡with ¡mixed ¡results ¡

but ¡an ¡overall ¡poor ¡impact ¡on ¡performance; ¡ ¡

Resolve ¡full-‑documents ¡using ¡rela*onal ¡learning ¡techniques. ¡

¡