FacetE: Exploiting Web Tables for Domain-Specific Word Embedding - - PowerPoint PPT Presentation

facete exploiting web tables for
SMART_READER_LITE
LIVE PREVIEW

FacetE: Exploiting Web Tables for Domain-Specific Word Embedding - - PowerPoint PPT Presentation

FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation Michael Gnther , Paul Sikorski, Maik Thiele, and Wolfgang Lehner DBTest 20 Workshop at SIGMOD 2020 19.06.2020 NLP Systems Workflow Data Storage with textual data


slide-1
SLIDE 1

FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation

Michael Günther, Paul Sikorski, Maik Thiele, and Wolfgang Lehner DBTest ‘20 Workshop at SIGMOD 2020 19.06.2020

slide-2
SLIDE 2

2

NLP Systems Workflow

WOR Ds W W W Extracted text data 5.02, 43.07, ….. Numerical Representation (Vectors) Language Model Data Storage with textual data

slide-3
SLIDE 3

3

5.02, 43.07, ….. Numerical Representation (Vectors) Language Model

NLP Systems Workflow

WOR Ds W W W Extracted text data Relational database with text data Extracted Relational Information

Extract Weights as Pre-Trained Language Model Large Text corpora in natural language Deep Neuronal Network

Training on Dummy Task 5.02, 43.07, …..

State-of-the-art Language Models: Word Embeddings

Data Storage with textual data

slide-4
SLIDE 4

4

NLP Systems Workflow

Data Storage with textual data WOR Ds W W W Extracted text data 5.02, 43.07, ….. Numerical Representation (Vectors) Language Model Classification and Regression Tasks Similarity Search Tasks

slide-5
SLIDE 5

5

Word Embedding for Systems

ML Systems

▪ Utilize implicitly encoded knowledge from large text corpora ▪ Capture sematic similarities of text values

Database Systems

▪ Semantic text similarity queries ▪ Data exploration ▪ Data integration

Information Retrieval Systems

▪ Semantic search ▪ Query Expansion ▪ Multi-lingual search Choice of the word embedding model is crucial for the performance!

slide-6
SLIDE 6

6

Evaluation of Word Embedding Models

Word Similarity

▪ Similar Words by cosine similarity of word vectors ▪ Example: most similar to “king”? → prince, man, and queen

Analogy Queries

▪ Retrieve Similar Relations 𝑏 − 𝑐 ≈ 𝑑 − ? ▪ Example: man – woman ≈ king - ? → queen 𝑡𝑗𝑛𝑑𝑝𝑡(𝒚, 𝒛) = 𝒚 ∙ 𝒛 𝒚 ∙ | 𝒛 | 3CosAdd: arg max

𝑒 𝜗𝑊 𝒃,𝒄,𝒅

𝑡𝑗𝑛𝑑𝑝𝑡 𝒆, 𝒅 − 𝒃 + 𝒄 Schematic Representation of Word Vectors

London England Germany Berlin dry drier driest wet wetter wettest man woman king queen prince

slide-7
SLIDE 7

7

Evaluation of Word Embedding Models

Common Similarity Datasets

▪ WS-353 353 word pairs of general

domain knowledge quantifying semantic relatedness

▪ SimLex-999 999 word pairs of general

domain knowledge quantifying semantic similarity

Common Analogy Query Datasets

▪ Google Analogy 550 semantic and

syntactic relations, mostly city-country relations

▪ MSR 8,000 analogies of 800 syntactic

relations

Limitations: Only small datasets Return a single value only Only general domain

Depend on human notion of similarity → Require human labeling effort Facts of general domain knowledge → Automatic extraction possible

Embedding Model Semantic Syntactic Total CBOW 57.3 68.9 63.7 SkipGram 66.1 65.1 65.6 … … … …

* Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Embedding Model WS353 RW … CBOW 57.2 32.5 … SkipGram 62.8 37.2 … … … … …

Similarity Eval* Analogy Eval*

slide-8
SLIDE 8

8

Evaluation of Word Embedding Models

Common Similarity Datasets

▪ WS-353 353 word pairs of general

domain knowledge quantifying semantic relatedness

▪ SimLex-999 999 word pairs of general

domain knowledge quantifying semantic similarity

Common Analogy Query Datasets

▪ Google Analogy 550 semantic and

syntactic relations, mostly city-country relations

▪ MSR 8,000 analogies of 800 syntactic

relations

Limitations: Only small datasets Return a single value only Only general domain Design Goals: Flexible structure Multiple categories Large number

  • f relations

Design Strategies:

Organization in facets Extraction from millions

  • f web tables

Definition of categories

Depend on human notion of similarity → Require human labeling effort Facts of general domain knowledge → Automatic extraction possible

slide-9
SLIDE 9

9

Dataset Design

Data Source: Web Tables

▪ Large amount of knowledge ▪ General enough to be expected in pre-trained word embedding models ▪ Redundancy allows to exclude temporary facts (e.g. time dependent facts like home soccer team to visiting team)

Target Design: Facets

▪ Each Facet 𝐺: 𝑃 → 𝑊 assigns objects (e.g. Soccer Player) to values (e.g. Teams) ▪ Allows flexible construction of application specific evaluation datasets ▪ More flexible then hierarchical categorization

Airp

  • rt

Location Rank Country Airp

  • rt

#Passeng ers City IATA

Country England Brazil …

Team Event Rank Year Team Country

Web Tables Corpus

Team Country

Header Pairs and Values

Team AC Milan Arsenal … Position Keeper Forward … Country England Brazil … City New York … IATA LGW LHR …

Soccer Player Team

FacetE Storage Format

Soccer Player Country Airport City Airport IATA

Airport Soccer Player

Collection of Facets

Airport Soccer Player

Sports Economy

… …

Team Rank Event Rank Airport IATA Airport Country Airport Area

London

slide-10
SLIDE 10

10

Extraction Pipeline

Word Embeddings Analogy Evaluation 250 Facets / 600K Values

Categorization: Assign facets to 8 broader categories Post Filtering: Filter by Pooling, Blacklist, … Soft Functional Dependencies: Check contradiction of most frequent relation Pre Filtering: Frequency and Regex Filter, Facet Creation 125M Web Tables

2 3 4 1

slide-11
SLIDE 11

11

Extraction Pipeline

1) Pre-Filtering

▪ Filters infrequent and non-textual data of English tables

Word Embeddings Analogy Evaluation 250 Facets / 600K Values

Categorization: Assign facets to 8 broader categories Post Filtering: Filter by Pooling, Blacklist, … Soft Functional Dependencies: Check contradiction of most frequent relation Pre Filtering: Frequency and Regex Filter, Facet Creation 125M Web Tables

2 3 4 1

Country Date Team Country Team Nick- name Country Team Team Country

Column-Tuples → Basis for Facets

Remove infrequent columns Remove non-textual data

slide-12
SLIDE 12

12

2) Soft-Functional Dependencies

▪ Determine static facts 2) Check on contradictions

Extraction Pipeline

Team Country Arsenal England AC Milan Italy Juventus Italy Team Country Arsenal United Kingdom AC Milan Italy Word Embeddings Analogy Evaluation 250 Facets / 600K Values

Categorization: Assign facets to 8 broader categories Post Filtering: Filter by Pooling, Blacklist, … Soft Functional Dependencies: Check contradiction of most frequent relation Pre Filtering: Frequency and Regex Filter, Facet Creation 125M Web Tables

2 3 4 1

Team Country

AC Milan Italy Juventus Italy Arsenal England

1) Determine most frequent relation pairs

𝑇𝐺𝐸 𝑝, 𝑤 = 𝑑𝑝𝑣𝑜𝑢(𝑝, 𝑤) σ𝑤′:(𝑝,𝑤′) 𝑑𝑝𝑣𝑜𝑢(𝑝, 𝑤′)

One Contradiction

𝑇𝐺𝐸 𝐵𝑠𝑡𝑓𝑜𝑏𝑚, 𝐹𝑜𝑕𝑚𝑏𝑜𝑒 = 2 3

Most frequent for “Arsenal”

slide-13
SLIDE 13

13

Extraction Pipeline

3) Post-Filtering

▪ Blacklists Remove too generic facets ▪ Word Embedding Pooling Retain only facets modeled by at least one word embedding model

Word Embeddings Analogy Evaluation 250 Facets / 600K Values

Categorization: Assign facets to 8 broader categories Post Filtering: Filter by Pooling, Blacklist, … Soft Functional Dependencies: Check contradiction of most frequent relation Pre Filtering: Frequency and Regex Filter, Facet Creation 125M Web Tables

2 3 4 1

Name Description

? ? ?

slide-14
SLIDE 14

14 Similarity to Keywords

Extraction Pipeline

4) Categorization

▪ Assign each of the 250 facets on of 8 broader categories (e.g. geographic, music, sports, …)

Word Embeddings Analogy Evaluation 250 Facets / 600K Values

Categorization: Assign facets to 8 broader categories Post Filtering: Filter by Pooling, Blacklist, … Soft Functional Dependencies: Check contradiction of most frequent relation Pre Filtering: Frequency and Regex Filter, Facet Creation 125M Web Tables

2 3 4 1

Team Country

AC Milan Italy Juvertus Italy Arsenal England

Word Embedding Model Cat. Sim

Music 0.15 Sports 0.53 …. ….

Keywords for categories

slide-15
SLIDE 15

15

Evaluation

Evaluation of Categories

Setup ▪ 4 Pre-trained word embedding models:

GloVe, Word2Vec-SkipGram, fastText, SentenceBert

▪ Selection of 4 FacetE categories Calculation ▪ Select facets 𝐺: 𝑃 → 𝑊 from the categories ▪ Determine the value 𝑊 for each object 𝑃 with 3CosAdd analogy method ▪ Calculate amount of correctly assigned values ▪ Calculate average in each category Coverage: For some text values word embedding models can not determine a vector Evaluation of 4 Categories

slide-16
SLIDE 16

16

Evaluation

Evaluation of Categories

Setup ▪ 4 Pre-trained word embedding models:

GloVe, Word2Vec-SkipGram, fastText, SentenceBert

▪ Selection of 4 FacetE categories Calculation ▪ Select facets 𝐺: 𝑃 → 𝑊 from the categories ▪ Determine the value 𝑊 for each object 𝑃 with 3CosAdd analogy method ▪ Calculate amount of correctly assigned values ▪ Calculate average in each category Coverage: For some text values word embedding models can not determine a vector Evaluation of 4 Categories

Observation

  • No single best model
  • High Coverage
slide-17
SLIDE 17

17

Evaluation

Evaluation of a Single Object Set

Setup ▪ 4 Pre-trained word embedding models:

GloVe, Word2Vec-SkipGram, fastText, SentenceBert

▪ Selection of all facets for cities Calculation ▪ Determine the value 𝑊 for each

  • bject 𝑃 with 3CosAdd analogy

method ▪ Calculate amount of correctly assigned values for each city name ▪ Calculate average across all

  • bjects

Evaluation of a Single Object Set - Cities

slide-18
SLIDE 18

18

Evaluation

Evaluation of a Single Object Set

Setup ▪ 4 Pre-trained word embedding models:

GloVe, Word2Vec-SkipGram, fastText, SentenceBert

▪ Selection of all facets for cities Calculation ▪ Determine the value 𝑊 for each

  • bject 𝑃 with 3CosAdd analogy

method ▪ Calculate amount of correctly assigned values for each city name ▪ Calculate average across all

  • bjects

Evaluation of a Single Object Set - Cities

Observation

Word2Vec performs better on geographic data, however GloVe has a better representation of cities

slide-19
SLIDE 19

19

Soccer Player Team Soccer Player Country

Soccer Player

Conclusion

Web Table Extraction Pipeline

▪ Web Tables are a good resource for structured relations of general common knowledge ▪ Pipeline is able to process millions of tables → Reusable for other table corpora

Facet Structure

▪ Enables flexible construction of evaluation datasets ▪ Evaluation of different granularity Single Facts (e.g. City → Country), Objects (e.g. Cities) or Domains (e.g. Geographic)

Evaluation of Common Word Embedding Models

▪ Large differences in accuracy values on different domains ▪ No best model for all cases 125M Web Tables

250 Facets / 600K Values

Airport City Airport IATA

Airport

Evaluation

FacetE Dataset: https://www.kaggle.com/guenthermi/facete

fast Text GloVe Word2 Vec