dagobah
play

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation - PowerPoint PPT Presentation

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labb Jixiong Liu Raphal Troncy Orange Labs Orange Labs Orange Labs EURECOM @yoan_chabot @tau_labbe @rtroncy Orange restricted Context


  1. DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy Orange Labs Orange Labs Orange Labs EURECOM @yoan_chabot @tau_labbe @rtroncy Orange restricted

  2. Context & Goals  Design a semantic engine able to query (semi-)structured data I want to have precise and relevant answers to my queries expressed in natural language, without having to know the target database(s) model(s)  We focus on tabular data: annotate the content and structure of tabular data for searching and recommending datasets

  3. Tabular Data to Knowledge Graph Matching  Goals CTA CEA CPA Column-Type Annotation Cell-Entity Annotation Columns-Property Annotation  1 st step: preprocessing to identify tables characteristics (orientation, key- column…)  2 nd step: annotations workflows  Method 1: Baseline lookups  Method 2: Embedding approach  We focus on the CTA and CEA tasks CPA processing: list of properties associated to entities pairs, plus majority voting

  4. Preprocessing (new homogeneity factor) Datatable Table in WTC format corpus Converter Header detection Table orientation (CSV, TSV, HTML, …) 2 2 Key column 𝑚𝑓𝑜(𝑦) (1 − 1 − 2 ∗ 𝑑𝑝𝑣𝑜𝑢 𝑢 𝑗 1 Content-based algorithm 𝐼𝑝𝑛 𝑦 = [ )] DWTC algorithm [1] 𝑚𝑓𝑜 𝑦 detection (homogeneity factor) 𝑢 𝑗 ∈ 𝑦 Lake Area Depth Country Hom. RH Windermere String_number String_number String unknown 0.89 • Object • Unit Kielder Reservoir String_number String_number String unknown 0.89 Primitive typing • Number Ullswater String_number String_number String unknown 0.89 • Date • Unknown Bassenthwaite Lake String_number String_number String unknown 0.89 Derwent Water String_number String_number String unknown 0.89 Hom. CH 0 0 0 Pre-processed tables 𝑁𝑓𝑏𝑜 𝐷𝐼 < 𝑁𝑓𝑏𝑜(𝑆𝐼) → 𝑰𝒑𝒔𝒋𝒜𝒑𝒐𝒖𝒃𝒎 ∃ 𝑑𝑝𝑚 𝑥ℎ𝑓𝑠𝑓 𝐼𝑝𝑛 𝑑𝑝𝑚 0: 3 ≠ 0 → 𝑰𝒇𝒃𝒆𝒇𝒔 = 𝒖𝒔𝒗𝒇 [1] https://subversion.assembla.com/svn/commondata/WDCFramework/tags/1/0/3/

  5. Baseline lookups Pre-processed tables Entities Lookups 1 ‒ Lookups from all tables cells 1 Lake Area (4 external sources + 1 internal API Windermere 14,73 km² API API Ingestion Wikidata ES) CirrusSearch Kielder Reservoir 10,86 km² Server ‒ Wikidata as pivot metadata 3 3 ‒ DBpedia translation (uri & 2 {title: "Q119936", types) 4 label: "Windermere"}, {"mainType": "populated place", {title: "Q390370", 2 SPARQL 4 "types": "settlement" label: "Windermere"} ‒ TF-IDF-like types scoring 6 "subTypes": ""} … 5 ‒ Entities disambiguation with 7 DBpedia entity target type(s) uri & types 6 7 Types scoring Entities 7 Type(s) selection Disambiguation CTA output CEA output

  6. Embedding approach Id: ["Q223687"], label:["Wes Anderson"], Embedding Q223687 aliases:["Wesley Wales Anderson"], 1 Enrichment types:["Q5","dbPedia.Person"], EMBEDDING subTypes:["dbPedia.Director","Q2526255"," Q36180"] OpenKE [2] ‒ Embedding enrichment through 1 Wikidata ES server Title Director Lookup Entities Rushmore Anderson 2 candidates Lookup ‒ Regex + Levenshtein lookup 2 Fight Club Fincher ‒ K-means clustering over 3 Lookup + Table based candidates space hyperparameters ‒ Scoring algorithm to extract 4 Candidates 3 5 best cluster and deduce target clustering type 4 ‒ Candidates disambiguation 6 Clusters scoring from clusters, types and entities scores Candidates’ types 6 5 scoring Candidates’ entities scoring [2] http://139.129.163.161/index/toolkits# pretrained-wikidata CTA output CEA output

  7. Embedding approach example Candidates scoring (CTA) 𝑇 𝑑 𝑅941209 Clusters scoring 𝑇 𝑙 𝑑𝑚𝑣𝑡𝑢𝑓𝑠#2 Entities scoring (CEA): 𝑇 𝑓 𝑗 = 0.25 ∗ 𝑇 𝑙 𝑜 + 0,5 ∗ 𝑆 𝑈 + 0.2 ∗ 𝑇 𝑑 (𝑗) Entities disambiguation: 𝑇 𝑓 𝑄𝑏𝑣𝑚 𝑈ℎ𝑝𝑛𝑏𝑡 𝐵𝑜𝑒𝑓𝑠𝑡𝑝𝑜 , 𝑻 𝒇 𝑿𝒇𝒕 𝑩𝒐𝒆𝒇𝒔𝒕𝒑𝒐 > 𝑇 𝑓 𝑄𝑏𝑣𝑚 𝑋. 𝑇. 𝐵𝑜𝑒𝑓𝑠𝑡𝑝𝑜

  8. Results Table 2: Round 1 results (own evaluator < AI crowd evaluator) Task CTA CEA Table1: Preprocessing results Criteria F1 Precision AH AP F1 Precision Task/Tool DWTC DAGOBAH Baseline 0.517 0.482 NA NA 0.784 0.814 Orientation Detection 0.9 0.957 Baseline++ 0.641 0.641 1.108 0.246 0.881 0.890 Header Extraction Not evaluated 1.0 Key Column Detection 0.857 0.986 Embedding 0.683 0.683 1.483 0.258 0.840 0.852 Approach Pros Cons  High coverage (multiple sources)  Lookup-services dependency (reliability) Baseline  Computational efficiency  Blackbox (indexing, scoring…)  Queries volume  Lookup strategy independence  Computational performances Embedding  Relevant clustering even with few data  K optimization  Generalization (no tailored cleaning + less  Embedding dependency heuristics in lookups and scoring)

  9. Discussion & Future Work  Performance bottlenecks (due to the challenge context):  Light Data cleaning … on purpose  Basic lookup strategies … on purpose (e.g. no use of dictionary)  Missing Wikidata – DBpedia type mappings  Subset embedding (restricted to baseline candidates)  Future work:  Test other Wikidata embeddings methods (on the whole space)  Compute joint embeddings with Wikipedia/DBpedia to enhance coverage  Experiment more clustering algorithms and parameters on different datasets  Learn data table embedding and find vectorial transformation(s) with KG embedding space  …

  10. DAGOBAH Datatable-powered Accurate-knowledge Graph for Outstanding and Beautiful Answers to Humans Thanks! Orange restricted

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend