Extracting Linked Data from statistic spreadsheets Tien-Duc Cao - - PowerPoint PPT Presentation

extracting linked data from statistic spreadsheets
SMART_READER_LITE
LIVE PREVIEW

Extracting Linked Data from statistic spreadsheets Tien-Duc Cao - - PowerPoint PPT Presentation

Extracting Linked Data from statistic spreadsheets Tien-Duc Cao tien-duc.cao@inria.fr Ioana Manolescu ioana.manolescu@inria.fr Xavier Tannier xtannier@limsi.fr Semantic Big Data workshop, Chicago, May 19th, 2017 Agenda 1. Context: data


slide-1
SLIDE 1

Extracting Linked Data from statistic spreadsheets

Tien-Duc Cao tien-duc.cao@inria.fr Ioana Manolescu ioana.manolescu@inria.fr Xavier Tannier xtannier@limsi.fr

Semantic Big Data workshop, Chicago, May 19th, 2017

slide-2
SLIDE 2

Agenda

1. Context: data journalism and journalistic fact-checking 2. Research problem: extracting linked open data from spreadsheets 3. Approach 4. Results 5. Future work 1 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

slide-3
SLIDE 3
  • 1. Fact-checking is a content management

problem

19/05/2017 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 2 Claim to be checked (text

  • r data)

Media content Media context Reference information source 1 Human actors (journalists, experts, crowd workers) Reference information source 2 Reference information source n

Verification tool (query, match, source search…)

… Analysis result « True / rather true / rather false / false See sources: http://dataref.com… »

slide-4
SLIDE 4
  • 1. Fact-checking is a content management

problem

19/05/2017 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 3 Claim to be checked (text

  • r data)

Media content Media context Reference information source 1 Human actors (journalists, experts, crowd workers) Reference information source 2 Reference information source n

Verification tool (query, match, source search…)

… Analysis result « True / rather true / rather false / false See sources: http://dataref.com… » Claim extraction Social network analysis Reconciliation, reputation Source d’information de référence n+1 Source d’information de référence n+1 Reference information source n+1 Source search / source selection Reference source construction, refinement, integration

slide-5
SLIDE 5
  • 1. Context
  • Which data source can help us to fact-check a statistical claim from the media?
  • E.g: “The unemployment rate in France last year was 50%?”
  • This work is a part of ContentCheck 1 project

4

1 https://team.inria.fr/cedar/contentcheck/

Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

slide-6
SLIDE 6
  • 2. Research problem: high-quality reference data
  • National statistic institutes such as

INSEE 1, France’s economic and societal statistics institute are often valuable data providers 5

1 https://insee.fr/

Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

http://abonnes.lemonde.fr/les-decodeurs/portfolio/2017/04/18/les-fractures-francaises-1-5-le-logement-les-raisons-de-la-crise_5112859_4355770.html

Existing house price index Available revenue per head Rent index Consumer price index

slide-7
SLIDE 7
  • 2. The road to high quality data…

6

Unfortunately most of the data published by INSEE looks like this (our text coloring):

Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

slide-8
SLIDE 8
  • 2. The road to high quality data…

7

Sometimes there are more than 1 table per sheet

Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

slide-9
SLIDE 9
  • 3. Extraction approach

8 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets"

Image sources: https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128 https://www.w3.org/RDF/icons/rdf_w3c.svg

19/05/2017

slide-10
SLIDE 10
  • 3. Extraction approach

9 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets"

Image sources: https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128 https://www.w3.org/RDF/icons/rdf_w3c.svg

19/05/2017

slide-11
SLIDE 11
  • 3. Approach: finding table boundaries

10 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

slide-12
SLIDE 12
  • 3. Extraction approach

11 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets"

Image sources: https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128 https://www.w3.org/RDF/icons/rdf_w3c.svg

19/05/2017

slide-13
SLIDE 13
  • 3. Approach: table extractor

12

  • Header cells mostly contain texts
  • Their positions are at:
  • the top (header rows) of table
  • the left (header columns) of table
  • Having more than 1 header rows/columns

indicates data aggregation

  • Data cells mostly contain numeric values

Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

slide-14
SLIDE 14
  • 3. Approach: table extractor

1. We distinguish header/data row/columns using

  • data type of its cells (text, number, special value to indicate a missing value, null for empty cell)
  • formatting information of its cells: cell’s border, cells belong to merged cell
  • the types of its neighbor rows/columns

2. Based on these we identify the exact structure of each table 13 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

slide-15
SLIDE 15
  • 3. Conceptual data model

14 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

slide-16
SLIDE 16
  • 4. Results
  • Collected 16011 Excel spreadsheets, extracted 74117 tables.
  • Accuracy evaluation:
  • We selected randomly 100 Excel files à 2432 tables
  • We visually identified the header cells, data cells and header hierarchy and then compared with those
  • btained from our system.

15 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

slide-17
SLIDE 17

16 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

  • 4. Sample extracted RDF
slide-18
SLIDE 18
  • 5. Future work

17 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017 Reference information source 1 Reference information source 2 Reference information source n

Verification tool (query, match, source search…)

Source search / source selection Reference source construction, refinement, integration

slide-19
SLIDE 19

Th Thanks / / q questions?

18 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier 
 "Extracting linked data from statistic spreadsheets" 19/05/2017

Excel files and extracted RDF files (10.5GB will be expired in May 29th 2017) https://goo.gl/4Y5Dtv Source code: no expiration date :) https://gitlab.inria.fr/cedar/insee-crawler https://gitlab.inria.fr/cedar/excel-extractor