Extracting Linked Data from statistic spreadsheets
Tien-Duc Cao tien-duc.cao@inria.fr Ioana Manolescu ioana.manolescu@inria.fr Xavier Tannier xtannier@limsi.fr
Semantic Big Data workshop, Chicago, May 19th, 2017
Extracting Linked Data from statistic spreadsheets Tien-Duc Cao - - PowerPoint PPT Presentation
Extracting Linked Data from statistic spreadsheets Tien-Duc Cao tien-duc.cao@inria.fr Ioana Manolescu ioana.manolescu@inria.fr Xavier Tannier xtannier@limsi.fr Semantic Big Data workshop, Chicago, May 19th, 2017 Agenda 1. Context: data
Tien-Duc Cao tien-duc.cao@inria.fr Ioana Manolescu ioana.manolescu@inria.fr Xavier Tannier xtannier@limsi.fr
Semantic Big Data workshop, Chicago, May 19th, 2017
1. Context: data journalism and journalistic fact-checking 2. Research problem: extracting linked open data from spreadsheets 3. Approach 4. Results 5. Future work 1 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
19/05/2017 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 2 Claim to be checked (text
Media content Media context Reference information source 1 Human actors (journalists, experts, crowd workers) Reference information source 2 Reference information source n
Verification tool (query, match, source search…)
… Analysis result « True / rather true / rather false / false See sources: http://dataref.com… »
19/05/2017 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 3 Claim to be checked (text
Media content Media context Reference information source 1 Human actors (journalists, experts, crowd workers) Reference information source 2 Reference information source n
Verification tool (query, match, source search…)
… Analysis result « True / rather true / rather false / false See sources: http://dataref.com… » Claim extraction Social network analysis Reconciliation, reputation Source d’information de référence n+1 Source d’information de référence n+1 Reference information source n+1 Source search / source selection Reference source construction, refinement, integration
4
1 https://team.inria.fr/cedar/contentcheck/
Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
INSEE 1, France’s economic and societal statistics institute are often valuable data providers 5
1 https://insee.fr/
Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
http://abonnes.lemonde.fr/les-decodeurs/portfolio/2017/04/18/les-fractures-francaises-1-5-le-logement-les-raisons-de-la-crise_5112859_4355770.html
Existing house price index Available revenue per head Rent index Consumer price index
6
Unfortunately most of the data published by INSEE looks like this (our text coloring):
Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
7
Sometimes there are more than 1 table per sheet
Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
8 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets"
Image sources: https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128 https://www.w3.org/RDF/icons/rdf_w3c.svg
19/05/2017
9 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets"
Image sources: https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128 https://www.w3.org/RDF/icons/rdf_w3c.svg
19/05/2017
10 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
11 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets"
Image sources: https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128 https://www.w3.org/RDF/icons/rdf_w3c.svg
19/05/2017
12
indicates data aggregation
Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
1. We distinguish header/data row/columns using
2. Based on these we identify the exact structure of each table 13 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
14 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
15 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
16 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017
17 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017 Reference information source 1 Reference information source 2 Reference information source n
Verification tool (query, match, source search…)
Source search / source selection Reference source construction, refinement, integration
18 Tien-Duc CAO, Ioana Manolescu, Xavier Tannier "Extracting linked data from statistic spreadsheets" 19/05/2017