Doctoral Consortium ADBIS 2019 Bled, Slovenia Textual Data - - PowerPoint PPT Presentation

doctoral consortium adbis 2019 bled slovenia textual data
SMART_READER_LITE
LIVE PREVIEW

Doctoral Consortium ADBIS 2019 Bled, Slovenia Textual Data - - PowerPoint PPT Presentation

Doctoral Consortium ADBIS 2019 Bled, Slovenia Textual Data Analysis from Data Lakes Pegdwend N. Sawadogo pegdwende.sawadogo@univ-lyon2.fr Supervised by Pr. Jrme Darmont September 8, 2019 Outline Introduction 1 Thesis


slide-1
SLIDE 1

Doctoral Consortium – ADBIS 2019 – Bled, Slovenia Textual Data Analysis from Data Lakes

Pegdwendé N. Sawadogo

pegdwende.sawadogo@univ-lyon2.fr

Supervised by Pr. Jérôme Darmont

September 8, 2019

slide-2
SLIDE 2

Outline

1

Introduction

2

Thesis Objectives

3

Metadata Models

4

First Results

5

Conclusion

slide-3
SLIDE 3

Introduction We are in big data era

We are in big data era

innovations in IT until the 2000s

RDBMSs World Wide Web Data Warehouses

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 3 / 19

slide-4
SLIDE 4

Introduction We are in big data era

We are in big data era

innovations in IT until the 2000s

RDBMSs World Wide Web Data Warehouses

innovations in IT since the 2000s

NoSQL DBMSs Internet of Things Data Lakes

slideserve.com/DeZyre Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 3 / 19

slide-5
SLIDE 5

Introduction What is a data lake?

What is a data lake?

Definition (Sawadogo et al., 2019) A data lake is a scalable storage and analysis system for data of any type, retained in their native format and used mainly by data specialists for knowledge extraction.

dwbimaster.com Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 4 / 19

slide-6
SLIDE 6

Introduction Benefits of data lakes

Benefits of data lakes

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 5 / 19

slide-7
SLIDE 7

Introduction Data lakes challenges

Data lakes challenges

“Data swamp” syndrome

Data swamp: inoperable DL Poor metadata management Poor data governance

medium.com Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 6 / 19

slide-8
SLIDE 8

Introduction Data lakes challenges

Data lakes challenges

“Data swamp” syndrome

Data swamp: inoperable DL Poor metadata management Poor data governance

medium.com

Enabling industrialized analyses

Opening DLs to business users Rich and intuitive metadata OLAP analysis

  • penflyers

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 6 / 19

slide-9
SLIDE 9

1

Introduction

2

Thesis Objectives

3

Metadata Models

4

First Results

5

Conclusion

slide-10
SLIDE 10

Thesis Objectives

Main Purposes

Enable industrialized analyses from data lakes Focus on textual data analysis Alternative solution to text data warehouses

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 8 / 19

slide-11
SLIDE 11

Thesis Objectives

Main Purposes

Enable industrialized analyses from data lakes Focus on textual data analysis Alternative solution to text data warehouses

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 8 / 19

slide-12
SLIDE 12

1

Introduction

2

Thesis Objectives

3

Metadata Models

4

First Results

5

Conclusion

slide-13
SLIDE 13

Metadata Models Data provenance-centric models

Data provenance-centric models

DAG organization : nodes = data objects Vertices = operations (users, transformations, etc.) Help to understand, explain and repair inconsistencies in the data.

ericsink.com Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 10 / 19

slide-14
SLIDE 14

Metadata Models Similarity-centric models

Similarity-centric models

Allow to recommend related data Make it possible to detect data clusters Simple variant Unoriented graph Nodes = data objects Edges = similarity strengths

[Maccioni and Torlone, 2018] Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 11 / 19

slide-15
SLIDE 15

Metadata Models Similarity-centric models

Similarity-centric models

Allow to recommend related data Make it possible to detect data clusters Simple variant Unoriented graph Nodes = data objects Edges = similarity strengths

[Maccioni and Torlone, 2018]

Decomposition into droplets Data object = several nodes Connections are deduced from similarity between related “droplets”

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 11 / 19

slide-16
SLIDE 16

Metadata Models Discussion

Discussion (Sawadogo et al., 2019)

Metadata model/system SE DI LG DP DV UT SPAR (Fauduet and Peyrard, 2010)

  • Terrizzano et al. (2015)
  • Singh et al. (2016)
  • GOODS (Halevy et al., 2016)
  • Ground (Hellerstein et al., 2017)
  • KAYAK (Maccioni and Torlone, 2018)
  • CoreKG (Beheshti et al., 2018)
  • Diamantini et al. (2018)
  • SE: Semantic Enrichment - DI: Data Indexing - LG: Links Generation

DP: Data Polymorphism - DV: Data Versioning - UT: Usage Tracking [Sawadogo et al., 2019b] - BBIGAP@ADBIS 2019 Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 12 / 19

slide-17
SLIDE 17

Metadata Models Discussion

Discussion (Sawadogo et al., 2019)

Metadata model/system SE DI LG DP DV UT SPAR (Fauduet and Peyrard, 2010)

  • Terrizzano et al. (2015)
  • Singh et al. (2016)
  • GOODS (Halevy et al., 2016)
  • Ground (Hellerstein et al., 2017)
  • KAYAK (Maccioni and Torlone, 2018)
  • CoreKG (Beheshti et al., 2018)
  • Diamantini et al. (2018)
  • SE: Semantic Enrichment - DI: Data Indexing - LG: Links Generation

DP: Data Polymorphism - DV: Data Versioning - UT: Usage Tracking [Sawadogo et al., 2019b] - BBIGAP@ADBIS 2019

No comprehensive metadata model Data versioning and data polymorphism as advanced features

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 12 / 19

slide-18
SLIDE 18

1

Introduction

2

Thesis Objectives

3

Metadata Models

4

First Results

5

Conclusion

slide-19
SLIDE 19

First Results Typology of data lake metadata (Sawadogo et al., 2019)

Typology of data lake metadata

[Sawadogo et al., 2019a] - ICEIS 2019 Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 14 / 19

slide-20
SLIDE 20

First Results Generic metadata model for data lakes

Generic metadata model for data lakes

Intra-objects metadata

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 15 / 19

slide-21
SLIDE 21

First Results Generic metadata model for data lakes

Generic metadata model for data lakes

Intra-objects metadata Inter-objects metadata

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 15 / 19

slide-22
SLIDE 22

First Results Generic metadata model for data lakes

Generic metadata model for data lakes

Intra-objects metadata Inter-objects metadata

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 15 / 19

slide-23
SLIDE 23

First Results Generic metadata model for data lakes

Generic metadata model for data lakes

Intra-objects metadata Global metadata

Not included Ontologies = graphs Mostly depend on adopted

technologies Inter-objects metadata

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 15 / 19

slide-24
SLIDE 24

First Results Expected features

Expected features

Data search

keyword/patern-based querying Query extension Navigation accross data

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 16 / 19

slide-25
SLIDE 25

First Results Expected features

Expected features

Data search

keyword/patern-based querying Query extension Navigation accross data

Navigation/OLAP analysis

Dimensions = data groupings Hierarchies = ontologies Aggregations = data fusion

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 16 / 19

slide-26
SLIDE 26

First Results Expected features

Expected features

Data search

keyword/patern-based querying Query extension Navigation accross data

Navigation/OLAP analysis

Dimensions = data groupings Hierarchies = ontologies Aggregations = data fusion

Recommendation of data

Similar data Affiliated data Data of same cluster

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 16 / 19

slide-27
SLIDE 27

First Results Expected features

Expected features

Data search

keyword/patern-based querying Query extension Navigation accross data

Navigation/OLAP analysis

Dimensions = data groupings Hierarchies = ontologies Aggregations = data fusion

Recommendation of data

Similar data Affiliated data Data of same cluster

Compliant with FAIR principles

Findable Accessible Interoperable Re-usable

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 16 / 19

slide-28
SLIDE 28

1

Introduction

2

Thesis Objectives

3

Metadata Models

4

First Results

5

Conclusion

slide-29
SLIDE 29

Conclusion

Conclusion

Overview

Opening data lakes to business users 6 key features to evaluate data lakes metadata models/systems Consideration of OLAP analysis in data lakes

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 18 / 19

slide-30
SLIDE 30

Conclusion

Conclusion

Overview

Opening data lakes to business users 6 key features to evaluate data lakes metadata models/systems Consideration of OLAP analysis in data lakes

Future works

Implementing our metadata model into a metadata system Designing an OLAP analysis platform for textual data ponds Identifying techniques and tools to ensure scalability

Pegdwendé N. Sawadogo Textual Data Analysis from Data Lakes September 8, 2019 18 / 19

slide-31
SLIDE 31

Doctoral Consortium – ADBIS 2019 – Bled, Slovenia Textual Data Analysis from Data Lakes

Pegdwendé N. Sawadogo

pegdwende.sawadogo@univ-lyon2.fr

Supervised by Pr. Jérôme Darmont

September 8, 2019