Towards Keyword-Based Search over Environmental Data Sources 3rd - - PowerPoint PPT Presentation

towards keyword based search over environmental data
SMART_READER_LITE
LIVE PREVIEW

Towards Keyword-Based Search over Environmental Data Sources 3rd - - PowerPoint PPT Presentation

Towards Keyword-Based Search over Environmental Data Sources 3rd International KEYSTONE Conference (IKC 2017) Gdask Poland, 11-12 September 2017. David lvarez-Castro, Jos R.R. Viqueira, Alberto Bugarn Centro Singular de Investigacin


slide-1
SLIDE 1

citius.usc.es Centro Singular de Investigación en Tecnoloxías da Información UNIVERSIDADE DE SANTIAGO DE COMPOSTELA

Towards Keyword-Based Search

  • ver Environmental Data Sources

David Álvarez-Castro, José R.R. Viqueira, Alberto Bugarín

3rd International KEYSTONE Conference (IKC 2017) Gdańsk Poland, 11-12 September 2017.

slide-2
SLIDE 2

Contents

 Motivation and Objective  KEYWORDTERM Architecture  Catalog and Index structure  Searching process (PoS Data Restrictions)  Conclusions and Future Work

slide-3
SLIDE 3

Contents

 Motivation and Objective  KEYWORDTERM Architecture  Catalog and Index structure  Searching process (PoS Data Restrictions)  Conclusions and Future Work

slide-4
SLIDE 4

Motivation

Motivation and Objective

RELEVANT METAPHOR: SEARCHING BOOKS

slide-5
SLIDE 5

Motivation

Motivation and Objective

EXAMPLE1: CHOLERA RISK

High sea surface temperature and rainfall near sea level during monsoon

slide-6
SLIDE 6

Motivation

Motivation and Objective

EXAMPLE1: CHOLERA RISK

High sea surface temperature and rainfall near sea level during monsoon Property of Space (PoS)

Conventional value at each point of space

Sea Surface Temperature (sst). 1/08/2011

slide-7
SLIDE 7

Motivation

Motivation and Objective

EXAMPLE1: CHOLERA RISK

High sea surface temperature and rainfall near sea level during monsoon Fuzzy Linguistic Value (FLV)

Fuzzy set of numeric values

slide-8
SLIDE 8

Motivation

Motivation and Objective

EXAMPLE1: CHOLERA RISK

High sea surface temperature and rainfall near sea level during monsoon sst [1/08/2011] sst mean August [2005‐2012]

Fuzzy set of spatio‐temporal elements

Data Restriction

slide-9
SLIDE 9

Motivation

Motivation and Objective

EXAMPLE1: CHOLERA RISK

High sea surface temperature and rainfall near sea level during monsoon High rainfall Data Restriction

slide-10
SLIDE 10

Motivation

Motivation and Objective

EXAMPLE1: CHOLERA RISK

High sea surface temperature and rainfall near sea level during monsoon Near the coastline and low elevation Fuzzy Spatial Relationship (FSR) Geographic Named Entity (GNE) Spatial Restriction Data restriction

Fuzzy set of spatio‐temporal elements

Name (time) Geometry (time) Properties (time)

slide-11
SLIDE 11

Motivation

Motivation and Objective

EXAMPLE1: CHOLERA RISK

High sea surface temperature and rainfall near sea level during monsoon Fuzzy Temporal Relationship (FSR) Geographic Named Entity (GNE) Temporal Restriction

Fuzzy set of spatio‐ temporal elements

slide-12
SLIDE 12

Motivation

Motivation and Objective

EXAMPLE2: TOURISM

High sea surface temperature near Camping Miño

slide-13
SLIDE 13

Motivation

Motivation and Objective

EXAMPLE2: TOURISM

High sea surface temperature near Camping Miño

Fuzzy set of spatio‐ temporal elements

Spatial Restriction Data restriction

slide-14
SLIDE 14

Motivation

Motivation and Objective

STATE OF THE ART

High sea surface temperature and rainfall near sea level during monsoon

Sea Surface Temperature Rainfall Elevation Coastline

Catalog Discover Data Source Geographic Information System Toolkit Catalog Catalog Data Source Data Source Data Source Geo data analysis Download

Not Feasible Task

slide-15
SLIDE 15

Objective

Motivation and Objective

slide-16
SLIDE 16

Contents

 Motivation and Objective  KEYWORDTERM Architecture  Catalog and Index structure  Searching process (PoS Data Restrictions)  Conclusions and Future Work

slide-17
SLIDE 17

KEYWORDTERM Architecture

GNE Data Source PoS Data Source

OGC WFS Unidata NetCDF Subset

Crawler Web GUI

OGC WMS OGC WMS

Catalog Index Structure Search Engine

Discovery Search Update OGC WMS Discovery & Search

slide-18
SLIDE 18

Contents

 Motivation and Objective  KEYWORDTERM Architecture  Catalog and Index structure  Searching process (PoS Data Restrictions)  Conclusions and Future Work

slide-19
SLIDE 19

Catalog

Catalog and Index Structure

 Properties of Space (PoS)

 Examples: Sea Surface Temperature, Rainfall, Elevation, etc.  Defined FLVs

̶ High, Normal, Low, etc.

 Geographic Named Entity Types (GNET)

 Examples: Accomodation_facility, Municipality, Coastline_feature, etc.  List of properties

̶ Beds of Accomodation_facility, population of Municipality, etc. ̶ Defined FLVs for each property

Harmonized vocabulary assumed One harmonized data source for each PoS/GNET assumed

Not Semantic Data Integration

slide-20
SLIDE 20

Index Structure

Catalog and Index Structure

 Properties (of Space and of GNETs)

 Precomputed memberships of all possible primitive data restrictions (defined FLVs)

̶ High Sea Surface Temperature, low elevation, many beds, low population, etc.

 GNETs

 Temporal evolution of:

̶ Names ̶ geometries

CONTENTS

Crawling data sources registered in the harmonized Catalog

slide-21
SLIDE 21

Index Structure

Catalog and Index Structure

 Multiresolution spatial and temporal pyramids of raster tiles

PRECOMPUTED POS DATA RESTRICTIONS

. . .

SPATIAL TEMPORAL

slide-22
SLIDE 22

Index Structure

Catalog and Index Structure

 Generation of Membership tiles

PRECOMPUTED POS DATA RESTRICTIONS

Sea Surface Temperature

GL2 TL3

Very low Low Normal High Very high

FLVs

Tiles with all 0’s are discarded

Membeship raster Tile

Membership value [0,1]

180 x 360 x 20 real values ~ 10MB

slide-23
SLIDE 23

Index Structure

Catalog and Index Structure

 Data access structures

PRECOMPUTED POS DATA RESTRICTIONS

Sea Surface Temperature Water Temperature Humidity Wind Speed Population Density

. . . . . . . . . . . . . . .

Property Name (Hash)

Very high High Normal Low Very Low

FLV (Hash) R‐Tree (Space) B+‐Tree (Time)

. . .

Membership raster tiles Spatial/Temporal Indexing

slide-24
SLIDE 24

Index Structure

Catalog and Index Structure

 Data access structures

PRECOMPUTED GNET PROPERTY DATA RESTRICTIONS

Sea Surface Temperature Water Temperature Humidity Wind Speed Population Density

. . . . . . . . . . . . . . .

Property Name (Hash) FLV (Hash) R‐Tree (Space) B+‐Tree (Time) Membership vector zones Spatial/Temporal Indexing

High Normal Low

[t1, t2] [t3, t8] [ti, tj]

. . .

Geo Time Memb.

0.5 0.7 1

slide-25
SLIDE 25

Index Structure

Catalog and Index Structure

 Data access structures

TEMPORAL EVOLUTION OF GNE DATA

GNETs R‐Tree (Space) B+‐Tree (Time) GNEs Textual/Spatial/Temporal Indexing Hash (Text)

Sport Facilities Roads Hotels Storms Administrative Divisions

. . . . . . . . .

Camping Miño Araguaney Virxe da cerca [t1, t2]

. . . . . . . . .

Geo Time Name [t1, t8] [t5, t9]

slide-26
SLIDE 26

Contents

 Motivation and Objective  KEYWORDTERM Architecture  Catalog and Index structure  Searching process (PoS Data Restrictions)  Conclusions and Future Work

slide-27
SLIDE 27

Phase 1: Accessing relevant raster membership tiles metadata

Searching process (PoS Data Restrictions)

ONE DATA RESTRICTION

 Obtain metadata of relevant tiles  Result

 Set of relevant tile metadata

TWO OR MORE DATA RESTRICTIONS

 Spatio-temporal join of tile metadata  Result

 Set of tuples of tile metadata  If (T1, T2, ..., Tn) is a tuple of tiles of the result then

̶ The intersection of their spatial and temporal extensions must be non-empty

slide-28
SLIDE 28

Phase 1: Accessing relevant raster membership tiles Metadata

Searching process (PoS Data Restrictions)

IMPLEMENTATION

 Spatial Relational DBMS (PostgreSQL + PostGIS)

PID GL

... ... ...

4 4

...

4

...

4

...

2 2

...

2

...

2

...

TL PoS FLV High High

...

Normal

...

Low

...

BBox TimeS t12 t33

...

t94

...

t7

...

TimeE t27 t49

...

t99

...

t85

... ... ... ...

Tile tile1 tile2

...

tile23

...

tile45

...

Hash Hash R-Tree B+-Tree

P1 V1 AND P2 V2

slide-29
SLIDE 29

Phase 1: Accessing relevant raster membership tiles metadata

Searching process (PoS Data Restrictions)

PERFORMANCE

Real Dataset 8340 Tiles ~ 80 GB of numeric real data Hardware 2 CPU x 2 Cores 4 GB RAM 50 GB DISK

slide-30
SLIDE 30

Phase 1: Accessing relevant raster membership tiles metadata

Searching process (PoS Data Restrictions)

PERFORMANCE

Queries

Only select Spatio‐ temporal Join

slide-31
SLIDE 31

Phase 2: Tile data access + [Fuzzy intersection of tile tuples]

Searching process (PoS Data Restrictions)

ONE DATA RESTRICTION

 Obtain tile data from disk  Generate response WMS layers

TWO OR MORE DATA RESTRICTIONS

 Perform fuzzy intersection between the tiles of each tuple

 Minimum membership at each spatio-temporal cell

 Algorithm 1

 Tiles with the same spatial and temporal resolution  Hash Join using space and time

 Algorithm 2

 Tiles with different spatial and/or temporal resolution  Spatial and/or temporal resampling + Hash Join using space and time

slide-32
SLIDE 32

Phase 2: Tile data access + [Fuzzy intersection of tile tuples]

Searching process (PoS Data Restrictions)

IMPLEMENTATION

 Centralized implementation in Python  Distributed implementation

 Storage: Apache Parquet

̶ Distributed columnar storage ̶ Data encodings and compression

 Processing: Apache Spark

̶ Map/reduce ̶ Distributed relational operations

  • Efficient Hash Join based on

Map/Reduce

slide-33
SLIDE 33

Phase 2: Tile data access + [Fuzzy intersection of tile tuples]

Searching process (PoS Data Restrictions)

PERFORMANCE

8 executors 8 GB RAM

slide-34
SLIDE 34

Phase 2: Tile data access + [Fuzzy intersection of tile tuples]

Searching process (PoS Data Restrictions)

PERFORMANCE

20 tuples of tiles 8 executors 20 tuples of tiles 8 GB RAM

Resampling ‐> more processing

slide-35
SLIDE 35

Contents

 Motivation and Objective  KEYWORDTERM Architecture  Catalog and Index structure  Searching process (PoS Data Restrictions)  Conclusions and Future Work

slide-36
SLIDE 36

Conclusions

Conclusions and Future Work

 Problem of keyword-based search over environmental data sources introduced for the first time  First prototype implementation

 Limited to Data Restrictions over Properties of Space (PoS)

 Performance evaluation

 Phase 2 of the search process is more expensive  Distributed processing is required

 Feasibility of the approach and identification of future work lines

slide-37
SLIDE 37

Future Work

Conclusions and Future Work

 Complete the functionality with Spatial and Temporal restrictions  Semantic integration and fusion of different data sources, including non- structured data sources

 Data quality and provenance  reliability and authority

 More specific indexing and ranking solutions to improve efficiency and effectiveness  Crawling of geospatial and environmental data sources

 Efficiency  Politeness rules (data source overload)

 Interaction with end user

 Trade-off between expressiveness and simplicity of the search language  Incorporation of user profile data in the searching and ranking process

̶ User / Expert defined FLVs

 Usability of the GUI

slide-38
SLIDE 38

Fondo Europeo de Desenvolvemento Rexional “Unha maneira de facer Europa”

3rd International KEYSTONE Conference (IKC2017), Gdańsk Poland, 11-12 September 2017

Towards Keyword-Based Search over Environmental Data Sources

Thanks!!

José R.R. Viqueira: jrr.viqueira@usc.es citius.usc.es