NoSQL working group Use case: Network of Life Mario David (LIP) - - PowerPoint PPT Presentation

nosql working group use case network of life
SMART_READER_LITE
LIVE PREVIEW

NoSQL working group Use case: Network of Life Mario David (LIP) - - PowerPoint PPT Presentation

NoSQL working group Use case: Network of Life Mario David (LIP) With contribution from Miguel Porto and Rui Figueira (CIBIO Portugal) EGI-Engage 1 www.egi.eu Outline GBIF and Atlas of Living Australia Web portal From GBIF to Network


slide-1
SLIDE 1

www.egi.eu EGI-Engage

NoSQL working group Use case: Network of Life

Mario David (LIP) With contribution from Miguel Porto and Rui Figueira (CIBIO Portugal)

1

slide-2
SLIDE 2

www.egi.eu EGI-Engage

Outline

  • GBIF and Atlas of Living Australia Web portal
  • From GBIF to Network of Life
  • Graph DBs - ArangoDB
  • Current status and first tests

2

slide-3
SLIDE 3

www.egi.eu EGI-Engage

Challenges of GBIF biodiversity data

Global Biodiversity Information Facility

  • 570 million records with many dimensions.
  • Need to support different spatial scales, information

detail, in the same platform.

  • Ensure confidence, users need to be able to scrutinize

all details of information.

  • The rate of new data addition is not fully predictable.
  • Crossing data with other types of information (remote

sensing, climatic) is also resource-demanding.

3

Rui Figueira (CIBIO)

slide-4
SLIDE 4

www.egi.eu EGI-Engage

Atlas of Living Australia

Provide:

  • Efficient organization and management of biodiversity

information, including to find, access and visualize data;

  • Integration with genetic, habitat, ecosystem and

geographical data;

  • Building different facets, e.g., for Invasive Alien

Species, threatened species, nature conservation

  • Web data services through API.

Platform for web portals and services for societal uses in biodiversity

4

Rui Figueira (CIBIO)

slide-5
SLIDE 5

www.egi.eu EGI-Engage

One platform, many facets (thematic, regional, national), different user communities

5

Rui Figueira (CIBIO)

slide-6
SLIDE 6

www.egi.eu EGI-Engage

One platform, many facets (thematic, regional, national), different user communities

6

Rui Figueira (CIBIO)

slide-7
SLIDE 7

www.egi.eu EGI-Engage

One platform, many facets (thematic, regional, national), different user communities

7

Rui Figueira (CIBIO)

slide-8
SLIDE 8

www.egi.eu EGI-Engage

One platform, many facets (thematic, regional, national), different user communities

8

Rui Figueira (CIBIO)

slide-9
SLIDE 9

www.egi.eu EGI-Engage

One platform, many facets (thematic, regional, national), different user communities

9

Rui Figueira (CIBIO)

slide-10
SLIDE 10

www.egi.eu EGI-Engage

Advantages of cloud solutions

Provide:

  • Scalability of the allocation of resources.
  • Sharing infrastructure and capacity between members
  • f GBIF network.
  • Persistence and availability of big volumes of data.

10

Rui Figueira (CIBIO)

slide-11
SLIDE 11

www.egi.eu EGI-Engage

GBIF ⇒ Net of Life

11

GBIF

{

  • }

{

  • }

Biologists POV

slide-12
SLIDE 12

www.egi.eu EGI-Engage

GBIF ⇒ Net of Life

12

Network of Life

{

  • }

{

  • }

pollination

{

  • }

Biologists POV

slide-13
SLIDE 13

www.egi.eu EGI-Engage

GBIF ⇒ Net of Life

13

Graph ⇒ GraphDB

Vertices Edges G = (V, E) V = {v1, v2, …} E = { {v1, v2}, {v1, v3},... }

Maths/Comp.Scient POV

slide-14
SLIDE 14

www.egi.eu EGI-Engage

GBIF ⇒ Net of Life

14

GraphDB + Documents ⇒ ArangoDB

{

  • }

{

  • }

{

  • }

Vertices Edges Documents

Maths/Comp.Scient POV

slide-15
SLIDE 15

www.egi.eu EGI-Engage

ArangoDB - I

  • Multi-model database: document, graph, key-value
  • Open source: https://github.com/arangodb/arangodb
  • Document model:
  • Data stored as linked JSON-like documents, organized in collections
  • No schema enforced, but set of indexes can be defined for each

collection

  • Fields can store other subdocuments and pointers to independent

documents

15

Miguel Porto (CIBIO)

slide-16
SLIDE 16

www.egi.eu EGI-Engage

ArangoDB - II

  • Graph model:
  • An “interpretation” built upon the document model:
  • Defined by a set of document collections representing vertices.
  • Another set of collections representing the edges connecting the vertices.
  • Vertexes and Edges are documents.
  • Native support for traversal queries:
  • Highly customizable behaviour
  • No need for “infinite” JOINs.
  • Indexes:
  • Graph traversal indexes (edge-vertex connections)
  • Geo indexes (constructed from latitude-longitude fields)
  • Full text, hash, etc.

16

Miguel Porto (CIBIO)

slide-17
SLIDE 17

www.egi.eu EGI-Engage

ArangoDB - III

  • AQL query language:
  • SQL-like but very different logic:
  • Entirely JSON-based.
  • No tables.
  • Rather complete set of functions to work with documents:
  • Data aggregation.
  • Filtering (including Geo functions), etc.
  • Document and array manipulation
  • Graph traversal and shortest path functions
  • Easy querying, processing and output results in the desired data

format

  • Very flexible in chaining and nesting query sentences

“Powerful and Fast”

17

Miguel Porto (CIBIO)

slide-18
SLIDE 18

www.egi.eu EGI-Engage

Network of Life: Architecture

Data analysis native modules ArangoDB server Network of Life Java server Frontends, Web, R Parallelized computations Graph traversal Data aggregation Exposes services for:

  • querying interaction data at different

levels of aggregation

  • downloading raw data
  • submitting data analysis jobs
  • uploading new data

Visualization Network queries Network data analysis Hypothesis testing Data downloading ... JSON data AQL queries WEB services

18

Miguel Porto (CIBIO)

slide-19
SLIDE 19

www.egi.eu EGI-Engage

Some first tests

  • Simple ArangoDB instance running on the desktop
  • Good query performance, in particular the ones

involving geographic indexes and graph traversal

  • ArangoDB having integrated geo indexes matches nicely the use

case

  • The application logic should be implemented in the

AQL queries.

19

Miguel Porto (CIBIO)

slide-20
SLIDE 20

www.egi.eu EGI-Engage

Test deployment

  • ArangoDB in cluster mode ⇒ allow sharding
  • Deployed 2 VMs in INCD Openstack
  • Each VM with 2 types of processes:
  • Coordinators: receives requests, distributes them to the DBServers,

executes AQL queries and returns the result to the clients. The coordinator also exposes information about cluster health and cluster statistics.

  • DBServers: can both store sharded (and non-sharded) collections.
  • A database and a coordinator can live on the same server.
  • And… learning the business :)

20