Big data analy+cs in the EUBrazil Cloud Connect project - - PowerPoint PPT Presentation

▶

Nov 21, 2022 964 likes •1.16k views

Big data analy+cs in the EUBrazil Cloud Connect project EGI CF 2014, Helsinki, May 19-23, 2014 S. Fiore 1 , D. Lezzi 2 , R. Badia 2 , I. Blanquer 3

SLIDE 1

Big ¡data ¡analy+cs ¡in ¡the ¡EUBrazil ¡ Cloud ¡Connect ¡project ¡

EGI ¡CF ¡2014, ¡Helsinki, ¡May ¡19-‑23, ¡2014 ¡

S. ¡Fiore1, ¡D. ¡Lezzi2, ¡R. ¡Badia2, ¡I. ¡Blanquer3, ¡G. ¡Aloisio1,4 ¡ ¡

1 ¡Euro ¡Mediterranean ¡Center ¡on ¡Climate ¡Change ¡(CMCC) ¡ 2 ¡Barcelona ¡Supercompu+ng ¡Center ¡(BSC) ¡ 3 ¡Universitat ¡Politecnica ¡de ¡Valencia ¡(UPVLC) ¡ 4 ¡University ¡of ¡Salento ¡(U. ¡Salento) ¡

SLIDE 2

EUBrazil ¡Cloud ¡Connect ¡

The ¡main ¡objec+ve ¡is ¡the ¡crea+on ¡of ¡a ¡federated ¡e-‑infrastructure ¡for ¡ research ¡using ¡a ¡user-‑centric ¡approach. ¡ To ¡achieve ¡this, ¡we ¡need ¡to ¡pursue ¡three ¡objec+ves: ¡

Adapta&on ¡of ¡exis+ng ¡applica+ons ¡to ¡tackle ¡new ¡scenarios ¡emerging ¡from ¡ coopera+on ¡between ¡Europe ¡and ¡Brazil ¡relevant ¡to ¡both ¡regions. ¡ Integra+on ¡of ¡frameworks ¡and ¡programming ¡models ¡for ¡scien&fic ¡gateways ¡and ¡ complex ¡workflows. ¡ Federa+on ¡of ¡resources, ¡to ¡build ¡up ¡a ¡general-‑purpose ¡infrastructure ¡comprising ¡ exis&ng ¡and ¡heterogeneous ¡resources ¡

Addi+onally, ¡EUBrazilCC ¡will: ¡perform ¡an ¡ac+ve ¡dissemina&on ¡campaign, ¡ analyse ¡innova&on, ¡foster ¡the ¡involvement ¡of ¡Brazilian ¡ins+tu+ons ¡in ¡cloud ¡ standards ¡defini&on, ¡and ¡bring ¡the ¡EU ¡Cloudscape ¡series ¡to ¡broader ¡ interna+onal ¡audience. ¡

614048 ¡-‑ ¡EUBrazilCC ¡ 2 ¡ 20/5/2014 ¡

SLIDE 3

EUBrazilCC ¡consor+um ¡

¡ ¡CRIA, ¡SP ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡BSC, ¡ES ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡UPV, ¡ES ¡ ¡ ¡ ¡ ¡Trust-‑IT, ¡UK ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡UNEW, ¡UK ¡ ¡ ¡ ¡ ¡ ¡ ¡UvA, ¡NL ¡ ¡ ¡CMCC, ¡IT ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ISCIII, ¡ES ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡UFCG, ¡CG ¡ ¡LNCC, ¡RJ ¡ ¡ ¡ ¡ ¡PUC-‑Rio, ¡RJ ¡ ¡FIOCRUZ, ¡RJ ¡

20/5/2014 ¡ 614048 ¡-‑ ¡EUBrazilCC ¡ 3 ¡

BR ¡Coordinator ¡ Francisco ¡Vilar ¡Brasileiro, ¡fubica@dsc.ufcg.edu.br ¡ ¡ Universidade ¡Federal ¡de ¡Campina ¡Grande, ¡Brazil ¡ EU ¡Coordinator ¡ Ignacio ¡Blanquer-‑Espert, ¡iblanque@dsic.upv.es ¡ Universitat ¡Politècnica ¡de ¡València, ¡Spain ¡

A ¡minimum ¡of ¡5500 ¡CPU ¡and ¡400TB ¡of ¡storage ¡

SLIDE 4

Use ¡Case ¡on ¡Biodiversity ¡ and ¡Climate ¡Change ¡

Objec&ve: ¡Understand ¡the ¡impact ¡of ¡climate ¡change ¡on ¡ ¡ terrestrial ¡biodiversity ¡through ¡two ¡workflows ¡based ¡on ¡ ¡ Earth ¡observa+on ¡and ¡ground ¡level ¡data. ¡ Technical ¡Challenge: ¡Integrate ¡parallel ¡data ¡analysis ¡with ¡other ¡processing ¡workflows ¡ in ¡a ¡geographically ¡distributed ¡environment. ¡ Interna&onal ¡Added ¡Value: ¡Integra+on ¡of ¡biodiversity ¡data ¡and ¡modelling ¡with ¡ mul+spectral ¡and ¡remote ¡sensing ¡data ¡for ¡studying ¡the ¡cross-‑correla+on ¡of ¡ biodiversity ¡and ¡climate ¡change. ¡

20/5/2014 ¡ 614048 ¡-‑ ¡EUBrazilCC ¡ 4 ¡ Species-‑ Link ¡ CMCC ¡ CIMP5 ¡ Imaging ¡ Data ¡ ¡Federated ¡Infrastructure ¡ & ¡PlaKorm ¡ Parallel ¡Data ¡ Analysis ¡ Climate ¡& ¡Biodiversity ¡Clearing-‑house ¡ ¡

SLIDE 5

Ver+cal ¡view ¡of ¡the ¡use ¡case ¡

SLIDE 6

A set of requirements have been jointly discussed with project partners to carry out data analysis on climate and satellite data. ¡ Preliminary ¡requirements ¡and ¡needs ¡focus ¡on: ¡  Time ¡series ¡analysis ¡  Data ¡reduc+on ¡(e.g. ¡by ¡aggrega+on) ¡  Model ¡intercomparison ¡  Data ¡subsegng ¡  Mul+model ¡means ¡  Massive ¡experiments ¡(the ¡same ¡task ¡applied ¡on ¡a ¡set ¡of ¡data) ¡  Worflow ¡experiments ¡(processing ¡chains) ¡  Massive ¡data ¡reduc+on ¡  Climate ¡indicators ¡computa+on ¡  Compare ¡historical ¡data ¡and ¡future ¡scenarios ¡  Maps ¡genera+on ¡

Data ¡analy+cs ¡requirements ¡

SLIDE 7

Climate ¡change ¡domain: ¡the ¡current ¡scien+fic ¡ workflow ¡and ¡the ¡ESGF ¡use ¡case ¡

Workflow: search, locate, download, analyze, display results ¡

J. Chen, A. Choudhary, S. Feldman, B. Hendrickson, C.R. Johnson, R. Mount, V. Sarkar, V. White, D. Williams. “Synergistic Challenges in Data-

Intensive Science and Exascale Computing,” DOE ASCAC Data Subcommittee Report, Department of Energy Office of Science, March, 2013.

SLIDE 8

Parallel ¡data ¡analysis ¡

In the EUBrazilCC project we will provide a parallel data analysis service

exploiting scalable VM-based solutions for the management of large volumes of scientific multidimensional data:

Climate data from CMIP5 federated data archive
Landsat5-7-8 satellite data repository
The platform exploits high performance database management paradigms

and efficient storage models to address data analysis

The platform is designed to address data post-processing, analysis and

mining, time series extraction, sub-setting and data reduction (e.g. data aggregation).

The front-end is designed to provide multiple interfaces:

WS-I+ (default, available), GSI/VOMS (in progress, EGI interoperability), OGC WPS (in progress, geo-sciences infrastructure interoperability), … .

SLIDE 9

PDAS ¡(aka ¡‘Ophidia’) ¡Architecture ¡

Front end ¡ Compute layer ¡ I/O layer ¡ I/O server instance ¡ Storage layer ¡ System catalog ¡

Array-based primitives ¡ Analytics Framework Standard interfaces Partitioning/hierarchical data mng Declarative language ¡ New storage model ¡

SLIDE 10

The array data type support is not enough to provide scientific data

management capabilities… primitives are needed as well.

A set of array-based primitives have been implemented
By definition, a primitive is applied to a single fragment
They come in the form of plugins (I/O server extensions)
So far, Ophidia provides a wide set of plugins (about 100) to perform

data reduction (by aggregation), sub-setting, predicates evaluation, statistical analysis, compression, and so forth.

Plugins can be nested to get more complex functionalities
Compression is provided as a primitive too

Array ¡based ¡primi+ves ¡

SLIDE 11

Array ¡based ¡primi+ves: ¡OPH_BOXPLOT ¡

Scientific point of view ¡ Ophidia storage level view ¡

ph_gsl_boxplot(measure, "OPH_DOUBLE”);

SLIDE 12

ph_boxplot(oph_subarray(oph_uncompress(measure), 1,18), "OPH_DOUBLE”)

subarray(measure, 1,18) ¡

Array ¡based ¡primi+ves: ¡nes+ng ¡feature ¡

Scientific point of view ¡ Storage level view ¡

SLIDE 13

Architecture ¡(compute ¡layer) ¡

Front end ¡ Compute layer ¡ I/O layer ¡ I/O server instance ¡ Storage layer ¡ System catalog ¡

Analytics Framework

SLIDE 14

OPERATOR NAME OPERATOR DESCRIPTION

Operators “Data processing” – Domain-agnostic

OPH_APPLY(datacube_in, datacube_out, array_based_primitive) Creates the datacube_out by applying the array-based primitive to the datacube_in OPH_DUPLICATE(datacube_ in, datacube_out) Creates a copy of the datacube_in in the datacube_out OPH_SUBSET(datacube_in, subset_string, datacube_out) Creates the datacube_out by doing a sub-setting of the datacube_in by applying the subset_string OPH_MERGE(datacube_in, merge_param, datacube_out) Creates the datacube_out by merging groups of merge_param fragments from datacube_in OPH_SPLIT(datacube_in, split_param, datacube_out) Creates the datacube_out by splitting into groups of split_param fragments each fragment of the datacube_in OPH_INTERCOMPARISON (datacube_in1, datacube_in2, datacube_out) Creates the datacube_out which is the element-wise difference between datacube_in1 and datacube_in2 OPH_DELETE(datacube_in) Removes the datacube_in OPERATOR NAME OPERATOR DESCRIPTION

Operators “Data processing” – Domain-oriented

OPH_EXPORT_NC (datacube_in, file_out) Exports the datacube_in data into the file_out NetCDF file. OPH_IMPORT_NC (file_in, datacube_out) Imports the data stored into the file_in NetCDF file into the new datacube_in datacube

Operators “Data access”

OPH_INSPECT_FRAG (datacube_in, fragment_in) Inspects the data stored in the fragment_in from the datacube_in OPH_PUBLISH(datacube_in) Publishes the datacube_in fragments into HTML pages

Operators “Metadata”

OPH_CUBE_ELEMENTS (datacube_in) Provides the total number of the elements in the datacube_in OPH_CUBE_SIZE (datacube_in) Provides the disk space occupied by the datacube_in OPH_LIST(void) Provides the list of available datacubes. OPH_CUBEIO(datacube_in) Provides the provenance information related to the datacube_in OPH_FIND(search_param) Provides the list of datacubes matching the search_param criteria

Metadata management (sequential and parallel operators) Data processing (parallel operators) ¡ Data Access (sequential and parallel operators) ¡

The ¡analy+cs ¡framework: ¡(some) ¡datacube ¡

perators

¡

Domain-oriented (parallel operators)

SLIDE 15

5/21/14 ¡ 15 ¡

¡COMPSs ¡integra+on ¡with ¡PDAS ¡

Task Scheduler
Assigns tasks to VMs or physical

resources

Each VM or resource has its own

task queue

Scheduling Optimizer
Checks status of workers
Can decide
To perform load balancing
Create/destroy new VMs
Resource Manager
Manages all cloud middleware

related features

Holds information about all workers

and about cloud providers

Scheduler Optimizer sends to the RM

requirements about new VM characteristics

i.e., VM that can run 3 tasks of

type T1 and 2 tasks of type T2

Resource Manager, evaluates the

cloud providers and chooses the best option

More economic
The decision can be to open a

new private or public VM

For each Cloud provider, a data

structure stores the different available instances (with its features) and the connector that should be used

SLIDE 16

5/21/14 ¡ 16 ¡

COMPSs integration with PDAS

COMPSs application: implementation of the application logic

(PDAS-based), where some data analytics operators will be instrumented by the COMPSs runtime and executed remotely

n the EUBrazilCC resources.
Workflow execution: data analytics workflows will perform

spatio-temporal data reduction, data inter-comparison, timeseries & maps production

Massive data analysis: data challenges running on CMIP5

datasets will focus on computing climate change indicators in the target areas.

Integration: COMPSs & PDAS will target high level user

workflows to address needs and requirements of the involved user communities. The integration with the COMPSs framework to provide workflow-based analytics on massive volumes of data are major goals to be addressed during the EUBrazilCC project implementation.

Outcome: the implemented COMPSs-PDAS application will

be offered as additional service in the project infrastructure, allowing to run DAGs of multiple operators which enact relevant processing chains for the Biodiversity & Climate Change use case.

PDAS on private cloud ¡ PDAS on HPC Cluster ¡

SLIDE 17

Conclusions ¡

EUBrazilCC ¡is ¡a ¡Project ¡defined ¡from ¡the ¡actual ¡needs ¡of ¡a ¡

complementary ¡interna+onal ¡consor+um ¡

EUBrazilCC ¡has ¡a ¡strong ¡user ¡focus, ¡with ¡three ¡different ¡use ¡

cases ¡to ¡validate ¡the ¡infrastructure ¡and ¡services. ¡

The ¡one ¡presented ¡today ¡focuses ¡on ¡Climate ¡change ¡and ¡biodiversity ¡

Running ¡big ¡data ¡analy+cs ¡workflows ¡in ¡a ¡cloud ¡environment ¡is ¡ a ¡key ¡challenge ¡of ¡this ¡use ¡case ¡ The ¡new ¡PDAS ¡(GSI ¡and ¡VOMS ¡enabled) ¡will ¡address ¡ interoperability ¡with ¡EGI ¡ The ¡integra+on ¡with ¡COMPSs ¡will ¡be ¡key ¡to ¡dynamically ¡support ¡ big ¡data ¡analy+cs ¡DAGs ¡for ¡climate ¡and ¡satellite ¡data ¡

20/5/2014 ¡ 614048 ¡-‑ ¡EUBrazilCC ¡ 17 ¡

SLIDE 18

[1] G. Aloisio, S. Fiore, I. Foster, D. N. Williams , “Scientific big data analytics challenges at large scale”, Big Data and Extreme-scale Computing (BDEC), April 30 to May 01, 2013, Charleston, USA (position paper). [2] S. Fiore, A. D'Anca, C. Palazzo, I. Foster, Dean N. Williams, Giovanni Aloisio, “Ophidia: Toward Big Data Analytics for eScience”, ICCS 2013, June 5-7, 2013 Barcelona, Spain, Procedia Computer Science, Elsevier,

pp. 2376-2385.

[3] S. Fiore, C. Palazzo, A. D’Anca, I. Foster, D. N. Williams, G. Aloisio, “A big data analytics framework for scientific data management”, Workshop on “Big Data and Science: Infrastructure and Services”, IEEE International Conference on BigData 2013, October 6-9, 2013, Santa Clara, USA, pp. 1-8. [4] S.Fiore, A. D’Anca, D. Elia, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, “Ophidia: a full software stack for scientific data analytics”, Workshop on Big Data Principles, Architectures & Applications, HPCS2014, Bologna, USA, July 21-25, 2014. [5] F. Lordan, E. Tejedor, J. Ejarque, R. Rafanell, J. Álvarez, F. Marozzo, D. Lezzi, R. Sirvent, D. Talia, R. M. Badia, ServiceSs: An Interoperable Programming Framework for the Cloud, Journal of Grid Computing (2014), Volume 12, Issue 1, pp.1267-91 .doi:10.1007/s10723-013-9272-5

Contacts:

EUBRazilCC coordinator (EU): Ignacio Blanquer-Espert, iblanque@dsic.upv.es (UPV), Spain PDAS: Sandro Fiore, Giovanni Aloisio (CMCC), sandro.fiore@cmcc.it , giovanni.aloisio@cmcc.it, Italy COMPSs: Rosa Badia, Daniele Lezzi (BSC), rosa.m.badia@bsc.es, daniele.lezzi@bsc.es, Spain