GRNET eScience platform for Big Data management Codename: orka - PowerPoint PPT Presentation

GRNET eScience platform for Big Data management Codename: orka Monday, February 1, 2016

Project Vision • Data-Intensive Science (store and process big data, at Petabyte scale) • Scientific workflows • Virtual Research Environment • Data streaming

Big data • The problem: data deluge • Solution: – PaaS over • ~okeanos (VM, processing) • Pithos+ (storage)

Hadoop project • Most popular implementation for the MapReduce programming paradigm • Open source, commodity hardware • Hadoop core (MapReduce, Hadoop distributed file system) • Rich ecosystem (Pig, Hive, Hbase, many more) • Researcher focuses on the algorithm and not the software install/maintain/scale etc.

Hadoop cluster with ~orka • GUI, CLI, REST on top of ~okeanos to: – Create cluster (with configurable options) from a range of Hadoop distro’s (aka images) – Transfer your data – Submit, execute, monitor jobs – Delete cluster – Start/stop/format cluster – Scale cluster, add/remove nodes – Save cluster creation metadata for reproducibility

Hadoop cluster with ~orka

Add-ons to basic Hadoop • Other components & runtimes – Spark • Apache Hadoop-based distro’s – Cloudera – Hue (HDFS explorer, Oozie web editor) • Storage backend – Pithos ó HDFS connector (analogous to Amazon S3 Filesystem for Hadoop)

Scientific Workflows • Orchestration of atomic jobs • Apache Oozie • Apache Pig – Built-in in orka images

Collaborative scientific research • Virtual Research Environment • Complete system for teams and projects • Components: – Research/Project home page (portal, wiki) – Project Management – Teleconference – Digital repositories • Implemented as Docker images

Virtual Research Environment Category Software stack Portal / CMS Drupal (v7.37) Wiki, blog, forum Mediawiki (v1.2.4) Project management Redmine (v3.04) Web conferencing BigBlueButton (v0.81) Digital repositories DSpace (v5.3)

Reproducible Research • Save your experiment’s metadata as a bundle • Domain Specific Language (DSL) that fully describes an experiment/job • Text editor => simple YAML file • Re-play, possibly with different parameters • Save bundle to Pithos • Share your bundle with other ~okeanos users

Data streams into HDFS • Apache Flume • Integrated into the Hadoop ecosystem • Focus on streaming data

High-level Architecture

Technology Stack eScience Subsystem 1 [Orka 0.1.1] Orka SubSystem: Technologies Overview Back-End Front-End Web Server Data Single Page Application (SPA) ü nginx ü Postgres DBMS ü HTML5 ü CSS 3 REST API App Server ü Ember JS ü Django REST F/Work ü uWSGI ü Bootstrap External APIs / Technologies Supported also, (in progress) ü Synnefo/kamaki Command Line (CLI) API ü RabbitMQ, Message Broker ü Authentication ü OrkaAPI (Python scripts) ü Celery Task Manager ü Hadoop

Current state – github.com/grnet/e-science – escience.grnet.gr

lambda λ on demand

Simplifying Computing The lambda architecture a a useful framework to think about designing big data applications a robust framework for ingesting real-time streams of data while b providing e ffi cient stream and batch analytics. c f ault-tolerant against both hardware failures and human errors serves a wide range of use cases, and in which low-latency d reads and updates are required λ lambda.grnet.gr 2

λ : lambda architecture The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in realtime by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer. Batch Layer The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) pre-computing arbitrary query functions, called batch views. Serving layer The serving layer indexes the batch views so that they can be queried in low-latency, ad- hoc way. Speed layer The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. λ lambda.grnet.gr 3

λ : lambda architecture an example batch layer serving layer query 3 2 batch view 1 master 5 batch view dataset a t a query d a t a a d t a d speed layer 4 real-time view real-time view deals with recent data only. data is dispatched to batch and speed layer for 1 4 processing. Any incoming query can be answered by 2 precomputes the batch views 5 merging results from batch views and real-time views. indexes the batch views 3 λ lambda.grnet.gr 4

Provisioning a λ instance okeanos Users Lambda on demand λ api service λ instances λ λ λ λ layers Speed Batch Speed Batch Speed Based on λ lambda.grnet.gr 5

λ ambda UI Dashboard, Instances, Applications and help λ - Instances manage lambda instances Create your lambda instances based on your needs. Manage , λ deploy applications and start your lambda instance. Applications manage your applications ? Upload your Java or Scala application for streaming and batch jobs. Your applications are stored on the Pithos+ storage service. Help app Informational guides Short guides on how to 1) deploy, run and manage your lambda instances. 11) deploy, run and manage your applications 111) export and view your results λ lambda.grnet.gr 6

Experienced User Use the λ ambda API lambda instance create manage delete lambda applications upload λ - API manage delete well documented with mkdocs doc Swagger λ lambda.grnet.gr 7

e-science vs λ Use the λ ambda API Lamda λ : focuses on analysing steaming Data e-Science: focuses on existing data + offers a pre-installed collaborative tools to handle data λ lambda.grnet.gr 8

Questions ? λ lambda.grnet.gr 9

GRNET eScience platform for Big Data management Codename: orka - PowerPoint PPT Presentation

GRNET eScience platform for Big Data management Codename: orka Monday, February 1, 2016 Project Vision Data-Intensive Science (store and process big data, at Petabyte scale) Scientific workflows Virtual Research Environment Data

GRNET SERVICE BOX George Thanos, GRNet Email: gthanos@grnet.gr Faidon Liampotis, GRNet Email:

eScience in the Netherlands Rob van Nieuwpoort R.vanNieuwpoort@esciencecenter.nl We work

In-house management tools TF-NOC George Kargiotakis (kargig@noc.grnet.gr) Andreas Polyrakis

GRNET NOC flash presentation TF NOC 15 16/2/2011, Ljubljana Andreas Polyrakis GRNET NOC

Pithos: Experience and Lessons http://pithos.grnet.gr Panos Louridas, GRNET louridas@grnet.gr

Archipelago: New Cloud Storage Backend of GRNET Filippos Giannakos Chrysostomos Nanakos

GRNET NOC network monitoring & visualization tools TF-NOC Zurich Alex Kosiaris

TCS (eScience) Personal CA Milan Sova Context TCS: TERENA SSL CA TERENA eScience SSL

eScience Projects in Projects in eScience Singapore Singapore Lawrence Wong National Grid

Building Virtual Communities with eScience Andy Parker Director, Cambridge eScience Centre What

Growi rowing ng resea research whi rch which com ch comput putes es Nick Jones Director

SAM/ARGO Status update P. Korosoglou (GRNET/AUTH) EGI-TJRA2.1 Evolution of the ARGO platform:

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

eScience on Distributed Infrastructure in Poland Marian Bubak AGH University of Science and

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

The Promise and Perils of Data Science in the Wild Data Science & Society Seminar | eScience

Welcome to the 4 th ICCT Workshop on Marine Black Carbon Dan Rutherford, PhD Director, Marine

Geovisualization of fishing vessel movement patterns using hybrid fractal/velocity signatures

Sketch Model Review Blue!Team!!Sec,on!A! October!3,!2013! 2.009%%Blue%A ! 1% Problems and

Streaming Model of Computation A streaming algorithm processes a data stream : Input is

A Novel Approach for Cooperative Overlay-Maintenance in Multi-Overlay Environments 1 Wu-Chun

DevOps & AWS Chris Econn Head of DevOps CorpInfo | AWS Premier Partner DevOps Bill of Rights

The Big Picture To write a character description about your own 'Lost Thing'. 1 Lesson 4 Short

Trade Patterns and the Future of I nternational Air Cargo FW S I ATA Speech March 1 1 , 2 0 1 4

GRNET eScience platform for Big Data management Codename: orka - PowerPoint PPT Presentation

GRNET eScience platform for Big Data management Codename: orka Monday, February 1, 2016 Project Vision Data-Intensive Science (store and process big data, at Petabyte scale) Scientific workflows Virtual Research Environment Data

GRNET SERVICE BOX George Thanos, GRNet Email: gthanos@grnet.gr Faidon Liampotis, GRNet Email:

eScience in the Netherlands Rob van Nieuwpoort R.vanNieuwpoort@esciencecenter.nl We work

In-house management tools TF-NOC George Kargiotakis (kargig@noc.grnet.gr) Andreas Polyrakis

GRNET NOC flash presentation TF NOC 15 16/2/2011, Ljubljana Andreas Polyrakis GRNET NOC

Pithos: Experience and Lessons http://pithos.grnet.gr Panos Louridas, GRNET louridas@grnet.gr

Archipelago: New Cloud Storage Backend of GRNET Filippos Giannakos Chrysostomos Nanakos

GRNET NOC network monitoring &amp; visualization tools TF-NOC Zurich Alex Kosiaris

TCS (eScience) Personal CA Milan Sova Context TCS: TERENA SSL CA TERENA eScience SSL

eScience Projects in Projects in eScience Singapore Singapore Lawrence Wong National Grid

Building Virtual Communities with eScience Andy Parker Director, Cambridge eScience Centre What

Growi rowing ng resea research whi rch which com ch comput putes es Nick Jones Director

SAM/ARGO Status update P. Korosoglou (GRNET/AUTH) EGI-TJRA2.1 Evolution of the ARGO platform:

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

eScience on Distributed Infrastructure in Poland Marian Bubak AGH University of Science and

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

The Promise and Perils of Data Science in the Wild Data Science &amp; Society Seminar | eScience

Welcome to the 4 th ICCT Workshop on Marine Black Carbon Dan Rutherford, PhD Director, Marine

Geovisualization of fishing vessel movement patterns using hybrid fractal/velocity signatures

Sketch Model Review Blue!Team!!Sec,on!A! October!3,!2013! 2.009%%Blue%A ! 1% Problems and

Streaming Model of Computation A streaming algorithm processes a data stream : Input is

A Novel Approach for Cooperative Overlay-Maintenance in Multi-Overlay Environments 1 Wu-Chun

DevOps &amp; AWS Chris Econn Head of DevOps CorpInfo | AWS Premier Partner DevOps Bill of Rights

The Big Picture To write a character description about your own 'Lost Thing'. 1 Lesson 4 Short

Trade Patterns and the Future of I nternational Air Cargo FW S I ATA Speech March 1 1 , 2 0 1 4

GRNET NOC network monitoring & visualization tools TF-NOC Zurich Alex Kosiaris

The Promise and Perils of Data Science in the Wild Data Science & Society Seminar | eScience

DevOps & AWS Chris Econn Head of DevOps CorpInfo | AWS Premier Partner DevOps Bill of Rights