[PPT] - OpenNebula: Experiences at KTH With a deeper dive into emerging PowerPoint Presentation

SLIDE 1

Åke ¡Edlund ¡

KTH ¡PDC-‑HPC ¡ ¡Center ¡for ¡High ¡Performance ¡Computing ¡ KTH ¡HPCViz ¡Data-‑Intensive ¡Computing ¡Group ¡ KTH ¡PDC-‑HPC ¡Cloud

1

OpenNebula: Experiences at KTH

With a deeper dive into emerging data analytics stacks

SLIDE 2

Outline of this talk

Cloud computing and data-intensive computing at PDC - a brief overview OpenNebula at PDC - examples Apache Spark at PDC - what I use our cloud for

2

SLIDE 3

Cloud computing and data-intensive computing at PDC - a brief overview OpenNebula at PDC - examples Apache Spark at PDC - what I use our cloud for

3

SLIDE 4

Cloud computing and data-intensive computing at PDC - a brief overview

Cloud ¡research ¡since ¡2007 ¡

– Cloud ¡provider ¡since ¡2009 ¡– ¡national ¡and ¡international ¡users ¡

Spark ¡user ¡since ¡May ¡2012 ¡(more ¡in ¡the ¡last ¡section) ¡

– Version ¡0.6 ¡released ¡in ¡October ¡15, ¡2012 ¡

Research ¡and ¡Development ¡

– Distributed ¡and ¡federated ¡clouds ¡and ¡data ¡analytics ¡stacks ¡ – Bioinformatics ¡and ¡LifeScience ¡applications ¡ – Scalable ¡statistics ¡ – Self-‑improving ¡systems ¡ – Strong ¡and ¡usable ¡security ¡factors ¡to ¡enable ¡researchers ¡to ¡store ¡sensitive ¡data ¡in ¡the ¡Cloud ¡

Projects ¡(many) ¡

– SNIC ¡Cloud ¡Infrastructure ¡(co-‑Initiator ¡and ¡Coordinator) ¡– ¡the ¡Swedish ¡roll ¡out ¡of ¡cloud ¡for ¡ eScience ¡ – NeIC ¡Nordic ¡Cloud ¡(co-‑Initiator ¡and ¡coordinator ¡Swedish ¡part) ¡ – BioBankCloud ¡(WP ¡leader) ¡– ¡PaaS ¡for ¡biobanking ¡ – EGI ¡Federated ¡Cloud ¡task ¡force ¡(development ¡and ¡resource ¡provider) ¡ – VENUS-‑C ¡(WP-‑Leader) ¡(2010 ¡– ¡2012) ¡ – …

4

SLIDE 5

Cloud Resources at PDC

PDC ¡Cloud ¡has ¡been ¡in ¡production ¡(with ¡external ¡users) ¡since ¡2010 ¡and ¡is ¡ today ¡an ¡installation ¡of ¡364 ¡cores ¡

‑

12 ¡nodes, ¡each ¡consisting ¡of ¡32 ¡cores ¡– ¡1 ¡TB ¡x ¡2 ¡disk ¡and ¡64 ¡GB ¡RAM ¡

‑

20 ¡TB ¡shared ¡(through ¡Infiniband) ¡by ¡the ¡12 ¡nodes ¡using ¡Ceph ¡(RBD ¡(block ¡ devices), ¡S3 ¡(Object ¡Storage) ¡-‑ ¡this ¡is ¡under ¡reconstruction ¡(from ¡SAN ¡to ¡ dedicated ¡Ceph ¡storage ¡nodes ¡-‑> ¡36 ¡TB) ¡

‑

Cloud ¡middlewares ¡used ¡over ¡the ¡years ¡range ¡from ¡Eucalyptus, ¡ OpenNebula, ¡and ¡now ¡a ¡mix ¡of ¡OpenNebula ¡and ¡OpenStack ¡

‑

Users ¡access ¡their ¡resources ¡using ¡web ¡panel ¡and/or ¡CLI/API ¡

Users ¡(so ¡far) ¡are ¡Nordic ¡and ¡European ¡researchers. ¡PDC ¡Cloud ¡is ¡leading ¡

partner ¡in ¡a ¡number ¡of ¡Swedish, ¡Nordic ¡and ¡European ¡cloud ¡projects, ¡e.g. ¡ being ¡one ¡of ¡the ¡first ¡certified ¡cloud ¡resource ¡providers ¡to ¡EGI ¡Federated ¡ Cloud.

5

SLIDE 6

Data-Intensive Computing at PDC

HPCViz ¡Data-‑Intensive ¡Computing ¡Group ¡(started ¡2012) ¡is ¡a ¡research ¡ group ¡building ¡on ¡the ¡experiences ¡from ¡PDC. ¡

‑ 9 ¡group ¡members ¡(7 ¡researchers, ¡2 ¡developers) ¡
‑ Collaborating ¡mainly ¡with ¡Uppsala ¡University ¡(bioinformatics), ¡KI ¡

(SciLifeLab) ¡on ¡applying, ¡and ¡further ¡expand, ¡emerging ¡novel ¡ techniques ¡for ¡iterative ¡and ¡interactive ¡in-‑memory ¡data ¡analytics ¡ stacks ¡(Spark, ¡Stratosphere, ¡H2O, ¡…) ¡

‑ Other ¡areas ¡of ¡interest ¡include ¡anomaly ¡detection ¡in ¡streaming ¡data, ¡

with ¡applications ¡in ¡performance ¡improvement ¡of ¡distributed ¡systems, ¡ and ¡security ¡(intrusion ¡detection).

6

SLIDE 7

[1] ¡"Practical ¡Cloud ¡Evaluation ¡from ¡a ¡Nordic ¡eScience ¡User ¡Perspective", ¡VTDC'11, ¡ACM ¡conference ¡San ¡Jose ¡(2011) ¡by ¡Åke ¡Edlund ¡and ¡Maarten ¡Koopman, ¡Zeeshan ¡Ali ¡Shah, ¡ Ilja ¡Livenson, ¡Frederik ¡Orellana, ¡Jukka ¡Kommeri, ¡Miika ¡Tuisku, ¡Pekka ¡Lehtovuori, ¡Klaus ¡Marius ¡Hansen, ¡Helmut ¡Neukirchen, ¡ ¡Ebba ¡Þóra ¡Hvannberg ¡

7

Our Cloud Learning Curve

2001 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2004 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2007 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2010 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2011 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2012 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2013 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2014

Nordic ¡cloud ¡project, ¡NEON ¡ (2010) ¡ Practical ¡evaluation ¡[1], ¡ testing ¡public ¡vs ¡private ¡ cloud ¡for ¡eScience ¡users ¡ (bioinformatics) SNIC ¡Cloud ¡project ¡(2011.6-‑2012.6+) ¡ Enabled ¡cloud ¡access ¡(public ¡and ¡ private) ¡to ¡SNIC ¡users. ¡ ¡14 ¡(some ¡ recurring) ¡users ¡of ¡SNIC ¡Cloud ¡for ¡ Amazon ¡ ¡(e.g. ¡running ¡Galaxy) ¡and ¡54 ¡

n ¡the ¡private ¡cloud ¡(currently ¡only ¡

PDC ¡Cloud, ¡partially ¡from ¡outside ¡ SNIC) SNIC ¡Galaxy ¡project ¡ (2013.3-‑2014.3). ¡The ¡goal ¡of ¡ the ¡project ¡is ¡to ¡deliver ¡Galaxy ¡ as ¡a ¡service, ¡using ¡the ¡Galaxy ¡ cloud ¡management ¡platform, ¡ Cloudman, ¡on ¡local ¡cloud ¡ installations ¡(private ¡clouds). ¡ SNIC ¡Cloud ¡Infrastructure ¡ (long-‑term, ¡started ¡Jan ¡2014). ¡ A ¡(generic) ¡IaaS ¡on ¡which ¡ communities/users ¡can ¡build ¡ their ¡PaaS. ¡Strong ¡emphasize ¡

n ¡user ¡communities ¡and ¡their ¡
commitment. ¡

Grid ¡Computing ¡projects ¡(DataGrid, ¡EGEE, ¡EGI) ¡– ¡including ¡EGI ¡Federated ¡Clouds ¡TF

KTH ¡PDC ¡Cloud ¡ experimentation

Public ¡ ¡ ¡IaaS Private ¡IaaS Private ¡PaaS Public ¡ ¡ ¡PaaS

PDC-‑HPC ¡(since ¡1989)

SLIDE 8

[1] ¡"Practical ¡Cloud ¡Evaluation ¡from ¡a ¡Nordic ¡eScience ¡User ¡Perspective", ¡VTDC'11, ¡ACM ¡conference ¡San ¡Jose ¡(2011) ¡by ¡Åke ¡Edlund ¡and ¡Maarten ¡Koopman, ¡Zeeshan ¡Ali ¡Shah, ¡ Ilja ¡Livenson, ¡Frederik ¡Orellana, ¡Jukka ¡Kommeri, ¡Miika ¡Tuisku, ¡Pekka ¡Lehtovuori, ¡Klaus ¡Marius ¡Hansen, ¡Helmut ¡Neukirchen, ¡ ¡Ebba ¡Þóra ¡Hvannberg ¡

8

2001 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2004 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2007 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2010 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2011 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2012 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2013 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡2014

Nordic ¡cloud ¡project, ¡NEON ¡ (2010) ¡ Practical ¡evaluation ¡[1], ¡ testing ¡public ¡vs ¡private ¡ cloud ¡for ¡eScience ¡users ¡ (bioinformatics) SNIC ¡Cloud ¡project ¡(2011.6-‑2012.6+) ¡ Enabled ¡cloud ¡access ¡(public ¡and ¡ private) ¡to ¡SNIC ¡users. ¡ ¡14 ¡(some ¡ recurring) ¡users ¡of ¡SNIC ¡Cloud ¡for ¡ Amazon ¡ ¡(e.g. ¡running ¡Galaxy) ¡and ¡54 ¡

n ¡the ¡private ¡cloud ¡(currently ¡only ¡

PDC ¡Cloud, ¡partially ¡from ¡outside ¡ SNIC) SNIC ¡Galaxy ¡project ¡ (2013.3-‑2014.3). ¡The ¡goal ¡of ¡ the ¡project ¡is ¡to ¡deliver ¡Galaxy ¡ as ¡a ¡service, ¡using ¡the ¡Galaxy ¡ cloud ¡management ¡platform, ¡ Cloudman, ¡on ¡local ¡cloud ¡ installations ¡(private ¡clouds). ¡ SNIC ¡Cloud ¡Infrastructure ¡ (long-‑term, ¡started ¡Jan ¡2014). ¡ A ¡(generic) ¡IaaS ¡on ¡which ¡ communities/users ¡can ¡build ¡ their ¡PaaS. ¡Strong ¡emphasize ¡

n ¡user ¡communities ¡and ¡their ¡
commitment. ¡

Grid ¡Computing ¡projects ¡(DataGrid, ¡EGEE, ¡EGI) ¡– ¡including ¡EGI ¡Federated ¡Clouds ¡TF

KTH ¡PDC ¡Cloud ¡ experimentation

Public ¡ ¡ ¡IaaS Private ¡IaaS Private ¡PaaS Public ¡ ¡ ¡PaaS

PDC-‑HPC ¡(since ¡1989)

Iaas ¡à PaaS ¡ Security ¡concerns. ¡Service ¡to ¡our ¡users. ¡ Easier ¡to ¡manage ¡larger ¡user ¡groups. Public ¡IaaS ¡à Private ¡IaaS ¡ Large ¡amount ¡of ¡sensitive ¡data, ¡

ften ¡too ¡cumbersome ¡for ¡

practical ¡use ¡of ¡public ¡clouds. ¡

Our Cloud Learning Curve

SLIDE 9

Federated ¡Cloud ¡Projects

Current ¡Cloud ¡Projects ¡

SNIC Cloud (co-Initiator and Coordinator) – the Swedish roll out of cloud for eScience
NeIC Nordic Cloud (co-Initiator and Coordinator Swedish part) ¡
¡BioBankCloud (WP leader) – PaaS for biobanking
EGI Federated Cloud (development and resource provider)

Earlier ¡Cloud ¡Projects ¡

SNIC Galaxy (PaaS) (co-Initiator and Coordinator) (2013)
SNIC Cloud (Initiator and Coordinator) (2011-2012)
SICS Startup Accelerator (co-Initiator and Coordinator) (2011)
VENUS-C (WP leader) (2010-2012)
NEON – Northern Europe cloud project (Initiator and Coordinator) (2010)

9

SLIDE 10

10

Main contribution to this section: from Zeeshan Ali Shah* Cloud computing and data-intensive computing at PDC - a brief overview OpenNebula at PDC - examples Apache Spark at PDC - what I use our cloud for * zashah@pdc.kth.se

SLIDE 11

Started with Eucalyptus

Back in 2009
Federated between KTH centers cross Stockholm.
Then Eucalyptus selected redhat in licensing model.
And we selected Open Nebula due to its openness

and easy access to it’s core team which was located in EU .

11

SLIDE 12

Open Nebula

2010 - Selected during technical kick-off of Venus-C project
Based in EU , easy access to developers
Fully open source
Started with Open Nebula 2.0
OVF (Open Virtualization format) interfaced was developed

within Venus-C

Federated with Other Venus-C sites such as BSC (Spain)

and ENGINEERING (Italy).

12

SLIDE 13

User base

13

www.e-science.se www.scilifelab.se

www.natmeg.se

Neurosciences, Karolinska Institute

And, yes, from EGI Fed cloud communities

Science for Life Laboratory (SciLifeLab) is a national center for molecular biosciences with focus on health and environmental research.

SLIDE 14

OpenNebula User experience

Served around 100+ users, both Swedish and other EU

researchers

Interfaces:

– Open Nebula CLI – Sunstone Dashboard – SDK (not so many) but option was there

Conducted Hands-on Workshops for users

14

SLIDE 15

Federation with EGI

Compute using OCCI (backend with Open Nebula)
Auto injection of user keys from Voms server
Federated identity with VOMS and X.509
Information system
Accounting service

15

From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)

SLIDE 16

Bio science users

Pre configured apps with Open Nebula

Galaxy - galaxyproject.org
Cloudbio linux - cloudbiolinux.org

Cloud ¡Bio ¡Linux Galaxy ¡(AWS ¡-‑ ¡for ¡CloudMan)

16

Issue: PoC Cloudman on ON (SARA, NL) - but moved to OS

SLIDE 17

Way forward

Dedicated storage service, like S3 , Swift (OpenStack)
Network service for versatile setups, like Neutron (OS)
Image caching on compute nodes.

– To minimize launch time of VMs, what we notice is that most of time in VMs launch took for copying image to designated host – Shared FS is an option, but it has its own limitations.

17

“Wish list” from Zeeshan Ali Shah * * zashah@pdc.kth.se

SLIDE 18

Big Data analytics

Apache Spark
Hadoop
Mesos -> YARN
Orchestration of Spark clusters with Open Nebula

18

See next section ….

SLIDE 19

19

Cloud computing and data-intensive computing at PDC - a brief overview OpenNebula at PDC - examples Apache Spark at PDC - what I use our cloud for

SLIDE 20

Sources ¡to ¡Big ¡Data

Probing ¡extreme ¡phenomena ¡in ¡scientific ¡ fields ¡with ¡mature ¡theories Increasingly ¡exploratory ¡research ¡areas Making ¡meaning ¡of ¡human ¡activity ¡on ¡the ¡ Internet ¡ 1990 2010 Sensing ¡everything ¡

20

SLIDE 21

Sources ¡to ¡Big ¡Data

Probing ¡extreme ¡phenomena ¡in ¡scientific ¡ fields ¡with ¡mature ¡theories Increasingly ¡exploratory ¡research ¡areas Making ¡meaning ¡of ¡human ¡activity ¡on ¡the ¡ Internet ¡ 1990 2010 Sensing ¡everything ¡

21

Sthlm, May 2014

SLIDE 22

Research ¡at ¡HPCViz ¡Data-‑Intensive ¡Computing ¡Group

…. ¡building ¡a ¡DS ¡curriculum ¡for ¡the ¡group Brain ¡images ¡– ¡Scabia ¡project, ¡MEG ¡data ¡ Paas ¡for ¡Life ¡Science ¡ ¡

‑ ¡Biobankcloud, ¡Galaxy, ¡..

Privacy ¡preservation ¡in ¡the ¡cloud ¡ ¡ ¡-‑ ¡Biobankcloud Federated ¡clouds ¡ ¡

‑ ¡EGI, ¡Nordic ¡Cloud, ¡CDMi ¡proxy

Cloud ¡environments ¡ ¡

‑ ¡Environment ¡launching ¡ ¡
‑ ¡Streaming ¡capabilities ¡
‑ ¡Workflows ¡-‑ ¡including ¡graph ¡data ¡ ¡

¡ ¡capabilities Anomaly ¡detection ¡in ¡performance ¡data ¡

‑ ¡Intrusion ¡Detection ¡
‑ ¡Performance ¡Analysis ¡
‑ ¡Sensor ¡data, ¡IoT, ¡…

Next: ¡Scalable ¡statistics Cloud ¡and ¡industry ¡– ¡esp. ¡startups Chemoinformatics ¡

‑ ¡ ¡MapReduce ¡based ¡Parallel ¡Virtual ¡ ¡ ¡

¡ ¡ ¡Screening ¡

22

Applications Technologies Industry Algorithms ¡ ¡ Theory

SLIDE 23

Federated Cloud Services

Federated ¡IaaS ¡and ¡STaaS ¡Cloud

Tier 1: Reliable Infrastructure Cloud Tier 4: Zero ICT Infrastructures Tier 3:

Platform as a Service

Tier 2:

General-purpose platform services

PaaS PaaS DB ¡aaS Hadoop ¡ aaS VRE Secure ¡storage

Key ¡Mgmt

Encryption

ACL ¡mgmt

Virtual ¡ ¡ eLaboratory

23

From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)

SLIDE 24

Federated Cloud Services

Federated ¡IaaS ¡and ¡STaaS ¡Cloud

Tier 1: Reliable Infrastructure Cloud Tier 4: Zero ICT Infrastructures Tier 3:

Platform as a Service

Tier 2:

General-purpose platform services

PaaS PaaS DB ¡aaS Hadoop ¡ aaS VRE Secure ¡storage

Key ¡Mgmt

Encryption

ACL ¡mgmt

Virtual ¡ ¡ eLaboratory

24

From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)

SLIDE 25

DAaaS ¡-‑ ¡What ¡do ¡We ¡Need?

Interactive ¡queries: ¡enable ¡faster ¡decisions ¡
Queries ¡on ¡streaming ¡data: ¡enable ¡decisions ¡on ¡real-‑time ¡data ¡
Sophisticated ¡data ¡processing: ¡enable ¡“better” ¡decisions ¡
Need ¡of ¡statistical ¡principles ¡(that ¡scale): ¡to ¡justify ¡the ¡inferential ¡

leap ¡from ¡data ¡to ¡knowledge: ¡ – Need ¡estimates ¡of ¡uncertainty ¡in ¡the ¡outputs ¡of ¡algorithms ¡ (“error ¡bars”) ¡

Pipelines: ¡ability ¡to ¡run ¡mixed ¡analysis ¡under ¡one ¡framework ¡– ¡for ¡

efficiency ¡and ¡to ¡be ¡able ¡to ¡develop ¡sophisticated ¡algorithms

Support batch, streaming, and interactive computations… in a unified framework

25

SLIDE 26

Applications

Spark Streaming

GraphX

MLBase

BlinkDB Pig

…

Storm

MPI

Shark HIVE

Spark

Hadoop MR

HDFS

Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

Hadoop YARN

“Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.

Infrastructure

E.g. public and private clouds

Data Processing Data Management Resource Management

Berkeley Data Analytics Stack

26

SLIDE 27

Apache Hadoop

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop

MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide

range of applications, including ETL, machine learning, stream processing, and graph computation.

Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary

DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

ZooKeeper™: A high-performance coordination service for distributed applications.

Applications Spark Streaming GraphX MLBase BlinkDB Pig … Storm MPI Shark HIVE Spark Hadoop MR HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management. Infrastructure E.g. public and private clouds

27

SLIDE 28 Applications Spark Streaming GraphX MLBase BlinkDB Pig … Storm MPI Shark HIVE Spark Hadoop MR HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management. Infrastructure E.g. public and private clouds

Berkeley Data Analytics Stack

Shark - Hive and SQL on top of Spark
MLbase - Machine Learning project on top of Spark
BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark
GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into Spark 0.9)
Apache Mesos - Cluster management system that supports running Spark
Tachyon - In memory storage system that supports running Spark
Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on

top of Apache Hadoop, Hama, and Spark

OpenDL - A deep learning algorithm library based on Spark framework. Just kick off.
SparkR - R frontend for Spark
Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster

28

SLIDE 29

Unifies batch, ¡streaming, ¡interac<ve ¡comp.
Easy to build sophisticated applications

– Support iterative, graph-parallel algorithms – Powerful APIs in Scala, Python, Java

Applications Spark Streaming GraphX MLBase BlinkDB Pig … Storm MPI Shark HIVE Spark Hadoop MR HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management. Infrastructure E.g. public and private clouds

Berkeley Data Analytics Stack

29

Streaming Interactive Sophisticated algorithms Batch, Interactive Interactive Sophisticated algorithms

spark.apache.org

SLIDE 30

Turning Data into Value, Examples

Unify real-time and historical data analysis

– Easier to build and maintain – Cheaper to operate – Easier to get insights, faster decisions

Unify streaming and machine-learning

– Faster diagnosis, decisions (e.g., better ad targeting)

Unify graph processing and ETLs

– Faster to get social network insights (e.g., improve user experience)

30

SLIDE 31

What it Means for Users

Separate frameworks:

…

HDFS read HDFS write

E T L

HDFS read HDFS write

t r a i n

HDFS read HDFS write

q u e r y

HDFS

HDFS read

Spark:

Interactive  analysis

31

E T L

t r a i n q u e r y

SLIDE 32

Advantage of an unified stack

Explore data interactively

to identify problems

Use same code in Spark

for processing large logs

Use similar code in

Spark Streaming for realtime processing

$ ¡./spark-‑shell ¡ scala> ¡val ¡file ¡= ¡sc.hadoopFile(“smallLogs”) ¡ ... ¡ scala> ¡val ¡filtered ¡= ¡file.filter(_.contains(“ERROR”)) ¡ ... ¡ scala> ¡val ¡mapped ¡= ¡filtered.map(...) ¡ ... ¡

bject ¡ProcessProductionData ¡{ ¡

¡ ¡def ¡main(args: ¡Array[String]) ¡{ ¡ ¡ ¡ ¡ ¡val ¡sc ¡= ¡new ¡SparkContext(...) ¡ ¡ ¡ ¡ ¡val ¡file ¡= ¡sc.hadoopFile(“productionLogs”) ¡ ¡ ¡ ¡ ¡val ¡filtered ¡= ¡file.filter(_.contains(“ERROR”)) ¡ ¡ ¡ ¡ ¡val ¡mapped ¡= ¡filtered.map(...) ¡ ¡ ¡ ¡ ¡... ¡ ¡ ¡} ¡ } object ¡ProcessLiveStream ¡{ ¡ ¡ ¡def ¡main(args: ¡Array[String]) ¡{ ¡ ¡ ¡ ¡ ¡val ¡sc ¡= ¡new ¡StreamingContext(...) ¡ ¡ ¡ ¡ ¡val ¡stream ¡= ¡sc.kafkaStream(...) ¡ ¡ ¡ ¡ ¡val ¡filtered ¡= ¡stream.filter(_.contains(“ERROR”)) ¡ ¡ ¡ ¡ ¡val ¡mapped ¡= ¡filtered.map(...) ¡ ¡ ¡ ¡ ¡... ¡ ¡ ¡} ¡ } 32

SLIDE 33

Spark Integration

val ¡points ¡= ¡sc.runSql[Double, ¡Double](

¡ ¡“select ¡latitude, ¡longitude ¡from ¡historic_tweets”)    val ¡model ¡= ¡KMeans.train(points, ¡10)    sc.twitterStream(...)  ¡ ¡.map(t ¡=> ¡(model.closestCenter(t.location), ¡1))  ¡ ¡.reduceByWindow(“5s”, ¡_ ¡+ ¡_)

From Scala:

33

SLIDE 34

Summary – challenges and opportunities arising

Data ¡processing: ¡from ¡special ¡to ¡general ¡-‑ ¡and ¡back? ¡
Data ¡locality: ¡from ¡detailed, ¡to ¡general ¡– ¡and ¡back? ¡See ¡eg. ¡

Google’s ¡OMEGA ¡efforts ¡

Infrastructure: ¡from ¡public ¡to ¡private ¡to ¡hybrid ¡cloud ¡
Disk ¡vs ¡in-‑memory: ¡going ¡back ¡to ¡earlier ¡more ¡complex ¡

environments? ¡Not ¡yet. ¡

Workflows/pipelines: ¡unification ¡crucial ¡for ¡performance ¡and ¡

usability ¡

New ¡areas ¡evolving, ¡both ¡in ¡computer ¡science ¡as ¡in ¡statistics ¡

– Quality: ¡Need ¡of ¡“error ¡bars” ¡around ¡outcomes ¡

Need ¡of ¡new ¡solutions ¡to ¡make ¡this ¡possible, ¡on ¡large ¡data ¡sets ¡

– Algorithmic weakening for statistical inference ¡

a new area in theoretical computer science? ¡
a new area in statistics?

34

SLIDE 35

Summary – Exciting times ahead!

Data ¡processing: ¡from ¡special ¡to ¡general ¡-‑ ¡and ¡back? ¡
Data ¡locality: ¡from ¡detailed, ¡to ¡general ¡– ¡and ¡back? ¡See ¡eg. ¡

Google’s ¡OMEGA ¡efforts ¡

Infrastructure: ¡from ¡public ¡to ¡private ¡to ¡hybrid ¡cloud ¡
Disk ¡vs ¡in-‑memory: ¡going ¡back ¡to ¡earlier ¡more ¡complex ¡

environments? ¡Not ¡yet. ¡

Workflows/pipelines: ¡unification ¡crucial ¡for ¡performance ¡and ¡

usability ¡

New ¡areas ¡evolving, ¡both ¡in ¡computer ¡science ¡as ¡in ¡statistics ¡

– Quality: ¡Need ¡of ¡“error ¡bars” ¡around ¡outcomes ¡

Need ¡of ¡new ¡solutions ¡to ¡make ¡this ¡possible, ¡on ¡large ¡data ¡sets ¡

– Algorithmic weakening for statistical inference ¡

a new area in theoretical computer science? ¡
a new area in statistics?

35

“Use ¡Clouds ¡running ¡Data ¡Analytics ¡

processing ¡Big ¡Data ¡to ¡solve ¡problems ¡in ¡ X-‑Informatics ¡( ¡or ¡e-‑X)” ¡

Need ¡to ¡excel ¡in ¡many ¡areas, ¡at ¡the ¡same ¡

time! ¡

Computer ¡Skills

Mathematics ¡& ¡ Statistics ¡Knowledge Substantive ¡ ¡ Experience

Data ¡ Science Machine ¡ Learning Traditional ¡ Research Danger ¡ Zone ¡!

SLIDE 36

References

Geoffrey ¡Fox, ¡Indiana ¡University ¡

– http://www.soic.indiana.edu/people/profiles/fox-‑geoffrey-‑charles.shtml ¡-‑ ¡great ¡ visionary ¡researcher ¡in ¡distributed ¡computing ¡and ¡its ¡usage ¡

Frontiers ¡in ¡Massive ¡Data ¡Analysis ¡

– http://www.nap.edu/catalog.php?record_id=18374 ¡-‑ ¡fundament ¡of ¡current ¡state-‑of-‑ the-‑art ¡

The ¡Fourth ¡Paradigm: ¡Data-‑Intensive ¡Scientific ¡Discovery ¡

– http://research.microsoft.com/en-‑us/collaboration/fourthparadigm/ ¡-‑ ¡a ¡good ¡starting ¡ point, ¡esp. ¡visions ¡from ¡Jim ¡Gray ¡

Spark ¡related ¡slides ¡from ¡ ¡

– Spark ¡team ¡

Matei ¡Zaharia, ¡MIT ¡and ¡Databricks ¡
Ion ¡Stoika, ¡UC ¡Berkeley ¡and ¡Databricks ¡
Patrick ¡Wendell, ¡Databricks ¡
Joseph ¡Gonzales ¡(GraphX), ¡UC ¡Berkeley

36

SLIDE 37

Thanks!

37

Åke ¡Edlund

edlund@pdc.kth.se

OpenNebula: Experiences at KTH With a deeper dive into emerging - - PowerPoint PPT Presentation

OpenNebula: Experiences at KTH

Outline of this talk

Cloud Resources at PDC

Data-Intensive Computing at PDC

Our Cloud Learning Curve

Our Cloud Learning Curve

Federated ¡Cloud ¡Projects

Started with Eucalyptus

Open Nebula

User base

OpenNebula User experience

Federation with EGI

Bio science users

Way forward

Big Data analytics

Sources ¡to ¡Big ¡Data

Sources ¡to ¡Big ¡Data

Federated Cloud Services

Federated Cloud Services

DAaaS ¡-‑ ¡What ¡do ¡We ¡Need?

Berkeley Data Analytics Stack

Apache Hadoop

Berkeley Data Analytics Stack

Berkeley Data Analytics Stack

Turning Data into Value, Examples

What it Means for Users

Advantage of an unified stack

Spark Integration

Summary – Exciting times ahead!

References

Thanks!

Q&A

OpenNebula: Experiences at KTH

Outline of this talk

Cloud Resources at PDC

Data-Intensive Computing at PDC

Our Cloud Learning Curve

Our Cloud Learning Curve

Federated ¡Cloud ¡Projects

Started with Eucalyptus

Open Nebula

User base

OpenNebula User experience

Federation with EGI

Bio science users

Way forward

Big Data analytics

Sources ¡to ¡Big ¡Data

Sources ¡to ¡Big ¡Data

Federated Cloud Services

Federated Cloud Services

DAaaS ¡-­‑ ¡What ¡do ¡We ¡Need?

Berkeley Data Analytics Stack

Apache Hadoop

Berkeley Data Analytics Stack

Berkeley Data Analytics Stack

Turning Data into Value, Examples

What it Means for Users

Advantage of an unified stack

Spark Integration

Summary – Exciting times ahead!

References

Thanks!

Q&A

DAaaS ¡-‑ ¡What ¡do ¡We ¡Need?