AI and Predictive Analytics in Data-Center Environments Distributed - - PowerPoint PPT Presentation

▶

Feb 08, 2024 48 likes •188 views

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An Introduction to Spark Environments Josep Ll. Berral @BSC Intel Academic Education Mindshare Initiative for AI Presentation Distributed computing

SLIDE 1

AI and Predictive Analytics in Data-Center Environments

Distributed Computing using Spark

An Introduction to Spark Environments Josep Ll. Berral @BSC

Intel Academic Education Mindshare Initiative for AI

SLIDE 2

Presentation

Distributed computing using Apache SPARK!

Apache Spark is a framework
for processing data
in a distributed manner
For distributing our experiments and analytics

SLIDE 3

Introduction

“Describe what to execute and let Spark to distribute it for execution”

SLIDE 4

Introduction to Spark

What is Apache Spark
Cluster Computing Framework
Programming clusters with data parallelism and fault tolerance
Programmable in Java, Scala, Python and R

SLIDE 5

Motivation for using Spark

Spark schedules data parallelism
User defines the set of operations to be performed
Spark performs an orchestrated execution
Distributed algorithms libraries:
ML, Graphs, Streaming, DB queries

exp data

exp

SLIDE 6

Motivation for using Spark

It works with Hadoop Distributed File System
Taking advantage Distributed File Systems
Bring execution to where data is distributed

exp

exp Data

in HDFS

SLIDE 7

Introduction to Apache Spark

Cluster Computing Framework

1. Define your cluster (directors and workers)

Cluster

SLIDE 8

Introduction to Apache Spark

Cluster Computing Framework

1. Define your cluster (directors and workers) 2. Link to your distributed File System

Cluster DFS

SLIDE 9

Introduction to Apache Spark

Cluster Computing Framework

1. Define your cluster (directors and workers) 2. Link to your distributed File System 3. Start a session / Create an app

My Cluster DFS Local Session

SLIDE 10

Introduction to Apache Spark

Cluster Computing Framework

1. Define your cluster (directors and workers) 2. Link to your distributed File System 3. Start a session / Create an app 4. Let Spark to plan and execute the workflow and data-flow

My Cluster DFS Local Session Run!

SLIDE 11

Introduction to Apache Spark

Distributed Data and Shuffling
Spark takes advantage of data distribution
If operations require to cross data from different places
Shuffling: Data needs to be shared among workers
We must think of it when preparing the analytics

d1 d2 r1 r2 r1 r2

. . . . . . Processing

Data Exchange

Keep processing

r2 r1

SLIDE 12

Virtualized Environments

Cloud environments
Take advantage of Virtualization/Containers

SLIDE 13

Virtualized Environments

Cloud environments
Take advantage of Virtualization/Containers

Worker Image

X 2 X 16GB X 1TB

CPU Mem Disk

Master Image

X 4 X 32GB X 2TB

CPU Mem Disk

VM/Container manager: “Deploy N workers and 1 master” “Create a virtual network to let them see each other” ”Give them a common configuration (master can find the workers, workers can find the DFS

r find the files, ...)”

SLIDE 14

Summary

What is Spark
Distributined Computing Framework
Spark distributed architecture
Directors and Workers
Distributing experiments and data
Leverage Virtualization
How we can deploy/scale using VMs and Containers