AI and Predictive Analytics in Data-Center Environments Distributed - - PowerPoint PPT Presentation

ai and predictive analytics in data center environments
SMART_READER_LITE
LIVE PREVIEW

AI and Predictive Analytics in Data-Center Environments Distributed - - PowerPoint PPT Presentation

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An Introduction to Spark Environments Josep Ll. Berral @BSC Intel Academic Education Mindshare Initiative for AI Presentation Distributed computing


slide-1
SLIDE 1

AI and Predictive Analytics in Data-Center Environments

Distributed Computing using Spark

An Introduction to Spark Environments Josep Ll. Berral @BSC

Intel Academic Education Mindshare Initiative for AI

slide-2
SLIDE 2

Presentation

Distributed computing using Apache SPARK!

  • Apache Spark is a framework
  • for processing data
  • in a distributed manner
  • For distributing our experiments and analytics
slide-3
SLIDE 3

Introduction

“Describe what to execute and let Spark to distribute it for execution”

slide-4
SLIDE 4

Introduction to Spark

  • What is Apache Spark
  • Cluster Computing Framework
  • Programming clusters with data parallelism and fault tolerance
  • Programmable in Java, Scala, Python and R
slide-5
SLIDE 5

Motivation for using Spark

  • Spark schedules data parallelism
  • User defines the set of operations to be performed
  • Spark performs an orchestrated execution
  • Distributed algorithms libraries:
  • ML, Graphs, Streaming, DB queries

exp data

d1

exp

d2

exp

d3

exp

slide-6
SLIDE 6

Motivation for using Spark

  • It works with Hadoop Distributed File System
  • Taking advantage Distributed File Systems
  • Bring execution to where data is distributed

exp

d1

exp

d2

exp

d3

exp Data

in HDFS

slide-7
SLIDE 7

Introduction to Apache Spark

  • Cluster Computing Framework

1. Define your cluster (directors and workers)

Cluster

slide-8
SLIDE 8

Introduction to Apache Spark

  • Cluster Computing Framework

1. Define your cluster (directors and workers) 2. Link to your distributed File System

Cluster DFS

slide-9
SLIDE 9

Introduction to Apache Spark

  • Cluster Computing Framework

1. Define your cluster (directors and workers) 2. Link to your distributed File System 3. Start a session / Create an app

My Cluster DFS Local Session

slide-10
SLIDE 10

Introduction to Apache Spark

  • Cluster Computing Framework

1. Define your cluster (directors and workers) 2. Link to your distributed File System 3. Start a session / Create an app 4. Let Spark to plan and execute the workflow and data-flow

My Cluster DFS Local Session Run!

slide-11
SLIDE 11

Introduction to Apache Spark

  • Distributed Data and Shuffling
  • Spark takes advantage of data distribution
  • If operations require to cross data from different places
  • Shuffling: Data needs to be shared among workers
  • We must think of it when preparing the analytics

d1 d2 r1 r2 r1 r2

. . . . . . Processing

Data Exchange

Keep processing

r2 r1

slide-12
SLIDE 12

Virtualized Environments

  • Cloud environments
  • Take advantage of Virtualization/Containers
slide-13
SLIDE 13

Virtualized Environments

  • Cloud environments
  • Take advantage of Virtualization/Containers

Worker Image

X 2 X 16GB X 1TB

CPU Mem Disk

Master Image

X 4 X 32GB X 2TB

CPU Mem Disk

VM/Container manager: “Deploy N workers and 1 master” “Create a virtual network to let them see each other” ”Give them a common configuration (master can find the workers, workers can find the DFS

  • r find the files, ...)”
slide-14
SLIDE 14

Summary

  • What is Spark
  • Distributined Computing Framework
  • Spark distributed architecture
  • Directors and Workers
  • Distributing experiments and data
  • Leverage Virtualization
  • How we can deploy/scale using VMs and Containers