Distributed Computing Using Spark Practical / Praktikum WS17/18 - - PowerPoint PPT Presentation

distributed computing using spark
SMART_READER_LITE
LIVE PREVIEW

Distributed Computing Using Spark Practical / Praktikum WS17/18 - - PowerPoint PPT Presentation

Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universitt Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala Agenda Introduction to Spark Case-study:


slide-1
SLIDE 1

Albert-Ludwigs-Universität Freiburg

Practical / Praktikum WS17/18

October 18th, 2017

Distributed Computing Using Spark

  • Prof. Dr. Georg Lausen

Anas Alzogbi Victor Anthony Arrascue Ayala

slide-2
SLIDE 2

Agenda

  • Introduction to Spark
  • Case-study: Recommender system for

scientific papers

  • Organization
  • Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 2

slide-3
SLIDE 3

Agenda

  • Introduction to Spark
  • Case-study: Recommender system for

scientific papers

  • Organization
  • Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 3

slide-4
SLIDE 4

Introduction to Spark

  • Distributed programming
  • MapReduce
  • Spark

18.10.2017 Distributed Computing Using Spark WS17/18 4

slide-5
SLIDE 5

Distributed programming - problem

  • Data grows faster than processing

capabilities

  • Web 2.0: users generate content
  • Social networks, online communities, etc.

18.10.2017 Distributed Computing Using Spark WS17/18 5

Source: https://www.flickr.com/photos/will-lion/2595497078

slide-6
SLIDE 6

Big Data

18.10.2017 Distributed Computing Using Spark WS17/18 6

Source: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ Source: http://www.bigdata-startups.com/open-source-tools/

slide-7
SLIDE 7

Big Data

  • Buzzword
  • Often less-structured
  • Requires different techniques, tools, approaches
  • To solve new problems or old ones in a better way

18.10.2017 Distributed Computing Using Spark WS17/18 7

slide-8
SLIDE 8

Network Programming Models

  • Requires a communication protocol for

programming parallel computers (slow)

  • MPI (wiki)
  • Locality of the data and the code across the

network have to be done manually

  • No failure management
  • Network problems not solved (e.g. stragglers)

18.10.2017 Distributed Computing Using Spark WS17/18 8

slide-9
SLIDE 9

Data Flow Models

  • Higher-level of abstraction: algorithms are

parallelized on large clusters

  • Fault-recovery by means of data

replication

  • Job divided into a set of independent tasks
  • Code is shipped to where the data is located
  • Good scalability

18.10.2017 Distributed Computing Using Spark WS17/18 9

slide-10
SLIDE 10

MapReduce – Key ideas

  • 1. Problem is split into smaller problems

(map step)

  • 2. Smaller problems are solved in a parallel

fashion

  • 3. Finally, a set of solutions to the smaller

problems get synthesized into a solution

  • f the original problem (Reduce step)

18.10.2017 Distributed Computing Using Spark WS17/18 10

slide-11
SLIDE 11

MapReduce – Overview

18.10.2017 Distributed Computing Using Spark WS17/18 11

split 1 split 0

Map Map Map Reduce Reduce

  • utput 0
  • utput 1

<k,v> Data

split 2 Input Data

… A target problem has to be parallelizable!!!

slide-12
SLIDE 12

MapReduce – Wordcount example

18.10.2017 Distributed Computing Using Spark WS17/18 12

Google Maps charts new territory into businesses Google selling new tools for businesses to build their own maps Google promises consumer experience for businesses with Maps Engine Pro Google is trying to get its Maps service used by more businesses

Google 4 Maps 4 Businesses 4 Engine 1 Charts 1 Territory 1 Tools 1 …

slide-13
SLIDE 13

MapReduce – Wordcount’s map

18.10.2017 Distributed Computing Using Spark WS17/18 13

Google Maps charts new territory into businesses Google selling new tools for businesses to build their own maps Google promises consumer experience for businesses with Maps Engine Pro Google is trying to get its Maps service used by more businesses

Map Map

Google 2 Charts 1 Maps 2 Territory 1 … Google 2 Businesses 2 Maps 2 Service 1 …

slide-14
SLIDE 14

MapReduce – Wordcount’s map

18.10.2017 Distributed Computing Using Spark WS17/18 14

Google Maps charts new territory into businesses Google selling new tools for businesses to build their own maps Google promises consumer experience for businesses with Maps Engine Pro Google is trying to get its Maps service used by more businesses

Map Map

Google 2 Charts 1 Maps 2 Territory 1 … Google 2 Businesses 2 Maps 2 Service 1 …

slide-15
SLIDE 15

MapReduce – Wordcount’s reduce

18.10.2017 Distributed Computing Using Spark WS17/18 15

Reduce Reduce

Google 2 Google 2 Maps 2 Maps 2 … Businesses 2 Businesses 2 Charts 1 Territory 1 … Google 4 Maps 4 … Businesses 4 Charts 1 Territory 1 …

slide-16
SLIDE 16

MapReduce – Wordcount’s reduce

18.10.2017 Distributed Computing Using Spark WS17/18 16

Reduce Reduce

Google 2 Google 2 Maps 2 Maps 2 … Businesses 2 Businesses 2 Charts 1 Territory 1 … Google 4 Maps 4 … Businesses 4 Charts 1 Territory 1 …

slide-17
SLIDE 17

MapReduce

  • Automatic
  • Partition and distribution of data
  • Parallelization and assignment of tasks
  • Scalability, fault-tolerance, scheduling

18.10.2017 Distributed Computing Using Spark WS17/18 17

slide-18
SLIDE 18

Apache Hadoop

  • Open-source implementation of MapReduce

18.10.2017 Distributed Computing Using Spark WS17/18 18

Source: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php

slide-19
SLIDE 19

MapReduce – Parallelizable algorithms

  • Matrix-vector multiplication
  • Power iteration (e.g. PageRank)
  • Gradient descent methods
  • Stochastic SVD
  • Matrix Factorization (Tall skinny QR)
  • etc…

18.10.2017 Distributed Computing Using Spark WS17/18 19

slide-20
SLIDE 20

MapReduce – Limitations

  • Inefficient for multi-pass algorithms
  • No efficient primitives for data sharing
  • State between steps is materialized and

distributed

  • Slow due to replication and storage

18.10.2017 Distributed Computing Using Spark WS17/18 20

Source: http://stanford.edu/~rezab/sparkclass

slide-21
SLIDE 21

Limitations – PageRank

  • Requires iterations of multiplications of

sparse matrix and vector

18.10.2017 Distributed Computing Using Spark WS17/18 21

Source: http://stanford.edu/~rezab/sparkclass

slide-22
SLIDE 22

Limitations – PageRank

  • MapReduce sometime requires

asymptotically more communication or I/O

  • Iterations are handled very poorly
  • Reading and writing to disk is a bottleneck
  • In some cases 90% of time is spent on I/O

18.10.2017 Distributed Computing Using Spark WS17/18 22

slide-23
SLIDE 23

Spark Processing Framework

  • Developed in 2009 in UC Berkeley’s
  • In 2010 open sourced at Apache
  • Most active big data community
  • Industrial contributions: over 50 companies
  • Written in Scala
  • Good at serializing closures
  • Clean APIs in Java, Scala, Python, R

18.10.2017 Distributed Computing Using Spark WS17/18 23

slide-24
SLIDE 24

Spark Processing Framework

18.10.2017 Distributed Computing Using Spark WS17/18 24

Contributors (2014)

slide-25
SLIDE 25

Spark – High Level Architecture

18.10.2017 Distributed Computing Using Spark WS17/18 25

HDFS

Source: https://mapr.com/ebooks/spark/

slide-26
SLIDE 26

Spark - Running modes

  • Local mode: for debugging
  • Cluster mode
  • Standalone mode
  • Apache Mesos
  • Hadoop Yarn

18.10.2017 Distributed Computing Using Spark WS17/18 26

slide-27
SLIDE 27

Spark – Programming model

  • Spark context: the entry point
  • Spark Session: since Spark 2.0
  • New unified entry point. It combines SQLContext,

HiveContext and future StreamingContex

  • Spark Conf: to initialize the context
  • Spark’s interactive shell
  • Scala: spark-shell
  • Python: pyspark

18.10.2017 Distributed Computing Using Spark WS17/18 27

slide-28
SLIDE 28

Spark – RDDs, the game changer

  • Resilient distributed datasets
  • A typed data-structure (RDD[T]) that is not

language specific

  • Each element of type T is stored locally on

a machine

  • It has to fit in memory
  • An RDD can be cached in memory

18.10.2017 Distributed Computing Using Spark WS17/18 28

slide-29
SLIDE 29

Resilient Distributed Datasets

  • Immutable collections of objects, spread

across cluster

  • User controlled partitioning and storage
  • Automatically rebuilt on failure
  • RDDs are replaced by Dataset, which is

strongly-typed like an RDD (Spark > 2.0)

18.10.2017 Distributed Computing Using Spark WS17/18 29

slide-30
SLIDE 30

Spark – Wordcount example

text_file = sc.textFile("...") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("...")

18.10.2017 Distributed Computing Using Spark WS17/18 30 http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext

slide-31
SLIDE 31

Spark – Data manipulation

  • Transformations: always yield a new RDD

instance (RDDs are immutable)

  • filter, map, flatMap, etc.
  • Actions: triggers a computation on the

RDD’s elements

  • count, foreach, etc.
  • Lazy evaluation of transformations

18.10.2017 Distributed Computing Using Spark WS17/18 31

slide-32
SLIDE 32

Spark – DataFrames

  • DataFrame API introduced since Spark 1.3
  • Handles table-like representation with named

columns and declared column types

  • Do not confuse with Python’s Pandas DataFrames
  • DataFrames translate SQL code into RDD low-

level operations

  • Since Spark 2.0, DataFrame is implemented as a

special case of DataSet

18.10.2017 Distributed Computing Using Spark WS17/18 32

slide-33
SLIDE 33

DataFrames – How to create DFs

  • 1. Convert existing RDDs
  • 2. Running SQL queries
  • 3. Loading external data

18.10.2017 Distributed Computing Using Spark WS17/18 33

slide-34
SLIDE 34

Spark SQL

  • SQL context

http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

18.10.2017 Distributed Computing Using Spark WS17/18 34

// Run SQL statements. Returns a DataFrame students = sqlContext.sql( "SELECT name FROM people WHERE occupation>=‘student’)

slide-35
SLIDE 35

Spark – DataFrames

18.10.2017 Distributed Computing Using Spark WS17/18 35

Source: Spark in Action (book, see literature)

slide-36
SLIDE 36

Machine Learning (ML) with Spark

  • ML project steps

1.Data collection 2.Data cleaning and preparation 3.Data analysis and feature extraction 4.Model training 5.Model evaluation 6.Model application

18.10.2017 Distributed Computing Using Spark WS17/18 36

Source: Spark in Action (book, see literature)

slide-37
SLIDE 37

Machine Learning (ML) with Spark

  • ML with Spark
  • Perfect for ML parallelizable algorithms!!
  • A single platform (the same system and the same API) for

performing most tasks:

  • Collect, prepare, analyze the data
  • Train, evaluate, use the model
  • Training and applying ML algorithms on very large

datasets

  • Offer most of the popular ML algorithms

18.10.2017 Distributed Computing Using Spark WS17/18 37

slide-38
SLIDE 38

Machine Learning (ML) with Spark

  • MLlib
  • Spark’s machine learning library
  • Provides a generalized API for training and tuning different

algorithms in the same way (influenced by scikit-learn)

  • Relies on several low-level libraries for performing
  • ptimized linear algebra operations:
  • Breeze, jblas for Scala and java
  • NumPy for Python

18.10.2017 Distributed Computing Using Spark WS17/18 38

slide-39
SLIDE 39

Machine Learning (ML) with Spark

  • MLlib two APIs
  • RDD-based API
  • Will be removed in Spark 3.0 (spark.mllib)
  • Dataframe-based API, will keep add new features

(spark.ml)

  • More user-friendly API than RDDs
  • A uniform API across ML algorithms and across

multiple languages

  • Facilitate practical ML Pipelines (feature

transformations)

18.10.2017 Distributed Computing Using Spark WS17/18 39

slide-40
SLIDE 40

MLlib abstractions

  • Transformer
  • Main method: transform
  • Examples:
  • ML model
  • Feature transformer
  • Estimator
  • main method: fit
  • Example: ML algorithm
  • Evaluator
  • Example: RMSE metric

18.10.2017 Distributed Computing Using Spark WS17/18 40

Estimator Transformer Evaluator

Fit

Input dataset Evaluation results Transforme d dataset

Estimate

Transform

Source: Spark in Action (book, see literature)

slide-41
SLIDE 41
  • A pipeline chains multiple Transformers and

Estimators together to specify an ML workflow

  • Example

Learn a prediction model using features extracted from text document Training phase

MLlib Pipelines

18.10.2017 Distributed Computing Using Spark WS17/18 41

Source: http://spark.apache.org/docs/latest/ml-pipeline.html#properties-of-pipeline-components

Test phase

slide-42
SLIDE 42

Organization

  • Introduction Introduction to Spark
  • Case-study: Recommender system for

scientific papers

  • Organization
  • Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 42

slide-43
SLIDE 43

Case-study: Recommender system for scientific papers

  • Motivation
  • Recommend relevant papers to users
  • Dataset
  • Set of papers (~172 K)
  • Textual content: Title + abstract
  • Attributes: type, journal, pages, year,…
  • Set of users (~ 28 K)
  • Ratings (~ 828 K ratings)

18.10.2017 Distributed Computing Using Spark WS17/18 43

slide-44
SLIDE 44

Organization

  • Introduction Introduction to Spark
  • Case-study: Recommender system for

scientific papers

  • Organization
  • Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 44

slide-45
SLIDE 45

Organization

  • Team
  • Educational goals
  • Requirements
  • ILIAS
  • Experiments’ submissions
  • Assessment
  • Discussion with the tutors
  • Schedule

18.10.2017 Distributed Computing Using Spark WS17/18 45

slide-46
SLIDE 46

Team

  • Prof. Georg Lausen
  • Assistants
  • Anas
  • Anthony
  • Tutors
  • Polina Koleva
  • Matteo Cossu

18.10.2017 Distributed Computing Using Spark WS17/18 46

slide-47
SLIDE 47

Educational goals

  • Distributed programming paradigm
  • Recommender Systems (use case)
  • Theoretical and practical training
  • Master project and thesis
  • Data Science profile for work market

18.10.2017 Distributed Computing Using Spark WS17/18 47

slide-48
SLIDE 48

Requirements

  • Mandatory
  • Registration via HisInOne
  • Attendance to Kick-off meeting
  • Recommended
  • Attendance of DAQL, SIDS or ML lectures
  • Basics In Python programming

18.10.2017 Distributed Computing Using Spark WS17/18 48

slide-49
SLIDE 49

ILIAS

  • Distributed Computing Using Spark -

WS1718

https://ilias.uni-freiburg.de/goto.php?target=crs_878841

  • Access with course password
  • Forum for clarification questions of tasks
  • Do not post solutions or suggestions

18.10.2017 Distributed Computing Using Spark WS17/18 49

slide-50
SLIDE 50

Experiments’ submissions

  • 6 experiments, 2-3 weeks of working time
  • Submissions in groups of 2 students (Form

your group)

  • Submissions via ILIAS

18.10.2017 Distributed Computing Using Spark WS17/18 50

slide-51
SLIDE 51

Assessment

  • Each experiment: 50 points. Overall 300

points.

  • At least 70% of the points required to pass
  • Corrections done by tutors

18.10.2017 Distributed Computing Using Spark WS17/18 51

slide-52
SLIDE 52

Discussion of solutions with tutors

  • Mandatory attendance
  • Each member has to be able to explain all tasks!
  • 0 points for that task
  • Copied solutions
  • First time: 0 points for that experiment
  • Second time: failure of the practical

18.10.2017 Distributed Computing Using Spark WS17/18 52

slide-53
SLIDE 53

Schedule

18.10.2017 Distributed Computing Using Spark WS17/18 53

Experiment Content Release Submission Discussion 1. Familiarizing with Tools, Loading Data, and Basic Analysis of Data 18.10.2017 01.11.2017, 11h 08.11.2017 2. Experiment 2 01.11.2017 15.11.2017, 11h 22.11.2017 3. Experiment 3 15.11.2017 29.11.2017, 11h 06.12.2017 4. Experiment 4 29.11.2017 13.12.2017, 11h 20.12.2017 5. Experiment 5 13.12.2017 10.01.2018, 11h 17.01.2018 6. Experiment 6 10.01.2018 31.01.2018, 11h 07.02.2018

slide-54
SLIDE 54

Literature

  • Spark in Action [book] by Petar Zečević

Marko Bonaći

  • Machine Learning with Spark [book] by

Nick Pentreath

  • Apache Spark documentation:

http://spark.apache.org/docs/latest

18.10.2017 Distributed Computing Using Spark WS17/18 54

slide-55
SLIDE 55

Organization

  • Introduction to Spark
  • Case-study: Recommender system for

scientific papers

  • Organization
  • Hands-on session

18.10.2017 Distributed Computing Using Spark WS17/18 55