Distributed Computing Using Spark Practical / Praktikum WS17/18 - PowerPoint PPT Presentation

Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala

Agenda  Introduction to Spark  Case-study: Recommender system for scientific papers  Organization  Hands-on session 18.10.2017 Distributed Computing Using Spark WS17/18 2

Agenda  Introduction to Spark  Case-study: Recommender system for scientific papers  Organization  Hands-on session 18.10.2017 Distributed Computing Using Spark WS17/18 3

Introduction to Spark  Distributed programming  MapReduce  Spark 18.10.2017 Distributed Computing Using Spark WS17/18 4

Distributed programming - problem  Data grows faster than processing capabilities - Web 2.0: users generate content - Social networks, online communities, etc. Source: https://www.flickr.com/photos/will-lion/2595497078 18.10.2017 Distributed Computing Using Spark WS17/18 5

Big Data Source: http://www.bigdata-startups.com/open-source-tools/ Source: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 18.10.2017 Distributed Computing Using Spark WS17/18 6

Big Data  Buzzword  Often less-structured  Requires different techniques, tools, approaches - To solve new problems or old ones in a better way 18.10.2017 Distributed Computing Using Spark WS17/18 7

Network Programming Models  Requires a communication protocol for programming parallel computers (slow) - MPI (wiki)  Locality of the data and the code across the network have to be done manually  No failure management  Network problems not solved (e.g. stragglers) 18.10.2017 Distributed Computing Using Spark WS17/18 8

Data Flow Models  Higher-level of abstraction: algorithms are parallelized on large clusters  Fault-recovery by means of data replication  Job divided into a set of independent tasks - Code is shipped to where the data is located  Good scalability 18.10.2017 Distributed Computing Using Spark WS17/18 9

MapReduce – Key ideas 1. Problem is split into smaller problems (map step) 2. Smaller problems are solved in a parallel fashion 3. Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step) 18.10.2017 Distributed Computing Using Spark WS17/18 10

MapReduce – Overview split 0 Map Input Data Reduce output 0 split 1 Map Reduce output 1 Map split 2 … <k,v> Data A target problem has to be parallelizable!!! 18.10.2017 Distributed Computing Using Spark WS17/18 11

MapReduce – Wordcount example Google Maps charts new territory into businesses Google 4 Maps 4 Google selling new tools for businesses to build their own Businesses 4 maps Engine 1 Charts 1 Google promises consumer experience for businesses with Territory 1 Maps Engine Pro Tools 1 … Google is trying to get its Maps service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 12

MapReduce – Wordcount’s map Google 2 Google Maps charts new territory into businesses Charts 1 Map Maps 2 Google selling new tools for Territory 1 businesses to build their own maps … Google 2 Google promises consumer experience for businesses with Businesses 2 Maps Engine Pro Maps 2 Map Service 1 Google is trying to get its Maps … service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 13

MapReduce – Wordcount’s map Google 2 Google Maps charts new territory into businesses Charts 1 Map Maps 2 Google selling new tools for Territory 1 businesses to build their own maps … Google 2 Google promises consumer experience for businesses with Businesses 2 Maps Engine Pro Maps 2 Map Service 1 Google is trying to get its Maps … service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 14

MapReduce – Wordcount’s reduce Google 2 Google 4 Google 2 Reduce Maps 4 Maps 2 … Maps 2 … Businesses 2 Businesses 2 Businesses 4 Charts 1 Charts 1 Reduce Territory 1 Territory 1 … … 18.10.2017 Distributed Computing Using Spark WS17/18 15

MapReduce – Wordcount’s reduce Google 2 Google 4 Google 2 Maps 4 Reduce Maps 2 … Maps 2 … Businesses 2 Businesses 2 Businesses 4 Charts 1 Reduce Charts 1 Territory 1 Territory 1 … … 18.10.2017 Distributed Computing Using Spark WS17/18 16

MapReduce  Automatic - Partition and distribution of data - Parallelization and assignment of tasks - Scalability, fault-tolerance, scheduling 18.10.2017 Distributed Computing Using Spark WS17/18 17

Apache Hadoop  Open-source implementation of MapReduce Source: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php 18.10.2017 Distributed Computing Using Spark WS17/18 18

MapReduce – Parallelizable algorithms  Matrix-vector multiplication  Power iteration (e.g. PageRank)  Gradient descent methods  Stochastic SVD  Matrix Factorization (Tall skinny QR)  etc … 18.10.2017 Distributed Computing Using Spark WS17/18 19

MapReduce – Limitations  Inefficient for multi-pass algorithms  No efficient primitives for data sharing  State between steps is materialized and distributed  Slow due to replication and storage Source: http://stanford.edu/~rezab/sparkclass 18.10.2017 Distributed Computing Using Spark WS17/18 20

Limitations – PageRank  Requires iterations of multiplications of sparse matrix and vector Source: http://stanford.edu/~rezab/sparkclass 18.10.2017 Distributed Computing Using Spark WS17/18 21

Limitations – PageRank  MapReduce sometime requires asymptotically more communication or I/O  Iterations are handled very poorly  Reading and writing to disk is a bottleneck - In some cases 90% of time is spent on I/O 18.10.2017 Distributed Computing Using Spark WS17/18 22

Spark Processing Framework  Developed in 2009 in UC Berkeley’s  In 2010 open sourced at Apache - Most active big data community - Industrial contributions: over 50 companies  Written in Scala - Good at serializing closures  Clean APIs in Java, Scala, Python, R 18.10.2017 Distributed Computing Using Spark WS17/18 23

Spark Processing Framework Contributors (2014) 18.10.2017 Distributed Computing Using Spark WS17/18 24

Spark – High Level Architecture HDFS Source: https://mapr.com/ebooks/spark/ 18.10.2017 Distributed Computing Using Spark WS17/18 25

Spark - Running modes  Local mode: for debugging  Cluster mode - Standalone mode - Apache Mesos - Hadoop Yarn 18.10.2017 Distributed Computing Using Spark WS17/18 26

Spark – Programming model  Spark context: the entry point  Spark Session: since Spark 2.0 - New unified entry point. It combines SQLContext, HiveContext and future StreamingContex  Spark Conf: to initialize the context  Spark’s interactive shell - Scala: spark-shell - Python: pyspark 18.10.2017 Distributed Computing Using Spark WS17/18 27

Spark – RDDs, the game changer  Resilient distributed datasets  A typed data-structure ( RDD[ T ] ) that is not language specific  Each element of type T is stored locally on a machine - It has to fit in memory  An RDD can be cached in memory 18.10.2017 Distributed Computing Using Spark WS17/18 28

Resilient Distributed Datasets  Immutable collections of objects, spread across cluster  User controlled partitioning and storage  Automatically rebuilt on failure  RDDs are replaced by Dataset, which is strongly-typed like an RDD (Spark > 2.0) 18.10.2017 Distributed Computing Using Spark WS17/18 29

Spark – Wordcount example text_file = sc.textFile("...") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("...") http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext 18.10.2017 Distributed Computing Using Spark WS17/18 30

Spark – Data manipulation  Transformations: always yield a new RDD instance (RDDs are immutable) - filter , map , flatMap, etc.  Actions: triggers a computation on the RDD’s elements - count , foreach, etc.  Lazy evaluation of transformations 18.10.2017 Distributed Computing Using Spark WS17/18 31

Spark – DataFrames  DataFrame API introduced since Spark 1.3  Handles table-like representation with named columns and declared column types  Do not confuse with Python’s Pandas DataFrames  DataFrames translate SQL code into RDD low- level operations  Since Spark 2.0, DataFrame is implemented as a special case of DataSet 18.10.2017 Distributed Computing Using Spark WS17/18 32

DataFrames – How to create DFs 1. Convert existing RDDs 2. Running SQL queries 3. Loading external data 18.10.2017 Distributed Computing Using Spark WS17/18 33

Spark SQL  SQL context // Run SQL statements. Returns a DataFrame students = sqlContext.sql( "SELECT name FROM people WHERE occupation>=‘student’) http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html 18.10.2017 Distributed Computing Using Spark WS17/18 34

Spark – DataFrames Source: Spark in Action (book, see literature) 18.10.2017 Distributed Computing Using Spark WS17/18 35

Distributed Computing Using Spark Practical / Praktikum WS17/18 - PowerPoint PPT Presentation

Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universitt Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala Agenda Introduction to Spark Case-study:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Scalable Evolutionary Search Ke TANG Shenzhen Key Laboratory of Computational Intelligence

Machine Translation 2: Statistical MT: Neural MT and Representations Ondej Bojar

5G and Automo,ve The Perfect Storm? Carla Fabiana Chiasserini

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

One parameter families of CalabiYau threefolds with trivial monodromy S lawomir Cynk

Periodic-end Dirac operators and Seiberg-Witten theory Tomasz Mrowka 1 Daniel Ruberman 2

Laplace Transforms of Periodic Functions Bernd Schr oder logo1 Bernd Schr oder Louisiana

Periodic Behaviour MCR3U: Functions Consider the graph below. What are some properties of the

Distributed Computing Using Spark Practical / Praktikum WS17/18 - PowerPoint PPT Presentation

Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universitt Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala Agenda Introduction to Spark Case-study:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Scalable Evolutionary Search Ke TANG Shenzhen Key Laboratory of Computational Intelligence

Machine Translation 2: Statistical MT: Neural MT and Representations Ondej Bojar

5G and Automo,ve The Perfect Storm? Carla Fabiana Chiasserini

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

One parameter families of CalabiYau threefolds with trivial monodromy S lawomir Cynk

Periodic-end Dirac operators and Seiberg-Witten theory Tomasz Mrowka 1 Daniel Ruberman 2

Laplace Transforms of Periodic Functions Bernd Schr oder logo1 Bernd Schr oder Louisiana

Periodic Behaviour MCR3U: Functions Consider the graph below. What are some properties of the

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark