distributed computing using spark
play

Distributed Computing Using Spark Practical / Praktikum WS17/18 - PowerPoint PPT Presentation

Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universitt Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala Agenda Introduction to Spark Case-study:


  1. Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala

  2. Agenda  Introduction to Spark  Case-study: Recommender system for scientific papers  Organization  Hands-on session 18.10.2017 Distributed Computing Using Spark WS17/18 2

  3. Agenda  Introduction to Spark  Case-study: Recommender system for scientific papers  Organization  Hands-on session 18.10.2017 Distributed Computing Using Spark WS17/18 3

  4. Introduction to Spark  Distributed programming  MapReduce  Spark 18.10.2017 Distributed Computing Using Spark WS17/18 4

  5. Distributed programming - problem  Data grows faster than processing capabilities - Web 2.0: users generate content - Social networks, online communities, etc. Source: https://www.flickr.com/photos/will-lion/2595497078 18.10.2017 Distributed Computing Using Spark WS17/18 5

  6. Big Data Source: http://www.bigdata-startups.com/open-source-tools/ Source: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 18.10.2017 Distributed Computing Using Spark WS17/18 6

  7. Big Data  Buzzword  Often less-structured  Requires different techniques, tools, approaches - To solve new problems or old ones in a better way 18.10.2017 Distributed Computing Using Spark WS17/18 7

  8. Network Programming Models  Requires a communication protocol for programming parallel computers (slow) - MPI (wiki)  Locality of the data and the code across the network have to be done manually  No failure management  Network problems not solved (e.g. stragglers) 18.10.2017 Distributed Computing Using Spark WS17/18 8

  9. Data Flow Models  Higher-level of abstraction: algorithms are parallelized on large clusters  Fault-recovery by means of data replication  Job divided into a set of independent tasks - Code is shipped to where the data is located  Good scalability 18.10.2017 Distributed Computing Using Spark WS17/18 9

  10. MapReduce – Key ideas 1. Problem is split into smaller problems (map step) 2. Smaller problems are solved in a parallel fashion 3. Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step) 18.10.2017 Distributed Computing Using Spark WS17/18 10

  11. MapReduce – Overview split 0 Map Input Data Reduce output 0 split 1 Map Reduce output 1 Map split 2 … <k,v> Data A target problem has to be parallelizable!!! 18.10.2017 Distributed Computing Using Spark WS17/18 11

  12. MapReduce – Wordcount example Google Maps charts new territory into businesses Google 4 Maps 4 Google selling new tools for businesses to build their own Businesses 4 maps Engine 1 Charts 1 Google promises consumer experience for businesses with Territory 1 Maps Engine Pro Tools 1 … Google is trying to get its Maps service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 12

  13. MapReduce – Wordcount’s map Google 2 Google Maps charts new territory into businesses Charts 1 Map Maps 2 Google selling new tools for Territory 1 businesses to build their own maps … Google 2 Google promises consumer experience for businesses with Businesses 2 Maps Engine Pro Maps 2 Map Service 1 Google is trying to get its Maps … service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 13

  14. MapReduce – Wordcount’s map Google 2 Google Maps charts new territory into businesses Charts 1 Map Maps 2 Google selling new tools for Territory 1 businesses to build their own maps … Google 2 Google promises consumer experience for businesses with Businesses 2 Maps Engine Pro Maps 2 Map Service 1 Google is trying to get its Maps … service used by more businesses 18.10.2017 Distributed Computing Using Spark WS17/18 14

  15. MapReduce – Wordcount’s reduce Google 2 Google 4 Google 2 Reduce Maps 4 Maps 2 … Maps 2 … Businesses 2 Businesses 2 Businesses 4 Charts 1 Charts 1 Reduce Territory 1 Territory 1 … … 18.10.2017 Distributed Computing Using Spark WS17/18 15

  16. MapReduce – Wordcount’s reduce Google 2 Google 4 Google 2 Maps 4 Reduce Maps 2 … Maps 2 … Businesses 2 Businesses 2 Businesses 4 Charts 1 Reduce Charts 1 Territory 1 Territory 1 … … 18.10.2017 Distributed Computing Using Spark WS17/18 16

  17. MapReduce  Automatic - Partition and distribution of data - Parallelization and assignment of tasks - Scalability, fault-tolerance, scheduling 18.10.2017 Distributed Computing Using Spark WS17/18 17

  18. Apache Hadoop  Open-source implementation of MapReduce Source: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php 18.10.2017 Distributed Computing Using Spark WS17/18 18

  19. MapReduce – Parallelizable algorithms  Matrix-vector multiplication  Power iteration (e.g. PageRank)  Gradient descent methods  Stochastic SVD  Matrix Factorization (Tall skinny QR)  etc … 18.10.2017 Distributed Computing Using Spark WS17/18 19

  20. MapReduce – Limitations  Inefficient for multi-pass algorithms  No efficient primitives for data sharing  State between steps is materialized and distributed  Slow due to replication and storage Source: http://stanford.edu/~rezab/sparkclass 18.10.2017 Distributed Computing Using Spark WS17/18 20

  21. Limitations – PageRank  Requires iterations of multiplications of sparse matrix and vector Source: http://stanford.edu/~rezab/sparkclass 18.10.2017 Distributed Computing Using Spark WS17/18 21

  22. Limitations – PageRank  MapReduce sometime requires asymptotically more communication or I/O  Iterations are handled very poorly  Reading and writing to disk is a bottleneck - In some cases 90% of time is spent on I/O 18.10.2017 Distributed Computing Using Spark WS17/18 22

  23. Spark Processing Framework  Developed in 2009 in UC Berkeley’s  In 2010 open sourced at Apache - Most active big data community - Industrial contributions: over 50 companies  Written in Scala - Good at serializing closures  Clean APIs in Java, Scala, Python, R 18.10.2017 Distributed Computing Using Spark WS17/18 23

  24. Spark Processing Framework Contributors (2014) 18.10.2017 Distributed Computing Using Spark WS17/18 24

  25. Spark – High Level Architecture HDFS Source: https://mapr.com/ebooks/spark/ 18.10.2017 Distributed Computing Using Spark WS17/18 25

  26. Spark - Running modes  Local mode: for debugging  Cluster mode - Standalone mode - Apache Mesos - Hadoop Yarn 18.10.2017 Distributed Computing Using Spark WS17/18 26

  27. Spark – Programming model  Spark context: the entry point  Spark Session: since Spark 2.0 - New unified entry point. It combines SQLContext, HiveContext and future StreamingContex  Spark Conf: to initialize the context  Spark’s interactive shell - Scala: spark-shell - Python: pyspark 18.10.2017 Distributed Computing Using Spark WS17/18 27

  28. Spark – RDDs, the game changer  Resilient distributed datasets  A typed data-structure ( RDD[ T ] ) that is not language specific  Each element of type T is stored locally on a machine - It has to fit in memory  An RDD can be cached in memory 18.10.2017 Distributed Computing Using Spark WS17/18 28

  29. Resilient Distributed Datasets  Immutable collections of objects, spread across cluster  User controlled partitioning and storage  Automatically rebuilt on failure  RDDs are replaced by Dataset, which is strongly-typed like an RDD (Spark > 2.0) 18.10.2017 Distributed Computing Using Spark WS17/18 29

  30. Spark – Wordcount example text_file = sc.textFile("...") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("...") http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext 18.10.2017 Distributed Computing Using Spark WS17/18 30

  31. Spark – Data manipulation  Transformations: always yield a new RDD instance (RDDs are immutable) - filter , map , flatMap, etc.  Actions: triggers a computation on the RDD’s elements - count , foreach, etc.  Lazy evaluation of transformations 18.10.2017 Distributed Computing Using Spark WS17/18 31

  32. Spark – DataFrames  DataFrame API introduced since Spark 1.3  Handles table-like representation with named columns and declared column types  Do not confuse with Python’s Pandas DataFrames  DataFrames translate SQL code into RDD low- level operations  Since Spark 2.0, DataFrame is implemented as a special case of DataSet 18.10.2017 Distributed Computing Using Spark WS17/18 32

  33. DataFrames – How to create DFs 1. Convert existing RDDs 2. Running SQL queries 3. Loading external data 18.10.2017 Distributed Computing Using Spark WS17/18 33

  34. Spark SQL  SQL context // Run SQL statements. Returns a DataFrame students = sqlContext.sql( "SELECT name FROM people WHERE occupation>=‘student’) http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html 18.10.2017 Distributed Computing Using Spark WS17/18 34

  35. Spark – DataFrames Source: Spark in Action (book, see literature) 18.10.2017 Distributed Computing Using Spark WS17/18 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend