Albert-Ludwigs-Universität Freiburg
Practical / Praktikum WS17/18
October 18th, 2017
Distributed Computing Using Spark
- Prof. Dr. Georg Lausen
Anas Alzogbi Victor Anthony Arrascue Ayala
Distributed Computing Using Spark Practical / Praktikum WS17/18 - - PowerPoint PPT Presentation
Distributed Computing Using Spark Practical / Praktikum WS17/18 October 18th, 2017 Albert-Ludwigs-Universitt Freiburg Prof. Dr. Georg Lausen Anas Alzogbi Victor Anthony Arrascue Ayala Agenda Introduction to Spark Case-study:
Albert-Ludwigs-Universität Freiburg
October 18th, 2017
Anas Alzogbi Victor Anthony Arrascue Ayala
18.10.2017 Distributed Computing Using Spark WS17/18 2
18.10.2017 Distributed Computing Using Spark WS17/18 3
18.10.2017 Distributed Computing Using Spark WS17/18 4
18.10.2017 Distributed Computing Using Spark WS17/18 5
Source: https://www.flickr.com/photos/will-lion/2595497078
18.10.2017 Distributed Computing Using Spark WS17/18 6
Source: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ Source: http://www.bigdata-startups.com/open-source-tools/
18.10.2017 Distributed Computing Using Spark WS17/18 7
18.10.2017 Distributed Computing Using Spark WS17/18 8
18.10.2017 Distributed Computing Using Spark WS17/18 9
18.10.2017 Distributed Computing Using Spark WS17/18 10
18.10.2017 Distributed Computing Using Spark WS17/18 11
split 1 split 0
Map Map Map Reduce Reduce
<k,v> Data
split 2 Input Data
18.10.2017 Distributed Computing Using Spark WS17/18 12
Google Maps charts new territory into businesses Google selling new tools for businesses to build their own maps Google promises consumer experience for businesses with Maps Engine Pro Google is trying to get its Maps service used by more businesses
Google 4 Maps 4 Businesses 4 Engine 1 Charts 1 Territory 1 Tools 1 …
18.10.2017 Distributed Computing Using Spark WS17/18 13
Google Maps charts new territory into businesses Google selling new tools for businesses to build their own maps Google promises consumer experience for businesses with Maps Engine Pro Google is trying to get its Maps service used by more businesses
Google 2 Charts 1 Maps 2 Territory 1 … Google 2 Businesses 2 Maps 2 Service 1 …
18.10.2017 Distributed Computing Using Spark WS17/18 14
Google Maps charts new territory into businesses Google selling new tools for businesses to build their own maps Google promises consumer experience for businesses with Maps Engine Pro Google is trying to get its Maps service used by more businesses
Google 2 Charts 1 Maps 2 Territory 1 … Google 2 Businesses 2 Maps 2 Service 1 …
18.10.2017 Distributed Computing Using Spark WS17/18 15
Google 2 Google 2 Maps 2 Maps 2 … Businesses 2 Businesses 2 Charts 1 Territory 1 … Google 4 Maps 4 … Businesses 4 Charts 1 Territory 1 …
18.10.2017 Distributed Computing Using Spark WS17/18 16
Google 2 Google 2 Maps 2 Maps 2 … Businesses 2 Businesses 2 Charts 1 Territory 1 … Google 4 Maps 4 … Businesses 4 Charts 1 Territory 1 …
18.10.2017 Distributed Computing Using Spark WS17/18 17
18.10.2017 Distributed Computing Using Spark WS17/18 18
Source: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php
18.10.2017 Distributed Computing Using Spark WS17/18 19
18.10.2017 Distributed Computing Using Spark WS17/18 20
Source: http://stanford.edu/~rezab/sparkclass
18.10.2017 Distributed Computing Using Spark WS17/18 21
Source: http://stanford.edu/~rezab/sparkclass
18.10.2017 Distributed Computing Using Spark WS17/18 22
18.10.2017 Distributed Computing Using Spark WS17/18 23
18.10.2017 Distributed Computing Using Spark WS17/18 24
18.10.2017 Distributed Computing Using Spark WS17/18 25
Source: https://mapr.com/ebooks/spark/
18.10.2017 Distributed Computing Using Spark WS17/18 26
18.10.2017 Distributed Computing Using Spark WS17/18 27
18.10.2017 Distributed Computing Using Spark WS17/18 28
18.10.2017 Distributed Computing Using Spark WS17/18 29
text_file = sc.textFile("...") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("...")
18.10.2017 Distributed Computing Using Spark WS17/18 30 http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
18.10.2017 Distributed Computing Using Spark WS17/18 31
18.10.2017 Distributed Computing Using Spark WS17/18 32
18.10.2017 Distributed Computing Using Spark WS17/18 33
18.10.2017 Distributed Computing Using Spark WS17/18 34
// Run SQL statements. Returns a DataFrame students = sqlContext.sql( "SELECT name FROM people WHERE occupation>=‘student’)
18.10.2017 Distributed Computing Using Spark WS17/18 35
Source: Spark in Action (book, see literature)
1.Data collection 2.Data cleaning and preparation 3.Data analysis and feature extraction 4.Model training 5.Model evaluation 6.Model application
18.10.2017 Distributed Computing Using Spark WS17/18 36
Source: Spark in Action (book, see literature)
18.10.2017 Distributed Computing Using Spark WS17/18 37
18.10.2017 Distributed Computing Using Spark WS17/18 38
18.10.2017 Distributed Computing Using Spark WS17/18 39
18.10.2017 Distributed Computing Using Spark WS17/18 40
Estimator Transformer Evaluator
Fit
Input dataset Evaluation results Transforme d dataset
Estimate
Transform
Source: Spark in Action (book, see literature)
Learn a prediction model using features extracted from text document Training phase
18.10.2017 Distributed Computing Using Spark WS17/18 41
Source: http://spark.apache.org/docs/latest/ml-pipeline.html#properties-of-pipeline-components
Test phase
18.10.2017 Distributed Computing Using Spark WS17/18 42
18.10.2017 Distributed Computing Using Spark WS17/18 43
18.10.2017 Distributed Computing Using Spark WS17/18 44
18.10.2017 Distributed Computing Using Spark WS17/18 45
18.10.2017 Distributed Computing Using Spark WS17/18 46
18.10.2017 Distributed Computing Using Spark WS17/18 47
18.10.2017 Distributed Computing Using Spark WS17/18 48
18.10.2017 Distributed Computing Using Spark WS17/18 49
18.10.2017 Distributed Computing Using Spark WS17/18 50
18.10.2017 Distributed Computing Using Spark WS17/18 51
18.10.2017 Distributed Computing Using Spark WS17/18 52
18.10.2017 Distributed Computing Using Spark WS17/18 53
Experiment Content Release Submission Discussion 1. Familiarizing with Tools, Loading Data, and Basic Analysis of Data 18.10.2017 01.11.2017, 11h 08.11.2017 2. Experiment 2 01.11.2017 15.11.2017, 11h 22.11.2017 3. Experiment 3 15.11.2017 29.11.2017, 11h 06.12.2017 4. Experiment 4 29.11.2017 13.12.2017, 11h 20.12.2017 5. Experiment 5 13.12.2017 10.01.2018, 11h 17.01.2018 6. Experiment 6 10.01.2018 31.01.2018, 11h 07.02.2018
18.10.2017 Distributed Computing Using Spark WS17/18 54
18.10.2017 Distributed Computing Using Spark WS17/18 55