Big Data Management & Analytics EXERCISE 3 16th of November - PowerPoint PPT Presentation

Big Data Management & Analytics EXERCISE 3 16th of November 2015 Sabrina Friedl LMU Munich

1. Revision of Lecture PARALLEL COMPUTING, MAPREDUCE

Parallel Computing Architectures • Required to analyse large amounts of data • Organisation of distributed file systems  Replicas of files on different nodes  Master node with directory of file copies  Examples: Google File System, Hadoop DFS • Goals  Fault tolerance  Parallel execution of tasks

MapReduce - Motivation MapReduce: Programming model for parallel processing Big Data on clusters • Stores data that can be processed together close to each other/close to worker (Data Locality) • Handles data flow, parallelization and coordination of tasks automatically • Copes with failures and stragglers

MapReduce – Processing (High Level) Master Assign Assign map tasks reduce tasks

MapReduce – Programming Model Transform set of input key-value pairs into set of output key-value pairs • Step 1 : map(k1, v1) -> list(k2, v2) • Step 2 : sort by k2 -> list(k2, list(v2)) • Step 3 : reduce(k2, list(v2)) -> list(k3, v3) -> Programmer specifies map() and reduce() function

MapReduce – Word Count Partition Reduce Input Map Output Shuffle & Sort (Amarena, 1) Amarena (Amarena, 1) (Amarena, 1 ) (Amarena, 3) Strawberry (Strawberry, 1) (Amarena, 1) Vanilla (Vanilla, 1) (Strawberry, 1) (Strawberry, 2) Amarena Strawberry (Amarena, 3) (Strawberry, 1) Vanilla Mango Mango (Mango, 1) (Strawberry, 2) Stracciatella Strawberry Stracciatella (Stracciatella,1) (Vanilla, 1) (Vanilla, 1) Amarena Stracciatella Strawberry (Strawberry, 1) (Vanilla, 1) (Mango, 1) Amarena (Stracciatella) (Mango, 1) (Mango, 1) (Amarena, 1) Amarena Stracciatella (Stracciatella,1) (Amarena, 1 ) Amarena (Stracciatella, 1) (Stracciatella, 2) (Stracciatella, 1)

MapReduce – Matrix Multiplication Can be written as Steps • 1. Map • 2. Join • 3. Map • 4. ReduceByKey

2. Spark and PySpark WORKING WITH PYSPARK

Apache Spark ™ Open source framework for cluster computing • Cluster managers that Spark runs on: Hadoop YARN, Apache Mesos, standalone • Distributed Storage Systems that can be used: Hadoop Distributed File System, Cassandra, Hbase, Hive Homepage: http://spark.apache.org/docs/latest/index.html

PySpark - Usage Spark Python API • Spark shell: $./bin/spark- shell (‘ \ ‘ for windows) • PySpark shell: $./bin/pyspark • Use in Python program: from pyspark import SparkConf, SparkContext sc = SparkContext( 'local' ) Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html#overview Quick Start Guide: http://spark.apache.org/docs/latest/quick-start.html

PySpark – Main Concepts Resilient distributed dataset (RDD)* • Collection of elements that can be operated on in parallel • To work with data in Spark, RDDs have to be created • Examples sc = SparkContext( 'local' ) data = sc.parallelize([1, 2, 3, 4]) #use sc.parallize() to create RDD from a list lines = sc.textFile( "text.txt" ) Actions and transformations, lazy evaluation principle* * see next lectures

PySpark – Working with MapReduce MapReduce in PySpark • rdd .map( f ) -> returns RDD (transformation) • rdd .reduce( f ) -> returns RDD (transformation) • rdd .collect() -> returns content of RDD as list (action) Examples: see code examples provided on course website

3. Exercises Will be discussed during next exercise on 23th of November

Exercises Install Spark on your computer and configure your IDE to work with PySpark (shown for Anaconda PyCharm on Windows). 1. Implement the word count example in PySpark. Use any text file you like 2. Implement the matrix multiplication example in PySpark.  Use the prepared code in matrixMultiplication_template.py and implement the missing parts 3. Implement K-Means in in PySpark. (see lecture slides)  Define or generate some points to do the clustering on and initialize 3 centroids.  Write two functions assign_to_centroid(point) and calculate_new_centroids(*cluster_points) to use in your map() and reduce()-calls.  Apply map() and reduce() iteratively and print out the new centroids as list in each step

Big Data Management & Analytics EXERCISE 3 16th of November - PowerPoint PPT Presentation

Big Data Management & Analytics EXERCISE 3 16th of November 2015 Sabrina Friedl LMU Munich 1. Revision of Lecture PARALLEL COMPUTING, MAPREDUCE Parallel Computing Architectures Required to analyse large amounts of data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

DM550/DM857 Introduction to Programming Peter Schneider-Kamp petersk@imada.sdu.dk

Scaling Up Hadoop Duen Horng (Polo) Chau Associate Professor Associate Director, MS

Real-time 802.11 on WARP Patrick Murphy Mango Communications Nov 2013 http://warpproject.org

Introduction to roxygen2 Aime Gott Education Practice Lead, Mango Solutions DataCamp

Providing R Reporting Capabilities

Architecting & Developing for Windows Phone Philipp Bauknecht CEO & Software Architect

Inspection of Windows Phone applications Dmitriy Evdokimov Andrey Chasovskikh About us Dmitriy

DE AQUI Y DE ALLA: Essential Topics in Latinx Humanities Curriculum Miguel Correa, the Berkeley

Big Data Management & Analytics EXERCISE 3 16th of November - PowerPoint PPT Presentation

Big Data Management & Analytics EXERCISE 3 16th of November 2015 Sabrina Friedl LMU Munich 1. Revision of Lecture PARALLEL COMPUTING, MAPREDUCE Parallel Computing Architectures Required to analyse large amounts of data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

DM550/DM857 Introduction to Programming Peter Schneider-Kamp petersk@imada.sdu.dk

Scaling Up Hadoop Duen Horng (Polo) Chau Associate Professor Associate Director, MS

Real-time 802.11 on WARP Patrick Murphy Mango Communications Nov 2013 http://warpproject.org

Introduction to roxygen2 Aime Gott Education Practice Lead, Mango Solutions DataCamp

Providing R Reporting Capabilities

Architecting &amp; Developing for Windows Phone Philipp Bauknecht CEO &amp; Software Architect

Inspection of Windows Phone applications Dmitriy Evdokimov Andrey Chasovskikh About us Dmitriy

DE AQUI Y DE ALLA: Essential Topics in Latinx Humanities Curriculum Miguel Correa, the Berkeley

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Architecting & Developing for Windows Phone Philipp Bauknecht CEO & Software Architect