Apache Spark CS240A T Yang Some of them are based on P. Wendells - PowerPoint PPT Presentation

Apache Spark CS240A T Yang Some of them are based on P. Wendell’s Spark slides

Parallel Processing using Spark+Hadoop • Hadoop: Distributed file system that connects machines. • Mapreduce: parallel programming style built on a Hadoop cluster • Spark: Berkeley design of Mapreduce programming • Given a file treated as a big list § A file may be divided into multiple parts (splits). • Each record (line) is processed by a Map function, § produces a set of intermediate key/value pairs. • Reduce: combine a set of values for the same key

>>> words = 'The quick brown fox jumps over the lazy dog'.split() Python Examples for List Processing for i in [5, 4, 3, 2, 1] : >>> lst = [3, 1, 4, 1, 5] print i >>> lst.append(2) >>> len(lst) 5 >>> lst.sort() >>> lst.insert(4,"Hello") >>>M = [x for x in S if x % 2 == 0] >>> [1]+ [2] à [1,2] >>> S = [x**2 for x in range(10)] >>> lst[0] ->3 [0,1,4,9,16,…,81] Python tuples >>> num=(1, 2, 3, 4) >>> num +(5) à >>> words =‘hello lazy dog'.split() (1,2,3,4, 5) à [‘hello’, ’lazy’, ‘dog’] >>> stuff = [(w.upper(), len(w)] for w in words] à [ (‘HELLO’, 5) (‘LAZY’, 4) , (‘DOG’, 4)] >>>numset=set([1, 2, 3, 2]) Duplicated entries are deleted >>>numset=frozenset([1, 2,3]) Such a set cannot be modified

Python map/reduce a = [1, 2, 3] b = [4, 5, 6, 7] c = [8, 9, 1, 2, 3] f= l ambda x: len(x) L = map( f, [a, b, c]) [3, 4, 5] g=lambda x,y: x+y reduce(g, [47,11,42,13]) 113

Mapreduce programming with SPAK: key concept Write programs in terms of operations on implicitly distributed datasets (RDD) RDD RDD RDD: Resilient Distributed Datasets RDD • Like a big list: RDD § Collections of objects spread Operations across a cluster, stored in RAM or on Disk • Transformations (e.g. map, filter, • Built through parallel groupBy) transformations • Make sure • Automatically rebuilt on input/output match failure

MapReduce vs Spark RDD RDD RDD RDD Map and reduce tasks operate on key-value Spark operates on RDD pairs with aggressive memory caching

Language Support Python Standalone Programs • Python, Scala, & Java lines = sc.textFile(...) lines = sc.textFile(...) lines.filter lines. filter(lambda s: “ERROR” in s lambda s: “ERROR” in s). ).count count() () Interactive Shells Scala • Python & Scala val lines = sc.textFile(...) val lines.filter(x => x.contains(“ERROR”)).count() Performance • Java & Scala are faster Java due to static typing • …but Python is often fine JavaRDD<String> lines = sc.textFile(...); new Function<String, Boolean>() { lines.filter( new Boolean call(String s) { return s.contains(“error”); return } }).count();

Spark Context and Creating RDDs #Start with sc #Start with sc – SparkContext as SparkContext as Main entry point to Spark functionality # Turn a Python collection into an RDD Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) sc.parallelize([1, 2, 3]) RDD # Load text file from local FS, HDFS, or S3 # Load text file from local FS, HDFS, or S3 > sc.textFile( sc.textFile(“file.txt” “file.txt”) > sc.textFile( sc.textFile(“directory/*.txt” “directory/*.txt”) > sc.textFile( sc.textFile(“hdfs://namenode:9000/path/file” “hdfs://namenode:9000/path/file”)

Spark Architecture

Spark Architecture RDD

RDD Basic Transformations RDD #read a text file and count number of lines RDD containing error lines = sc.textFile(“file.log”) lines.filter(lambda s: “ERROR” in s).count() > nums = sc.parallelize([1, 2, 3]) nums = sc.parallelize([1, 2, 3]) # Pass each element through a function # Pass each element through a function > squares = nums. squares = nums.map map(lambda x: x*x lambda x: x*x) ) // {1, 4, 9} // {1, 4, 9} # Keep elements passing a predicate # Keep elements passing a predicate > even = squares. even = squares.filter filter(lambda x: x % 2 == 0 lambda x: x % 2 == 0) ) // {4} // {4}

RDD Basic Actions RDD > nums = sc.parallelize([1, 2, 3]) nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection # Retrieve RDD contents as a local collection > nums. nums.collect collect() () # => [1, 2, 3] # => [1, 2, 3] # Return first K elements # Return first K elements > nums. nums.take take(2) (2) # => [1, 2] # => [1, 2] # Count number of elements # Count number of elements > nums. nums.count count() () # => 3 # => 3 # Merge elements with an associative function # Merge elements with an associative function > nums. nums.reduce reduce(lambda x, y: x + y lambda x, y: x + y) ) # => 6 # => 6 # Write elements to a text file # Write elements to a text file > nums. nums.saveAsTextFile saveAsTextFile(“hdfs://file.txt” “hdfs://file.txt”)

Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python : pair = (a, b) pair[0] # => a pair[1] # => b Scala : val pair = (a, b) pair._1 // => a pair._2 // => b Java : Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b

Some Key-Value Operations RDD RDD > pets = sc.parallelize( pets = sc.parallelize( [( [(“cat” “cat”, 1), ( , 1), (“dog” “dog”, 1), ( , 1), (“cat” “cat”, 2)]) , 2)]) > pets. pets.reduceByKey reduceByKey(lambda x, y: x + y lambda x, y: x + y) # => {(cat, 3), (dog, 1)} # => {(cat, 3), (dog, 1)} > pets. pets.groupByKey groupByKey() () # => {(cat, [1, 2]), (dog, [1])} # => {(cat, [1, 2]), (dog, [1])} > pets. pets.sortByKey sortByKey() () # => {(cat, 1), (cat, 2), (dog, 1)} # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey() also automatically implements combiners on the map side

Example: Word Count > lines = sc.textFile( lines = sc.textFile(“hamlet.txt” “hamlet.txt”) > counts = lines. counts = lines.flatMap flatMap(lambda line: line.split(“ ”) lambda line: line.split(“ ”)) .map map(lambda word: (word, 1) lambda word: (word, 1)) lines .reduceByKey reduceByKey(lambda x, y: x + y lambda x, y: x + y) flatmap map reduceByKey “to” (to, 1) (be, 1)(be,1) (be,2) “be” (be, 1) “to be or” (not, 1) (not, 1) “or” (or, 1) “not” (not, 1) (or, 1) (or, 1) “to” (to, 1) “not to be” (to, 2) (to, 1)(to,1) “be” (be, 1)

Other Key-Value Operations > visits = sc.parallelize([ ( visits = sc.parallelize([ (“index.html” “index.html”, “1.2.3.4” “1.2.3.4”), ), (“about.html” “about.html”, “3.4.5.6” “3.4.5.6”), ), (“index.html” “index.html”, “1.3.3.1” “1.3.3.1”) ]) ) ]) > pageNames = sc.parallelize([ ( pageNames = sc.parallelize([ (“index.html” “index.html”, , “Home” “Home”), ), (“about.html” “about.html”, , “About” “About”) ]) ) ]) > visits. visits.join join(pageNames) (pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) # (“about.html”, (“3.4.5.6”, “About”)) > visits. visits.cogroup cogroup(pageNames) (pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”])) # (“about.html”, ([“3.4.5.6”], [“About”]))

Under The Hood: DAG Scheduler = RDD = cached partition • General task graphs B: A: • Automatically pipelines F: functions Stage 1 groupBy • Data locality E: C: D: aware • Partitioning join aware to avoid shuffles map filter Stage 2 Stage 3

Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks > words.reduceByKey(lambda x, y: x + y, 5) > words.groupByKey(5) > visits.join(pageViews, 5)

More RDD Operators sample • map • reduce map reduce take • filter • count filter count first • groupBy • fold groupBy fold partitionBy • sort • reduceByKey sort reduceByKey mapWith • union • groupByKey union groupByKey pipe • join • cogroup join cogroup save ... ... • leftOuterJoin • cross leftOuterJoin cross • rightOuterJoin • zip rightOuterJoin zip

Interactive Shell • The Fastest Way to Learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • OR Can run locally

… or a Standalone Application import sys import sys from pyspark import SparkContext from pyspark import SparkContext if __name__ == "__main__": if __name__ == "__main__": sc = SparkContext( sc = SparkContext( “local” “local”, , “WordCount” “WordCount”, sys.argv[0], , sys.argv[0], None) None) lines = sc.textFile(sys.argv[1]) lines = sc.textFile(sys.argv[1]) counts = lines. counts = lines.flatMap flatMap(lambda s: s.split(“ ”) lambda s: s.split(“ ”)) ) \ .map map(lambda word: (word, 1) lambda word: (word, 1)) ) \ .reduceByKey reduceByKey(lambda x, y: x + y lambda x, y: x + y) counts. counts.saveAsTextFile saveAsTextFile(sys.argv[2]) (sys.argv[2])

Apache Spark CS240A T Yang Some of them are based on P. Wendells - PowerPoint PPT Presentation

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on a Hadoop

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

_world[y:] = [[' '] * XSIZE] #

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

Classification criteria Instruction set Size Instruction format Execution model

Assignment 4 - Code Generation A few words about the MIPS architecture Compilerconstructie LIACS

Main Memory by J. Nelson Amaral Types of Memories

CS 7643: Deep Learning Topics: Computational Graphs Notation + example Computing

following circumcision Tuesday, October 20, 2020 Continuing Medical Education Announcement

Improved Security Analysis and Alternative Solutions Alexandra Boldyreva Nathan Chenette Adam

Apache Spark CS240A T Yang Some of them are based on P. Wendells - PowerPoint PPT Presentation

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on a Hadoop

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

_world[y:] = [[' '] * XSIZE] #

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

Classification criteria Instruction set Size Instruction format Execution model

Assignment 4 - Code Generation A few words about the MIPS architecture Compilerconstructie LIACS

Main Memory by J. Nelson Amaral Types of Memories

CS 7643: Deep Learning Topics: Computational Graphs Notation + example Computing

following circumcision Tuesday, October 20, 2020 Continuing Medical Education Announcement

Improved Security Analysis and Alternative Solutions Alexandra Boldyreva Nathan Chenette Adam

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark