Apache Spark CS240A Winter 2016. T Yang Some of them are based on - PowerPoint PPT Presentation

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendell’s Spark slides

Parallel Processing using Spark+Hadoop • Hadoop: Distributed file system that connects machines. • Mapreduce: parallel programming style built on a Hadoop cluster • Spark: Berkeley design of Mapreduce programming • Given a file treated as a big list  A file may be divided into multiple parts (splits). • Each record (line) is processed by a Map function,  produces a set of intermediate key/value pairs. • Reduce: combine a set of values for the same key

>>> words = 'The quick brown fox jumps over the lazy dog'.split() Python Examples and List Comprehension for i in [5, 4, 3, 2, 1] : >>> lst = [3, 1, 4, 1, 5] print i >>> lst.append(2) >>> len(lst) print 'Blastoff!' 5 >>> lst.sort() >>> lst.insert(4,"Hello") >>> [1]+ [2]  [1,2] >>>M = [x for x in S if x % 2 == 0] >>> lst[0] ->3 >>> S = [x**2 for x in range(10)] [0,1,4,9,16,…,81] Python tuples >>> num=(1, 2, 3, 4) >>> words =‘hello lazy dog'.split() >>> num +(5)  >>> stuff = [(w.upper(), len(w)] for w in words] (1,2,3,4, 5)  [ (‘HELLO’, 5) (‘LAZY’, 4) , (‘DOG’, 4)] >>>numset=set([1, 2, 3, 2]) Duplicated entries are deleted >>>numset=frozenset([1, 2,3]) Such a set cannot be modified

Python map/reduce a = [1, 2, 3] b = [4, 5, 6, 7] c = [8, 9, 1, 2, 3] f= l ambda x: len(x) L = map( f, [a, b, c]) [3, 4, 5] g=lambda x,y: x+y reduce(g, [47,11,42,13]) 113

Mapreduce programming with SPAK: key concept Write programs in terms of operations on implicitly distributed datasets (RDD) RDD RDD RDD: Resilient Distributed Datasets RDD • Like a big list: RDD  Collections of objects spread Operations across a cluster, stored in RAM or • Transformations on Disk (e.g. map, filter, • Built through parallel groupBy) transformations • Make sure • Automatically rebuilt on input/output match failure

MapReduce vs Spark RDD RDD RDD RDD Spark operates on RDD Map and reduce tasks operate on key-value pairs

Language Support Python Standalone Programs • Python, Scala, & Java lines = sc.tex lines = sc.textFile(.. tFile(...) .) lines. lines.filter( lambda s: “ERROR” in s ). ).count count() () Interactive Shells Scala • Python & Scala val val lines = sc.textFile(...) lines.filter( x => x.contains(“ERROR”) ).count() Performance • Java & Scala are faster Java due to static typing • …but Python is often fine JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains( “error” ); } }).count();

Spark Context and Creating RDDs #St #Star art w t wit ith s h sc c – Sp Spark arkCo Conte ntext xt as as Main entry point to Spark functionality # Tu Turn rn a a Py Pyth thon on co coll llec ecti tion on in into to an an RD RDD > sc. sc.pa paral ralle leliz lize( e([1, [1, 2, 2, 3 3]) ]) # L # Loa oad t d tex ext f t fil ile f e from rom l loca ocal l FS, FS, H HDFS DFS, o , or r S3 S3 > sc. sc.te textF xtFil ile( e( “file.txt” ) > sc. sc.te textF xtFil ile( e( “directory/*.txt” ) > sc. sc.te textF xtFil ile( e( “hdfs://namenode:9000/path/file” )

Spark Architecture

Basic Transformations > num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # P # Pas ass e s eac ach e h ele lemen ment t t thr hroug ough h a f a fun uncti ction on > squ squar ares es = = num nums. s.map map(lam lambd bda x a x: x : x*x *x) ) // // {1 {1, 4 , 4, , 9} 9} # K # Kee eep e p ele lemen ments ts pa passi ssing ng a a pr predi edica cate te > even ven = sq square ares.fil filter er(lambd mbda x: x: x % % 2 == == 0) ) // {4 {4} #read a text file and count number of lines containing error lines = sc.textFile(“file.log”) lines.filter( lambda s: “ERROR” in s ).count()

Basic Actions > num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # Ret # R etrie rieve ve RD RDD D con conten tents ts as as a a lo loca cal c l coll ollec ectio tion > num nums. s.col colle lect ct() () # = # => [ > [1, 1, 2, 2, 3 3] # R # Ret eturn urn f firs irst t K e K elem lemen ents ts > num nums. s.tak take(2) (2) # = # => [ > [1, 1, 2] 2] # Co Count n t numbe mber of of elem lements ts > num nums. s.cou count nt() () # = # => 3 > 3 # M # Mer erge ge el eleme ement nts w s with ith a an a n ass ssoci ociat ative ive fu func nctio tion > num nums. s.red reduc uce(la lambd mbda a x, x, y: y: x x + y + y) ) # = # => 6 > 6 # Wr Write e e eleme ements t s to a a text xt file ile > num nums. s.sav saveA eAsTe sText xtFil File( “hdfs://file.txt” )

Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python : pair = (a, b) pair[0] # => a pair[1] # => b Scala : val pair = (a, b) pair._1 // => a pair._2 // => b Java : Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b

Some Key-Value Operations pets = s pets = sc. c.parallel paralleliz ize( e( > [( [( “cat” , 1 , 1), ( “dog” , 1), ( , 1), ( “cat” , 2)]) ]) pets. pets.reduce duceByKey(lamb lambda x, y: x y: x + y) > # => => {(cat, 3 t, 3), (dog, 1 g, 1)} pets. pets.groupB oupByKey() () # => => {(cat, [ t, [1, 2]), (d , (dog, [1])} ])} > pets.sortBy pets. rtByKey() () # => => {(cat, 1 t, 1), (cat, 2 t, 2), (dog, 1 g, 1)} > yKey also automatically implements reduceB duceByKey combiners on the map side

Example: Word Count lines nes = = sc.te sc.textF xtFile ile( “hamlet.txt” ) > count unts = s = line lines. s.flat latMap Map( lambda line: line.split(“ ”) ) > .map ap(lambd lambda w a word ord: (wo : (word, rd, 1) 1)) .redu educeByK ceByKey ey(lam lambda x bda x, y , y: x : x + y + y) “to” (to, 1) (be, 1)(be,1) (be,2) “be” (be, 1) “to be or” (not, 1) (not, 1) “or” (or, 1) “not” (not, 1) (or, 1) (or, 1) “to” (to, 1) “not to be” (to, 2) (to, 1)(to,1) “be” (be, 1)

Other Key-Value Operations vi visit sits s = s = sc.p c.paral ralle leliz lize([ e([ ( “index.html” , “1.2.3.4” ), ), > ( “about.html” , “3.4.5.6” ), ), ( “index.html” , “1.3.3.1” ) ]) ) ]) pa pageN geNam ames es = s = sc.pa .para ralle lleliz lize([ ([ ( “index.html” , , “Home” ), ), > ( “about.html” , , “About” ) ]) ) ]) visit vi sits. s.join join(pa (pageN geNam ames) es) > # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) vi visit sits. s.cogroup cogroup(pa (page geNam Names) es) > # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))

Under The Hood: DAG Scheduler • General task B: A: graphs • Automatically F: Stage 1 groupBy pipelines functions E: C: D: • Data locality aware join • Partitioning map filter Stage 2 aware Stage 3 to avoid shuffles = RDD = cached partition

Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks > words.reduceByKey(lambda x, y: x + y, 5) > words.groupByKey(5) > visits.join(pageViews, 5)

More RDD Operators sample • map ap • reduce educe take • filter ilter • count ount first • groupBy roupBy • fold old partitionBy • sort ort • reduceBy educeByKey Key mapWith • union nion • groupByK roupByKey ey pipe • join oin • cogroup ogroup save ... • leftOute eftOuterJoi rJoin • cross ross • rightOut ightOuterJo erJoin in • zip ip

Interactive Shell • The Fastest Way to Learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • OR Can run locally

… or a Standalone Application import sy impo t sys fr from om py pysp spar ark k im impor ort t Sp Spar arkC kCont ntex ext if _ if __name name__ = __ == " = "__ma __main__ in__": ": sc = c = Spar SparkCo kContex ntext( t( “local” , , “WordCount” , sys sys.argv argv[0], [0], None None) lines ines = s = sc.t c.textF extFile( ile(sys sys.arg .argv[1] v[1]) count ounts = s = lin lines. es.flatM latMap ap( lambda s: s.split(“ ”) ) ) \ .map ap(lamb ambda w da word: ord: (w (word, ord, 1) 1)) ) \ .reduc educeBy eByKey Key(lamb lambda da x, y x, y: x : x + y + y) count ounts. s.save saveAsTex sTextFil tFile(sys.a ys.argv[ rgv[2]) 2])

Apache Spark CS240A Winter 2016. T Yang Some of them are based on - PowerPoint PPT Presentation

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring 2019

RAP Tight integration with the physical world Location aware Communication patterns:

Network Layer Goals: Overview: last time understand principles network layer services

Network Layer (Routing) Where we are in the Course Moving on up to the Network Layer!

Big Data Analytics with Apache Apostolos N. Papadopoulos (Associate Prof.)

UIMA-based Annotation Type System for a Text Mining Architecture Udo Hahn, Ekaterina Buyko ,

Apache Kafka Real-Time Data Pipelines http://kafka.apache.org/ Joe Stein Developer,

TVM & THE APACHE SOFTWARE FOUNDATION MARKUS WEIMER MEMBER, APACHE SOFTWARE FOUNDATION

Apache Spark CS240A Winter 2016. T Yang Some of them are based on - PowerPoint PPT Presentation

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring 2019

RAP Tight integration with the physical world Location aware Communication patterns:

Network Layer Goals: Overview: last time understand principles network layer services

Network Layer (Routing) Where we are in the Course Moving on up to the Network Layer!

Big Data Analytics with Apache Apostolos N. Papadopoulos (Associate Prof.)

UIMA-based Annotation Type System for a Text Mining Architecture Udo Hahn, Ekaterina Buyko ,

Apache Kafka Real-Time Data Pipelines http://kafka.apache.org/ Joe Stein Developer,

TVM &amp; THE APACHE SOFTWARE FOUNDATION MARKUS WEIMER MEMBER, APACHE SOFTWARE FOUNDATION

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

TVM & THE APACHE SOFTWARE FOUNDATION MARKUS WEIMER MEMBER, APACHE SOFTWARE FOUNDATION