apache spark
play

Apache Spark CS240A Winter 2016. T Yang Some of them are based on - PowerPoint PPT Presentation

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on


  1. Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendell’s Spark slides

  2. Parallel Processing using Spark+Hadoop • Hadoop: Distributed file system that connects machines. • Mapreduce: parallel programming style built on a Hadoop cluster • Spark: Berkeley design of Mapreduce programming • Given a file treated as a big list  A file may be divided into multiple parts (splits). • Each record (line) is processed by a Map function,  produces a set of intermediate key/value pairs. • Reduce: combine a set of values for the same key

  3. >>> words = 'The quick brown fox jumps over the lazy dog'.split() Python Examples and List Comprehension for i in [5, 4, 3, 2, 1] : >>> lst = [3, 1, 4, 1, 5] print i >>> lst.append(2) >>> len(lst) print 'Blastoff!' 5 >>> lst.sort() >>> lst.insert(4,"Hello") >>> [1]+ [2]  [1,2] >>>M = [x for x in S if x % 2 == 0] >>> lst[0] ->3 >>> S = [x**2 for x in range(10)] [0,1,4,9,16,…,81] Python tuples >>> num=(1, 2, 3, 4) >>> words =‘hello lazy dog'.split() >>> num +(5)  >>> stuff = [(w.upper(), len(w)] for w in words] (1,2,3,4, 5)  [ (‘HELLO’, 5) (‘LAZY’, 4) , (‘DOG’, 4)] >>>numset=set([1, 2, 3, 2]) Duplicated entries are deleted >>>numset=frozenset([1, 2,3]) Such a set cannot be modified

  4. Python map/reduce a = [1, 2, 3] b = [4, 5, 6, 7] c = [8, 9, 1, 2, 3] f= l ambda x: len(x) L = map( f, [a, b, c]) [3, 4, 5] g=lambda x,y: x+y reduce(g, [47,11,42,13]) 113

  5. Mapreduce programming with SPAK: key concept Write programs in terms of operations on implicitly distributed datasets (RDD) RDD RDD RDD: Resilient Distributed Datasets RDD • Like a big list: RDD  Collections of objects spread Operations across a cluster, stored in RAM or • Transformations on Disk (e.g. map, filter, • Built through parallel groupBy) transformations • Make sure • Automatically rebuilt on input/output match failure

  6. MapReduce vs Spark RDD RDD RDD RDD Spark operates on RDD Map and reduce tasks operate on key-value pairs

  7. Language Support Python Standalone Programs • Python, Scala, & Java lines = sc.tex lines = sc.textFile(.. tFile(...) .) lines. lines.filter( lambda s: “ERROR” in s ). ).count count() () Interactive Shells Scala • Python & Scala val val lines = sc.textFile(...) lines.filter( x => x.contains(“ERROR”) ).count() Performance • Java & Scala are faster Java due to static typing • …but Python is often fine JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains( “error” ); } }).count();

  8. Spark Context and Creating RDDs #St #Star art w t wit ith s h sc c – Sp Spark arkCo Conte ntext xt as as Main entry point to Spark functionality # Tu Turn rn a a Py Pyth thon on co coll llec ecti tion on in into to an an RD RDD > sc. sc.pa paral ralle leliz lize( e([1, [1, 2, 2, 3 3]) ]) # L # Loa oad t d tex ext f t fil ile f e from rom l loca ocal l FS, FS, H HDFS DFS, o , or r S3 S3 > sc. sc.te textF xtFil ile( e( “file.txt” ) > sc. sc.te textF xtFil ile( e( “directory/*.txt” ) > sc. sc.te textF xtFil ile( e( “hdfs://namenode:9000/path/file” )

  9. Spark Architecture

  10. Spark Architecture

  11. Basic Transformations > num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # P # Pas ass e s eac ach e h ele lemen ment t t thr hroug ough h a f a fun uncti ction on > squ squar ares es = = num nums. s.map map(lam lambd bda x a x: x : x*x *x) ) // // {1 {1, 4 , 4, , 9} 9} # K # Kee eep e p ele lemen ments ts pa passi ssing ng a a pr predi edica cate te > even ven = sq square ares.fil filter er(lambd mbda x: x: x % % 2 == == 0) ) // {4 {4} #read a text file and count number of lines containing error lines = sc.textFile(“file.log”) lines.filter( lambda s: “ERROR” in s ).count()

  12. Basic Actions > num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # Ret # R etrie rieve ve RD RDD D con conten tents ts as as a a lo loca cal c l coll ollec ectio tion > num nums. s.col colle lect ct() () # = # => [ > [1, 1, 2, 2, 3 3] # R # Ret eturn urn f firs irst t K e K elem lemen ents ts > num nums. s.tak take(2) (2) # = # => [ > [1, 1, 2] 2] # Co Count n t numbe mber of of elem lements ts > num nums. s.cou count nt() () # = # => 3 > 3 # M # Mer erge ge el eleme ement nts w s with ith a an a n ass ssoci ociat ative ive fu func nctio tion > num nums. s.red reduc uce(la lambd mbda a x, x, y: y: x x + y + y) ) # = # => 6 > 6 # Wr Write e e eleme ements t s to a a text xt file ile > num nums. s.sav saveA eAsTe sText xtFil File( “hdfs://file.txt” )

  13. Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python : pair = (a, b) pair[0] # => a pair[1] # => b Scala : val pair = (a, b) pair._1 // => a pair._2 // => b Java : Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b

  14. Some Key-Value Operations pets = s pets = sc. c.parallel paralleliz ize( e( > [( [( “cat” , 1 , 1), ( “dog” , 1), ( , 1), ( “cat” , 2)]) ]) pets. pets.reduce duceByKey(lamb lambda x, y: x y: x + y) > # => => {(cat, 3 t, 3), (dog, 1 g, 1)} pets. pets.groupB oupByKey() () # => => {(cat, [ t, [1, 2]), (d , (dog, [1])} ])} > pets.sortBy pets. rtByKey() () # => => {(cat, 1 t, 1), (cat, 2 t, 2), (dog, 1 g, 1)} > yKey also automatically implements reduceB duceByKey combiners on the map side

  15. Example: Word Count lines nes = = sc.te sc.textF xtFile ile( “hamlet.txt” ) > count unts = s = line lines. s.flat latMap Map( lambda line: line.split(“ ”) ) > .map ap(lambd lambda w a word ord: (wo : (word, rd, 1) 1)) .redu educeByK ceByKey ey(lam lambda x bda x, y , y: x : x + y + y) “to” (to, 1) (be, 1)(be,1) (be,2) “be” (be, 1) “to be or” (not, 1) (not, 1) “or” (or, 1) “not” (not, 1) (or, 1) (or, 1) “to” (to, 1) “not to be” (to, 2) (to, 1)(to,1) “be” (be, 1)

  16. Other Key-Value Operations vi visit sits s = s = sc.p c.paral ralle leliz lize([ e([ ( “index.html” , “1.2.3.4” ), ), > ( “about.html” , “3.4.5.6” ), ), ( “index.html” , “1.3.3.1” ) ]) ) ]) pa pageN geNam ames es = s = sc.pa .para ralle lleliz lize([ ([ ( “index.html” , , “Home” ), ), > ( “about.html” , , “About” ) ]) ) ]) visit vi sits. s.join join(pa (pageN geNam ames) es) > # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) vi visit sits. s.cogroup cogroup(pa (page geNam Names) es) > # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))

  17. Under The Hood: DAG Scheduler • General task B: A: graphs • Automatically F: Stage 1 groupBy pipelines functions E: C: D: • Data locality aware join • Partitioning map filter Stage 2 aware Stage 3 to avoid shuffles = RDD = cached partition

  18. Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks > words.reduceByKey(lambda x, y: x + y, 5) > words.groupByKey(5) > visits.join(pageViews, 5)

  19. More RDD Operators sample • map ap • reduce educe take • filter ilter • count ount first • groupBy roupBy • fold old partitionBy • sort ort • reduceBy educeByKey Key mapWith • union nion • groupByK roupByKey ey pipe • join oin • cogroup ogroup save ... • leftOute eftOuterJoi rJoin • cross ross • rightOut ightOuterJo erJoin in • zip ip

  20. Interactive Shell • The Fastest Way to Learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • OR Can run locally

  21. … or a Standalone Application import sy impo t sys fr from om py pysp spar ark k im impor ort t Sp Spar arkC kCont ntex ext if _ if __name name__ = __ == " = "__ma __main__ in__": ": sc = c = Spar SparkCo kContex ntext( t( “local” , , “WordCount” , sys sys.argv argv[0], [0], None None) lines ines = s = sc.t c.textF extFile( ile(sys sys.arg .argv[1] v[1]) count ounts = s = lin lines. es.flatM latMap ap( lambda s: s.split(“ ”) ) ) \ .map ap(lamb ambda w da word: ord: (w (word, ord, 1) 1)) ) \ .reduc educeBy eByKey Key(lamb lambda da x, y x, y: x : x + y + y) count ounts. s.save saveAsTex sTextFil tFile(sys.a ys.argv[ rgv[2]) 2])

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend