Apache Spark
CS240A Winter 2016. T Yang
Some of them are based on P. Wendell’s Spark slides
Apache Spark CS240A Winter 2016. T Yang Some of them are based on - - PowerPoint PPT Presentation
Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on
CS240A Winter 2016. T Yang
Some of them are based on P. Wendell’s Spark slides
machines.
a Hadoop cluster
programming
function,
key
>>> lst = [3, 1, 4, 1, 5] >>> lst.append(2) >>> len(lst) 5 >>> lst.sort() >>> lst.insert(4,"Hello") >>> [1]+ [2] [1,2] >>> lst[0] ->3 Python tuples >>> num=(1, 2, 3, 4) >>> num +(5) (1,2,3,4, 5)
for i in [5, 4, 3, 2, 1] : print i print 'Blastoff!'
>>>M = [x for x in S if x % 2 == 0] >>> S = [x**2 for x in range(10)] [0,1,4,9,16,…,81] >>> words =‘hello lazy dog'.split() >>> stuff = [(w.upper(), len(w)] for w in words] [ (‘HELLO’, 5) (‘LAZY’, 4) , (‘DOG’, 4)]
>>> words = 'The quick brown fox jumps over the lazy dog'.split()
>>>numset=set([1, 2, 3, 2]) Duplicated entries are deleted >>>numset=frozenset([1, 2,3]) Such a set cannot be modified
a = [1, 2, 3] b = [4, 5, 6, 7] c = [8, 9, 1, 2, 3] f= lambda x: len(x) L = map(f, [a, b, c]) [3, 4, 5] g=lambda x,y: x+y reduce(g, [47,11,42,13]) 113
RDD: Resilient Distributed Datasets
across a cluster, stored in RAM or
transformations
failure
Operations
(e.g. map, filter, groupBy)
input/output match
RDD RDD RDD RDD
RDD RDD RDD RDD Map and reduce tasks operate on key-value pairs
Standalone Programs
Interactive Shells
Performance
due to static typing
lines = lines = sc.tex sc.textFile(.. tFile(...) .) lines. lines.filter(lambda s: “ERROR” in s). ).count count() ()
val val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()
JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
#St #Star art w t wit ith s h sc c – Sp Spark arkCo Conte ntext xt as as
Main entry point to Spark functionality # Tu
Turn rn a a Py Pyth thon
coll llec ecti tion
into to an an RD RDD >sc. sc.pa paral ralle leliz lize( e([1, [1, 2, 2, 3 3]) ]) # L # Loa
d tex ext f t fil ile f e from rom l loca
l FS, FS, H HDFS DFS, o , or r S3 S3 >sc. sc.te textF xtFil ile( e(“file.txt”) >sc. sc.te textF xtFil ile( e(“directory/*.txt”) >sc. sc.te textF xtFil ile( e(“hdfs://namenode:9000/path/file”)
> num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # P # Pas ass e s eac ach e h ele lemen ment t t thr hroug
h a f a fun uncti ction
> squ squar ares es = = num nums. s.map map(lam lambd bda x a x: x : x*x *x) ) // // {1 {1, 4 , 4, , 9} 9} # K # Kee eep e p ele lemen ments ts pa passi ssing ng a a pr predi edica cate te > even ven = sq square ares.fil filter er(lambd mbda x: x: x % % 2 == == 0) ) // {4 {4}
#read a text file and count number of lines containing error lines = sc.textFile(“file.log”) lines.filter(lambda s: “ERROR” in s).count()
> num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # R # Ret etrie rieve ve RD RDD D con conten tents ts as as a a lo loca cal c l coll
ectio tion > num nums. s.col colle lect ct() () # = # => [ > [1, 1, 2, 2, 3 3] # R # Ret eturn urn f firs irst t K e K elem lemen ents ts > num nums. s.tak take(2) (2) # = # => [ > [1, 1, 2] 2] # Co Count n t numbe mber of
lements ts > num nums. s.cou count nt() () # = # => 3 > 3 # M # Mer erge ge el eleme ement nts w s with ith a an a n ass ssoci
ative ive fu func nctio tion > num nums. s.red reduc uce(la lambd mbda a x, x, y: y: x x + y + y) ) # = # => 6 > 6 # Wr Write e e eleme ements t s to a a text xt file ile > num nums. s.sav saveA eAsTe sText xtFil File(“hdfs://file.txt”)
Spark’s “distributed reduce” transformations
pair = (a, b) pair[0] # => a pair[1] # => b
val pair = (a, b) pair._1 // => a pair._2 // => b
Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b
> pets = s pets = sc. c.parallel paralleliz ize( e( [( [(“cat”, 1 , 1), (“dog”, 1), ( , 1), (“cat”, 2)]) ]) > pets. pets.reduce duceByKey(lamb lambda x, y: x y: x + y) # => => {(cat, 3 t, 3), (dog, 1 g, 1)} > pets. pets.groupB
() # => => {(cat, [ t, [1, 2]), (d , (dog, [1])} ])} > pets. pets.sortBy rtByKey() () # => => {(cat, 1 t, 1), (cat, 2 t, 2), (dog, 1 g, 1)}
reduceB duceByKey yKey also automatically implements
> lines nes = = sc.te sc.textF xtFile ile(“hamlet.txt”) > count unts = s = line lines. s.flat latMap Map(lambda line: line.split(“ ”)) .map ap(lambd lambda w a word
: (word, rd, 1) 1)) .redu educeByK ceByKey ey(lam lambda x bda x, y , y: x : x + y + y)
“to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 1)(be,1) (not, 1) (or, 1) (to, 1)(to,1) (be,2) (not, 1) (or, 1) (to, 2)
> vi visit sits s = s = sc.p c.paral ralle leliz lize([ e([ (“index.html”, “1.2.3.4”), ), (“about.html”, “3.4.5.6”), ), (“index.html”, “1.3.3.1”) ]) ) ]) > pa pageN geNam ames es = s = sc.pa .para ralle lleliz lize([ ([ (“index.html”, , “Home”), ), (“about.html”, , “About”) ]) ) ]) > vi visit sits. s.join join(pa (pageN geNam ames) es) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) > vi visit sits. s.cogroup cogroup(pa (page geNam Names) es) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))
graphs
pipelines functions
aware
aware to avoid shuffles
= cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
All the pair RDD operations take an optional second parameter for number of tasks
> words.reduceByKey(lambda x, y: x + y, 5) > words.groupByKey(5) > visits.join(pageViews, 5)
ap
ilter
roupBy
nion
eftOuterJoi rJoin
ightOuterJo erJoin in
educe
educeByKey Key
roupByKey ey
ross
ip sample take first partitionBy mapWith pipe save ...
impo import sy t sys fr from
pysp spar ark k im impor
t Sp Spar arkC kCont ntex ext if _ if __name name__ = __ == " = "__ma __main__ in__": ": sc = c = Spar SparkCo kContex ntext( t( “local”, , “WordCount”, sys sys.argv argv[0], [0], None None) lines ines = s = sc.t c.textF extFile( ile(sys sys.arg .argv[1] v[1]) count
s = lin lines. es.flatM latMap ap(lambda s: s.split(“ ”)) ) \ .map ap(lamb ambda w da word:
(word,
1)) ) \ .reduc educeBy eByKey Key(lamb lambda da x, y x, y: x : x + y + y) count
s.save saveAsTex sTextFil tFile(sys.a ys.argv[ rgv[2]) 2])
import import org.apache.spark.api.java.JavaSparkContext; JavaSparkContext sc = new new JavaSparkContext( “masterUrl”, “name”, “sparkHome”, new String[] {“app.jar”})); import org.apache.spark.SparkContext import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext._ val sc = new SparkContext( val sc = new SparkContext(“url”, , “name”, “sparkHome”, Seq( , Seq(“app.jar”)) ))
Cluster URL, or local / local[N] App name Spark install path on cluster List of JARs with app code (to ship)
Scala Java
from from pyspark import import SparkContext sc = SparkContext(“masterUrl”, “name”, “sparkHome”, [“library.py”]))
Python
http://<Standalone Master>:8080 (by default)
Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
Model page reputation on the web i=1,n lists all parents of page x. PR(x) is the page rank of each page. C(t) is the out-degree of t. d is a damping factor .
n i i i
t C t PR d d x PR
1
) ( ) ( ) 1 ( ) (
0.4 0.4 0.2 0.2 0.2 0.2 0.4
Start with seed Rank values
Each page distributes Rank “credit” to all
points to. Each target page adds up “credit” from multiple in- bound links to compute PRi+1
Effects at each iteration is local. i+1th iteration depends only on
ith iteration
At iteration i, PageRank for individual nodes can be computed
independently
Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value
Iterate until convergence
Source of Image: Lin 2008
1.0 1.0 1.0 1.0
1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0 1.0 1.0 1.0 1 0.5 0.5 0.5 1 0.5
1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58 1.0 1.85 0.58
1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58 0.29 0.29 0.5 1.85 0.58 1.0 1.85 0.58 0.5
1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
0.39 1.72 1.31 0.58 . . .
1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |outdegreep| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
0.46 1.37 1.44 0.73 Final state:
Random surfer model to describe the algorithm
10%*num-nodes divided by num-nodes is 0.1
R(x) = 0.1+ 0.05 R(x) + incoming-contributions
Initial weight 1 for everybody
To/From 1 2 3 Random Factor New Weight
0.05 0.283 0.0 0.283 0.10 0.716 1 0.425 0.05 0.0 0.283 0.10 0.858 2 0.425 0.283 0.05 0.283 0.10 1.141 3 0.00 0.283 0.85 0.05 0.10 1.283