Making Big Data Processing Simple with Spark
Matei Zaharia
December 17, 2015
Making Big Data Processing Simple with Spark Matei Zaharia - - PowerPoint PPT Presentation
Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast to process large datasets High-level
December 17, 2015
real-time
structured data
machine learning
graph
20 40 60 80 100 120 140 160 2010 2011 2012 2013 2014 2015 Contributors
. . . . . . Input
HDFS read HDFS write HDFS read HDFS write
Input query 1 query 2 query 3 result 1 result 2 result 3 . . . . . .
HDFS read
. . . . . . Input
Distributed memory Input query 1 query 2 query 3 . . . . . .
processing
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘\t’)[2]) messages.cache()
Block 1 Block 2 Block 3
Worker Worker Worker Driver
messages.filter(lambda s: “MySQL” in s).count() messages.filter(lambda s: “Redis” in s).count() . . .
tasks results
Cache 1 Cache 2 Cache 3
Base RDD Transformed RDD Action
Example: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
filter reduce map Input file
filter reduce map Input file
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 Running Time (s) Running Time (s) Number of Iterations Number of Iterations Hadoop Spark
110 s / iteration first iteration 80 s further iterations 1 s
Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines
72 minutes
207 machines 23 minutes
Time to sort 100TB
real-time
structured data
machine learning
graph
// Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
. . . HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query HDFS write HDFS read ETL train query
Hive Impala (disk) Impala (mem) Spark (disk) Spark (mem) 10 20 30 40 50 Response Time (sec) Response Time (sec)
Mahout GraphLab Spark 10 20 30 40 50 60 Response Time (min) Response Time (min)
Storm Spark 5 10 15 20 25 30 35 Throughput (MB/s/node) Throughput (MB/s/node)
Many talks online at spark-summit.org
29% 36% 40% 44% 52% 68% Faud Detection / Security User-Facing Services Log Processing Recommendation Data Warehousing Business Intelligence
58% 58% 62% 69% MLlib + GraphX Spark Streaming DataFrames Spark SQL
than one component