Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde

OVERVIEW OF SPARK event.cwi.nl/lsde

What is Spark? • Fast and expressive cluster computing system interoperable with Apache Hadoop Up to 100 × faster • Improves efficiency through: (2-10 × on disk) – In-memory computing primitives – General computation graphs Often 5 × less code • Improves usability through: – Rich APIs in Scala, Java, Python – Interactive shell event.cwi.nl/lsde

The Spark Stack • Spark is the basis of a wide set of projects in the Berkeley Data Analytics Stack (BDAS) Spark MLIB Spark GraphX Streaming (machine SQL (graph) learning) (real-time) … Spark More details: amplab.berkeley.edu event.cwi.nl/lsde

Why a New Programming Model? • MapReduce greatly simplified big data analysis • But as soon as it got popular, users wanted more: – More complex , multi-pass analytics (e.g. ML, graph) – More interactive ad-hoc queries – More real-time stream processing • All 3 need faster data sharing across parallel jobs event.cwi.nl/lsde

Data Sharing in MapReduce HDFS HDFS HDFS HDFS read write read write . . . iter. 1 iter. 2 Input result 1 query 1 HDFS read result 2 query 2 result 3 query 3 Input . . . Slow due to replication, serialization, and disk IO event.cwi.nl/lsde

Data Sharing in Spark . . . iter. 1 iter. 2 Input query 1 one-time processing query 2 query 3 Input Distributed . . . memory ~10 × faster than network and disk event.cwi.nl/lsde

Spark Programming Model • Key idea: resilient distributed datasets (RDDs) – Distributed collections of objects that can be cached in memory across the cluster – Manipulated through parallel operators – Automatically recomputed on failure • Programming interface – Functional APIs in Scala, Java, Python – Interactive use from Scala shell event.cwi.nl/lsde

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(lambda x: x.startswith (“ERROR”) ) messages = errors.map(lambda x: x.split (‘ \ t’)[2] ) messages.cache() Driver Worker Worker event.cwi.nl/lsde

Lambda Functions errors = lines.filter(lambda x: x.startswith (“ERROR”) ) messages = errors.map(lambda x: x.split (‘ \ t’)[2] ) Lambda function  functional programming! = implicit function definition bool detect_error(string x) { return x.startswith (“ERROR”); } event.cwi.nl/lsde

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD lines = spark.textFile(“hdfs://...”) Cache 1 results Worker errors = lines.filter(lambda x: x.startswith (“ERROR”) ) messages = errors.map(lambda x: x.split (‘ \ t’)[2] ) tasks Block 1 messages.cache() Driver Action messages.filter( lambda x: “foo” in x ).count messages.filter( lambda x: “bar” in x ).count Cache 2 . . . Worker Result: scaled to 1 TB data in 5-7 sec Cache 3 (vs 170 sec for on-disk data) Block 2 Worker Result: full-text search of Wikipedia in Block 3 event.cwi.nl/lsde <1 sec (vs 20 sec for on-disk data)

Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file event.cwi.nl/lsde

Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) reduce filter map Input file event.cwi.nl/lsde

Example: Logistic Regression 4000 3500 110 s / iteration Running Time (s) 3000 2500 Hadoop 2000 1500 Spark 1000 500 first iteration 80 s 0 further iterations 1 s 1 5 10 20 30 Number of Iterations event.cwi.nl/lsde

Spark in Scala and Java // Scala: val val lines = sc.textFile(...) lines.filter(x => x.contains (“ERROR”) ).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(new new Function<String, Boolean>() { Boolean call(String s) { re retu turn rn s.contains( “error” ); } }).count(); event.cwi.nl/lsde

Supported Operators sample • map • reduce take • filter • count first • groupBy • fold partitionBy • sort • reduceByKey mapWith • union • groupByKey pipe • join • cogroup save • leftOuterJoin • cross ... ... • rightOuterJoin • zip event.cwi.nl/lsde

Software Components • Spark client is library in user program (1 instance Your application per app) • Runs tasks locally or on cluster SparkContext – Mesos, YARN, standalone mode • Accesses storage systems via Hadoop Cluster Local InputFormat API manager threads – Can use HBase , HDFS, S3, … Worker Worker Spark Spark executor executor HDFS or other storage event.cwi.nl/lsde

Task Scheduler General task graphs B: A: Automatically pipelines functions F: Stage 1 Data locality aware groupBy Partitioning aware E: C: D: to avoid shuffles join map filter Stage 2 Stage 3 = cached partition = RDD event.cwi.nl/lsde

Spark SQL • Columnar SQL analytics engine for Spark – Support both SQL and complex analytics – Up to 100X faster than Apache Hive • Compatible with Apache Hive – HiveQL, UDF/UDAF, SerDes, Scripts – Runs on existing Hive warehouses • In use at Yahoo! for fast in-memory OLAP event.cwi.nl/lsde

Hive Architecture Client CLI JDBC Driver Hive Physical Plan SQL Query Catalog Parser Optimizer Execution MapReduce HDFS event.cwi.nl/lsde

Spark SQL Architecture Client CLI JDBC Driver Cache Mgr. Hive Physical Plan SQL Query Catalog Parser Optimizer Execution Spark HDFS event.cwi.nl/lsde [Engle et al, SIGMOD 2012]

What Makes it Faster? • Lower-latency engine (Spark OK with 0.5s jobs) • Support for general DAGs • Column-oriented storage and compression • New optimizations (e.g. map pruning) event.cwi.nl/lsde

Other Spark Stack Projects • Spark Streaming: stateful, fault-tolerant stream processing (out since Spark 0.7) • sc.twitterStream(...) .flatMap(_.getText.split (“ ”) ) .map(word => (word, 1)) .reduceByWindow( “5s”, _ + _ ) • MLlib: Library of high-quality machine learning algorithms (out since 0.8) event.cwi.nl/lsde

Response Time (s) 20 10 25 15 0 5 Performance Impala (disk) Impala (mem) SQL Redshift Spark SQL (disk) Spark SQL (mem) Throughput (MB/s/node) 20 25 10 15 30 35 0 5 Streaming Storm Spark Response Time (min) 20 25 10 15 30 0 5 Hadoop event.cwi.nl/lsde Graph Giraph GraphX

What it Means for Users • Separate frameworks: query train ETL HDFS HDFS HDFS HDFS HDFS HDFS … read write read write read write Spark: query train ETL HDFS read HDFS event.cwi.nl/lsde

Conclusion • Big data analytics is evolving to include: – More complex analytics (e.g. machine learning) – More interactive ad-hoc queries – More real-time stream processing • Spark is a fast platform that unifies these apps • More info: spark-project.org event.cwi.nl/lsde

SPARK MLLIB event.cwi.nl/lsde

What is MLLIB? MLlib is a Spark subproject providing machine learning primitives: • initial contribution from AMPLab, UC Berkeley • shipped with Spark since version 0.8 event.cwi.nl/lsde

What is MLLIB? Algorithms : • classification : logistic regression, linear support vector machine (SVM), naive Bayes • regression : generalized linear regression (GLM) • collaborative filtering : alternating least squares (ALS) • clustering : k-means • decomposition : singular value decomposition (SVD), principal component analysis (PCA) event.cwi.nl/lsde

Collaborative Filtering event.cwi.nl/lsde

Alternating Least Squares (ALS) event.cwi.nl/lsde

Collaborative Filtering in Spark MLLIB trainset = sc.textFile("s3n://bads-music-dataset/train_*.gz") .map(lambda l: l.split('\t')) .map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2]))) model = ALS.train(trainset, rank=10, iterations=10) # train testset = # load testing set sc.textFile("s3n://bads-music-dataset/test_*.gz") .map(lambda l: l.split('\t')) .map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2]))) # apply model to testing set (only first two cols) to predict predictions = model.predictAll(testset.map(lambda p: (p[0], p[1]))) .map(lambda r: ((r[0], r[1]), r[2])) event.cwi.nl/lsde

Spark MLLIB – ALS Performance System Wall-clock /me (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481 • Dataset: Netflix data • Cluster: 9 machines. • MLlib is an order of magnitude faster than Mahout. • MLlib is within factor of 2 of GraphLab. event.cwi.nl/lsde

Spark Implementation of ALS • Workers load data • Models are instantiated at workers. • At each iteration, models are shared via join between workers. • Good scalability. • Works on large datasets Master event.cwi.nl/lsde Workers

Spark SQL + MLLIB event.cwi.nl/lsde

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde - PowerPoint PPT Presentation

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK event.cwi.nl/lsde What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Up to 100 faster Improves efficiency

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Ethics in Techniques for large-scale data Graham J.L. Kemp TECHNIQUES FOR LARGE-SCALE DATA

A large-scale chemical data integration system Gaia Paolini Pfizer Confidential 1 Large-Scale

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde DATA

MongoDB large scale data-centric architectures QConSF 2012 Kenny Gorman Founder, ObjectRocket

Large-scale Data Processing and Optimisation Eiko Yoneki University of Cambridge Computer

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

Meeting the Challenges of Ultra- -Large Large- - Meeting the Challenges of Ultra Scale Systems

GLAST Large Area Telescope: GLAST Large Area Telescope: Gamma- -ray Large ray Large Gamma

INCORPORATING LARGE-SCALE CITIZEN INCORPORATING LARGE-SCALE CITIZEN DELIBERATION INTO

Workshop Workshop on Large on Large- -Scale Disaster Recovery Scale Disaster Recovery i i

Meeting the Challenges of Ultra- -Large Large- -Scale Scale Meeting the Challenges of Ultra

ASPR zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA ASPR TRACIE was developed as a

Network dynamics: advanced models Marta Arias, Ramon Ferrer-i-Cancho, Argimiro Arratia

Scalable Tensor Computations with Cyclops and Faster Algorithms for Alternating Least Squares

Charlotte A. & Clinton E. Rings Photo History (Slides) 1964 1971 By Al Ring 2007

Conficker / Downadup spreading vectors MS08-067 Vulnerability in Server service USB-Flash drives

Clinical Research | Clinical Care Investigator Driven Industry Sponsored

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3