large scale data engineering
play

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde - PowerPoint PPT Presentation

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK event.cwi.nl/lsde What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Up to 100 faster Improves efficiency


  1. Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde

  2. OVERVIEW OF SPARK event.cwi.nl/lsde

  3. What is Spark? • Fast and expressive cluster computing system interoperable with Apache Hadoop Up to 100 × faster • Improves efficiency through: (2-10 × on disk) – In-memory computing primitives – General computation graphs Often 5 × less code • Improves usability through: – Rich APIs in Scala, Java, Python – Interactive shell event.cwi.nl/lsde

  4. The Spark Stack • Spark is the basis of a wide set of projects in the Berkeley Data Analytics Stack (BDAS) Spark MLIB Spark GraphX Streaming (machine SQL (graph) learning) (real-time) … Spark More details: amplab.berkeley.edu event.cwi.nl/lsde

  5. Why a New Programming Model? • MapReduce greatly simplified big data analysis • But as soon as it got popular, users wanted more: – More complex , multi-pass analytics (e.g. ML, graph) – More interactive ad-hoc queries – More real-time stream processing • All 3 need faster data sharing across parallel jobs event.cwi.nl/lsde

  6. Data Sharing in MapReduce HDFS HDFS HDFS HDFS read write read write . . . iter. 1 iter. 2 Input result 1 query 1 HDFS read result 2 query 2 result 3 query 3 Input . . . Slow due to replication, serialization, and disk IO event.cwi.nl/lsde

  7. Data Sharing in Spark . . . iter. 1 iter. 2 Input query 1 one-time processing query 2 query 3 Input Distributed . . . memory ~10 × faster than network and disk event.cwi.nl/lsde

  8. Spark Programming Model • Key idea: resilient distributed datasets (RDDs) – Distributed collections of objects that can be cached in memory across the cluster – Manipulated through parallel operators – Automatically recomputed on failure • Programming interface – Functional APIs in Scala, Java, Python – Interactive use from Scala shell event.cwi.nl/lsde

  9. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(lambda x: x.startswith (“ERROR”) ) messages = errors.map(lambda x: x.split (‘ \ t’)[2] ) messages.cache() Driver Worker Worker event.cwi.nl/lsde

  10. Lambda Functions errors = lines.filter(lambda x: x.startswith (“ERROR”) ) messages = errors.map(lambda x: x.split (‘ \ t’)[2] ) Lambda function  functional programming! = implicit function definition bool detect_error(string x) { return x.startswith (“ERROR”); } event.cwi.nl/lsde

  11. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD lines = spark.textFile(“hdfs://...”) Cache 1 results Worker errors = lines.filter(lambda x: x.startswith (“ERROR”) ) messages = errors.map(lambda x: x.split (‘ \ t’)[2] ) tasks Block 1 messages.cache() Driver Action messages.filter( lambda x: “foo” in x ).count messages.filter( lambda x: “bar” in x ).count Cache 2 . . . Worker Result: scaled to 1 TB data in 5-7 sec Cache 3 (vs 170 sec for on-disk data) Block 2 Worker Result: full-text search of Wikipedia in Block 3 event.cwi.nl/lsde <1 sec (vs 20 sec for on-disk data)

  12. Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file event.cwi.nl/lsde

  13. Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) reduce filter map Input file event.cwi.nl/lsde

  14. Example: Logistic Regression 4000 3500 110 s / iteration Running Time (s) 3000 2500 Hadoop 2000 1500 Spark 1000 500 first iteration 80 s 0 further iterations 1 s 1 5 10 20 30 Number of Iterations event.cwi.nl/lsde

  15. Spark in Scala and Java // Scala: val val lines = sc.textFile(...) lines.filter(x => x.contains (“ERROR”) ).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(new new Function<String, Boolean>() { Boolean call(String s) { re retu turn rn s.contains( “error” ); } }).count(); event.cwi.nl/lsde

  16. Supported Operators sample • map • reduce take • filter • count first • groupBy • fold partitionBy • sort • reduceByKey mapWith • union • groupByKey pipe • join • cogroup save • leftOuterJoin • cross ... ... • rightOuterJoin • zip event.cwi.nl/lsde

  17. Software Components • Spark client is library in user program (1 instance Your application per app) • Runs tasks locally or on cluster SparkContext – Mesos, YARN, standalone mode • Accesses storage systems via Hadoop Cluster Local InputFormat API manager threads – Can use HBase , HDFS, S3, … Worker Worker Spark Spark executor executor HDFS or other storage event.cwi.nl/lsde

  18. Task Scheduler General task graphs B: A: Automatically pipelines functions F: Stage 1 Data locality aware groupBy Partitioning aware E: C: D: to avoid shuffles join map filter Stage 2 Stage 3 = cached partition = RDD event.cwi.nl/lsde

  19. Spark SQL • Columnar SQL analytics engine for Spark – Support both SQL and complex analytics – Up to 100X faster than Apache Hive • Compatible with Apache Hive – HiveQL, UDF/UDAF, SerDes, Scripts – Runs on existing Hive warehouses • In use at Yahoo! for fast in-memory OLAP event.cwi.nl/lsde

  20. Hive Architecture Client CLI JDBC Driver Hive Physical Plan SQL Query Catalog Parser Optimizer Execution MapReduce HDFS event.cwi.nl/lsde

  21. Spark SQL Architecture Client CLI JDBC Driver Cache Mgr. Hive Physical Plan SQL Query Catalog Parser Optimizer Execution Spark HDFS event.cwi.nl/lsde [Engle et al, SIGMOD 2012]

  22. What Makes it Faster? • Lower-latency engine (Spark OK with 0.5s jobs) • Support for general DAGs • Column-oriented storage and compression • New optimizations (e.g. map pruning) event.cwi.nl/lsde

  23. Other Spark Stack Projects • Spark Streaming: stateful, fault-tolerant stream processing (out since Spark 0.7) • sc.twitterStream(...) .flatMap(_.getText.split (“ ”) ) .map(word => (word, 1)) .reduceByWindow( “5s”, _ + _ ) • MLlib: Library of high-quality machine learning algorithms (out since 0.8) event.cwi.nl/lsde

  24. Response Time (s) 20 10 25 15 0 5 Performance Impala (disk) Impala (mem) SQL Redshift Spark SQL (disk) Spark SQL (mem) Throughput (MB/s/node) 20 25 10 15 30 35 0 5 Streaming Storm Spark Response Time (min) 20 25 10 15 30 0 5 Hadoop event.cwi.nl/lsde Graph Giraph GraphX

  25. What it Means for Users • Separate frameworks: query train ETL HDFS HDFS HDFS HDFS HDFS HDFS … read write read write read write Spark: query train ETL HDFS read HDFS event.cwi.nl/lsde

  26. Conclusion • Big data analytics is evolving to include: – More complex analytics (e.g. machine learning) – More interactive ad-hoc queries – More real-time stream processing • Spark is a fast platform that unifies these apps • More info: spark-project.org event.cwi.nl/lsde

  27. SPARK MLLIB event.cwi.nl/lsde

  28. What is MLLIB? MLlib is a Spark subproject providing machine learning primitives: • initial contribution from AMPLab, UC Berkeley • shipped with Spark since version 0.8 event.cwi.nl/lsde

  29. What is MLLIB? Algorithms : • classification : logistic regression, linear support vector machine (SVM), naive Bayes • regression : generalized linear regression (GLM) • collaborative filtering : alternating least squares (ALS) • clustering : k-means • decomposition : singular value decomposition (SVD), principal component analysis (PCA) event.cwi.nl/lsde

  30. Collaborative Filtering event.cwi.nl/lsde

  31. Alternating Least Squares (ALS) event.cwi.nl/lsde

  32. Collaborative Filtering in Spark MLLIB trainset = sc.textFile("s3n://bads-music-dataset/train_*.gz") .map(lambda l: l.split('\t')) .map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2]))) model = ALS.train(trainset, rank=10, iterations=10) # train testset = # load testing set sc.textFile("s3n://bads-music-dataset/test_*.gz") .map(lambda l: l.split('\t')) .map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2]))) # apply model to testing set (only first two cols) to predict predictions = model.predictAll(testset.map(lambda p: (p[0], p[1]))) .map(lambda r: ((r[0], r[1]), r[2])) event.cwi.nl/lsde

  33. Spark MLLIB – ALS Performance System Wall-clock /me (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481 • Dataset: Netflix data • Cluster: 9 machines. • MLlib is an order of magnitude faster than Mahout. • MLlib is within factor of 2 of GraphLab. event.cwi.nl/lsde

  34. Spark Implementation of ALS • Workers load data • Models are instantiated at workers. • At each iteration, models are shared via join between workers. • Good scalability. • Works on large datasets Master event.cwi.nl/lsde Workers

  35. Spark SQL + MLLIB event.cwi.nl/lsde

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend