Spark Overview / High-level Architecture Indexing from Spark - PowerPoint PPT Presentation

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term vectors & Spark SQL Document Matching

user since 2010, committer since April 2014, work for SolrCloud features … and bin/solr! Release manager for Lucene / Solr 5.1 in Action Several years experience working with Hadoop, Pig, Hive, , but only started using Spark about 6 months

Wealth of overview / getting started resources on the Web https://spark.apache.org/ https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Faster, more modernized alternative to MapReduce Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x Unified platform for Big Data Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining

MLlib Spark GraphX Spark (machine SQL (BSP) Streaming learning) Spark Core Execution The Shuffle Caching Model

Spark Master (daemon) Spark Worker Node (1...N of these) • Keeps track of live workers • Web UI on port 8080 Spark Slave (daemon) • Task Scheduler • Restart failed tasks Spark Executor (JVM process) Losing a master prevents new applications from being executed Tasks Can achieve HA using ZooKeeper and multiple master nodes

val ¡file ¡= ¡spark.textFile(" val ¡counts ¡= ¡file.flatMap(line ¡=> ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.map(word ¡=> ¡(word, ¡1)

map(word ¡=> ¡(word, ¡1)) ¡ Map words into Split lines into words pairs with count of 1 quick ¡ (quick,1) ¡ (brown,1) ¡ brown ¡ (fox,1) ¡ fox ¡ x quick ¡ (quick,1) ¡ … ¡ quick ¡ (quick,1) ¡ … ¡

Created from external system OR using a transformation of another RDD lineage of coarse-grained transformations (map, join, filter, is lost, RDDs can be re-computed by re-playing the transformations User can choose to persist an RDD (for reusing during interactive data-mining)

https://github.com/LucidWorks/spark-solr/ Streaming applications Real-time, streaming ETL jobs as sink for Spark job Real-time document matching against stored queries Distributed computations (interactive data mining, machine learning) Solr query as Spark RDD (resilient distributed dataset) Optionally process results from each shard in parallel

Transform a stream of records into small, deterministic batches Discretized stream: sequence of RDDs Once you have an RDD, you can use all the other Spark libs (MLlib, etc) Low-latency micro batches Time to process a batch must be less than the batch interval time Two types of operators: Transformations (group by, join, etc) Output (send to some external sink, e.g. Solr) Impressive performance!

zkHost ¡localhost:2181 ¡–collection ¡social ¡ Various transformations / enrichments SolrSupport.indexDStreamOfDocs on each tweet (e.g. sentiment analysis, language detection) map() <Status> tweets = jssc, null, filters); JavaDStream<SolrInputDocument> docs = tweets.map

<Status> ¡tweets ¡= ¡ createStream (jssc, ¡ null , ¡filters); ¡ // ¡map ¡incoming ¡tweets ¡into ¡SolrInputDocument ¡objects ¡for ¡indexing ¡in ¡Solr SolrInputDocument> ¡docs ¡= ¡tweets.map( ¡ Status,SolrInputDocument>() ¡{ ¡ SolrInputDocument ¡call(Status ¡status) ¡{ ¡ ¡doc ¡= ¡ autoMapToSolrInputDoc ( "tweet-‑" +status.getId(), ¡status, ¡ provider_s" , ¡ "twitter" ); ¡ author_s" , ¡status.getUser().getScreenName()); ¡ " , ¡status.isRetweet() ¡? ¡ "echo" ¡ : ¡ "post" ); ¡

> ¡docs) ¡ SolrInputDocument>, ¡Void>() ¡{ ¡ SolrInputDocument> ¡solrInputDocumentJavaRDD) ¡throws ¡Exception ¡{ ¡ solrInputDocumentJavaRDD. foreachPartition ( ¡ <Iterator<SolrInputDocument>>() ¡{ ¡ (Iterator<SolrInputDocument> ¡solrInputDocumentIterator) ¡throws ¡Exception ¡{ ¡ solrServer ¡= ¡getSolrServer(zkHost); ¡ SolrInputDocument> ¡batch ¡= ¡new ¡ArrayList<SolrInputDocument>(); ¡ solrInputDocumentIterator.hasNext()) ¡{ ¡ solrInputDocumentIterator.next()); ¡ () ¡>= ¡batchSize) ¡ sendBatchToSolr(solrServer, ¡collection, ¡batch); ¡ batch.isEmpty()) ¡ (solrServer, ¡collection, ¡batch); ¡

For each document, determine which of a large set of stored queries Useful for alerts, alternative flow paths through a stream, etc Index a micro-batch into an embedded (in-memory) Solr instance and then determine which queries match Matching framework; you have to decide where to load the stored queries from and what to do when matches are found Scale it using Spark … need to scale to many queries, checkout

JavaDStream<SolrInputDocument> enriched = SolrSupport. filterDocuments (docFilterContext Get queries … map() Index docs into an EmbeddedSolrServer = Initialized from configs , null, filters); stored in ZooKeeper JavaDStream<SolrInputDocument> docs = tweets.map ( new Function<Status,SolrInputDocument>() {

Custom partitioning scheme for RDD using Solr’s DocRouter Stream docs directly to each shard leader using metadata from document shard assignment, and ConcurrentUpdateSolrClient shardPartitioner ¡= ¡ new ¡ ShardPartitioner(zkHost, ¡collection); ¡ shardPartitioner).foreachPartition( ¡ <Iterator<Tuple2<String, ¡SolrInputDocument>>>() ¡{ ¡ call(Iterator<Tuple2<String, ¡SolrInputDocument>> ¡tupleIter) ¡ ConcurrentUpdateSolrClient ¡cuss ¡= ¡ null ; ¡ tupleIter.hasNext()) ¡{ ¡ ConcurrentUpdateSolrClient ¡once ¡per ¡partition ¡

Can execute any query and expose as an RDD produces JavaRDD<SolrDocument> ¡ Use deep-paging if needed (cursorMark) For reading full result sets where global sort order doesn’t matter, parallelize query execution by distributing requests across the Spark SolrDocument> ¡results ¡= ¡ ¡ solrRDD.queryShards(jsc, ¡solrQuery); ¡

Can be used to construct RDD<Vector> which can then be passed to new ¡ SolrRDD(zkHost, ¡collection); ¡ <Vector> ¡vectors ¡= ¡ ¡ solrRDD.queryTermVectors(jsc, ¡solrQuery, ¡field, ¡numFeatures ¡clusters ¡= ¡ ¡ vectors.rdd(), ¡numClusters, ¡numIterations);

new ¡ SolrQuery(...); ¡ "text_t" , "type_s" ); ¡ SolrRDD(zkHost, ¡collection); ¡ solrJavaRDD ¡= ¡solrRDD.queryShards(jsc, ¡solrQuery); ¡ new ¡ SQLContext(jsc); ¡ sqlContext, ¡solrQuery, ¡solrJavaRDD, ¡zkHost, ¡collection); ¡ "tweets" ); ¡ "SELECT ¡COUNT(type_s) ¡FROM ¡tweets ¡WHERE ¡type_s='echo'" results.javaRDD().map( new ¡ Function<Row, ¡Long>() ¡{ ¡ Long ¡call(Row ¡row) ¡{ ¡

Reference implementation of Solr and Spark on YARN Formal benchmarks for reads and writes to Solr Checkout SOLR-6816 – improving replication performance Add Spark support to Solr Scale Toolkit Integrate metrics to give visibility into performance More use cases … Feel free to reach out to me with questions:

Spark Overview / High-level Architecture Indexing from Spark - PowerPoint PPT Presentation

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term vectors & Spark SQL Document Matching user since 2010, committer since April 2014, work for SolrCloud features and bin/solr! Release manager

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Spark architecture Spark architecture Hardware organization Hardware organization In local

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Overview of the PEBBL and PICO Projects: Massively Parallel Branch and Bound Jonathan Eckstein

BREEDING SANDWORMS: HOW TO FUZZ YOUR WAY OUT OF ADOBE READER X'S SANDBOX Who we are Research

Classes, Objects & References The Challenges of Complexity Complexity of Agent-Based Model

Security-by-Contract: Toward a Semantics for Digital Signatures on Mobile Code* N. Dragoni, F .

Introduction to symmetric crypto Some cipher history D. J. Bernstein 1973, and again in 1974:

TRECVID-2013 Semantic Indexing task: Overview Georges Qunot Laboratoire d'Informatique de

CPSC 121: Mode els of Computation Un nit 6 Rewriting Predicat te Logic Statements Based on

MARKETING AND SELLING THE DRUPAL COMMERCE ECOSYSTEM Ryan Szrama, Commerce Guys COMMERCE GUYS

Spark Overview / High-level Architecture Indexing from Spark - PowerPoint PPT Presentation

Spark Overview / High-level Architecture Indexing from Spark Reading data from Solr + term vectors & Spark SQL Document Matching user since 2010, committer since April 2014, work for SolrCloud features and bin/solr! Release manager

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Spark architecture Spark architecture Hardware organization Hardware organization In local

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Overview of the PEBBL and PICO Projects: Massively Parallel Branch and Bound Jonathan Eckstein

BREEDING SANDWORMS: HOW TO FUZZ YOUR WAY OUT OF ADOBE READER X'S SANDBOX Who we are Research

Classes, Objects &amp; References The Challenges of Complexity Complexity of Agent-Based Model

Security-by-Contract: Toward a Semantics for Digital Signatures on Mobile Code* N. Dragoni, F .

Introduction to symmetric crypto Some cipher history D. J. Bernstein 1973, and again in 1974:

TRECVID-2013 Semantic Indexing task: Overview Georges Qunot Laboratoire d'Informatique de

CPSC 121: Mode els of Computation Un nit 6 Rewriting Predicat te Logic Statements Based on

MARKETING AND SELLING THE DRUPAL COMMERCE ECOSYSTEM Ryan Szrama, Commerce Guys COMMERCE GUYS

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Classes, Objects & References The Challenges of Complexity Complexity of Agent-Based Model