Extending Spark ML Super Happy New Pipeline Stage Time! *Scala only - PowerPoint PPT Presentation

Built with public APIs* Extending Spark ML Super Happy New Pipeline Stage Time! *Scala only - see developer for details. kroszk@

Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming this year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos

What are we going to talk about? ● What Spark ML pipelines look like ● What Estimators and Transformers are ● How to implement a Transformer - and what else you will need to do to make an estimator ● I will of course try and sell you many copies of my new book if you have an expense account.

Spark ML pipelines ● In the batch setting, an estimator is trained on a dataset, and produces a static, immutable transformer. Estimator Tokenizer HashingTF String Indexer Naive Bayes fit(df) Transformer Streaming Streaming Tokenizer HashingTF String Indexer Naive Bayes

So what does a pipeline stage look like? Wendy Piersall Are either an: ● Estimator - no need to train can directly transform (e.g. HashingTF) (with transform) ● Transformer - has a method called “fit” which returns an estimator Must provide: ● transformSchema (used to validate input schema is reasonable) & copy Often have: ● Special params for configuration (so we can do meta-algorithms)

Walking through a simple transformer: Mário Macedo class HardCodedWordCountStage ( override val uid : String ) extends Transformer { def this () = this ( Identifiable .randomUID("hardcodedwordcount")) def copy(extra : ParamMap ) : HardCodedWordCountStage = { defaultCopy(extra) }

Verify the input schema is reasonable: override def transformSchema(schema : StructType ) : StructType = { // Check that the input type is a string val idx = schema.fieldIndex("happy_pandas") val field = schema.fields(idx) if (field.dataType != StringType ) { throw new Exception (s"Input type ${ field.dataType } did not match input type StringType") } // Add the return field schema.add( StructField ("happy_panda_counts", IntegerType , false )) }

Do the “work” (e.g. predict labels or w/e): vic15 def transform(df : Dataset [ _ ]) : DataFrame = { val wordcount = udf { in : String => in.split(" ").size } df.select(col("*"), wordcount(df.col("happy_pandas")).as("happy_panda_counts")) }

What about configuring our stage? Jason Wesley Upton class ConfigurableWordCount ( override val uid : String ) extends Transformer { final val inputCol = new Param [ String ]( this , "inputCol", "The input column") final val outputCol = new Param [ String ]( this , "outputCol", "The output column") def setInputCol(value : String ) : this.type = set(inputCol, value) def setOutputCol(value : String ) : this.type = set(outputCol, value)

So why do we configure it that way? Tricia Hall ● Allow meta algorithms to work on it ● If you like inside of spark you’ll see “sharedParams” for common params (like input column) ● We can access those unless we pretend to be inside of org.apache.spark - so we have to make our own

So how to make an estimator? sneakerdog ● Very similar, instead of directly providing transform provide a `fit` which returns a “model” which implements the estimator interface as shown above ● We could look at one - but I’m only supposed to talk for 10 minutes ● So keep an eye out for my blog post in November :) ● Also take a look at the algorithms in Spark itself (helpful traits you can mixin to take care of many common things).

Resources to continue with: Captain Pancakes ● O’Reilly Radar (“Ideas”) Blog Post http://bit.ly/extendSparkML ● High Performance Spark Example Repo has some sample “custom” models https://github.com/high-performance-spark/high-performance-spark-examples ○ Of course buy several copies of the book - it is the gift of the season :p ● The models inside of Spark its self: https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/ spark/ml (use some internal APIs but a good starting point) ● As always the Spark API documentation: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.pac kage ● My Slide share http://www.slideshare.net/hkarau

Fast Data Coming soon: Learning Spark Processing with Spark in Action Spark (2nd edition) Fast Data Processing with Advanced Spark Coming soon: Analytics with (Out of Date) High Performance Spark Spark

The next book….. First seven chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark ● Extending ML is covered in Chapter 9 :) Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.

Any PySpark Users: Have some k thnx bye :) simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Will tweet results “eventually” @holdenkarau If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca) if you feel comfortable doing so :)

Extending Spark ML Super Happy New Pipeline Stage Time! *Scala only - PowerPoint PPT Presentation

Built with public APIs* Extending Spark ML Super Happy New Pipeline Stage Time! *Scala only - see developer for details. kroszk@ Who am I? My name is Holden Karau Prefered pronouns are she/her Im a Principal Software

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

1 [9-4] Mor M. Peretz, Switch-Mode Power Supplies Current feedback loop I o L i o V o v o S V

Output Gap Skills Levels Insufficient Skills Levels Insufficient % with NVQ4 + (Wal = 27%,

10/1/201413 October (c) PIIE, 2009 1 2009 Mensch tracht und Gott lacht : Giving guidance on

CSE 5194.01: OpenAI and ONNX John Herwig CSE 5194.01 OpenAI What is OpenAI? According to their

How You Can Use Open Source Materials to Learn Python & Data Science Kamila Stpniowska,

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these so different? Last time :

Overview of Compilation Readings: EAC2 Chapter 1 EECS4302 M: Compilers and Interpreters Winter

Extending Spark ML Super Happy New Pipeline Stage Time! *Scala only - PowerPoint PPT Presentation

Built with public APIs* Extending Spark ML Super Happy New Pipeline Stage Time! *Scala only - see developer for details. kroszk@ Who am I? My name is Holden Karau Prefered pronouns are she/her Im a Principal Software

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

1 [9-4] Mor M. Peretz, Switch-Mode Power Supplies Current feedback loop I o L i o V o v o S V

Output Gap Skills Levels Insufficient Skills Levels Insufficient % with NVQ4 + (Wal = 27%,

10/1/201413 October (c) PIIE, 2009 1 2009 Mensch tracht und Gott lacht : Giving guidance on

CSE 5194.01: OpenAI and ONNX John Herwig CSE 5194.01 OpenAI What is OpenAI? According to their

How You Can Use Open Source Materials to Learn Python &amp; Data Science Kamila Stpniowska,

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these so different? Last time :

Overview of Compilation Readings: EAC2 Chapter 1 EECS4302 M: Compilers and Interpreters Winter

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

How You Can Use Open Source Materials to Learn Python & Data Science Kamila Stpniowska,