Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

▶

Sep 14, 2023 325 likes •1.92k views

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and MapReduce Spark Streaming Analytics Overview Big data Hadoop MapReduce Spark SparkSQL MLlib Streaming analytics and other trends 2

SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Hadoop and MapReduce Spark Streaming Analytics

SLIDE 2

Overview

Big data Hadoop MapReduce Spark SparkSQL MLlib Streaming analytics and other trends

2

SLIDE 3

Recall…

3

SLIDE 4

Infrastructure

“Big Data” “Integration” “Architecture” NoSQL and NewSQL “Streaming”

Two sides emerge

4

SLIDE 5

Analytics

“Data Science” “Machine Learning” “AI”

But also still: BI and Visualization

Two sides emerge

5

SLIDE 6

There’s a difference

6

SLIDE 7

Previously

In-memory analytics

Together with some intermediate techniques: mostly based on disk swapping and directed acyclic execution graphs

Now: moving to the world of big data

Managing, storing, querying data Storage and computation in a distributed setup: over multiple machines And (hopefully still, though it will be difficult for a while…): analytics

7

SLIDE 8

Hadoop

8

SLIDE 9

Hadoop

We all hear Hadoop being mentioned every time a team is talking about some big daunting task related to the analysis or management of big data

Have lots of volume? Hadoop! Unstructured data? Hadoop! Streaming data? Hadoop! Want to run super-fast machine learning in parallel? You guessed it… Hadoop!

So what is Hadoop and what is it not? 9

SLIDE 10

History

The genesis of Hadoop came from the Google File System paper that was published in 2003

Spawned another research paper from Google: MapReduce

Hadoop itself however started as part of a search engine called Nutch, which was being worked

n by Doug Cutting and Mike Cafarella

5k lines of code for NDFS (Nutch Distributed File System) and 6k lines of code for MapReduce

In 2006, Cutting joined Yahoo! to work in its search engine division. The part of Nutch which dealt with distributed computing and processing (initially constructed to handle with the simultaneous parsing of enormous amounts of web links in an efficient manner) was split of and renamed to “Hadoop”

Toy elephant of his son

In 2008, Yahoo! open-sourced Hadoop Hadoop become part of an ecosystem of technologies which are managed by the non-profit Apache Software foundation Today, most of the hype around Hadoop has been passed us, for reasons we’ll see later

10

SLIDE 11

Hadoop

Even when talking about “raw” Hadoop, it is important to know that it describes a stack containing four core modules:

1. Hadoop Common (a set of shared libraries)
2. Hadoop Distributed File System (HDFS), a Java-based file system to store

data across multiple machines

3. MapReduce (a programming model to process large sets of data in parallel)
4. YARN (Yet Another Resource Negotiator), a framework to schedule and

handle resource requests in a distributed environment

In MapReduce version 1 (Hadoop 1), HDFS and MapReduce were tightly coupled. Didn’t scale well to really big clusters. In Hadoop 2, the resource management and scheduling tasks are separated from MapReduce by YARN

11

SLIDE 12

HDFS

HDFS is the distributed file system used by Hadoop to store data in the cluster

HDFS lets you connect nodes (commodity personal computers, which was a big deal at the time) over which data files are distributed You can then access and store the data files as one seamless file system HDFS is fault tolerant and provides high-throughput access Theoretically, you don’t need to have it running and files could instead be stored elsewhere

HDFS is composed of a NameNode, an optional SecondaryNameNode (for data recovery in the event of failure), and DataNodes which hold the actual data

A NameNode manages holds all the metadata regarding the stored files and manages namespace operations like

pening, closing, and renaming files and directories, and maps data blocks to DataNodes

DataNodes handle read and write requests from HDFS clients and create, delete, and replicate data blocks according to instructions from the governing NameNode A typical installation cluster has a dedicated machine that runs a name node and at least one data node Data nodes continuously loop, asking the name node for instructions HDFS supports a hierarchical file organization of directories and files inside them

12

SLIDE 13

HDFS

HDFS replicates file blocks for fault tolerance

An application can specify the number of replicas of a file at the time it is created, and this number can be changed any time after that. The name node makes all decisions concerning block replication

One of the main goals of HDFS is to support large files. The size of a typical HDFS block is 64MB

Therefore, each HDFS file consists of one or more 64MB blocks HDFS tries to place each block on separate data nodes

13

SLIDE 14

HDFS

14

SLIDE 15

HDFS

15

SLIDE 16

HDFS

16

SLIDE 17

HDFS

17

SLIDE 18

HDFS

HDFS provides a native Java API and a native C-language wrapper for the Java API, as well as shell commands to interface with the file system

byte[] fileData = readFile(); String filePath = "/data/course/participants.csv"; Configuration config = new Configuration();

rg.apache.hadoop.fs.FileSystem hdfs = org.apache.hadoop.fs.

FileSystem.get(config);

rg.apache.hadoop.fs.Path path = new org.apache.hadoop.fs.Path(filePath);
rg.apache.hadoop.fs.FSDataOutputStream outputStream = hdfs.create(path);
utputStream.write(fileData, 0, fileData.length);

In layman’s terms: a massive, distributed, C:-drive…

And note: reading in a massive file in a naive way will still end up badly

18

SLIDE 19

MapReduce

What is MapReduce?

A “programming framework” for coordinating tasks in a distributed environment HDFS uses “behind the scenes” this to make access fast Reading a file is converted to a MapReduce task to read across multiple DataNodes and stream the resulting file Can be used to construct scalable and fault-tolerant operations in general

HDFS provides a way to store files in a distributed fashion, MapReduce allows to do something with them in a distributed fashion 19

SLIDE 20

MapReduce

The concepts of “map” and “reduce” existed long before Hadoop and stem from the domain of functional programming Map: apply a function on every item in a list: result is a new list of values

numbers = [1,2,3,4,5] numbers.map(λ x : x * x) # [1,4,9,16,25]

Reduce: apply function on a list: result is a scalar

numbers.reduce(λ x : sum(x)) # 15

20

SLIDE 21

MapReduce

A Hadoop map-reduce pipeline works over lists of (key, value) pairs

The map operations maps each pair to a list of output key-value pairs

Zero, one, or more

This operation can be run in parallel over the input pairs The input list could also contain a single key-value pair

Next, the output entries are shuffled and distributed so that all output entries belonging to the same key are assigned to the same worker All of these workers then apply a reduce function to each group

Producing a final key-value pair for each distinct key The resulting, final outputs are then (optionally) sorted per key to produce the final outcome

21

SLIDE 22

MapReduce

# Input: a list of key-value pairs documents = [ ('document1', 'two plus two does'), ('document2', 'not equal two'), ] def map(key, value): # For each word, produce an output pair for word in value.split(' '): yield (word, 1) for input in documents: map(input) # [ (two, 1), (plus, 1), (two, 1), (does, 1) ] # [ (not, 1), (equal, 1), (two, 1) ] def reduce(key, values): # For each key, produce output as sum of values yield (key, sum(values)) reduce('two', [1, 1, 1]) # ('two', 3) reduce('plus', [1]) # ('plus', 1) # ... and so on

22

SLIDE 23

MapReduce: word count example

23

SLIDE 24

MapReduce: word count example

24

SLIDE 25

MapReduce: averaging example

def map(key, value): yield (value['genre'], value['nrPages'])

25

SLIDE 26

MapReduce: averaging example

# A minibatch-style approach would also be possible def map(key, value): for record in value: yield (record['genre'], record['nrPages'])

26

SLIDE 27

MapReduce: averaging example

def reduce(key, values): yield (key, sum(values) / length(values))

27

SLIDE 28

MapReduce: averaging example

There’s a gotcha, however: the reduce operation should work on partial results and be able to be applied multiple times in a chain 28

SLIDE 29

MapReduce: averaging example

29

SLIDE 30

MapReduce

The reduce operation should work on partial results and be able to be applied multiple times in a chain

1. The reduce function should output the same structure as emitted by the map

function, since this output can be used again in an additional reduce

peration
2. The reduce function should provide correct results even if called multiple

times on partial results 30

SLIDE 31

MapReduce: correct averaging example

function map(key, value): yield (value['genre'], (value['nrPages'], 1)) function reduce(key, values): sum, newcount = 0, 0 for (nrPages, count) in values: sum = sum + nrPages * count newcount = newcount + count yield (key, (sum/newcount, newcount))

Instead of using a running average as a value, our value will now itself be a pair

f (running average, number of records already seen)

31

SLIDE 32

MapReduce: correct averaging example

32

SLIDE 33

Testing it out

A small piece of Python code map_reduce.py will be made available as background material You can use this to play around with the MapReduce paradigm without setting up a full Hadoop stack 33

SLIDE 34

Testing it out: minimum per group example

from map_reduce import runtask documents = [ ('drama', 200), ('education', 100), ('action', 20), ('thriller', 20), ('drama', 220), ('education', 150), ('action', 10), ('thriller', 160), ('drama', 140), ('education', 160), ('action', 20), ('thriller', 30) ] # Provide a mapping function of the form mapfunc(value) # Must yield (k,v) pairs def mapfunc(value): genre, pages = value yield (genre, pages) # Provide a reduce function of the form reducefunc(key, list_of_values) # Must yield (k,v) pairs def reducefunc(key, values): yield (key, min(values)) # Pass your input list, mapping and reduce functions runtask(documents, mapfunc, reducefunc)

34

SLIDE 35

Back to Hadoop

On Hadoop, MapReduce tasks are written using Java

Bindings for Python and other languages exist as well, but Java is the “native” environment Java program is packages as a JAR archive and launched using the command:

hadoop jar myfile.jar ClassToRun [args...] hadoop jar wordcount.jar RunWordCount /input/dataset.txt /output/

35

SLIDE 36

Back to Hadoop

public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; IntWritable result = new IntWritable(); for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

36

SLIDE 37

Back to Hadoop

hadoop jar wordcount.jar WordCount /users/me/dataset.txt /users/me/output/

37

SLIDE 38

Back to Hadoop

$ hadoop fs -ls /users/me/output Found 2 items

rw-r—r-- 1 root hdfs 0 2017-05-20 15:11 /users/me/output/_SUCCESS
rw-r—r-- 1 root hdfs 2069 2017-05-20 15:11 /users/me/output/part-r-00000

$ hadoop fs -cat /users/me/output/part-r-00000 and 2 first 1 is 3 line 2 second 1 the 2 this 3

38

SLIDE 39

Back to Hadoop

MapReduce tasks can consist of more than mappers and reducers

Partitioners, Combiners, Shufflers, and Sorters

39

SLIDE 40

MapReduce

Constructing MapReduce programs requires “a certain skillset” in terms of programming (to put it lightly)

One does not simply implement Random Forest on MapReduce There’s a reason why most tutorials don’t go much further than counting words

Tradeoffs in terms of speed, memory consumption, and scalability

Big does not mean fast Does your use case really align with a search engine?

40

SLIDE 41

YARN

How is a MapReduce program coordinated amongst the different nodes in the cluster?

In the former Hadoop 1 architecture, the cluster was managed by a service called the JobTracker TaskTracker services lived on each node and would launch tasks on behalf of jobs (instructed by the JobTracker) The JobTracker would serve information about completed jobs JobTracker could still become overloaded, however!

In Hadoop 2, MapReduce is split into two components

The cluster resource management capabilities have become YARN, while the MapReduce- specific capabilities remain MapReduce

41

SLIDE 42

YARN

YARN’s setup has a couple advantages

First, by breaking up the JobTracker into a few different services, it avoids many of the scaling issues facing Hadoop 1 It also makes it possible to run frameworks other than MapReduce on a Hadoop cluster. For example, Impala can also run on YARN and share resources on a cluster with MapReduce I.e. can be used for all sorts of coordination of tasks

Will be an advantage once we move away from Hadoop (see later) Even then, people also proposed alternatives to Yarn (see later)

I.e. a general coordination and resource management framework 42

SLIDE 43

So… Hadoop?

Standard Hadoop: definitely not a turn-key solution for most environments

Just a big hard drive and a way to do scalable MapReduce? In a way which is not fun to program at all?

As such, many implementations and vendors also mix-in a number of additional projects such as:

HBase: a distributed database which runs on top of the Hadoop core stack (no SQL, just MapReduce) Hive: a data warehouse solution with SQL like query capabilities to handle data in the form of tables Pig: a framework to manipulate data stored in HDFS without having to write complex MapReduce programs from scratch Cassandra: another distributed database Ambari: a web interface for managing Hadoop stacks (managing all these other fancy names) Flume: a framework to collect and deal with streaming data intakes Oozie: a more advanced job scheduler that cooperates with YARN Zookeeper: a centralized service for maintaining configuration information, naming (a cluster on its own) Sqoop: a connector to move data between Hadoop and relational databases Atlas: a system to govern metadata and its compliance Ranger: a centralized platform to define, administer and manage security policies consistently across Hadoop components Spark: a computing framework geared towards data analytics

43

SLIDE 44

So… Hadoop?

44

SLIDE 45

SQL on Hadoop

45

SLIDE 46

The first letdown

2008: the first release of Apache Hive, the original SQL-on-Hadoop solution

Rapidly became one of the de-facto tools included with almost all Hadoop installations Hive converts SQL queries to a series of map-reduce jobs, and presents itself to clients in a way which very much resembles a MySQL server It also offers a command line client, Java APIs and JDBC drivers, which made the project wildly successful and quickly adapted by all organizations which were quickly beginning to realize that they’d taken a step back from their traditional data warehouse setups in their desire to switch to Hadoop as soon as possible

SELECT genre, SUM(nrPages) FROM books --\ GROUP BY genre -- > convert to MapReduce job ORDER BY genre --/

From the moment a new distributed data store gets popular, the next question will be how to run SQL on top of it… What do you mean it’s a file system? How do we query this thing? We need SQL!

“ “

46

SLIDE 47

There is (was?) also HBase

The first database on Hadoop

Native database on top of Hadoop No SQL, own get/put/filter operations Complex queries as MapReduce jobs

hbase(main):009:0> scan 'users' ROW COLUMN+CELL seppe column=email:, timestamp=1495293082872, value=seppe.vandenbroucke@kuleuven.be seppe column=name:first, timestamp=1495293050816, value=Seppe seppe column=name:last, timestamp=1495293067245, value=vanden Broucke 1 row(s) in 0.1170 seconds hbase(main):011:0> get 'users', 'seppe' COLUMN CELL email: timestamp=1495293082872, value=seppe.vandenbroucke@kuleuven.be name:first timestamp=1495293050816, value=Seppe name:last timestamp=1495293067245, value=vanden Broucke 4 row(s) in 0.1250 seconds

47

SLIDE 48

There is (was?) also Pig

Another way to ease the pain of writing MapReduce programs

Still not very easy though People still wanted good ole SQL

timesheet = LOAD 'timesheet.csv' USING PigStorage(','); raw_timesheet = FILTER timesheet by $0>1; timesheet_logged = FOREACH raw_timesheet GENERATE $0 AS driverId, $2 AS hours_logged, $3 AS miles_logged; grp_logged = GROUP timesheet_logged by driverId; sum_logged = FOREACH grp_logged GENERATE group as driverId, SUM(timesheet_logged.hours_logged) as sum_hourslogged, SUM(timesheet_logged.miles_logged) as sum_mileslogged;

48

SLIDE 49

Hive

2008: the first release of Apache Hive, the original SQL-on-Hadoop solution Hive converts SQL queries to a series of map-reduce jobs, and presents itself to clients in a way which very much resembles a MySQL server

SELECT genre, SUM(nrPages) FROM books --\ GROUP BY genre -- > convert to MapReduce job ORDER BY genre --/

Hive is handy… but SQL-on-Hadoop technologies are not perfect implementations of relational database management systems:

Sacrifice on features such as speed, SQL language compatibility Support for complex joins lacking For Hive, the main draw back was its lack of speed Because of the overhead incurred by translating each query into a series of map-reduce jobs, even the simplest of queries can consume a large amount of time

Big does not mean fast 49

SLIDE 50

So… without MapReduce?

For a long time, companies such as Hortonworks were pushing behind the development of Hive, mainly by putting efforts behind Apache Tez, which provides a new backend for Hive, no longer based on the map-reduce paradigm but on directed-acyclic-graph pipelines In 2012, Cloudera, another well-known Hadoop vendor, introduced their own SQL-on-Hadoop technology as part of their “Impala” stack

Cloudera also opted to forego map-reduce completely, and instead uses its own set of execution daemons, which have to be installed along Hive-compatible datanodes. It offers SQL-92 syntax support, a command line client, and ODBC drivers Much faster than a standard Hive installation, allowing for immediate feedback after queries, hence making them more interactive Today: Apache Impala is open source

It didn’t take long for other vendors to take notice of the need for SQL-on-Hadoop, and in recent years, we saw almost every vendor joining the bandwagon and offering their own query engines (IBM’s BigSQL platform or Oracle’s Big Data SQL, for instance)

Some better, some worse

But… 50

SLIDE 51

Hype meets reality

Hadoop was created in 2006!

It’s now been more than a decade since Google’s papers on MapReduce

Interest in the concept of “Big Data” reached fever pitch sometime between 2011 and 2014

Big Data was the new “black”, “gold” or “oil” There’s an increasing sense of having reached some kind of plateau 2015 was probably the year when people started moving to AI and its many related concepts and flavors: machine intelligence, deep learning, etc. Today, we’re in the midst of a new “AI summer” (with it’s own hype as well)

In a tech startup industry that loves its shiny new objects, the term “Big Data” is in the unenviable position of sounding increasingly “3 years ago” – Matt Turck

“ “

51

SLIDE 52

Hype meets reality

Big Data wasn’t a very likely candidate for the type of hype it experienced in the first place

Big Data, fundamentally, is… plumbing There’s a reason why most map-reduce examples don’t go much further than counting words The early years of the Big Data phenomenon were propelled by a very symbiotic relationship among a core set of large Internet companies Fast forward a few years, and we’re now in the thick of the much bigger, but also trickier,

pportunity: adoption of Big Data technologies by a broader set of companies

Those companies do not have the luxury of starting from scratch

Big Data success is not about implementing one piece of technology (like Hadoop), but instead requires putting together a collection of technologies, people and processes 52

SLIDE 53

Today

Today, the field of the data aspect has stabilized: the storage and querying aspect has found a good marriage between big data techniques, speed, a return to relational data bases and NoSQL-style scalability

E.g. Amazon Redshift, Snowflake, CockroachDB, Presto…

(We’ll revisit this later when talking about NoSQL) Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse

“ “

53

SLIDE 54

Big Analytics?

“What do you mean it’s a file system and some MapReduce? How do we query this thing?” We need SQL! Hive is too slow! Can we do it without MapReduce?

Most managers worth their salt have realized that Hadoop-based solutions might not be the right fit

Proper cloud-based databases might be

But the big unanswered question right now: Or rather: How to use Hadoop for machine learning and analytics?

“ “

How to support distributed analytics?

“ “

54

SLIDE 55

Big Analytics?

It turns out that MapReduce was never very well suited for analytics

Extremely hard to convert techniques to a map-reduce paradigm Slow due to lots of in-out swapping to HDFS Ask the Mahout project, they tried Slow for most “online” tasks… Querying is nice, but… we just end up with business intelligence dashboarding and pretending we have big data?

“2015 was the year of Apache Spark”

Bye bye, Hadoop! Spark has been embraced by a variety of players, from IBM to Cloudera-Hortonworks Spark is meaningful because it effectively addresses some of the key issues that were slowing down the adoption of Hadoop: it is much faster (benchmarks have shown Spark is 10 to 100 times faster than Hadoop’s MapReduce), easier to program, and lends itself well to machine learning

55

SLIDE 56

Spark

56

SLIDE 57

Time for a spark

Just as Hadoop was perhaps not the right solution to satisfy common querying needs, it was also not the right solution for analytics In 2015, another project, Apache Spark, entered the scene in full with a radically different approach Spark is meaningful because it effectively addresses some of the key issues that were slowing down the adoption of Hadoop: it is much faster (benchmarks have shown Spark is 10 to 100 times faster than Hadoop’s MapReduce), easier to program, and lends itself well to machine learning 57

SLIDE 58

Spark

Apache Spark is a top-level project

f the Apache Software

Foundation, it is an open-source distributed general-purpose cluster computing framework with an in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL

Spark’s speed, simplicity, and broad support for existing development environments and storage systems make it popular with a wide range of developers, and relatively accessible to those learning to work with it for the first time The project supporting Spark’s ongoing development is one of Apache’s largest and most vibrant, with over 500 contributors from more than 200 organizations responsible for code in the current software release

58

SLIDE 59

So what do we throw out?

The resource manager (YARN)?

We’re still running on a cluster of machines Spark can run on top of YARN, but also Mesos (an alternative resource manager), or even in standalone mode

The data storage (HDFS)?

Again, Spark can work with a variety of storage systems Google Cloud Amazon S3 Apache Cassandra Apache Hadoop (HDFS) Apache HBase Apache Hive Flat files (JSON, Parquet, CSV, others)

59

SLIDE 60

So what do we throw out?

One thing that we do “kick out” is MapReduce

Spark is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning (aha!) How? Apache Spark replaces the MapReduce paradigm with an advanced DAG execution engine that supports cyclic data flow and in-memory computing

A smarter way to distribute jobs over machines! Note the similarities with previous projects such as Dask… 60

SLIDE 61

Spark’s building blocks

61

SLIDE 62

Spark core

This is the heart of Spark, responsible for management functions such as task scheduling Spark core also implements the core abstraction to represent data elements: the Resilient Distributed Dataset (RDD)

The Resilient Distributed Dataset is the primary data abstraction in Apache Spark

Represents a collection of data elements

It is designed to support in-memory data storage, distributed across a cluster in a manner that is demonstrably both fault-tolerant and efficient Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse-grained sets of data Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes Once data is loaded into an RDD, two types of operations can be carried out:

Transformations, which create a new RDD by changing the original through processes such as mapping, filtering, and more Actions, such as counts, which measure but do not change the original data

62

SLIDE 63

RDDs are distributed, fault-tolerant, efficient

63

SLIDE 64

RDDs are distributed, fault-tolerant, efficient

Note that an RDD represents a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel

Any sort of element collection: a collection of text lines, a collection of single words, a collection of objects, a collection of images, a collection of instances, … The only feature provided is automatic distribution and task management over this collection

Through transformations and actions: do things with the RDD

The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node Transformations are said to be lazily evaluated: they are not executed until a subsequent action has a need for the result Where possible, these RDDs remain in memory, greatly increasing the performance of the cluster, particularly in use cases with a requirement for iterative queries or processes

64

SLIDE 65

RDDs are distribted, fault-tolerant, efficient

65

SLIDE 66

Writing Spark-based programs

Similar as with MapReduce, an application using the RDD framework can be coded in Java or Scala and packaged as a JAR file to be launched on the cluster However, Spark also provides an interactive shell interface (Spark shell) to its cluster environment And also exposes APIs to work directly with the RDD concept in a variety of languages

Scala Java Python R SQL

66

SLIDE 67

Spark shell (pyspark)

PySpark is the “driver program”: runs on the client and will set up a “SparkContext” (a connection to the Spark cluster)

>>> textFile = sc.textFile("README.md") # sc is the SparkContext # textFile is now an RDD (each element represents a line of text) >>> textFile.count() # Number of items in this RDD 126 >>> textFile.first() # First item in this RDD u'# Apache Spark' # Chaining together a transformation and an action: # How many lines contain "Spark"? >>> textFile.filter(lambda line: "Spark" in line).count() 15

67

SLIDE 68

SparkContext

SparkContext sets up internal services and establishes a connection to a Spark execution environment

Data operations are not executed on your machine: the client sends them to be executed by the Spark cluster! No data is loaded in the client… unless you’d perform a .toPandas()

68

SLIDE 69

Deploying an application

Alternative to the interactive mode:

from pyspark import SparkContext # Set up context ourselves sc = SparkContext("local", "Simple App") logData = sc.textFile("README.md") numAs = logData.filter(lambda s: 'a' in s).count() numBs = logData.filter(lambda s: 'b' in s).count() print("Lines with a: %i, lines with b: %i" % (numAs, numBs)) sc.stop()

Execute using:

/bin/spark-submit MyExampleApp.py Lines with a: 46, Lines with b: 23

69

SLIDE 70

Transformations

map(func) filter(func) flatMap(func) mapPartitons(func) sample(withReplacement, fraction, seed) union(otherRDD) intersection(otherRDD) distinct() groupByKey() reduceByKey(func) sortByKey() join(otherRDD)

Actions

reduce(func) count() first() take(n) takeSample(withReplacement, n) saveAsTextFile(path) countByKey() foreach(func)

Examples

https://github.com/wdm0006/DummyRDD

A test class that walks like and RDD, talks like an RDD but is actually just a list No real Spark behind it Nice for testing and learning, however

from dummy_spark import SparkContext, SparkConf sconf = SparkConf() sc = SparkContext(master='', conf=sconf) # Make an RDD from a Python list: a collection of numbers rdd = sc.parallelize([1, 2, 3, 4, 5]) print(rdd.count()) print(rdd.map(lambda x: x**2).collect())

71

SLIDE 72

Examples: word count

from dummy_spark import SparkContext, SparkConf sconf = SparkConf() sc = SparkContext(master='', conf=sconf) # Make an RDD from a text file: collection of lines text_file = sc.textFile("kuleuven.txt") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) print(counts)

72

SLIDE 73

Examples: filtering

from dummy_spark import SparkContext, SparkConf sconf = SparkConf() sc = SparkContext(master='', conf=sconf) rdd = sc.parallelize(list(range(1, 21))) print(rdd.filter(lambda x: x % 3 == 0).collect())

73

SLIDE 74

SparkSQL, DataFrames and Datasets

74

SLIDE 75

These RDDs still “feel” a lot like MapReduce…

Indeed, many operations are familiar: map , reduce , reduceByKey , … But remember: the actual execution is more optimized However, from the perspective of the user, this is still very low-level

Nice if you want low-level control to perform transformation and actions on your dataset Or when your data is unstructured, such as streams of text Or you actually want to manipulate your data with functional programming constructs Or you don’t care about imposing a schema, such as columnar format

But what if you do want to work with tabular structured data… like a data frame? 75

SLIDE 76

SparkSQL

Like Apache Spark in general, SparkSQL is all about distributed in-memory computations

SparkSQL builds on top of Spark Core with functionality to load and query structured data using queries that can be expressed using SQL, HiveQL, or through high-level API’s similar to e.g. Pandas (called the “DataFrame” and “Dataset” API’s in Spark) At the core of SparkSQL is the Catalyst query optimizer

Since Spark 2.0, Spark SQL is the primary and feature-rich interface to Spark’s underlying in-memory distributed platform (hiding Spark Core’s RDDs behind higher-level abstractions) 76

SLIDE 77

SparkSQL

77

SLIDE 78

SparkSQL

df.select("name").show() # +-------+ # | name| # +-------+ # |Michael| # | Andy| # | Justin| # +-------+ df.select(df['name'], df['age'] + 1).show() # +-------+---------+ # | name|(age + 1)| # +-------+---------+ # |Michael| null| # | Andy| 31| # | Justin| 20| # +-------+---------+

78

SLIDE 79

SparkSQL

df.filter(df['age'] > 21).show() # +---+----+ # |age|name| # +---+----+ # | 30|Andy| # +---+----+ df.groupBy("age").count().show() # +----+-----+ # | age|count| # +----+-----+ # | 19| 1| # |null| 1| # | 30| 1| # +----+-----+

79

SLIDE 80

SparkSQL

# Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() # +----+-------+ # | age| name| # +----+-------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+-------+

80

SLIDE 81

DataFrames

Like an RDD, a DataFrame is an immutable distributed collection of data elements

Extends the “free-form” elements by imposing that every element is organized as a set of value into named columns, e.g. (age=30, name=Seppe) Imposes some additional structure on top of RDDs

Designed to make large data sets processing easier

This allows for an easier and higher-level abstraction Provides a domain specific language API to manipulate your distributed data (see examples above) Makes Spark accessible to a wider audience Finally, much more in line to what data scientists are actually used to

81

SLIDE 82

DataFrames

pyspark.sql.SparkSession : Main entry point for DataFrame and SQL functionality pyspark.sql.DataFrame : A distributed collection of data grouped into named columns pyspark.sql.Row : A row of data in a DataFrame pyspark.sql.Column : A column expression in a DataFrame pyspark.sql.GroupedData : Aggregation methods, returned by DataFrame.groupBy() pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values) pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality pyspark.sql.functions : List of built-in functions available for DataFrame pyspark.sql.types : List of data types available pyspark.sql.Window : For working with window functions

82

SLIDE 83

DataFrames

Class pyspark.sql.DataFrame : A distributed collection of data grouped into named columns:

agg(*exprs) : Aggregate on the entire DataFrame without groups columns : Returns all column names as a list corr(col1, col2, method=None) : Calculates the correlation of two columns count() : Returns the number of rows in this DataFrame cov(col1, col2) : Calculate the sample covariance for the given columns crossJoin(other) : Returns the cartesian product with another DataFrame crosstab(col1, col2) : Computes a pair-wise frequency table of the given columns describe(*cols) : Computes statistics for numeric and string columns distinct() : Returns a new DataFrame containing the distinct rows in this DataFrame. drop(*cols) : Returns a new DataFrame that drops the specified column dropDuplicates(subset=None) : Return a new DataFrame with duplicate rows removed dropna(how='any', thresh=None, subset=None) : Returns new DataFrame omitting rows with

null values

fillna(value, subset=None) : Replace null values

83

SLIDE 84

DataFrames

Class pyspark.sql.DataFrame : A distributed collection of data grouped into named columns:

filter(condition) : Filters rows using the given condition; where() is an alias for filter() first() : Returns the first row as a Row foreach(f) : Applies the f function to all rows of this DataFrame groupBy(*cols) : Groups the DataFrame using the specified columns head(n=None) : Returns the first n rows intersect(other) : Return a intersection with other DataFrame join(other, on=None, how=None) : Joins with another DataFrame, using the given join expression

rderBy(*cols, **kwargs) : Returns a new DataFrame sorted by the specified column(s)

printSchema() : Prints out the schema in the tree format randomSplit(weights, seed=None) : Randomly splits this DataFrame with the provided weights replace(to_replace, value, subset=None) : Returns DataFrame replacing a value with another value select(*cols) : Projects a set of expressions and returns a new DataFrame toPandas() : Returns the contents of this DataFrame as a Pandas data frame union(other) : Return a new DataFrame containing union of rows in this frame and another frame

84

SLIDE 85

DataFrames

Can be loaded in from:

Parquet files Hive tables JSON files CSV files (as of Spark 2) JDBC (to connect with a database) AVRO files (using “spark-avro” library or built-in in Spark 2.4) Normal RDDs (given that you specify or infer a “schema”)

Can also be converted back to a standard RDD 85

SLIDE 86

SparkR

Implementation of the Spark DataFrame API for R

An R package that provides a light-weight frontend to use Apache Spark from R Way of working very similar to dplyr Can convert R data frames to SparkDataFrame objects

df <- as.DataFrame(faithful) groupBy(df, df$waiting) %>% summarize(count = n(df$waiting)) %>% head(3) ## waiting count ##1 70 4 ##2 67 1 ##3 69 2

86

SLIDE 87

Datasets

Spark Datasets is an extension of the DataFrame API that provides a type-safe,

bject-oriented programming interface

Introduced in Spark 1.6 Like DataFrames, Datasets take advantage of Spark’s optimizer by exposing expressions and data fields to a query planner Datasets extend these benefits with compile-time type safety: meaning production applications can be checked for errors before they are run

A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema

At the core of the Dataset API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation

Core idea: where a DataFrame represented a collection of Rows (which a number of named Columns), a Dataset represents a collection of typed objects (with their according typed fields) which can be converted from and to table rows 87

SLIDE 88

Datasets

Since Spark 2.0, the DataFrame APIs has merged with the Datasets APIs, unifying data processing capabilities across libraries

Because of this unification, developers now have fewer concepts to learn or remember, and work with a single high-level and type-safe API called Dataset However, DataFrame as a name is still used: a DataFrame is a Dataset[Row], so a collection of generic Row objects

88

SLIDE 89

Datasets

Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API

Dataset represents a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java Consider DataFrame as an alias for Dataset[Row], where a Row represents a generic untyped JVM object Since Python and R have no compile-time type-safety, there’s only the untyped API, namely DataFrames Language Main Abstraction Scala Dataset[T] & DataFrame (= Dataset[Row]) Java Dataset[T] Python DataFrame (= Dataset[Row]) R DataFrame (= Dataset[Row])

89

SLIDE 90

Datasets

Benefits:

Static typing and runtime-type safety: both syntax and analysis errors can now be caught during compilation of our program High-level abstraction and custom view into structured and semi-structured data Ease-of-use of APIs with structure Performance and optimization

For us R and Python users, we can continue using DataFrames knowing that they are built on Dataset[Row]

Most common use case anyways

(A more deteailed example will be posted in background information for those interested) 90

SLIDE 91

MLlib

91

SLIDE 92

MLlib

MLlib is Spark’s machine learning (ML) library

Its goal is to make practical machine learning scalable and easy Think of it as a “scikit-learn”-on-Spark

Provides:

ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and load algorithms, models, and Pipelines Utilities: linear algebra, statistics, data handling, etc.

92

SLIDE 93

MLlib

As of Spark 2.0, the primary Machine Learning API for Spark is the DataFrame-based API in the spark.ml package

Before: spark.mllib was RDD-based Not a very helpful way of working MLlib still supported the RDD-based API in spark.mllib

Since Spark 2.3, MLlib’s DataFrames-based API have reached feature parity with the RDD-based API

After reaching feature parity, the RDD-based API will be deprecated The RDD-based API is expected to be removed in Spark 3.0 Why: DataFrames provide a more user-friendly API than RDDs

93

SLIDE 94

Classification

Logistic regression Decision tree classifier Random forest classifier Gradient-boosted tree classifier Multilayer perceptron classifier One-vs-Rest classifier (One-vs-All) Naive Bayes

Regression

Linear regression Generalized linear regression Decision tree regression Random forest regression Gradient-boosted tree regression Survival regression Isotonic regression

Clustering

K-means Latent Dirichlet allocation (LDA) Bisecting k-means Gaussian Mixture Model (GMM)

Recommender systems

Collaborative filtering

Validation routines

MLlib

94

SLIDE 95

MLlib example

from pyspark.ml.classification import LogisticRegression training = spark.read.format("libsvm").load("data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) print("Coefs: " + str(lrModel.coefficients)) print("Intercept: " + str(lrModel.intercept)) from pyspark.ml.clustering import KMeans dataset = spark.read.format("libsvm").load("data/data.txt") kmeans = KMeans().setK(2).setSeed(1) model = kmeans.fit(dataset) predictions = model.transform(dataset) centers = model.clusterCenters() print("Cluster Centers: ") for center in centers: print(center)

95

SLIDE 96

Conclusions so Far

96

SLIDE 97

Spark versus…

Spark: a high-performance in-memory data-processing framework

Has been widely adopted and still one of the main computing platforms today

Versus:

MapReduce (a mature batch-processing platform for the petabyte scale): Spark is faster, better suited in an online, analytics setting, implements data frame and ML concepts and algorithms Apache Tez: “aimed at building an application framework which allows for a complex directed- acyclic-graph of tasks for processing data”

Hortonworks: Spark is a general purpose engine with APIs for mainstream developers, while Tez is a framework for purpose-built tools such as Hive and Pig Cloudera was rooting for Spark, Hortonworks for Tez (a few years ago…) Today: Tez is out! (Hortonworks had to also adopt Spark, and merged with Cloudera)

Apache Mahout: “the goal is to build an environment for quickly creating scalable performant machine learning applications”

A simple and extensible programming environment and framework for building scalable algorithms A wide variety of premade algorithms for Apache Spark, H2O, Apache Flink Before: also “MapReduce all things” approach Kind of an extension to Spark Though most of the algorithms also in MLlib… so not that widely used any more!

97

SLIDE 98

Spark versus…

One contender that still is very much in the market is H2O (http://www.h2o.ai/)

Core is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading. - The data is read in parallel and is distributed across the cluster and stored in memory in a columnar format in a compressed way H2O’s REST API allows access to all the capabilities of H2O from an external program or script via JSON over HTTP

They also had the idea of coming up with a better “MapReduce” engine

Based on a distributed Key-value store In-memory map/reduce Can work on top of Hadoop (YARN) or standalone Though not as efficient as Spark’s engine

H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform

“ “

98

SLIDE 99

H2O

99

SLIDE 100

H2O

However, H2O was quick to realize the benefits of Spark, and the role they could play: “customers want to use Spark SQL to make a query, feed the results into H2O Deep Learning to build a model, make their predictions, and then use the results again in Spark”

“Sparkling Water”

100

SLIDE 101

H2O

Web based UI: Flow Strong support for algorithms Productionization in mind!

E.g. documentation describes “What happens when you try to predict on a categorical level not seen during training?” and “How does the algorithm handle missing values during testing?”

“A better MLlib”

We see a lot of companies embracing H2O on Spark as the “next extension”

101

SLIDE 102

H2O

library(h2o) h2o.init(nthreads=-1, max_mem_size = "2G") h2o.removeAll() df <- h2o.importFile(path = normalizePath("./covtype.full.csv")) splits <- h2o.splitFrame(df, c(0.6,0.2)) train <- h2o.assign(splits[[1]], "train.hex") valid <- h2o.assign(splits[[2]], "valid.hex") test <- h2o.assign(splits[[3]], "test.hex") rf1 <- h2o.randomForest( training_frame = train, validation_frame = valid, x=1:12, y=13, model_id = "rf_covType_v1", ntrees = 200, stopping_rounds = 2, score_each_iteration = T) summary(rf1) rf1@model$validation_metrics h2o.hit_ratio_table(rf1,valid = T)[1,2] h2o.shutdown(prompt=FALSE)

102

SLIDE 103

http://datascience.la/benchmarking-random-forest- implementations/

H2O

Though faster implementations continue to arrive… see, e.g.: https://github.com/aksnzhy/xforest 103

SLIDE 104

Summary so far

Hadoop letdown: MapReduce not that user-friendly

Not convenient for analytics purposes More high-volume bath operations, ETL What about SQL? What about analytics?

Spark re-uses components of Hadoop

HDFS (or Hbase, Hive, flat files, Cassandra, …) YARN (or Mesos, or stand-alone)

105

SLIDE 106

Summary so far

DAG approach adopted by many other projects

E.g. Dask, https://dask.pydata.org/en/latest/

Spark has a Directed Acyclic Graph (DAG) execution engine

Supporting cyclic data flow and in-memory computing

106

SLIDE 107

Summary so far

DAG engine powers core concept of Resilient Distributed Datasets (RDDs)

Fault-tolerant and efficient Two types of operations: transformations and actions Represents an unstructured collection of data (e.g. lines of tekst, images, vectors, …)

Programs are written interactively (Spark shell)

Or packaged as an application (like with a MapReduce app) Bindings to Scala, Java, Python, R, and support for SQL

RDDs are still a bit hard to learn and use

Low-level control to perform transformation and actions on your dataset Data not necessarily structured, columnar SparkSQL is the engine to work with semi-structured data Exposes DataFrame and Dataset APIs Underlyingly uses RDDs, but more suited towards structured analyses

107

SLIDE 108

Summary so far

MLlib is Spark’s machine learning (ML) library

Before: spark.mllib was RDD-based Not a very helpful way of working MLlib will still support the RDD-based API in spark.mllib MLlib will not add new features to the RDD-based API Newer features: spark.ml Classification, regression, clustering, recommender systems

108

SLIDE 109

Streaming Analytics

109

SLIDE 110

Streaming analytics?

Not only serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making Next to Spark, other exciting frameworks have emerged and gain momentum

Kafka, Samza, Flink, Ignite, Kudu, Splunk

Lots of attention today towards streaming / realtime analytics

Although many projects still in beginning stage “Streaming analytics” might not be that hard

110

SLIDE 111

Streaming analytics?

With regards to the why, examples include:

Advertising: from mass branding to 1-1 targeting Fintech: general advice to personalized robo-advisors Healthcare: mass treatment to designer medicine Retail: static branding to real-time personalization Manufacturing: break-then-fix to repair-before-break

Note: many of these have more to do with “personalization”

Take care of slogans such as “self-learning models”

111

SLIDE 112

Streaming analytics

We need to differentiate between different “aspects” of “streaming” Streaming data: my data source emits events, or a stream of instances, or a stream of …

How can I access to historical events? How do I store this? How do I convert this to a data set? How do I send such streams across my data center, clients, applications? (Plumbing?)

Streaming training: my algorithm needs to be trained on a continuous stream

“On-line” algorithms, very hard problem for many techniques Requires ad-hoc changes and modification of each algorithm

Streaming prediction: my model needs to predict on a stream of instances, events, …

Matter of deployment, operationalizing the model

112

SLIDE 113

Streaming analytics

We need to differentiate between different “aspects” of “streaming” Streaming data: my data source emits events, or a stream of instances, or a stream of …

If we can do this, do we even need streaming training? Is this really a hard setting?

Streaming training: my algorithm needs to be trained on a continuous stream

See above

Streaming prediction: my model needs to predict on a stream of instances, events, …

Is this really a large issue in your setting?

113

SLIDE 114

Think about your use case

Netflix: 9 million events per second at peak LinkedIn: 500 billion events per day, ~24 GB per second during peak hours Bombardier showcased its C Series jetliner that carries Pratt & Whitney’s Geared Turbo Fan (GTF) engine, which is fitted with 5,000 sensors that generate up to 10 GB of data per second. A single twin-engine aircraft with an average 12-hr. flight-time can produce up to 844 TB of data WhatsApp, Uber, …?

How do you compare? 114

SLIDE 115

Streaming data, streaming engine

Data can be bounded (finite) or unbounded Execution engine is streaming or batch Combinations of both possible!

You can pretend a finite data set comes in as a stream And you can handle an infinite data set in batches

For finite data-sets, accurate processing is relatively simple

On failure, schedule a new reprocessing Use checkpoints to make it more efficient Effectively what Spark does (the resilient in RDD)

For infinite data sets, this is harder

Need to take time into account Time of event creation, ingestion, processed

115

SLIDE 116

We also need to rethink aggregations

116

SLIDE 117

We also need to rethink aggregations

117

SLIDE 118

The reality

What about “on-line” algorithms?

Do we need them? In many cases: perhaps not Also hard to find “on-line” implementations Streaming linear regression, online k-means clustering, incremental matrix factorization

E.g. Netflix: 9 million events per second

Real-time (re)training of recommender matrix, but main training still offline

In most cases:

You might need to deal with streaming data: how to story it, access history? We want to be able to perform online predictions on this data Training can be done offline Model can be deployed in a streaming setup Depending on your needs: re-train every month, week, day, … but depends more on how fast the feature space changes, not really on how fast the data comes in

118

SLIDE 119

Streaming overview

https://www.slideshare.net/sbaltagi/apache-flink-realworld-use-cases-for-streaming-analytics

119

SLIDE 120

Event collectors

(We’re on a quest for something to help with analytics in a streaming setting) Event collectors gather, collect, centralize events Examples of event collectors include:

Apache Flume: one of the oldest Apache projects designed to collect, aggregate, and move large data sets such as web server logs to a centralized location

Main use case: streaming logs from multiple sources capable of running JVM Not that helpful for analytics

Apache NiFi: a relatively new project. It is based on Enterprise Integration Patterns (EIP) where the data flows through multiple stages and transformations before reaching the destination

Apache NiFi comes with a highly intuitive graphical interface that makes it easy to design data flow and transformations Main use case: ETL, EIP, data transformations Not that helpful for analytics

120

SLIDE 121

Event brokers

Message/event/data brokers handle message validation, transformation and routing

Message oriented middleware (“MOM”) It mediates communication amongst applications, minimizing the mutual awareness that applications should have of each other in order to be able to exchange messages, effectively implementing decoupling Message routing (one or more destinations), message transformation, simple message aggregation In a way which is resilient, fail-safe, scalable A lot of the streaming data “plumbing” thus handled by message/event/data brokers

Examples include:

Apache ActiveMQ, Apache Kafka, Celery, RabbitMQ, Redis, ZeroMQ Especially Kafka is a popular choice: can be easily integrated with Spark But in itself: not really much analytics E.g. in-between Uber’s mobile app and data lakes

121

SLIDE 122

Event processors

Here we find the actually intelligence

Spark Streaming and Spark Structured Streaming as two very popular options

Spark Streaming enables developers to build streaming applications through Sparks’ high-level API

Since it runs on Spark, Spark Streaming lets developers reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state Spark Streaming operates in micro-batching mode where the batch size is much smaller to conventional batch processing Can be put on top of Kafka acting as the message broker: common approach these days

122

SLIDE 123

Spark Streaming

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data

DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams Internally, a DStream is represented as a sequence of RDDs This helps with regards to easy of use: many of the same actions and operations can be applied

123

SLIDE 124

Spark Streaming

124

SLIDE 125

Spark Streaming

pyspark.streaming.StreamingContext is the main entry point for all streaming

functionality

from pyspark import SparkContext from pyspark.streaming import StreamingContext sc = SparkContext("local[2]", "NetworkWordCount") ssc = StreamingContext(sc, 1) # Create a DStream that will connect to localhost:9999 as a source lines = ssc.socketTextStream("localhost", 9999) # Split each line into words words = lines.flatMap(lambda line: line.split(" ")) # Count each word in each batch pairs = words.map(lambda word: (word, 1)) wordCounts = pairs.reduceByKey(lambda x, y: x + y) wordCounts.pprint() # Print out first ten elements # of each RDD generated in this Dstream ssc.start() # Start the computation ssc.awaitTermination() # Wait for the computation to terminate

125

SLIDE 126

Spark Streaming

A Discretized Stream or DStream is the basic abstraction provided by Spark Streaming

It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream Internally, a DStream is represented by a continuous series of RDDs Each RDD in a DStream contains data from a certain interval Similar to that of RDDs, transformations allow the data from the input DStream to be modified DStreams support many of the transformations available on normal Spark RDD’s The transform operation (along with its variations like transformWith) allows arbitrary RDD- to-RDD functions to be applied on a DStream Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data

126

SLIDE 127

Spark Streaming

Every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream

window length: the duration of the window (3 below) sliding interval: the interval at which the window operation is performed (2 below)

windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, 30, 10)

127

SLIDE 128

Spark Streaming

You can perform different kinds of joins on streams in Spark Streaming

windowedStream1 = stream1.window(20) windowedStream2 = stream2.window(60) joinedStream = windowedStream1.join(windowedStream2) dataset = ... # some RDD windowedStream = stream.window(20) joinedStream = windowedStream.transform(lambda rdd: rdd.join(dataset))

128

SLIDE 129

Spark Streaming

Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems

Triggers the actual execution of all the DStream transformations

print() prints the first ten elements of every batch of data in a DStream on the driver node

running the streaming application, useful for development and debugging (this is called

pprint() in the Python API) saveAsTextFiles(prefix, [suffix]) saves this DStream’s contents as text files. The file name

at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”

saveAsHadoopFiles(prefix, [suffix]) saves this DStream’s contents as Hadoop files. The file

name at each batch interval is generated based on prefix and suffix: “prefix- TIME_IN_MS[.suffix]”.

foreachRDD(func) is the most generic output operator that applies a function, func, to each

RDD generated from the stream

129

SLIDE 130

Spark Streaming

Note that Spark Streaming utilizes RDDs heavily

Not very pleasant

But: you can quite easily use DataFrames and SQL operations on streaming data

Just convert the RDD to a DataFrame You have to create a SparkSession using the SparkContext that the StreamingContext is using Furthermore this has to done such that it can be restarted on driver failures. This is done by creating a lazily instantiated singleton instance of SparkSession

130

SLIDE 131

Spark Streaming and MLlib

You can also easily use machine learning algorithms provided by MLlib

First of all, there are streaming (on-line) machine learning algorithms which can simultaneously learn from the streaming data as well as apply the model on the streaming data Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data (more common case)

131

SLIDE 132

Spark Structured Streaming

Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine

The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended. This leads to a new stream processing model that is very similar to a batch processing model. You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table You can express your streaming computation the same way you would express a batch computation on static data The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive You can use the Dataset/DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to-batch joins, etc The computation is executed on the same optimized Spark SQL engine Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming Structured Streaming is alpha in Spark 2.1 but stable in 2.2

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

132

SLIDE 133

Spark Structured Streaming

133

SLIDE 134

Spark Structured Streaming

134

SLIDE 135

Spark Structured Streaming

Note that Structured Streaming does not materialize the entire table

It reads the latest available data from the streaming data source, processes it incrementally to update the result, and then discards the source data It only keeps around the minimal intermediate state data as required to update the result

135

SLIDE 136

Spark Structured Streaming

Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations

In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into

136

SLIDE 137

Spark Structured Streaming

Structured Streaming can maintain the intermediate state for partial aggregates for a long period of time such that late data can update aggregates of old windows correctly 137

SLIDE 138

Spark Structured Streaming

from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import split spark = SparkSession.builder.appName("StructuredNetworkWordCount") \ .getOrCreate() # This will be a streaming data frame lines = spark.readStream.format("socket").option("host", "localhost") \ .option("port", 9999).load() # Split the lines into words and generate word count words = lines.select( explode( split(lines.value, " ") ).alias("word") ) wordCounts = words.groupBy("word").count() # Start running the query that prints the running counts to the console query = wordCounts.writeStream.outputMode("complete").format("console") query.start() query.awaitTermination()

138

SLIDE 139

Other event processors

Apache Storm: was originally developed by Nathan Marz at BackType, a company that was acquired by Twitter. After the acquisition, Twitter open sourced Storm before donating it to Apache

Used Flipboard, Yahoo!, and Twitter, it was a long-time standard for developing distributed, real-time, data processing platforms Storm is often referred as “the Hadoop for real-time processing”. It is primarily designed for scalability and fault- tolerance For new projects: not that much in use any more (mainly replaced by Kafka + Spark setups)

Apache Samza: tightly coupled to YARN. Used at LinkedIn and some other places, not much more Apache Apex: also tightly coupled to YARN. Not much used any more… Apache Flink: a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams

Gaining traction, for use cases where mini-batching is not viable E.g. used by Lyft, Huawei, Tencent, https://flink.apache.org/poweredby.html

Apache Ignite: a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time

For very heavy duty applications

Apache Beam: based on a unified model for defining and executing data-parallel processing pipelines

139

SLIDE 140

Event indexers, search, visualization

Tools like Splunk, Kibana

Good commercial offerings available More concerned with dashboarding on real-time data, not so much learning

140

SLIDE 141

Conclusions

141

SLIDE 142

The Apache family

Most of the “big data tooling” are Apache Foundation projects

Open source Java Enterprisy Many of these depend on others… Who can make sense of this all? Some relatively new

Future to be seen in terms of adoption and maturity

Hadoop → Spark → Spark + H2O → Spark + Kafka + H2O → Flink → …

142

SLIDE 143

Key analytics patterns

Key patterns and insights:

1. The hype of big data is over (luckily)

Distributed storage and querying (the data plumbing): take the best of breed Most likely a cloud-based solution (Snowflake, Redshift, BigQuery)!

2. For analytics, multiple patterns emerge

Development environment: heavily notebook driven in all cases Pure Python / notebook / virtual environment driven, either hosted or not (Google, Amazon) Spark (+ H2O) (+ Kafka)… other “big” Apache projects losing steam Or: Kubeflow and other DAG-based approaches (e.g. Dask, Ray, Airflow) + containerization technologies (e.g. Docker) for scalability and reproducibility In some cases: pure TensorFlow-based (deep learning only) Choice between hosted and do-it yourself for all of the above

Important: perform a thorough assessment before committing

No solutions before problems!

143

SLIDE 144

Focus on what matters

https://landscape.cncf.io/category=streaming-messaging&format=card- mode&grouping=category

All of these distributed streaming architectures are massively overused. They certainly have their place, but you probably don’t need them. I see it all the time with ML work. You wind up using a cluster to overcome the memory inefficiency of Spark, when you could have just used a single machine. This has been my experience, too. I worked at just one place that had a really good handle on high-volume, high-velocity streaming data, and they didn’t use Flink or Storm or Kafka or anything like that. They mostly just used the KISS principle and a protobuf-style wire format. There is definitely a point where these sorts of scale-out-centric solutions are

unavoidable. Short of that point, though, they’re probably best avoided. –

https://news.ycombinator.com/item?id=19016997

“ “

144

SLIDE 145

Focus on what matters

https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

145

SLIDE 146

Platforms: have someone solve it for you?

Either combine a combination of Apache projects

Hadoop on Premise Spark in the Cloud

Or provide a one-click Jupyter environment

Installing packages Scheduling and monitoring models

Options for “infrastructure in the cloud”: Amazon, Google, …

Challenge remains deployment, management, governance, and starting from the right question

Recall discussion in “Evaluation” and “Data Science Tools”!

146

SLIDE 147

Focus on what matters

Just 5% of data scientists surveyed by O’Reilly use any SAS software, and 0% use any IBM analytics software. In the slightly broader KDnuggets poll, 6% use SAS Enterprise Miner, and 8% say they use IBM SPSS Modeler Gartner’s obsession with ‘Citizen Data Scientists’ leads it to criticize Domino and H2O because they are ‘hard to use’. Imagine that! If you want to use a data science platform, you need to know how to do data science… Gartner is clueless about open source software – https://thomaswdinsmore.com/2017/02/28/gartner-looks-at-data-science- platforms/

“ “

147

SLIDE 148

Focus on what matters

148

SLIDE 149

So what should you do?

Steal from the best…

https://eng.uber.com/scaling-michelangelo/

149

SLIDE 150

So what should you do?

Steal from the best…

https://github.com/uber/manifold

150

SLIDE 151

So what should you do?

Steal from the best…

https://eng.uber.com/uber-big-data-platform/

151

SLIDE 152

So what should you do?

Steal from the best…

https://medium.com/netflix-techblog/notebook-innovation-591ee3221233

152

SLIDE 153

So what should you do?

Steal from the best…

https://labs.spotify.com/2020/02/27/how-we-improved-data-discovery-for-data-scientists-at-spotify/

153

SLIDE 154

So what should you do?

Steal from the best…

https://airflow.apache.org/ https://github.com/spotify/luigi

154

SLIDE 155

So what should you do?

Steal from the best…

https://blog.keen.io/architecture-of-giants-data-stacks-at-facebook-netflix-airbnb-and-pinterest/

155

SLIDE 156

So what should you do?

Good artists copy, great artists steal. – Picasso – Steve Jobs

“ “

156

SLIDE 157

Most importantly

Start with the analytics: big data stacks bring the plumbing, but doesn’t mean you can make soup

Not every organization is engineering driven You probably do not have the capacity, or need, even, to keep up to date Be honest: do you have big data? What do you mean with “real time”, “streaming”? Make an informed decision The algorithms are the same everywhere anyway One use case with small data can lead to much more value than a data lake filled with stuff. Data is not gold People and process matters. Right governance, thinking about deployment, maintenance, monitoring, and most of all: the business question

157

SLIDE 158