Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart - - PowerPoint PPT Presentation

applied spark
SMART_READER_LITE
LIVE PREVIEW

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart - - PowerPoint PPT Presentation

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org | @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member, Apache Software


slide-1
SLIDE 1

Applied Spark

From Concepts to Bitcoin Analytics

Andrew F. Hart – ahart@apache.org | @andrewfhart

slide-2
SLIDE 2

My Day Job

  • CTO, Pogoseat
  • Upgrade technology for

live events

3/28/16 QCON-SP Andrew Hart 2

slide-3
SLIDE 3

Additionally…

  • Member, Apache Software Foundation

3/28/16 QCON-SP Andrew Hart 3

slide-4
SLIDE 4

Additionally…

  • Founder, Data Fluency

Software consultancy specializing in appropriate data solutions for startups and smb's Help clients make good decisions and leverage power traditionally accessible only to big business

3/28/16 QCON-SP Andrew Hart 4

slide-5
SLIDE 5

Previously…

  • NASA Jet Propulsion Laboratory

Building data management pipelines for research missions in many domains (Climate, Cancer, Mars, Radioastronomy, etc.)

3/28/16 QCON-SP Andrew Hart 5

slide-6
SLIDE 6

Apache Spark

3/28/16 QCON-SP Andrew Hart 6

slide-7
SLIDE 7

Spark is…

  • General purpose cluster computing software that

maximizes use of cluster memory to process data

  • Used by hundreds of organizations to realize

performance gains over previous-generation cluster compute platforms, particularly for map- reduce style problems

3/28/16 QCON-SP Andrew Hart 7

slide-8
SLIDE 8

Spark…

  • Was developed at the Algorithms,

Machines, and People laboratory (AMP Lab) at the University of California, Berkeley in 2009

  • Is open source software presently under

the governance of the Apache Software Foundation

3/28/16 QCON-SP Andrew Hart 8

slide-9
SLIDE 9

Why Spark Exists

  • Confluence of three trends:

Increased volume of digital data Decreasing cost of computer memory (RAM) Data processing technology liberation

3/28/16 QCON-SP Andrew Hart 9

slide-10
SLIDE 10

Digital data volume

  • Early days

– low-resolution sensors, comparatively few people

  • n the internet

– proprietary data, custom solutions, expensive, custom hardware

3/28/16 QCON-SP Andrew Hart 10

slide-11
SLIDE 11

Digital data volume

  • Modern era

– ubiquitous, high-resolution cameras – mobile devices packed with sensors – i.o.t. – open source software, cheap commodity hardware

3/28/16 QCON-SP Andrew Hart 11

slide-12
SLIDE 12

We've gone from this Internet…

3/28/16 QCON-SP Andrew Hart 12

slide-13
SLIDE 13

3/28/16 QCON-SP Andrew Hart 13

To this… (Humanity's global communication platform)

slide-14
SLIDE 14

3/28/16 QCON-SP Andrew Hart 14

1.5 Billion connected PCs 3.2 Billion connected people 6 Billion connected mobile devices

slide-15
SLIDE 15

3/28/16 QCON-SP Andrew Hart 15

What are we doing with all of this? Every minute:

  • 18,000 votes cast on Reddit
  • 51,000 apps downloaded by Apple users
  • 350,000 tweets posted on Twitter
  • 4,100,000 likes recorded on Facebook
slide-16
SLIDE 16

3/28/16 QCON-SP Andrew Hart 16

We are awash in data…

  • Monetizing this data is a core competency for

many businesses

  • Need tools to do this effectively at today's scale
slide-17
SLIDE 17
  • 2. Tool support and technology liberation

3/28/16 QCON-SP Andrew Hart 17

slide-18
SLIDE 18

3/28/16 QCON-SP Andrew Hart 18

How far do you want to go back? We have always used tools to help us cope with data

slide-19
SLIDE 19

3/28/16 QCON-SP Andrew Hart 19

VisiCalc

  • Early "big data" tool

– Allowed business to move from the chalk board to the digital spreadsheet – Phenomenal increase in productivity running numbers for business

slide-20
SLIDE 20

3/28/16 QCON-SP Andrew Hart 20

Modern-era Spreadsheet Tech Microsoft Excel 1,048,576 rows x 16,384 columns

slide-21
SLIDE 21

3/28/16 QCON-SP Andrew Hart 21

Open Source Alternatives Exist… Microsoft Excel 1,048,576 rows x 16,384 columns Apache OpenOffice 1,048,576 rows x 1,024 columns

slide-22
SLIDE 22

3/28/16 QCON-SP Andrew Hart 22

Relational Database Systems

  • Support thousands of tables,

millions of rows…

slide-23
SLIDE 23

3/28/16 QCON-SP Andrew Hart 23

Relational Database Systems

  • Support thousands of tables,

millions of rows…

  • Viable open source alternatives

exist for many use cases

slide-24
SLIDE 24

3/28/16 QCON-SP Andrew Hart 24

Modern Big Data Era

  • MapReduce Algorithm (2004)

parallelize large-scale computation across clusters of servers

slide-25
SLIDE 25

3/28/16 QCON-SP Andrew Hart 25

Modern Big Data Era

  • Hadoop

Open source processing framework for MapReduce applications to run

  • n large clusters of commodity

(unreliable) hardware

slide-26
SLIDE 26
  • 3. Commoditization of computer memory

3/28/16 QCON-SP Andrew Hart 26

slide-27
SLIDE 27

Early Days…

  • Main memory was

hand made.

  • You could see

each bit.

3/28/16 QCON-SP Andrew Hart 27

slide-28
SLIDE 28

Modern Era… AWS EC2 r3.8xlarge 244 GiB for US$1.41/hr*

3/28/16 QCON-SP Andrew Hart 28

slide-29
SLIDE 29

Why talk about memory?

  • Ability to use memory efficiently distinguishes

Spark from Hadoop, contributes to its speed advantages in many scenarios

3/28/16 QCON-SP Andrew Hart 29

slide-30
SLIDE 30

How Spark Works

3/28/16 QCON-SP Andrew Hart 30

slide-31
SLIDE 31

Primary abstraction in Spark: Resilient Distributed Datasets (RDD)

  • Immutable (read-only), partitioned dataset
  • Processed in parallel on each cluster node
  • Fault-tolerant – resilient to node failure

3/28/16 QCON-SP Andrew Hart 31

slide-32
SLIDE 32

Primary abstraction in Spark: Resilient Distributed Datasets (RDD)

  • Uses the distributed memory of the cluster to

store the state of a computation as a sharable

  • bject across jobs

(…instead of serializing to disk)

3/28/16 QCON-SP Andrew Hart 32

slide-33
SLIDE 33

Traditional MapReduce:

3/28/16 QCON-SP Andrew Hart 33

Data on Disk Map-1 Map-2 . . . Map-n Tuples

  • n Disk

Reduce-1 Reduce-2 . . . Reduce-n Tuples

  • n Disk

HDFS Read HDFS Write HDFS Read HDFS Write

slide-34
SLIDE 34

Spark RDD Architecture:

3/28/16 QCON-SP Andrew Hart 34

Data on Disk Map-1 Map-2 . . . Map-n Cluster Memory Reduce-1 Reduce-2 . . . Reduce-n Data on Disk HDFS Read HDFS Write RDD

slide-35
SLIDE 35

Unified computational model:

  • Spark unifies batch & streaming models, which

traditionally require different architectures

  • Sort of like the limit theorem in Calculus

– If you imagine a time-series set of RDDs with ever smaller windows of time, you can approximate streaming workflows

3/28/16 QCON-SP Andrew Hart 35

slide-36
SLIDE 36

Two ways to create an RDD in Spark Programs:

  • "Parallelize" an existing collection (e.g.: Python

array) in the driver program

  • Reference a dataset on external storage

– text files on disk – anything with a supported Hadoop InputFormat

3/28/16 QCON-SP Andrew Hart 36

slide-37
SLIDE 37

RDDs from RDDs:

  • RDDs are immutable (read-only)
  • The result of applying transformations and

actions to an RDD is a new RDD

  • RDD's can be persisted to memory

– reducing need to re-compute RDD every time

3/28/16 QCON-SP Andrew Hart 37

slide-38
SLIDE 38

Writing Spark Programs: Think of spark programs as a way to describe a sequence of transformations and actions that should be applied to an RDD

3/28/16 QCON-SP Andrew Hart 38

slide-39
SLIDE 39

Writing Spark Programs:

  • Transformations create a new dataset from an

existing one (e.g.: map)

  • Actions return a value after running a

computation (e.g.: reduce)

3/28/16 QCON-SP Andrew Hart 39

slide-40
SLIDE 40

Writing Spark Programs:

  • Spark provides a rich set of transformations:

map flatMap filter sample union intersection distinct groupByKey sortByKey cogroup pipe join cartesian coalesce

3/28/16 QCON-SP Andrew Hart 40

slide-41
SLIDE 41

Writing Spark Programs:

  • Spark provides a rich set of actions:

reduce collect count first take takeSample takeOrdered saveAsTextFile countByKey foreach

3/28/16 QCON-SP Andrew Hart 41

slide-42
SLIDE 42

Writing Spark Programs:

  • Transformations are lazily evaluated. They are
  • nly computed when a subsequent action

(which must return a result) is run.

  • Transformations get recomputed each time an

action is run (unless you explicitly persist the resulting RDD to memory)

3/28/16 QCON-SP Andrew Hart 42

slide-43
SLIDE 43

Structure of Spark Programs: Spark programs have two principal components:

  • Driver Program
  • Worker function

3/28/16 QCON-SP Andrew Hart 43

slide-44
SLIDE 44

Structure of Spark Programs: Driver program

  • Executes on the master node
  • Establishes context

3/28/16 QCON-SP Andrew Hart 44

slide-45
SLIDE 45

Structure of Spark Programs: Worker (processing) function

  • Executes on each worker node
  • Computes transformations and actions
  • n RDD partitions

3/28/16 QCON-SP Andrew Hart 45

slide-46
SLIDE 46

Structure of Spark Programs: SparkContext

  • Holds all of the information about the cluster
  • Manages what gets shipped to nodes
  • Makes life extremely easy for developers

3/28/16 QCON-SP Andrew Hart 46

slide-47
SLIDE 47

Structure of Spark Programs: Shared Variables

  • Broadcast variables: efficiently share static

data with the cluster nodes

  • Accumulators: write-only variables that serve

as counters

3/28/16 QCON-SP Andrew Hart 47

slide-48
SLIDE 48

Structure of Spark Programs: Worker (processing) function

  • Executes on each worker node
  • Computes transformations and actions
  • n RDD partitions

3/28/16 QCON-SP Andrew Hart 48

slide-49
SLIDE 49

Interacting with Spark

3/28/16 QCON-SP Andrew Hart 49

slide-50
SLIDE 50

Spark APIs:

Spark provides APIS for several languages

  • Scala
  • Java
  • Python
  • R
  • SQL

3/28/16 QCON-SP Andrew Hart 50

slide-51
SLIDE 51

Using Spark: There are two main ways to leverage Spark

  • Interactively through the included command

line interface

  • Programmatically via standalone programs

submitted as jobs to the master node

3/28/16 QCON-SP Andrew Hart 51

slide-52
SLIDE 52

Using Spark: Spark is GREAT for experimenting

  • Run experimental programs on small sample

datasets on a single machine

  • To scale up, simply re-target the SparkContext

to point to the master node of a Spark cluster

3/28/16 QCON-SP Andrew Hart 52

slide-53
SLIDE 53

Bitcoin

3/28/16 QCON-SP Andrew Hart 53

slide-54
SLIDE 54

Bitcoin is…

  • A decentralized digital currency

– value is exchanged directly between individuals with no need for a traditional central authority (e.g.: bank)

  • A global payment network

– transactions are broadcast and verified by network peers, with each using a complete copy of the network transaction history (the "blockchain")

3/28/16 QCON-SP Andrew Hart 54

slide-55
SLIDE 55

Bitcoin is…

  • A decentralized digital currency

– value is exchanged directly between individuals with no need for a traditional central authority (e.g.: bank)

  • A global payment network

– transactions are broadcast and verified by network peers, with each using a complete copy of the network transaction history (the "blockchain")

3/28/16 QCON-SP Andrew Hart 55

slide-56
SLIDE 56

Bitcoin Protocol

  • Open source software

(http://github.com/bitcoin/bitcoin) for processing peer-to-peer financial transactions with zero trust

  • The backbone of an experimental financial

system secured by math (instead of a trusted authority)

3/28/16 QCON-SP Andrew Hart 56

slide-57
SLIDE 57

Bitcoin: Open Data

  • Unprecedented transparency into the

workings of a financial network

– Peer-to-peer payments made using Bitcoin – Trading markets speculating on Bitcoin value

  • Global, 24x7, Free
  • Interesting research datasets abound

3/28/16 QCON-SP Andrew Hart 57

slide-58
SLIDE 58

Bitcoin Exchange Companies

3/28/16 QCON-SP Andrew Hart 58

slide-59
SLIDE 59
  • Bitcoin exchange Logos

3/28/16 QCON-SP Andrew Hart 59

slide-60
SLIDE 60
  • Bitcoin exchange Logos

3/28/16 QCON-SP Andrew Hart 60

slide-61
SLIDE 61

3/28/16 QCON-SP Andrew Hart 61

slide-62
SLIDE 62

Spark Demos

3/28/16 QCON-SP Andrew Hart 62

slide-63
SLIDE 63

The Hardware:

3/28/16 QCON-SP Andrew Hart 63

Master Node 6 CPU Cores 8GB RAM 6 CPU Cores 8GB RAM 6 CPU Cores 8GB RAM

slide-64
SLIDE 64

The Goal:

  • Demonstrate experimenting with Spark CLI
  • Demonstrate program execution on a

standalone Spark cluster

3/28/16 QCON-SP Andrew Hart 64

slide-65
SLIDE 65

The Data:

  • 1 month of log files containing transactions

from 10 global exchanges (in 3 currencies)

  • 1 trade per line, JSON encoded
  • Organized into files by exchange, currency,

and date

3/28/16 QCON-SP Andrew Hart 65

slide-66
SLIDE 66

The Question:

  • What is the cumulative value of all "buy"-type

transactions in the dataset?

  • Bonus: bucketed by fiat currency

3/28/16 QCON-SP Andrew Hart 66

slide-67
SLIDE 67

3/28/16 QCON-SP Andrew Hart 67

slide-68
SLIDE 68

Wrap-up

3/28/16 QCON-SP Andrew Hart 68

slide-69
SLIDE 69

Things to keep in mind using Spark:

  • Speed comes from avoiding writes to disk

– Allocate enough memory in your cluster to hold your data completely in memory

  • Once data fits in memory, adding more is not

going to boost performance

3/28/16 QCON-SP Andrew Hart 69

slide-70
SLIDE 70

Things to keep in mind using Spark:

  • Once data fits in memory, most apps are CPU
  • r network bound

– Allocate more cores in your cluster to increase parallelism (and tune Spark to use them!) – Have disks around to handle spillover, configure to reduce unnecessary writes

3/28/16 QCON-SP Andrew Hart 70

slide-71
SLIDE 71

Spark ecosystem:

3/28/16 QCON-SP Andrew Hart 71

image credit: https://databricks.com/spark/about

slide-72
SLIDE 72

Spark ecosystem (BDAS Stack):

3/28/16 QCON-SP Andrew Hart 72

image credit: https://amplab.cs.berkeley.edu/software/

slide-73
SLIDE 73

Spark at the ASF:

3/28/16 QCON-SP Andrew Hart 73

https://spark.apache.org/

  • Outstanding documentation
  • Community mailing lists
  • Updated project status
  • Entered Incubator in 2013
  • Graduated to "Top Level" project 2014
  • 44 committers, 800+ contributors
slide-74
SLIDE 74

Thank you! Contact: Twitter: @andrewfhart Email: ahart@apache.org Web: http://andrewfhart.com/

3/28/16 QCON-SP Andrew Hart 74