applied spark
play

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart - PowerPoint PPT Presentation

Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org | @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member, Apache Software


  1. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart – ahart@apache.org | @andrewfhart

  2. My Day Job • CTO, Pogoseat • Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2

  3. Additionally… • Member, Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 3

  4. Additionally… • Founder, Data Fluency Software consultancy specializing in appropriate data solutions for startups and smb's Help clients make good decisions and leverage power traditionally accessible only to big business 3/28/16 QCON-SP Andrew Hart 4

  5. Previously… • NASA Jet Propulsion Laboratory Building data management pipelines for research missions in many domains (Climate, Cancer, Mars, Radioastronomy, etc.) 3/28/16 QCON-SP Andrew Hart 5

  6. Apache Spark 3/28/16 QCON-SP Andrew Hart 6

  7. Spark is… • General purpose cluster computing software that maximizes use of cluster memory to process data • Used by hundreds of organizations to realize performance gains over previous-generation cluster compute platforms, particularly for map- reduce style problems 3/28/16 QCON-SP Andrew Hart 7

  8. Spark… • Was developed at the Algorithms, Machines, and People laboratory (AMP Lab) at the University of California, Berkeley in 2009 • Is open source software presently under the governance of the Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 8

  9. Why Spark Exists • Confluence of three trends: Increased volume of digital data Decreasing cost of computer memory (RAM) Data processing technology liberation 3/28/16 QCON-SP Andrew Hart 9

  10. Digital data volume • Early days – low-resolution sensors, comparatively few people on the internet – proprietary data, custom solutions, expensive, custom hardware 3/28/16 QCON-SP Andrew Hart 10

  11. Digital data volume • Modern era – ubiquitous, high-resolution cameras – mobile devices packed with sensors – i.o.t. – open source software, cheap commodity hardware 3/28/16 QCON-SP Andrew Hart 11

  12. We've gone from this Internet… 3/28/16 QCON-SP Andrew Hart 12

  13. To this… (Humanity's global communication platform) 3/28/16 QCON-SP Andrew Hart 13

  14. 1.5 Billion connected PCs 3.2 Billion connected people 6 Billion connected mobile devices 3/28/16 QCON-SP Andrew Hart 14

  15. What are we doing with all of this? Every minute: • 18,000 votes cast on Reddit • 51,000 apps downloaded by Apple users • 350,000 tweets posted on Twitter • 4,100,000 likes recorded on Facebook 3/28/16 QCON-SP Andrew Hart 15

  16. We are awash in data… • Monetizing this data is a core competency for many businesses • Need tools to do this effectively at today's scale 3/28/16 QCON-SP Andrew Hart 16

  17. 2. Tool support and technology liberation 3/28/16 QCON-SP Andrew Hart 17

  18. How far do you want to go back? We have always used tools to help us cope with data 3/28/16 QCON-SP Andrew Hart 18

  19. VisiCalc • Early "big data" tool – Allowed business to move from the chalk board to the digital spreadsheet – Phenomenal increase in productivity running numbers for business 3/28/16 QCON-SP Andrew Hart 19

  20. Modern-era Spreadsheet Tech Microsoft Excel 1,048,576 rows x 16,384 columns 3/28/16 QCON-SP Andrew Hart 20

  21. Open Source Alternatives Exist… Microsoft Excel 1,048,576 rows x 16,384 columns Apache OpenOffice 1,048,576 rows x 1,024 columns 3/28/16 QCON-SP Andrew Hart 21

  22. Relational Database Systems • Support thousands of tables, millions of rows… 3/28/16 QCON-SP Andrew Hart 22

  23. Relational Database Systems • Support thousands of tables, millions of rows… • Viable open source alternatives exist for many use cases 3/28/16 QCON-SP Andrew Hart 23

  24. Modern Big Data Era • MapReduce Algorithm (2004) parallelize large-scale computation across clusters of servers 3/28/16 QCON-SP Andrew Hart 24

  25. Modern Big Data Era • Hadoop Open source processing framework for MapReduce applications to run on large clusters of commodity (unreliable) hardware 3/28/16 QCON-SP Andrew Hart 25

  26. 3. Commoditization of computer memory 3/28/16 QCON-SP Andrew Hart 26

  27. Early Days… • Main memory was hand made. • You could see each bit. 3/28/16 QCON-SP Andrew Hart 27

  28. Modern Era… AWS EC2 r3.8xlarge 244 GiB for US$1.41/hr* 3/28/16 QCON-SP Andrew Hart 28

  29. Why talk about memory? • Ability to use memory efficiently distinguishes Spark from Hadoop, contributes to its speed advantages in many scenarios 3/28/16 QCON-SP Andrew Hart 29

  30. How Spark Works 3/28/16 QCON-SP Andrew Hart 30

  31. Primary abstraction in Spark: Resilient Distributed Datasets (RDD) • Immutable (read-only), partitioned dataset • Processed in parallel on each cluster node • Fault-tolerant – resilient to node failure 3/28/16 QCON-SP Andrew Hart 31

  32. Primary abstraction in Spark: Resilient Distributed Datasets (RDD) • Uses the distributed memory of the cluster to store the state of a computation as a sharable object across jobs (…instead of serializing to disk) 3/28/16 QCON-SP Andrew Hart 32

  33. Traditional MapReduce: HDFS HDFS HDFS HDFS Read Map-1 Write Read Reduce-1 Write Tuples Data on Tuples Map-2 Reduce-2 Disk on Disk on Disk . . . . . . Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 33

  34. Spark RDD Architecture: HDFS HDFS Read Map-1 Reduce-1 Write Cluster Data on Data on Map-2 Reduce-2 Disk Memory Disk . . . . . . RDD Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 34

  35. Unified computational model: • Spark unifies batch & streaming models, which traditionally require different architectures • Sort of like the limit theorem in Calculus – If you imagine a time-series set of RDDs with ever smaller windows of time, you can approximate streaming workflows 3/28/16 QCON-SP Andrew Hart 35

  36. Two ways to create an RDD in Spark Programs: • "Parallelize" an existing collection (e.g.: Python array) in the driver program • Reference a dataset on external storage – text files on disk – anything with a supported Hadoop InputFormat 3/28/16 QCON-SP Andrew Hart 36

  37. RDDs from RDDs: • RDDs are immutable (read-only) • The result of applying transformations and actions to an RDD is a new RDD • RDD's can be persisted to memory – reducing need to re-compute RDD every time 3/28/16 QCON-SP Andrew Hart 37

  38. Writing Spark Programs: Think of spark programs as a way to describe a sequence of transformations and actions that should be applied to an RDD 3/28/16 QCON-SP Andrew Hart 38

  39. Writing Spark Programs: • Transformations create a new dataset from an existing one (e.g.: map) • Actions return a value after running a computation (e.g.: reduce) 3/28/16 QCON-SP Andrew Hart 39

  40. Writing Spark Programs: • Spark provides a rich set of transformations: map flatMap filter sample union intersection distinct groupByKey sortByKey cogroup pipe join cartesian coalesce 3/28/16 QCON-SP Andrew Hart 40

  41. Writing Spark Programs: • Spark provides a rich set of actions: reduce collect count first take takeSample takeOrdered saveAsTextFile countByKey foreach 3/28/16 QCON-SP Andrew Hart 41

  42. Writing Spark Programs: • Transformations are lazily evaluated . They are only computed when a subsequent action (which must return a result) is run. • Transformations get recomputed each time an action is run (unless you explicitly persist the resulting RDD to memory) 3/28/16 QCON-SP Andrew Hart 42

  43. Structure of Spark Programs: Spark programs have two principal components: • Driver Program • Worker function 3/28/16 QCON-SP Andrew Hart 43

  44. Structure of Spark Programs: Driver program • Executes on the master node • Establishes context 3/28/16 QCON-SP Andrew Hart 44

  45. Structure of Spark Programs: Worker (processing) function • Executes on each worker node • Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 45

  46. Structure of Spark Programs: SparkContext • Holds all of the information about the cluster • Manages what gets shipped to nodes • Makes life extremely easy for developers 3/28/16 QCON-SP Andrew Hart 46

  47. Structure of Spark Programs: Shared Variables • Broadcast variables: efficiently share static data with the cluster nodes • Accumulators : write-only variables that serve as counters 3/28/16 QCON-SP Andrew Hart 47

  48. Structure of Spark Programs: Worker (processing) function • Executes on each worker node • Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 48

  49. Interacting with Spark 3/28/16 QCON-SP Andrew Hart 49

  50. Spark APIs: Spark provides APIS for several languages • Scala • Java • Python • R • SQL 3/28/16 QCON-SP Andrew Hart 50

  51. Using Spark: There are two main ways to leverage Spark • Interactively through the included command line interface • Programmatically via standalone programs submitted as jobs to the master node 3/28/16 QCON-SP Andrew Hart 51

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend