Lie to me Demystifying Spark accumulators SergeyZhemzhitsky - - PowerPoint PPT Presentation

lie to me demystifying spark accumulators
SMART_READER_LITE
LIVE PREVIEW

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky - - PowerPoint PPT Presentation

Lie to me Demystifying Spark accumulators SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru DMP About DMPkit Data Management Platform Data Monetization Technology 1DMC - Data Exchange More than 13 years in IT 7 years of friendship


slide-1
SLIDE 1

Lie to me… Demystifying Spark accumulators

SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru

slide-2
SLIDE 2

ü More than 13 years in IT ü 7 years of friendship with data processing technologies ü CTO, engineer, an expert in data management platforms, highly loaded systems and systems integration ü In CleverDATA is in charge of technical & technological vision and development of 1DMx product line ü Periodically can be met at professional conferences as a speaker on the mentioned topics

Data Monetization Technology 1DMC - Data Exchange DMPkit – Data Management Platform

DMP About

slide-3
SLIDE 3

A data management platform that lets you build your own solutions for the collection, storage and processing of your data. The core of the entire data monetization infrastructure for companies that are focused on extracting all the benefits from their data. We offer solutions that reduce the cost of customer acquisition and retention to large and medium-sized companies, by implementing fully automated communications and online advertising management systems that are driven by modern AI technologies. A unified access point, that connects to a multitude of data-providers and data-consumers, advertising platforms, marketing channels and ad networks.

Rated by Data Insight - Artificial Intelligence in Russia 2017 in the category of Big Data Services Market analysis in programmatic 2016-2017 in the section of Data Martket Rated by AdIndex Technology 2017, 2018 in the category of DMP and Processed Data Suppliers

1DMC - Data Exchange DMPkit – Data Management Platform

DMP Product Line

slide-4
SLIDE 4

REDUCTION OF YOUR CUSTOMER ACQUISITION AND RETENTION by fully automated communications and online advertising management based on:

ARTIFICIAL INTELLIGENCE YOUR OWN INTERNAL DATA about customers UNIQUE EXTERNAL DATA about your audience

WEBSITE ADS MOBILE APP CRM

EMAIL, SMS, PUSH

SOCIAL

CUSTOMER SERVICE

1DMC - Data Exchange DMPkit – Data Management Platform

DMP 1DMP – Data Marketing Platform

slide-5
SLIDE 5

CONNECTED ADVERTISING SYSTEMS: GOOGLE, MYTARGET

, YANDEX, AUDITORIUS, HYBRID, ADSPEND, WEBORAMA, EXEBID, ADRIVER, GETINTENT , MEDIASNIPER, RTBID, VENGO, PMI, ADVARK, APPNEXUS.

1DMC - Data Exchange 1DMC DATA EXCHANGE – an independent marketplace platform

that unites data providers and data consumers to exchange depersonalized knowledge about their audience. At the moment, this is the largest independent platform -data operator on Eastern Europe segment of the Internet (85M unique users per day) and does not have direct substitutes on the market

DATA USAGE:

DSPs eCOMMERCE Publishers & Media B2B & B2C services DMPs MARKETING CHANNELS

  • Repeated sale
  • Anti-fraud
  • Targeted advertising
  • Scoring

20+

DATA PROVIDERS

9000+

DATA SOURCES

2500+

AUDIENCE ATTRIBUTES

85М

TOTAL AUDIENCE PER DAY ADVERTISING SYSTEMS & DSP

10+

ACCESS TO UNIQUE EXTERNAL 3rd PARTY DATA & MARKETING COMMUNICATION CHANNELS

DMPkit – Data Management Platform

DMP 1DMC – Data Exchange

slide-6
SLIDE 6

DMPkit – Data Management Platform Audience Onboarding

Uploading your first-party data in a self- serve environment, and match those users to a any publisher’s users and any advertiser platforms users

Data Collecting

Tracking and collecting detailed data of user behavior on websites, mobile applications, social networks. Uploading data from CRM and transactional systems

Data Processing

Forming a single user profile and single Customer Journey across all channels and devices, for a detailed understanding

  • f each user for all the collected data

Audience Enrichment

Using data from external suppliers: data from social networking sites on the behavior of the Internet, data on eCommerce sites and online services

Audience Research

Find users who are similar to your most valuable segments or users who are part of an entirely new and critical emerging audience

Campaign Optimization

Using AI for campaign and Customer Journey optimization, predictive data analytics and recommendation models

Audience Insights

Obtaining data on the response to the communication, impressions, clicks, conversions and targeted actions, responses and purchases

Audience Extension

Segmenting by events, profiles, steps of the Customer Journey. Creating predictive models and Look-a-like modeling

THE CORE of DATA MANAGEMENT AND MONETIZATION INFRASTRUCTURE

DMP DMPkit – Data Management Platform

slide-7
SLIDE 7

DMP Tools

slide-8
SLIDE 8

DMP Is it fast?

slide-9
SLIDE 9

DMP Is it fast?

slide-10
SLIDE 10

DMP Is it fast?

slide-11
SLIDE 11

DMP Is it fast?

slide-12
SLIDE 12

DMP How to speed it up?

Brute force - scale your servers either vertically or horizontally by adding more servers, more RAM, more CPUs, more GPUs, more LAN bandwidth, etc.

slide-13
SLIDE 13

DMP How to speed it up?

Brute force - scale your servers either vertically or horizontally by adding more servers, more RAM, more CPUs, more GPUs, more LAN bandwidth, etc. Know you data - don’t process the data you don’t have to, use partitioning, bucketing, columnar data formats, co-partitioned datasets, etc.

slide-14
SLIDE 14

Data Management Platform

slide-15
SLIDE 15

DMP

Analytical Data Store

MOM

pixels impressions clicks events raw data 3rd party data Raw Data Store & Processing Fast Data Store user profiles segments aggregates 3rd party data 3rd party data ad platform user profiles segments

Data Flows

slide-16
SLIDE 16

Analytical Data Store

MOM

pixels impressions clicks events raw data 3rd party data Raw Data Store & Processing Fast Data Store 3rd party data 3rd party data ad platform user profiles segments user profiles segments aggregates

DMP Data Flows

slide-17
SLIDE 17

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker” during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber" to make a discount of 40%

DMP Segmentation

slide-18
SLIDE 18

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker” during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber" [impression] [click] [tr. pixel] [tr. pixel]

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-04-01 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-04-01 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-04-01 18:57:19.628 …

[cookies]

DMP Segmentation

slide-19
SLIDE 19

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-04-01 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-04-01 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-04-01 18:57:19.628 …

1 2 3 4 c1

DMP Segmentation

slide-20
SLIDE 20

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-04-01 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-04-01 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-04-01 18:57:19.628 …

1 2 3 4 c1 (c1, 0)

DMP Segmentation

slide-21
SLIDE 21

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

1 2 3 4 c1 (c1, 0) (c1, 1)

DMP Segmentation

slide-22
SLIDE 22

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

1 2 3 4 c1 (c1, 0) (c1, 1) (c1, 2)

DMP Segmentation

slide-23
SLIDE 23

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

1 2 3 4 c1 (c1, 0) (c1, 1) (c1, 2) (c1, 3)

DMP Segmentation

slide-24
SLIDE 24

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

1 2 3 4 c1 (c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø

DMP Segmentation

slide-25
SLIDE 25

find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

DMP Segmentation

slide-26
SLIDE 26

find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

(c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø map

DMP Segmentation

slide-27
SLIDE 27

reduce

find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

(c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø map (c1, 0;1;2;3) true(0) and true(1) and true(2) and true(3) and not false(4)

DMP Segmentation

slide-28
SLIDE 28

reduce

find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

(c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø map (c1, 0;1;2;3) true(0) and true(1) and true(2) and true(3) and not false(4) c1

DMP Segmentation

slide-29
SLIDE 29

val predicateMatches = events.flatMap { event => rules.value.foldLeft(Set[((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)

DMP Segmentation

slide-30
SLIDE 30

val predicateMatches = events.flatMap { event => rules.value.foldLeft(Set[((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)

DMP Segmentation

slide-31
SLIDE 31

val predicateMatches = events.flatMap { event => rules.value.foldLeft(Set[((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)

DMP Segmentation

slide-32
SLIDE 32

val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches .saveAsNewAPIHadoopDataset(...)

DMP Segmentation

slide-33
SLIDE 33

RDDs

slide-34
SLIDE 34

val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches .saveAsNewAPIHadoopDataset(...) val stats = ruleMatches .treeAggregate(...)

DMP Spark Actions

slide-35
SLIDE 35

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5) rdd.saveAsTextFile("/data/output") val stats = FileSystem .getStatistics("file", classOf[RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions

slide-36
SLIDE 36

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5) rdd.saveAsTextFile("/data/output") val stats = FileSystem .getStatistics("file", classOf[RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions

slide-37
SLIDE 37

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") rdd.saveAsTextFile("/data/output") val stats = FileSystem .getStatistics("file", classOf[RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

BYTES READ: 10 486 160 (~10 MB)

DMP Spark Actions

slide-38
SLIDE 38

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions

slide-39
SLIDE 39

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

BYTES READ: 20 841 256 (~20 MB)

DMP Spark Actions

slide-40
SLIDE 40

By default, each transformed RDD may be recomputed each time you run an action on it.

DMP Spark Actions

slide-41
SLIDE 41

By default, each transformed RDD may be recomputed each time you run an action on it.

2x data read

DMP Spark Actions

slide-42
SLIDE 42

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations

slide-43
SLIDE 43

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions

slide-44
SLIDE 44

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

BYTES READ: 10 486 160 (~10 MB)

DMP Spark Actions

slide-45
SLIDE 45

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ...

DMP Spark Actions

slide-46
SLIDE 46

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ...

BYTES READ: 14 557 992 (~14 MB)

DMP Spark Actions

slide-47
SLIDE 47

val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys .map { item => acc.add(item) item } ruleMatches .saveAsNewAPIHadoopDataset(...)

DMP Spark Accumulators

slide-48
SLIDE 48

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators

slide-49
SLIDE 49

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-50
SLIDE 50

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-51
SLIDE 51

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-52
SLIDE 52

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-53
SLIDE 53

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== Inserts new Stage here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-54
SLIDE 54

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== Inserts new Stage is here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-55
SLIDE 55

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-56
SLIDE 56

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

30 was not equal to 10 Expected :10 Actual :30

DMP Spark Accumulators

slide-57
SLIDE 57

val blockManager = SparkEnv.get.blockManager val block = blockManager.diskBlockManager.getAllBlocks() .filter(_.isInstanceOf[ShuffleDataBlockId]) .map(_.asInstanceOf[ShuffleDataBlockId]) .head throw new FetchFailedException( blockManager.blockManagerId, block.shuffleId, block.mapId, block.reduceId, "__spark_stage_failed__")

DMP Spark Accumulators

slide-58
SLIDE 58

val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys .map { item => acc.add(item) item } ruleMatches .saveAsNewAPIHadoopDataset(...)

DMP Spark Accumulators

slide-59
SLIDE 59

Action Meaning collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count() Return the number of elements in the dataset. ... ... saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. DMP Spark Accumulators

slide-60
SLIDE 60

Action Meaning collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count() Return the number of elements in the dataset. ... ... saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. DMP Spark Accumulators

slide-61
SLIDE 61

val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches.saveAsNewAPIHadoopDataset(...) ruleMatches.foreach(acc.add)

DMP Spark Accumulators

slide-62
SLIDE 62

val acc = sc.longAccumulator ... ... val rdd = sc.textFile("/data/input", 5) rdd.saveAsTextFile("/data/output") rdd.foreach(_ => acc.add(1L)) ... stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

BYTES READ: 20 841 256 (~20 MB)

DMP Spark Accumulators

slide-63
SLIDE 63

Про задачу DMP Custom RDD? def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {

// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException("Output format class not set") } if (keyClass == null) { throw new SparkException("Output key class not set") } if (valueClass == null) { throw new SparkException("Output value class not set") } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")") if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit()

  • utputMetrics.setBytesWritten(callback())
  • utputMetrics.setRecordsWritten(recordsWritten)

} self.context.runJob(self, writeToFile) writer.commitJob()

}

slide-64
SLIDE 64

def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {

// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException("Output format class not set") } if (keyClass == null) { throw new SparkException("Output key class not set") } if (valueClass == null) { throw new SparkException("Output value class not set") } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")") if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit()

  • utputMetrics.setBytesWritten(callback())
  • utputMetrics.setRecordsWritten(recordsWritten)

} self.context.runJob(self, writeToFile) writer.commitJob()

} Про задачу DMP Custom RDD?

slide-65
SLIDE 65

def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException("Output format class not set") } if (keyClass == null) { throw new SparkException("Output key class not set") } if (valueClass == null) { throw new SparkException("Output value class not set") } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")") if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup()

val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => {

... Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit()

  • utputMetrics.setBytesWritten(callback())
  • utputMetrics.setRecordsWritten(recordsWritten)

}

self.context.runJob(self, writeToFile)

writer.commitJob() }

Про задачу DMP Custom RDD or …?

slide-66
SLIDE 66

… or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) }

slide-67
SLIDE 67

… or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) }

slide-68
SLIDE 68

… or SparkContext?

trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super.register(acc) acc match { case _: ActionAccumulator[_, _] => this.accumulators.put(acc.id, ActionCallable(acc)) case _ => } } ... }

slide-69
SLIDE 69

… or SparkContext?

trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super.register(acc) acc match { case _: ActionAccumulator[_, _] => this.accumulators.put(acc.id, ActionCallable(acc)) case _ => } } ... }

slide-70
SLIDE 70

… or SparkContext?

trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super.register(acc) acc match { case _: ActionAccumulator[_, _] => this.accumulators.put(acc.id, ActionCallable(acc)) case _ => } } ... }

slide-71
SLIDE 71

… or SparkContext?

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

  • verride def hasNext: Boolean = iter.hasNext
  • verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

slide-72
SLIDE 72

… or SparkContext?

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

  • verride def hasNext: Boolean = iter.hasNext
  • verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

slide-73
SLIDE 73

… or SparkContext?

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

  • verride def hasNext: Boolean = iter.hasNext
  • verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

slide-74
SLIDE 74

… or SparkContext?

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

  • verride def hasNext: Boolean = iter.hasNext
  • verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

slide-75
SLIDE 75

SparkContext! val sc = new SparkContext(...) with ActionAccumulable

slide-76
SLIDE 76

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

DMP Spark Accumulators

slide-77
SLIDE 77

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

DMP Spark Accumulators

slide-78
SLIDE 78

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

DMP Spark Accumulators

slide-79
SLIDE 79

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

DMP Spark Accumulators

slide-80
SLIDE 80

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

DMP Spark Accumulators

slide-81
SLIDE 81

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

BYTES READ: 10 486 160 (~10 MB)

DMP Spark Accumulators

slide-82
SLIDE 82

DMP Spark Accumulators

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...

slide-83
SLIDE 83

DMP Spark Accumulators

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... // rdd.count() acc.value shouldBe data.length ...

slide-84
SLIDE 84

DMP Spark Accumulators

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... // rdd.count() rdd.saveAsTextFile("/data/output") acc.value shouldBe data.length ...

slide-85
SLIDE 85

Task not serializable

  • rg.apache.spark.SparkException: Task not serializable

at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner$.clean(ClosureCleaner.scala:108) at SparkContext.clean(SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack:

  • object not serializable (class: JobConf, value: Configuration: ...
  • field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1,

name: conf$4, type: class JobConf

  • object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0>

...

  • field class: ActionAccumulable$$anonfun$1,

name: func$1, type: interface.Function2

  • object class ActionAccumulable$$anonfun$1, <function2>

DMP Spark Accumulators

slide-86
SLIDE 86

Task not serializable

  • rg.apache.spark.SparkException: Task not serializable

at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner$.clean(ClosureCleaner.scala:108) at SparkContext.clean(SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack:

  • object not serializable (class: JobConf, value: Configuration: ...
  • field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1,

name: conf$4, type: class JobConf

  • object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0>

...

  • field class: ActionAccumulable$$anonfun$1,

name: func$1, type: interface.Function2

  • object class ActionAccumulable$$anonfun$1, <function2>

DMP Spark Accumulators

slide-87
SLIDE 87

Task not serializable

  • rg.apache.spark.SparkException: Task not serializable

at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner$.clean(ClosureCleaner.scala:108) at SparkContext.clean(SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack:

  • object not serializable (class: JobConf, value: Configuration: ...
  • field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1,

name: conf$4, type: class JobConf

  • object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0>

...

  • field class: ActionAccumulable$$anonfun$1,

name: func$1, type: interface.Function2

  • object class ActionAccumulable$$anonfun$1, <function2>

DMP Spark Accumulators

slide-88
SLIDE 88

Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones

DMP ClosureCleaner intro.

slide-89
SLIDE 89

Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones

DMP ClosureCleaner intro.

slide-90
SLIDE 90

Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones

DMP ClosureCleaner intro.

slide-91
SLIDE 91

def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) } DMP ClosureCleaner intro.

slide-92
SLIDE 92

def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) } DMP ClosureCleaner intro.

slide-93
SLIDE 93

def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) } DMP ClosureCleaner intro.

slide-94
SLIDE 94

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

  • verride def hasNext: Boolean = iter.hasNext
  • verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

DMP ClosureCleaner intro.

slide-95
SLIDE 95

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

  • verride def hasNext: Boolean = iter.hasNext
  • verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

DMP ClousureCleaner intro.

slide-96
SLIDE 96

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val cleanF = clean(func) val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

  • verride def hasNext: Boolean = iter.hasNext
  • verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } cleanF(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

DMP ClousureCleaner intro.

slide-97
SLIDE 97

DMP Spark Accumulators

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... // rdd.count() rdd.saveAsTextFile("/data/output") acc.value shouldBe data.length ...

slide-98
SLIDE 98

DMP Spark Accumulators

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... // rdd.count() rdd.saveAsTextFile("/data/output") acc.value shouldBe data.length ...

BYTES READ: 10 486 160 (~10 MB)

slide-99
SLIDE 99

Datasets

slide-100
SLIDE 100

DMP Why Datasets?

slide-101
SLIDE 101

DataFrame = Dataset[Row]

DMP Spark Quiz

slide-102
SLIDE 102

DataFrame = Dataset[Row] Dataset = RDD[Row]

DMP Spark Quiz

slide-103
SLIDE 103

DataFrame = Dataset[Row] Dataset = RDD[Row] Dataset = RDD[InternalRow]

DMP Spark Quiz

slide-104
SLIDE 104

import spark.implicits._ import org.apache.spark.sql.functions._ val acc = sc.longAccumulator val data = 1L to 10L val ds = spark.createDataset(data) val accDs = ds.map { item => acc.add(item) item } accDs.agg(sum("value")).first().getLong(0) shouldBe data.sum

DMP Dataset Accumulators

slide-105
SLIDE 105

import spark.implicits._ import org.apache.spark.sql.functions._ val acc = sc.longAccumulator val data = 1L to 10L val ds = spark.createDataset(data) val accDs = ds.map { item => acc.add(item) item } accDs.agg(sum("value")).first().getLong(0) shouldBe data.sum

DMP Dataset Accumulators

slide-106
SLIDE 106

import spark.implicits._ import org.apache.spark.sql.functions._ val acc = sc.longAccumulator val data = 1L to 10L val ds = spark.createDataset(data) val accDs = ds.map { item => acc.add(item) item } acc.sum shouldBe data.sum

DMP Dataset Accumulators

slide-107
SLIDE 107

import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong

DMP Dataset Accumulators

slide-108
SLIDE 108

import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong

DMP Dataset Accumulators

slide-109
SLIDE 109

Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.ClassCastException: [B cannot be cast to org.apache.spark.sql.catalyst.InternalRow at Spec$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply $mcV$sp$6$$anonfun$14.apply(Spec.scala:114) at SimpleAccumulator.add(SimpleAccumulator.scala:66) at ActionAccumulatorCallable.add(ActionCallable.scala:64) at ActionAccumulableSparkContext$$anonfun$1$$anon $1.next(ActionAccumulableSparkContext.scala:181) at Iterator$class.foreach(Iterator.scala:893) at ActionAccumulableSparkContext$$anonfun$1$$anon $1.foreach(ActionAccumulableSparkContext.scala:170) at Growable$class.$plus$plus$eq(Growable.scala:59) at ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)

DMP Dataset Accumulators

slide-110
SLIDE 110

import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") acc.value shouldBe data.size.toLong

DMP Dataset Accumulators

slide-111
SLIDE 111

import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") acc.value shouldBe data.size.toLong

DMP Dataset Accumulators

slide-112
SLIDE 112

import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") acc.value shouldBe data.size.toLong import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong

DMP Dataset Accumulators

slide-113
SLIDE 113

Dataset = RDD[InternalRow] ?

DMP Dataset Accumulators

slide-114
SLIDE 114

Dataset = RDD[Array[Byte]] ???

DMP Dataset Accumulators

slide-115
SLIDE 115

abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { } }

DMP Dataset Accumulators

slide-116
SLIDE 116

DMP Dataset Accumulators

class Dataset[T] private[sql](...) extends Serializable { def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ => withNewExecutionId { queryExecution .executedPlan .executeToIterator() .map(boundEnc.fromRow) .asJava } } }

slide-117
SLIDE 117

abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd() } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { } }

DMP Dataset Accumulators

slide-118
SLIDE 118

abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd() } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { execute().mapPartitionsInternal { iter => val bos = new ByteArrayOutputStream() while (iter.hasNext && (n < 0 || count < n)) { ... } Iterator(bos.toByteArray) } } }

DMP Dataset Accumulators

slide-119
SLIDE 119

abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd().toLocalIterator } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { } }

DMP Dataset Accumulators

slide-120
SLIDE 120

abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd().toLocalIterator.flatMap(decodeUnsafeRows) } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { } def decodeUnsafeRows(bytes: Array[Byte]): Iterator[InternalRow] = { } }

DMP Dataset Accumulators

slide-121
SLIDE 121

abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd().toLocalIterator.flatMap(decodeUnsafeRows) } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { } def decodeUnsafeRows(bytes: Array[Byte]): Iterator[InternalRow] = { val codec = CompressionCodec.createCodec(SparkEnv.get.conf) val bis = new ByteArrayInputStream(bytes) val ins = new DataInputStream(codec.compressedInputStream(bis)) new Iterator[InternalRow] { ... } } }

DMP Dataset Accumulators

slide-122
SLIDE 122
  • verride def add(v: Any): Unit = v match {

case arr: Array[Byte] => val rows = decodeUnsafeRows(arr, encoder.schema.length) _value = rows.foldLeft(_value) { (acc, row) => add(acc, encoder.fromRow(row)) } case row: InternalRow => _value = add(_value, encoder.fromRow(row)) case _ => throw new IllegalArgumentException( s"Value of unexpected data type received: ${v.getClass.getName}, expecting: Array[Byte] or InternalRow") }

DMP Dataset Accumulators

slide-123
SLIDE 123

import spark.implicits._ val acc = DatasetActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong

DMP Dataset Accumulators

slide-124
SLIDE 124

import spark.implicits._ val acc = DatasetActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong

DMP Dataset Accumulators

slide-125
SLIDE 125

import spark.implicits._ val acc = DatasetActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong

DMP Dataset Accumulators

slide-126
SLIDE 126

case class Item(value: Long) val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data).as[Item] ds.agg(sum("value")).first() acc.value shouldBe data.sum

DMP Dataset Accumulators

slide-127
SLIDE 127

case class Item(value: Long) val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data).as[Item] ds.agg(sum("value")).first() acc.value shouldBe data.sum

DMP Dataset Accumulators

slide-128
SLIDE 128

case class Item(value: Long) val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data).as[Item] ds.agg(sum("value")).first() acc.value shouldBe data.sum

DMP Dataset Accumulators

slide-129
SLIDE 129
  • verride def add(v: Any): Unit = v match {

case arr: Array[Byte] => val rows = decodeUnsafeRows(arr, encoder.schema.length) _value = rows.foldLeft(_value) { (acc, row) => add(acc, encoder.fromRow(row)) } case row: InternalRow => _value = add(_value, encoder.fromRow(row)) case _ => throw new IllegalArgumentException( s"Value of unexpected data type received: ${v.getClass.getName}, expecting: Array[Byte] or InternalRow") }

DMP Double Decoding

slide-130
SLIDE 130

DMP Schema Inference

import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) spark.read.csv("/tmp/output").as[Item].count()

slide-131
SLIDE 131

DMP Schema Inference

import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) spark.read.csv("/tmp/output").as[Item].count()

slide-132
SLIDE 132

DMP Schema Inference

import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) spark.read.csv("/tmp/output").as[Item].count()

slide-133
SLIDE 133

Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.IllegalArgumentException: Value of unexpected data type received: java.lang.String, expecting: Array[Byte] or InternalRow at DatasetActionAccumulator.add(DatasetActionAccumulator.scala:141) at ActionAccumulatorCallable.add(ActionCallable.scala:64) at ActionAccumulableSparkContext$$anonfun $1$$anon$1.next(ActionAccumulableSparkContext.scala:181) at Iterator$$anon$10.next(Iterator.scala:393) at Iterator$class.foreach(Iterator.scala:893) at AbstractIterator.foreach(Iterator.scala:1336) at Growable$class.$plus$plus$eq(Growable.scala:59) at ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at TraversableOnce$class.to(TraversableOnce.scala:310) at AbstractIterator.to(Iterator.scala:1336)

DMP Schema Inference

slide-134
SLIDE 134

DMP Schema Inference

import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val schema = StructType(Seq(StructField("value", LongType))) spark.read.schema(schema).csv("/tmp/output").as[Item].count()

slide-135
SLIDE 135

DMP Schema Inference

import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val schema = StructType(Seq(StructField("value", LongType))) spark.read.schema(schema).csv("/tmp/output").as[Item].count()

slide-136
SLIDE 136

Dataset Accumulators :: Aggregation differences DMP Confusing Aggregate Functions' Behavior

RDDs

val acc = ActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L sc .makeRDD(data) .map(Item) .count() acc.value

Datasets

val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L spark .createDataset(data) .as[Item] .count() acc.value

slide-137
SLIDE 137

RDDs

val acc = ActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L sc .makeRDD(data) .map(Item) .count() acc.value

Datasets

val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L spark .createDataset(data) .as[Item] .count() acc.value

Dataset Accumulators :: Aggregation differences DMP Confusing Aggregate Functions' Behavior

55 10

slide-138
SLIDE 138

RDDs

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

Datasets

def count(): Long = withCallback("count", groupBy().count()) { df => df.collect(needCallback = false).head.getLong(0) }

Dataset Accumulators :: Aggregation differences DMP Confusing Aggregate Functions' Behavior

slide-139
SLIDE 139

DMP Accumulators

slide-140
SLIDE 140

DMP Accumulators

Partition Partition Partition Partition Partition Partition

ShuffleMapStage

ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask Partition ShuffleMapTask

ResultStage

ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition ResultTask ResultTask ResultTask

Job

action creates

slide-141
SLIDE 141

ShuffleMapStage

DMP Accumulators

Partition Partition Partition Partition Partition Partition ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask Partition ShuffleMapTask

ResultStage

ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition ResultTask ResultTask ResultTask

Job

action creates

slide-142
SLIDE 142

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== new Stage is here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-143
SLIDE 143

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== new Stage is here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

Dangerous

slide-144
SLIDE 144

ShuffleMapStage

DMP Accumulators

Partition Partition Partition Partition Partition Partition ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask Partition ShuffleMapTask

ResultStage

ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition ResultTask ResultTask ResultTask

Job

action creates

slide-145
SLIDE 145

val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-146
SLIDE 146

val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-147
SLIDE 147

val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-148
SLIDE 148

val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-149
SLIDE 149

val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-150
SLIDE 150

val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

slide-151
SLIDE 151

val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

10 was equal to 10 Expected :10 Actual :10

slide-152
SLIDE 152

val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) // <== new Stage is here .filter(...) .keys .map { item => acc.add(item) item } ruleMatches .saveAsNewAPIHadoopDataset(...)

DMP Spark Accumulators

Safe Dangerous

slide-153
SLIDE 153

DMP Accumulators

Dangerous

slide-154
SLIDE 154

DMP Accumulators

Dangerous Safe Safe

slide-155
SLIDE 155

DMP Accumulators

Safe Safe

slide-156
SLIDE 156

DMP More Info

SPARK-732 Accumulator should only be updated once for each task in result stage SPARK-3628 Don't apply accumulator updates multiple times for tasks in result stages SPARK-12469 Data Property Accumulators for Spark (formerly Consistent Accumulators) SPARK-22681 Accumulator should only be updated once for each task in result stage ... ...

slide-157
SLIDE 157

https://github.com/cleverdata/dmpkit-spark-standard

DMP Coming Soon ...

slide-158
SLIDE 158

s . z h e m z h i t s k y @ c l e v e r d a t a . r u info@cleverdata.ru

slide-159
SLIDE 159
slide-160
SLIDE 160

http://cleverdata.ru https://1dmp.io https://1dmс.io info@cleverdata.ru +7 (495) 782-38-60 http://cleverleaf.co.uk https://1dmp.io https://1dmс.io info@cleverleaf.co.uk +44 (782) 785-14-28