[PPT] - Lie to me Demystifying Spark accumulators SergeyZhemzhitsky PowerPoint Presentation

SLIDE 1

Lie to me… Demystifying Spark accumulators

SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru

SLIDE 2

ü More than 13 years in IT ü 7 years of friendship with data processing technologies ü CTO, engineer, an expert in data management platforms, highly loaded systems and systems integration ü In CleverDATA is in charge of technical & technological vision and development of 1DMx product line ü Periodically can be met at professional conferences as a speaker on the mentioned topics

Data Monetization Technology 1DMC - Data Exchange DMPkit – Data Management Platform

DMP About

SLIDE 3

A data management platform that lets you build your own solutions for the collection, storage and processing of your data. The core of the entire data monetization infrastructure for companies that are focused on extracting all the benefits from their data. We offer solutions that reduce the cost of customer acquisition and retention to large and medium-sized companies, by implementing fully automated communications and online advertising management systems that are driven by modern AI technologies. A unified access point, that connects to a multitude of data-providers and data-consumers, advertising platforms, marketing channels and ad networks.

Rated by Data Insight - Artificial Intelligence in Russia 2017 in the category of Big Data Services Market analysis in programmatic 2016-2017 in the section of Data Martket Rated by AdIndex Technology 2017, 2018 in the category of DMP and Processed Data Suppliers

1DMC - Data Exchange DMPkit – Data Management Platform

DMP Product Line

SLIDE 4

REDUCTION OF YOUR CUSTOMER ACQUISITION AND RETENTION by fully automated communications and online advertising management based on:

ARTIFICIAL INTELLIGENCE YOUR OWN INTERNAL DATA about customers UNIQUE EXTERNAL DATA about your audience

WEBSITE ADS MOBILE APP CRM

EMAIL, SMS, PUSH

SOCIAL

CUSTOMER SERVICE

1DMC - Data Exchange DMPkit – Data Management Platform

DMP 1DMP – Data Marketing Platform

SLIDE 5

CONNECTED ADVERTISING SYSTEMS: GOOGLE, MYTARGET

, YANDEX, AUDITORIUS, HYBRID, ADSPEND, WEBORAMA, EXEBID, ADRIVER, GETINTENT , MEDIASNIPER, RTBID, VENGO, PMI, ADVARK, APPNEXUS.

1DMC - Data Exchange 1DMC DATA EXCHANGE – an independent marketplace platform

that unites data providers and data consumers to exchange depersonalized knowledge about their audience. At the moment, this is the largest independent platform -data operator on Eastern Europe segment of the Internet (85M unique users per day) and does not have direct substitutes on the market

DATA USAGE:

DSPs eCOMMERCE Publishers & Media B2B & B2C services DMPs MARKETING CHANNELS

Repeated sale
Anti-fraud
Targeted advertising
Scoring

20+

DATA PROVIDERS

9000+

DATA SOURCES

2500+

AUDIENCE ATTRIBUTES

85М

TOTAL AUDIENCE PER DAY ADVERTISING SYSTEMS & DSP

10+

ACCESS TO UNIQUE EXTERNAL 3rd PARTY DATA & MARKETING COMMUNICATION CHANNELS

DMPkit – Data Management Platform

DMP 1DMC – Data Exchange

SLIDE 6

DMPkit – Data Management Platform Audience Onboarding

Uploading your first-party data in a self- serve environment, and match those users to a any publisher’s users and any advertiser platforms users

Data Collecting

Tracking and collecting detailed data of user behavior on websites, mobile applications, social networks. Uploading data from CRM and transactional systems

Data Processing

Forming a single user profile and single Customer Journey across all channels and devices, for a detailed understanding

f each user for all the collected data

Audience Enrichment

Using data from external suppliers: data from social networking sites on the behavior of the Internet, data on eCommerce sites and online services

Audience Research

Find users who are similar to your most valuable segments or users who are part of an entirely new and critical emerging audience

Campaign Optimization

Using AI for campaign and Customer Journey optimization, predictive data analytics and recommendation models

Audience Insights

Obtaining data on the response to the communication, impressions, clicks, conversions and targeted actions, responses and purchases

Audience Extension

Segmenting by events, profiles, steps of the Customer Journey. Creating predictive models and Look-a-like modeling

THE CORE of DATA MANAGEMENT AND MONETIZATION INFRASTRUCTURE

DMP DMPkit – Data Management Platform

SLIDE 7

DMP Tools

SLIDE 8

DMP Is it fast?

SLIDE 9

DMP Is it fast?

SLIDE 10

DMP Is it fast?

SLIDE 11

DMP Is it fast?

SLIDE 12

DMP How to speed it up?

Brute force - scale your servers either vertically or horizontally by adding more servers, more RAM, more CPUs, more GPUs, more LAN bandwidth, etc.

SLIDE 13

DMP How to speed it up?

Brute force - scale your servers either vertically or horizontally by adding more servers, more RAM, more CPUs, more GPUs, more LAN bandwidth, etc. Know you data - don’t process the data you don’t have to, use partitioning, bucketing, columnar data formats, co-partitioned datasets, etc.

SLIDE 14

Data Management Platform

SLIDE 15

DMP

Analytical Data Store

MOM

pixels impressions clicks events raw data 3rd party data Raw Data Store & Processing Fast Data Store user profiles segments aggregates 3rd party data 3rd party data ad platform user profiles segments

Data Flows

SLIDE 16

Analytical Data Store

MOM

pixels impressions clicks events raw data 3rd party data Raw Data Store & Processing Fast Data Store 3rd party data 3rd party data ad platform user profiles segments user profiles segments aggregates

DMP Data Flows

SLIDE 17

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker” during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber" to make a discount of 40%

DMP Segmentation

SLIDE 18

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker” during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber" [impression] [click] [tr. pixel] [tr. pixel]

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-04-01 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-04-01 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-04-01 18:57:19.628 …

[cookies]

DMP Segmentation

SLIDE 19

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-04-01 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-04-01 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-04-01 18:57:19.628 …

1 2 3 4 c1

DMP Segmentation

SLIDE 20

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-04-01 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-04-01 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-04-01 18:57:19.628 …

1 2 3 4 c1 (c1, 0)

DMP Segmentation

SLIDE 21

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

1 2 3 4 c1 (c1, 0) (c1, 1)

DMP Segmentation

SLIDE 22

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

1 2 3 4 c1 (c1, 0) (c1, 1) (c1, 2)

DMP Segmentation

SLIDE 23

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

1 2 3 4 c1 (c1, 0) (c1, 1) (c1, 2) (c1, 3)

DMP Segmentation

SLIDE 24

find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"

id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …

1 2 3 4 c1 (c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø

DMP Segmentation

SLIDE 25

find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

DMP Segmentation

SLIDE 26

find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

(c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø map

DMP Segmentation

SLIDE 27

reduce

find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

(c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø map (c1, 0;1;2;3) true(0) and true(1) and true(2) and true(3) and not false(4)

DMP Segmentation

SLIDE 28

reduce

find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"

(c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø map (c1, 0;1;2;3) true(0) and true(1) and true(2) and true(3) and not false(4) c1

DMP Segmentation

SLIDE 29

val predicateMatches = events.flatMap { event => rules.value.foldLeft(Set[((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)

DMP Segmentation

SLIDE 30

val predicateMatches = events.flatMap { event => rules.value.foldLeft(Set[((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)

DMP Segmentation

SLIDE 31

val predicateMatches = events.flatMap { event => rules.value.foldLeft(Set[((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)

DMP Segmentation

SLIDE 32

val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches .saveAsNewAPIHadoopDataset(...)

DMP Segmentation

SLIDE 33

RDDs

SLIDE 34

val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches .saveAsNewAPIHadoopDataset(...) val stats = ruleMatches .treeAggregate(...)

DMP Spark Actions

SLIDE 35

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5) rdd.saveAsTextFile("/data/output") val stats = FileSystem .getStatistics("file", classOf[RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions

SLIDE 36

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5) rdd.saveAsTextFile("/data/output") val stats = FileSystem .getStatistics("file", classOf[RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions

SLIDE 37

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") rdd.saveAsTextFile("/data/output") val stats = FileSystem .getStatistics("file", classOf[RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

BYTES READ: 10 486 160 (~10 MB)

DMP Spark Actions

SLIDE 38

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions

SLIDE 39

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

BYTES READ: 20 841 256 (~20 MB)

DMP Spark Actions

SLIDE 40

By default, each transformed RDD may be recomputed each time you run an action on it.

DMP Spark Actions

SLIDE 41

By default, each transformed RDD may be recomputed each time you run an action on it.

2x data read

DMP Spark Actions

SLIDE 42

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations

SLIDE 43

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

DMP Spark Actions

SLIDE 44

val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

BYTES READ: 10 486 160 (~10 MB)

DMP Spark Actions

SLIDE 45

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ...

DMP Spark Actions

SLIDE 46

sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ...

BYTES READ: 14 557 992 (~14 MB)

DMP Spark Actions

SLIDE 47

val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys .map { item => acc.add(item) item } ruleMatches .saveAsNewAPIHadoopDataset(...)

DMP Spark Accumulators

SLIDE 48

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators

SLIDE 49

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

SLIDE 50

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

SLIDE 51

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

SLIDE 52

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

SLIDE 53

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== Inserts new Stage here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

SLIDE 54

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== Inserts new Stage is here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

SLIDE 55

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

DMP Spark Accumulators

SLIDE 56

val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length

30 was not equal to 10 Expected :10 Actual :30

DMP Spark Accumulators

SLIDE 57

val blockManager = SparkEnv.get.blockManager val block = blockManager.diskBlockManager.getAllBlocks() .filter(_.isInstanceOf[ShuffleDataBlockId]) .map(_.asInstanceOf[ShuffleDataBlockId]) .head throw new FetchFailedException( blockManager.blockManagerId, block.shuffleId, block.mapId, block.reduceId, "__spark_stage_failed__")

DMP Spark Accumulators

SLIDE 58

val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys .map { item => acc.add(item) item } ruleMatches .saveAsNewAPIHadoopDataset(...)

DMP Spark Accumulators

SLIDE 59

Action Meaning collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count() Return the number of elements in the dataset. ... ... saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. DMP Spark Accumulators

SLIDE 60

Action Meaning collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count() Return the number of elements in the dataset. ... ... saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. DMP Spark Accumulators

SLIDE 61

val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches.saveAsNewAPIHadoopDataset(...) ruleMatches.foreach(acc.add)

DMP Spark Accumulators

SLIDE 62

val acc = sc.longAccumulator ... ... val rdd = sc.textFile("/data/input", 5) rdd.saveAsTextFile("/data/output") rdd.foreach(_ => acc.add(1L)) ... stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)

BYTES READ: 20 841 256 (~20 MB)

DMP Spark Accumulators

SLIDE 63

Про задачу DMP Custom RDD? def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {

// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException("Output format class not set") } if (keyClass == null) { throw new SparkException("Output key class not set") } if (valueClass == null) { throw new SparkException("Output value class not set") } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")") if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit()

utputMetrics.setBytesWritten(callback())
utputMetrics.setRecordsWritten(recordsWritten)

} self.context.runJob(self, writeToFile) writer.commitJob()

}

SLIDE 64

def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {

// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException("Output format class not set") } if (keyClass == null) { throw new SparkException("Output key class not set") } if (valueClass == null) { throw new SparkException("Output value class not set") } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")") if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit()

utputMetrics.setBytesWritten(callback())
utputMetrics.setRecordsWritten(recordsWritten)

} self.context.runJob(self, writeToFile) writer.commitJob()

} Про задачу DMP Custom RDD?

SLIDE 65

def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException("Output format class not set") } if (keyClass == null) { throw new SparkException("Output key class not set") } if (valueClass == null) { throw new SparkException("Output value class not set") } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")") if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup()

val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => {

... Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit()

utputMetrics.setBytesWritten(callback())
utputMetrics.setRecordsWritten(recordsWritten)

}

self.context.runJob(self, writeToFile)

writer.commitJob() }

Про задачу DMP Custom RDD or …?

SLIDE 66

… or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) }

SLIDE 67

… or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) }

SLIDE 68

… or SparkContext?

trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super.register(acc) acc match { case _: ActionAccumulator[_, _] => this.accumulators.put(acc.id, ActionCallable(acc)) case _ => } } ... }

SLIDE 69

… or SparkContext?

trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super.register(acc) acc match { case _: ActionAccumulator[_, _] => this.accumulators.put(acc.id, ActionCallable(acc)) case _ => } } ... }

SLIDE 70

… or SparkContext?

trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super.register(acc) acc match { case _: ActionAccumulator[_, _] => this.accumulators.put(acc.id, ActionCallable(acc)) case _ => } } ... }

SLIDE 71

… or SparkContext?

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

verride def hasNext: Boolean = iter.hasNext
verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

SLIDE 72

… or SparkContext?

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

verride def hasNext: Boolean = iter.hasNext
verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

SLIDE 73

… or SparkContext?

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

verride def hasNext: Boolean = iter.hasNext
verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

SLIDE 74

… or SparkContext?

trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {

verride def hasNext: Boolean = iter.hasNext
verride def next(): T = {

val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }

SLIDE 75

SparkContext! val sc = new SparkContext(...) with ActionAccumulable

SLIDE 76