Lie to me Demystifying Spark accumulators SergeyZhemzhitsky - - PowerPoint PPT Presentation
Lie to me Demystifying Spark accumulators SergeyZhemzhitsky - - PowerPoint PPT Presentation
Lie to me Demystifying Spark accumulators SergeyZhemzhitsky s.zhemzhitsky@cleverdata.ru DMP About DMPkit Data Management Platform Data Monetization Technology 1DMC - Data Exchange More than 13 years in IT 7 years of friendship
ü More than 13 years in IT ü 7 years of friendship with data processing technologies ü CTO, engineer, an expert in data management platforms, highly loaded systems and systems integration ü In CleverDATA is in charge of technical & technological vision and development of 1DMx product line ü Periodically can be met at professional conferences as a speaker on the mentioned topics
Data Monetization Technology 1DMC - Data Exchange DMPkit – Data Management Platform
DMP About
A data management platform that lets you build your own solutions for the collection, storage and processing of your data. The core of the entire data monetization infrastructure for companies that are focused on extracting all the benefits from their data. We offer solutions that reduce the cost of customer acquisition and retention to large and medium-sized companies, by implementing fully automated communications and online advertising management systems that are driven by modern AI technologies. A unified access point, that connects to a multitude of data-providers and data-consumers, advertising platforms, marketing channels and ad networks.
Rated by Data Insight - Artificial Intelligence in Russia 2017 in the category of Big Data Services Market analysis in programmatic 2016-2017 in the section of Data Martket Rated by AdIndex Technology 2017, 2018 in the category of DMP and Processed Data Suppliers
1DMC - Data Exchange DMPkit – Data Management Platform
DMP Product Line
REDUCTION OF YOUR CUSTOMER ACQUISITION AND RETENTION by fully automated communications and online advertising management based on:
ARTIFICIAL INTELLIGENCE YOUR OWN INTERNAL DATA about customers UNIQUE EXTERNAL DATA about your audience
WEBSITE ADS MOBILE APP CRM
EMAIL, SMS, PUSH
SOCIAL
CUSTOMER SERVICE
1DMC - Data Exchange DMPkit – Data Management Platform
DMP 1DMP – Data Marketing Platform
CONNECTED ADVERTISING SYSTEMS: GOOGLE, MYTARGET
, YANDEX, AUDITORIUS, HYBRID, ADSPEND, WEBORAMA, EXEBID, ADRIVER, GETINTENT , MEDIASNIPER, RTBID, VENGO, PMI, ADVARK, APPNEXUS.
1DMC - Data Exchange 1DMC DATA EXCHANGE – an independent marketplace platform
that unites data providers and data consumers to exchange depersonalized knowledge about their audience. At the moment, this is the largest independent platform -data operator on Eastern Europe segment of the Internet (85M unique users per day) and does not have direct substitutes on the market
DATA USAGE:
DSPs eCOMMERCE Publishers & Media B2B & B2C services DMPs MARKETING CHANNELS
- Repeated sale
- Anti-fraud
- Targeted advertising
- Scoring
20+
DATA PROVIDERS
9000+
DATA SOURCES
2500+
AUDIENCE ATTRIBUTES
85М
TOTAL AUDIENCE PER DAY ADVERTISING SYSTEMS & DSP
10+
ACCESS TO UNIQUE EXTERNAL 3rd PARTY DATA & MARKETING COMMUNICATION CHANNELS
DMPkit – Data Management Platform
DMP 1DMC – Data Exchange
DMPkit – Data Management Platform Audience Onboarding
Uploading your first-party data in a self- serve environment, and match those users to a any publisher’s users and any advertiser platforms users
Data Collecting
Tracking and collecting detailed data of user behavior on websites, mobile applications, social networks. Uploading data from CRM and transactional systems
Data Processing
Forming a single user profile and single Customer Journey across all channels and devices, for a detailed understanding
- f each user for all the collected data
Audience Enrichment
Using data from external suppliers: data from social networking sites on the behavior of the Internet, data on eCommerce sites and online services
Audience Research
Find users who are similar to your most valuable segments or users who are part of an entirely new and critical emerging audience
Campaign Optimization
Using AI for campaign and Customer Journey optimization, predictive data analytics and recommendation models
Audience Insights
Obtaining data on the response to the communication, impressions, clicks, conversions and targeted actions, responses and purchases
Audience Extension
Segmenting by events, profiles, steps of the Customer Journey. Creating predictive models and Look-a-like modeling
THE CORE of DATA MANAGEMENT AND MONETIZATION INFRASTRUCTURE
DMP DMPkit – Data Management Platform
DMP Tools
DMP Is it fast?
DMP Is it fast?
DMP Is it fast?
DMP Is it fast?
DMP How to speed it up?
Brute force - scale your servers either vertically or horizontally by adding more servers, more RAM, more CPUs, more GPUs, more LAN bandwidth, etc.
DMP How to speed it up?
Brute force - scale your servers either vertically or horizontally by adding more servers, more RAM, more CPUs, more GPUs, more LAN bandwidth, etc. Know you data - don’t process the data you don’t have to, use partitioning, bucketing, columnar data formats, co-partitioned datasets, etc.
Data Management Platform
DMP
Analytical Data Store
MOM
pixels impressions clicks events raw data 3rd party data Raw Data Store & Processing Fast Data Store user profiles segments aggregates 3rd party data 3rd party data ad platform user profiles segments
Data Flows
Analytical Data Store
MOM
pixels impressions clicks events raw data 3rd party data Raw Data Store & Processing Fast Data Store 3rd party data 3rd party data ad platform user profiles segments user profiles segments aggregates
DMP Data Flows
find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker” during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber" to make a discount of 40%
DMP Segmentation
find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker” during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber" [impression] [click] [tr. pixel] [tr. pixel]
id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-04-01 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-04-01 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-04-01 18:57:19.628 …
[cookies]
DMP Segmentation
find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"
id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-04-01 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-04-01 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-04-01 18:57:19.628 …
1 2 3 4 c1
DMP Segmentation
find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"
id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-04-01 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-04-01 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-04-01 18:57:19.628 …
1 2 3 4 c1 (c1, 0)
DMP Segmentation
find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"
id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …
1 2 3 4 c1 (c1, 0) (c1, 1)
DMP Segmentation
find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"
id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …
1 2 3 4 c1 (c1, 0) (c1, 1) (c1, 2)
DMP Segmentation
find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"
id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …
1 2 3 4 c1 (c1, 0) (c1, 1) (c1, 2) (c1, 3)
DMP Segmentation
find all users who have taken part in campaign[s] "Star Wars" [and] viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] [and] clicked banner[s] "Darth Vader's lightsaber" [and] visited buying area of "Darth Vader's lightsaber" [and] not visited order confirmed area of "Darth Vader's lightsaber"
id cookie event_id event_type campaign_id timestamp … 1 c1 Darth Vader impression Star Wars 2018-10-18 14:25:11.462 … 2 c1 Darth Vader's lightsaber click Star Wars 2018-10-18 06:31:12.157 … 3 c1 Darth Vader's lightsaber tr. pixel Star Wars 2018-10-18 18:57:19.628 …
1 2 3 4 c1 (c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø
DMP Segmentation
find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"
DMP Segmentation
find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"
(c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø map
DMP Segmentation
reduce
find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"
(c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø map (c1, 0;1;2;3) true(0) and true(1) and true(2) and true(3) and not false(4)
DMP Segmentation
reduce
find all users who have taken part in campaign[s] "Star Wars" viewed banner[s] "Darth Vader" or "Luke Skywalker" during [last] 6 day[s] clicked banner[s] "Darth Vader's lightsaber" visited buying area of "Darth Vader's lightsaber" not visited order confirmed area of "Darth Vader's lightsaber"
(c1, 0) (c1, 1) (c1, 2) (c1, 3) Ø map (c1, 0;1;2;3) true(0) and true(1) and true(2) and true(3) and not false(4) c1
DMP Segmentation
val predicateMatches = events.flatMap { event => rules.value.foldLeft(Set[((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)
DMP Segmentation
val predicateMatches = events.flatMap { event => rules.value.foldLeft(Set[((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)
DMP Segmentation
val predicateMatches = events.flatMap { event => rules.value.foldLeft(Set[((String, String), Set[Int])]()) { case (acc, (ruleId, rule)) => if (rule.applyGlobal(event)) acc + ((event.cookie, ruleId) -> rule.getMatched(evt)) else acc } } val ruleMatches = predicateMatches .reduceByKey(_ ++ _) .filter { case ((uid, ruleId), predicates) => rules.value(ruleId).evalMatched(predicates) } .keys ruleMatches.saveAsNewAPIHadoopDataset(...)
DMP Segmentation
val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches .saveAsNewAPIHadoopDataset(...)
DMP Segmentation
RDDs
val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches .saveAsNewAPIHadoopDataset(...) val stats = ruleMatches .treeAggregate(...)
DMP Spark Actions
val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5) rdd.saveAsTextFile("/data/output") val stats = FileSystem .getStatistics("file", classOf[RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)
DMP Spark Actions
val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5) rdd.saveAsTextFile("/data/output") val stats = FileSystem .getStatistics("file", classOf[RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)
DMP Spark Actions
val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") rdd.saveAsTextFile("/data/output") val stats = FileSystem .getStatistics("file", classOf[RawLocalFileSystem]) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)
BYTES READ: 10 486 160 (~10 MB)
DMP Spark Actions
val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)
DMP Spark Actions
val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)
BYTES READ: 20 841 256 (~20 MB)
DMP Spark Actions
By default, each transformed RDD may be recomputed each time you run an action on it.
DMP Spark Actions
By default, each transformed RDD may be recomputed each time you run an action on it.
2x data read
DMP Spark Actions
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations
val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)
DMP Spark Actions
val data = 1L to 1000000L sc.makeRDD(data) .map("%09d".format(_)) .saveAsTextFile("/data/input") val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)
BYTES READ: 10 486 160 (~10 MB)
DMP Spark Actions
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ...
DMP Spark Actions
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val rdd = sc.textFile("/data/input", 5).cache() rdd.saveAsTextFile("/data/output") val numRecords = rdd.treeAggregate(0L)( (r: Long, t: String) => r + 1L, (r1: Long, r2: Long) => r1 + r2 ) ... ... ...
BYTES READ: 14 557 992 (~14 MB)
DMP Spark Actions
val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys .map { item => acc.add(item) item } ruleMatches .saveAsNewAPIHadoopDataset(...)
DMP Spark Accumulators
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== Inserts new Stage here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== Inserts new Stage is here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
30 was not equal to 10 Expected :10 Actual :30
DMP Spark Accumulators
val blockManager = SparkEnv.get.blockManager val block = blockManager.diskBlockManager.getAllBlocks() .filter(_.isInstanceOf[ShuffleDataBlockId]) .map(_.asInstanceOf[ShuffleDataBlockId]) .head throw new FetchFailedException( blockManager.blockManagerId, block.shuffleId, block.mapId, block.reduceId, "__spark_stage_failed__")
DMP Spark Accumulators
val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys .map { item => acc.add(item) item } ruleMatches .saveAsNewAPIHadoopDataset(...)
DMP Spark Accumulators
Action Meaning collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count() Return the number of elements in the dataset. ... ... saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. DMP Spark Accumulators
Action Meaning collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. count() Return the number of elements in the dataset. ... ... saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop- supported file system. Spark will call toString on each element to convert it to a line of text in the file. ... ... foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. DMP Spark Accumulators
val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) .filter(...) .keys ruleMatches.saveAsNewAPIHadoopDataset(...) ruleMatches.foreach(acc.add)
DMP Spark Accumulators
val acc = sc.longAccumulator ... ... val rdd = sc.textFile("/data/input", 5) rdd.saveAsTextFile("/data/output") rdd.foreach(_ => acc.add(1L)) ... stats.getBytesRead shouldBe data.length * 10L + (data.length / 2 +- data.length / 2)
BYTES READ: 20 841 256 (~20 MB)
DMP Spark Accumulators
Про задачу DMP Custom RDD? def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {
// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException("Output format class not set") } if (keyClass == null) { throw new SparkException("Output key class not set") } if (valueClass == null) { throw new SparkException("Output value class not set") } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")") if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit()
- utputMetrics.setBytesWritten(callback())
- utputMetrics.setRecordsWritten(recordsWritten)
} self.context.runJob(self, writeToFile) writer.commitJob()
}
def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {
// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException("Output format class not set") } if (keyClass == null) { throw new SparkException("Output key class not set") } if (valueClass == null) { throw new SparkException("Output value class not set") } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")") if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup() val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { // Hadoop wants a 32-bit task attempt ID, so if ours is bigger than Int.MaxValue, roll it // around by taking a mod. We expect that no task will be attempted 2 billion times. val taskAttemptId = (context.taskAttemptId % Int.MaxValue).toInt val (outputMetrics, callback) = SparkHadoopWriterUtils.initHadoopOutputMetrics(context) writer.setup(context.stageId, context.partitionId, taskAttemptId) writer.open() var recordsWritten = 0L Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit()
- utputMetrics.setBytesWritten(callback())
- utputMetrics.setRecordsWritten(recordsWritten)
} self.context.runJob(self, writeToFile) writer.commitJob()
} Про задачу DMP Custom RDD?
def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass if (outputFormatInstance == null) { throw new SparkException("Output format class not set") } if (keyClass == null) { throw new SparkException("Output key class not set") } if (valueClass == null) { throw new SparkException("Output value class not set") } SparkHadoopUtil.get.addCredentials(hadoopConf) logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " + valueClass.getSimpleName + ")") if (SparkHadoopWriterUtils.isOutputSpecValidationEnabled(self.conf)) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(hadoopConf) hadoopConf.getOutputFormat.checkOutputSpecs(ignoredFs, hadoopConf) } val writer = new SparkHadoopWriter(hadoopConf) writer.preSetup()
val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => {
... Utils.tryWithSafeFinallyAndFailureCallbacks { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records SparkHadoopWriterUtils.maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten) recordsWritten += 1 } }(finallyBlock = writer.close()) writer.commit()
- utputMetrics.setBytesWritten(callback())
- utputMetrics.setRecordsWritten(recordsWritten)
}
self.context.runJob(self, writeToFile)
writer.commitJob() }
Про задачу DMP Custom RDD or …?
… or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) }
… or SparkContext? def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) }
… or SparkContext?
trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super.register(acc) acc match { case _: ActionAccumulator[_, _] => this.accumulators.put(acc.id, ActionCallable(acc)) case _ => } } ... }
… or SparkContext?
trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super.register(acc) acc match { case _: ActionAccumulator[_, _] => this.accumulators.put(acc.id, ActionCallable(acc)) case _ => } } ... }
… or SparkContext?
trait ActionAccumulable extends SparkContext { private val accumulators = new ConcurrentHashMap[Long, ActionCallable[_]]() abstract override def register(acc: AccumulatorV2[_, _]): Unit = { super.register(acc) acc match { case _: ActionAccumulator[_, _] => this.accumulators.put(acc.id, ActionCallable(acc)) case _ => } } ... }
… or SparkContext?
trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {
- verride def hasNext: Boolean = iter.hasNext
- verride def next(): T = {
val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }
… or SparkContext?
trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {
- verride def hasNext: Boolean = iter.hasNext
- verride def next(): T = {
val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }
… or SparkContext?
trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {
- verride def hasNext: Boolean = iter.hasNext
- verride def next(): T = {
val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }
… or SparkContext?
trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {
- verride def hasNext: Boolean = iter.hasNext
- verride def next(): T = {
val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }
SparkContext! val sc = new SparkContext(...) with ActionAccumulable
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...
BYTES READ: 10 486 160 (~10 MB)
DMP Spark Accumulators
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... rdd.count() acc.value shouldBe data.length ...
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... // rdd.count() acc.value shouldBe data.length ...
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... // rdd.count() rdd.saveAsTextFile("/data/output") acc.value shouldBe data.length ...
Task not serializable
- rg.apache.spark.SparkException: Task not serializable
at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner$.clean(ClosureCleaner.scala:108) at SparkContext.clean(SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack:
- object not serializable (class: JobConf, value: Configuration: ...
- field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1,
name: conf$4, type: class JobConf
- object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0>
...
- field class: ActionAccumulable$$anonfun$1,
name: func$1, type: interface.Function2
- object class ActionAccumulable$$anonfun$1, <function2>
DMP Spark Accumulators
Task not serializable
- rg.apache.spark.SparkException: Task not serializable
at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner$.clean(ClosureCleaner.scala:108) at SparkContext.clean(SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack:
- object not serializable (class: JobConf, value: Configuration: ...
- field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1,
name: conf$4, type: class JobConf
- object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0>
...
- field class: ActionAccumulable$$anonfun$1,
name: func$1, type: interface.Function2
- object class ActionAccumulable$$anonfun$1, <function2>
DMP Spark Accumulators
Task not serializable
- rg.apache.spark.SparkException: Task not serializable
at ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at ClosureCleaner$.clean(ClosureCleaner.scala:108) at SparkContext.clean(SparkContext.scala:2104) Caused by: java.io.NotSerializableException: JobConf Serialization stack:
- object not serializable (class: JobConf, value: Configuration: ...
- field class: PairRDDFunctions$$anonfun$saveAsHadoopDataset$1,
name: conf$4, type: class JobConf
- object class PairRDDFunctions$$anonfun$saveAsHadoopDataset$1, <function0>
...
- field class: ActionAccumulable$$anonfun$1,
name: func$1, type: interface.Function2
- object class ActionAccumulable$$anonfun$1, <function2>
DMP Spark Accumulators
Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones
DMP ClosureCleaner intro.
Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones
DMP ClosureCleaner intro.
Traverses the hierarchy of enclosing closures to null out any references that are not actually used by the starting closure but still included in the compiled classes to make "usually" non-serializable closures serializable ones
DMP ClosureCleaner intro.
def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) } DMP ClosureCleaner intro.
def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) } DMP ClosureCleaner intro.
def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF)) } DMP ClosureCleaner intro.
trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {
- verride def hasNext: Boolean = iter.hasNext
- verride def next(): T = {
val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }
DMP ClosureCleaner intro.
trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {
- verride def hasNext: Boolean = iter.hasNext
- verride def next(): T = {
val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } func(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }
DMP ClousureCleaner intro.
trait ActionAccumulable extends SparkContext { abstract override def runJob[T, U: ClassTag](rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, ...): Unit = { val cleanF = clean(func) val accFunc = (tc: TaskContext, iter: Iterator[T]) => { val accIter = new Iterator[T] {
- verride def hasNext: Boolean = iter.hasNext
- verride def next(): T = {
val rec: T = iter.next() accumulators.values.foreach(_.add(rec)) rec } } cleanF(tc, accIter) } super.runJob(rdd, accFunc, partitions, resultHandler) } }
DMP ClousureCleaner intro.
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... // rdd.count() rdd.saveAsTextFile("/data/output") acc.value shouldBe data.length ...
DMP Spark Accumulators
sparkConf .set("spark.testing", "true") .set("spark.testing.memory", (75*1024*1024).toString) ... val acc = longAccumulator sc.register(acc) ... val rdd = sc.textFile("/data/input", 5)//.cache() ... // rdd.count() rdd.saveAsTextFile("/data/output") acc.value shouldBe data.length ...
BYTES READ: 10 486 160 (~10 MB)
Datasets
DMP Why Datasets?
DataFrame = Dataset[Row]
DMP Spark Quiz
DataFrame = Dataset[Row] Dataset = RDD[Row]
DMP Spark Quiz
DataFrame = Dataset[Row] Dataset = RDD[Row] Dataset = RDD[InternalRow]
DMP Spark Quiz
import spark.implicits._ import org.apache.spark.sql.functions._ val acc = sc.longAccumulator val data = 1L to 10L val ds = spark.createDataset(data) val accDs = ds.map { item => acc.add(item) item } accDs.agg(sum("value")).first().getLong(0) shouldBe data.sum
DMP Dataset Accumulators
import spark.implicits._ import org.apache.spark.sql.functions._ val acc = sc.longAccumulator val data = 1L to 10L val ds = spark.createDataset(data) val accDs = ds.map { item => acc.add(item) item } accDs.agg(sum("value")).first().getLong(0) shouldBe data.sum
DMP Dataset Accumulators
import spark.implicits._ import org.apache.spark.sql.functions._ val acc = sc.longAccumulator val data = 1L to 10L val ds = spark.createDataset(data) val accDs = ds.map { item => acc.add(item) item } acc.sum shouldBe data.sum
DMP Dataset Accumulators
import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong
DMP Dataset Accumulators
import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong
DMP Dataset Accumulators
Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.ClassCastException: [B cannot be cast to org.apache.spark.sql.catalyst.InternalRow at Spec$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply $mcV$sp$6$$anonfun$14.apply(Spec.scala:114) at SimpleAccumulator.add(SimpleAccumulator.scala:66) at ActionAccumulatorCallable.add(ActionCallable.scala:64) at ActionAccumulableSparkContext$$anonfun$1$$anon $1.next(ActionAccumulableSparkContext.scala:181) at Iterator$class.foreach(Iterator.scala:893) at ActionAccumulableSparkContext$$anonfun$1$$anon $1.foreach(ActionAccumulableSparkContext.scala:170) at Growable$class.$plus$plus$eq(Growable.scala:59) at ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
DMP Dataset Accumulators
import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") acc.value shouldBe data.size.toLong
DMP Dataset Accumulators
import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") acc.value shouldBe data.size.toLong
DMP Dataset Accumulators
import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") acc.value shouldBe data.size.toLong import spark.implicits._ val acc = ActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong
DMP Dataset Accumulators
Dataset = RDD[InternalRow] ?
DMP Dataset Accumulators
Dataset = RDD[Array[Byte]] ???
DMP Dataset Accumulators
abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { } }
DMP Dataset Accumulators
DMP Dataset Accumulators
class Dataset[T] private[sql](...) extends Serializable { def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ => withNewExecutionId { queryExecution .executedPlan .executeToIterator() .map(boundEnc.fromRow) .asJava } } }
abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd() } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { } }
DMP Dataset Accumulators
abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd() } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { execute().mapPartitionsInternal { iter => val bos = new ByteArrayOutputStream() while (iter.hasNext && (n < 0 || count < n)) { ... } Iterator(bos.toByteArray) } } }
DMP Dataset Accumulators
abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd().toLocalIterator } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { } }
DMP Dataset Accumulators
abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd().toLocalIterator.flatMap(decodeUnsafeRows) } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { } def decodeUnsafeRows(bytes: Array[Byte]): Iterator[InternalRow] = { } }
DMP Dataset Accumulators
abstract class SparkPlan extends QueryPlan[SparkPlan] ... { def executeToIterator(): Iterator[InternalRow] = { getByteArrayRdd().toLocalIterator.flatMap(decodeUnsafeRows) } def getByteArrayRdd(n: Int = -1): RDD[Array[Byte]] = { } def decodeUnsafeRows(bytes: Array[Byte]): Iterator[InternalRow] = { val codec = CompressionCodec.createCodec(SparkEnv.get.conf) val bis = new ByteArrayInputStream(bytes) val ins = new DataInputStream(codec.compressedInputStream(bis)) new Iterator[InternalRow] { ... } } }
DMP Dataset Accumulators
- verride def add(v: Any): Unit = v match {
case arr: Array[Byte] => val rows = decodeUnsafeRows(arr, encoder.schema.length) _value = rows.foldLeft(_value) { (acc, row) => add(acc, encoder.fromRow(row)) } case row: InternalRow => _value = add(_value, encoder.fromRow(row)) case _ => throw new IllegalArgumentException( s"Value of unexpected data type received: ${v.getClass.getName}, expecting: Array[Byte] or InternalRow") }
DMP Dataset Accumulators
import spark.implicits._ val acc = DatasetActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong
DMP Dataset Accumulators
import spark.implicits._ val acc = DatasetActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong
DMP Dataset Accumulators
import spark.implicits._ val acc = DatasetActionAccumulator( 0L, (r: Long, t: InternalRow) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data) ds.count() acc.value shouldBe data.size.toLong
DMP Dataset Accumulators
case class Item(value: Long) val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data).as[Item] ds.agg(sum("value")).first() acc.value shouldBe data.sum
DMP Dataset Accumulators
case class Item(value: Long) val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data).as[Item] ds.agg(sum("value")).first() acc.value shouldBe data.sum
DMP Dataset Accumulators
case class Item(value: Long) val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L val ds = spark.createDataset(data).as[Item] ds.agg(sum("value")).first() acc.value shouldBe data.sum
DMP Dataset Accumulators
- verride def add(v: Any): Unit = v match {
case arr: Array[Byte] => val rows = decodeUnsafeRows(arr, encoder.schema.length) _value = rows.foldLeft(_value) { (acc, row) => add(acc, encoder.fromRow(row)) } case row: InternalRow => _value = add(_value, encoder.fromRow(row)) case _ => throw new IllegalArgumentException( s"Value of unexpected data type received: ${v.getClass.getName}, expecting: Array[Byte] or InternalRow") }
DMP Double Decoding
DMP Schema Inference
import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) spark.read.csv("/tmp/output").as[Item].count()
DMP Schema Inference
import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) spark.read.csv("/tmp/output").as[Item].count()
DMP Schema Inference
import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) spark.read.csv("/tmp/output").as[Item].count()
Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.IllegalArgumentException: Value of unexpected data type received: java.lang.String, expecting: Array[Byte] or InternalRow at DatasetActionAccumulator.add(DatasetActionAccumulator.scala:141) at ActionAccumulatorCallable.add(ActionCallable.scala:64) at ActionAccumulableSparkContext$$anonfun $1$$anon$1.next(ActionAccumulableSparkContext.scala:181) at Iterator$$anon$10.next(Iterator.scala:393) at Iterator$class.foreach(Iterator.scala:893) at AbstractIterator.foreach(Iterator.scala:1336) at Growable$class.$plus$plus$eq(Growable.scala:59) at ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at TraversableOnce$class.to(TraversableOnce.scala:310) at AbstractIterator.to(Iterator.scala:1336)
DMP Schema Inference
DMP Schema Inference
import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val schema = StructType(Seq(StructField("value", LongType))) spark.read.schema(schema).csv("/tmp/output").as[Item].count()
DMP Schema Inference
import spark.implicits._ case class Item(value: Long) val data = 1L to 10L val ds = spark.createDataset(data) ds.write.csv("/tmp/output") val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + 1, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val schema = StructType(Seq(StructField("value", LongType))) spark.read.schema(schema).csv("/tmp/output").as[Item].count()
Dataset Accumulators :: Aggregation differences DMP Confusing Aggregate Functions' Behavior
RDDs
val acc = ActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L sc .makeRDD(data) .map(Item) .count() acc.value
Datasets
val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L spark .createDataset(data) .as[Item] .count() acc.value
RDDs
val acc = ActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L sc .makeRDD(data) .map(Item) .count() acc.value
Datasets
val acc = DatasetActionAccumulator( 0L, (r: Long, t: Item) => r + t.value, (r1: Long, r2: Long) => r1 + r2 ) sc.register(acc) val data = 1L to 10L spark .createDataset(data) .as[Item] .count() acc.value
Dataset Accumulators :: Aggregation differences DMP Confusing Aggregate Functions' Behavior
55 10
RDDs
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
Datasets
def count(): Long = withCallback("count", groupBy().count()) { df => df.collect(needCallback = false).head.getLong(0) }
Dataset Accumulators :: Aggregation differences DMP Confusing Aggregate Functions' Behavior
DMP Accumulators
DMP Accumulators
Partition Partition Partition Partition Partition Partition
ShuffleMapStage
ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask Partition ShuffleMapTask
ResultStage
ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition ResultTask ResultTask ResultTask
Job
action creates
ShuffleMapStage
DMP Accumulators
Partition Partition Partition Partition Partition Partition ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask Partition ShuffleMapTask
ResultStage
ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition ResultTask ResultTask ResultTask
Job
action creates
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== new Stage is here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val acc = sc.longAccumulator val rdd = sc.makeRDD(data) .map { num => acc.add(1L) num } .repartition(3) // <== new Stage is here rdd .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
Dangerous
ShuffleMapStage
DMP Accumulators
Partition Partition Partition Partition Partition Partition ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask ShuffleMapTask Partition ShuffleMapTask
ResultStage
ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask ResultTask Partition Partition Partition Partition Partition Partition Partition Partition Partition Partition ResultTask ResultTask ResultTask
Job
action creates
val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
val data = 1L to 10L val rdd = sc.makeRDD(data) .repartition(3) // <== Inserts new Stage is here val acc = sc.longAccumulator rdd .map { num => acc.add(1L) num } .map(failStage(2)) .saveAsTextFile("/data/output") acc.value shouldBe data.length
DMP Spark Accumulators
10 was equal to 10 Expected :10 Actual :10
val acc = new MyAccumulator() sc.register(acc) val ruleMatches = events .flatMap(...) .reduceByKey(...) // <== new Stage is here .filter(...) .keys .map { item => acc.add(item) item } ruleMatches .saveAsNewAPIHadoopDataset(...)
DMP Spark Accumulators
Safe Dangerous
DMP Accumulators
Dangerous
DMP Accumulators
Dangerous Safe Safe
DMP Accumulators
Safe Safe
DMP More Info
SPARK-732 Accumulator should only be updated once for each task in result stage SPARK-3628 Don't apply accumulator updates multiple times for tasks in result stages SPARK-12469 Data Property Accumulators for Spark (formerly Consistent Accumulators) SPARK-22681 Accumulator should only be updated once for each task in result stage ... ...
https://github.com/cleverdata/dmpkit-spark-standard
DMP Coming Soon ...
s . z h e m z h i t s k y @ c l e v e r d a t a . r u info@cleverdata.ru
http://cleverdata.ru https://1dmp.io https://1dmс.io info@cleverdata.ru +7 (495) 782-38-60 http://cleverleaf.co.uk https://1dmp.io https://1dmс.io info@cleverleaf.co.uk +44 (782) 785-14-28