cs 744 dataflow
play

CS 744: DATAFLOW Shivaram Venkataraman Fall 2020 - PowerPoint PPT Presentation

! welcome CS 744: DATAFLOW Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - Assignment 2 grades are up! Canvas - Midterm grading in progress - Course project proposal comments week tis Thursday feedback Peer feedback


  1. ! welcome CS 744: DATAFLOW Shivaram Venkataraman Fall 2020

  2. ↳ ↳ ADMINISTRIVIA - Assignment 2 grades are up! Canvas → - Midterm grading in progress - Course project proposal comments week tis Thursday feedback Peer feedback Instructor - AEFIS feedback (next slide)

  3. ↳ AEFIS FEEDBACK organization Better T Improve writing on the slides, speak slower - Get a better internet connection? Better microphone? how know Let me sounds ? it this ring More office hour slots Discussion groups: same group each time? Also add prof. input More time for Midterm exam, more guidance on deliverables More homework/hands-on experience vs. too many evaluation components?

  4. stream Processing Applications f- J f Machine Learning SQL Streaming Graph , Spark → MapReduce Computational Engines GFS Scalable Storage Systems → Meson Resource Management → DRF Datacenter Architecture -

  5. operators of DAA operators or " spat tape Pytorch DATAFLOW MODEL (?)

  6. ↳ ↳ MOTIVATION ESPN Lom . Streaming Video Provider video each - videos , ads - How much to bill each advertiser ? has some each ) - Need per-user, per-video viewing sessions Foard - Handle out of order data heard phone main which city etc . -1 → Goals Offline order out data , unbounded of - Easy to program - much delay till results are how - - Balance correctness, latency and cost available results accurate how your are

  7. APPROACH Developers writing API Design → Dataflow Model applications I Separate user-facing model from execution Decompose queries into L L TENET - What is being computed Ll ) framework - Where in time is it computed d) Framework can process → Output as it processes - When is it materialized data bounded :# ftp..sk arrives - How does it relate to earlier results similar to very data ① MapReduce ② streaming batch small a FEE iii. i . viewing → e → ' day I 1 day events events ' process ma ma arrive ' when they ' and - - - as

  8. ⇒ ↳ Dashboard TERMINOLOGY Processing - time Syst € ) arriving # ② constantly Unbounded/bounded data is Data → Streaming/Batch execution ESPN .com 'D ) slide - ad µµ mtmt¥ See previous - Timestamps user ( input wrt event occurs Time Event time: when video viewed in ad was at time e.g ; processed Processing time: is event which at an Time is event ad - view which dashboard at time e- g. , the - update processed to

  9. WINDOWING logical winadroewsae.ge/:;:::n?I:soam/ constructs keys across ^ - I - - Id window - 10am HIGH ' - I . . . - - - - - - . . . . . . Finneran ↳ remake - ¥ - - FF - Hom - # # , rpm - . - - - - - - - - - - - - ↳ noamtmatauidne not ) Do oimapbeueen Tuning ← overeat each with consecutive windows keys other windows

  10. WATERMARK or SKEW is Watermark " know to not easy time processing Heuristics . so ↳ you - - - • - - - - - - After = - - 10 mins time , event lags , most devices : events : serial skew time Event ' - catch up . System has processed all : events up to • . 12:02:30 T / . & event - t & between gap - No processing time

  11. API Spark in flatmate MapReduce ParDo: in or Map MapReduce GroupByKey: Reduce in Windowing window into tuple a Buckets AssignWindow → based strategy buckets on MergeWindow Merge → ( sessions )

  12. hwan EXAMPLE Assign tuples to sessions - timestamp meant , + ÷i¥ I - - - and overlap - - aedrdenfo.fi/ftamp them merges - - - - o . - I GroupByKey

  13. TRIGGERS AND INCREMENTAL PROCESSING Windowing: where in event time data are grouped Triggering: when in processing time groups are emitted ;÷ : : ? ;÷iwsr . : FEI ÷ Strategies Discarding . = Accumulating . 11 = 6 Output v1 = Accumulating & Retracting - 5 11 I , ataumulahng retracting

  14. RUNNING EXAMPLE PCollection<KV<String, Integer>> input = IO.read(...); PCollection<KV<String, Integer>> output = input.apply(Sum.integersPerKey()); - Single ← summit for f key key each ' - -

  15. GLOBAL WINDOWS, ACCUMULATE PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtPeriod(1, MINUTE))) ÷ .accumulating()) .apply(Sum.integersPerKey()); 33 t 18 I 22 HI • ' = 22 12+10 ed → Crigger - O every A . in min . Prout Fane

  16. GLOBAL WINDOWS, COUNT, DISCARDING PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtCount(2))) - .discarding()) .apply(Sum.integersPerKey()); . ! or ::fgE card . . → a ,

  17. FiXED WINDOWS, MICRO BATCH - PCollection<KV<String, Integer>> output = input 5 - 12 : 02 .apply(Window.into(FixedWindows.of(2, MINUTES)) 12 : 00 - - # 14 - 12 : 04 .trigger(Repeat(AtWatermark()))) 12 : 02 - - .accumulating()) 12=04-12--006 3 " -00 - ;D : of monk " o . iii : a :* :÷ in A a M t

  18. SUMMARY/LESSONS Design for unbounded data: Don’t rely on completeness Be flexible, diverse use cases - Billing - Recommendation - Anomaly detection Windowing, Trigger API to simplify programming on unbounded data

  19. DISCUSSION https://forms.gle/jwHjTBbR49vyQASq6

  20. ↳ ⇒ window fires time windows Fixed every a) streaming watermark pass given watermark is Assume latency ⇒ worse outputs ⇒ fewer EA Eat X T batch Micro - D - - - - - nie partial rum -1 entry streaming ⇒ IEEE 'm event - ts * . system to . . Ingest proc - t - Sub Pub Apache Kafka time update query disk persist

  21. Consider you are implementing a micro-batch streaming API on top of Apache Spark. What are some of the bottlenecks/challenges you might have in building such a system?

  22. NEXT STEPS Next class: Naiad Course project proposal peer feedback

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend