CS 327E Class 7 November 5, 2018 Check your GCP Credits :) - PowerPoint PPT Presentation

Oct 30, 2023 •369 likes •559 views

CS 327E Class 7 November 5, 2018 Check your GCP Credits :) iClicker Question Are you running low on GCP credits? A. Yes B. No No Quiz Today Dataflow Concepts A system for processing arbitrary computations on large amounts of data

CS 327E Class 7 November 5, 2018
Check your GCP Credits :)
iClicker Question Are you running low on GCP credits? A. Yes B. No
No Quiz Today
Dataflow Concepts ● A system for processing arbitrary computations on large amounts of data ● Can process batch data and streaming data using the same code ● Uses Apache Beam, an open-source programming model ● Designed to be very scalable, millions of QPS
Apache Beam Concepts ● A model for describing data and data processing operations: ○ Pipeline : a data processing task from start to finish ○ PCollection : a collection of data elements ○ Transform : a data transformation operation ● SDKs for Java, Python and Go ● Executed in the cloud on Dataflow, Spark, Flink, etc. ● Executed locally with Direct Runner for dev/testing
Beam Pipeline ● Pipeline = A directed acyclic graph where the nodes are the Transforms and the edges are the PCollections ● General Structure of a Pipeline : ○ Reads one or more data sources as input PCollections ○ Applies one or more Transforms on the PCollections ○ Outputs resulting PCollection as one or more data sinks ● Executed as a single unit ● Run in batch or streaming mode
PCollection ● PCollection = A collection of data elements ● Elements can be of any type (String, Int, Array, etc.) ● PCollections are distributed across machines ● PCollections are immutable ● Created from a data source or a Transform ● Written to a data sink or passed to another Transform
Transform Types ● Element-wise: ○ maps 1 input to (1, 0, many) outputs ○ Examples: ParDo , Map , FlatMap ● Aggregation: ○ reduces many inputs to (1, fewer) outputs ○ Examples: GroupByKey, CoGroupByKey ● Composite: combines element-wise and aggregation ○ GroupByKey -> ParDo
Transform Properties ● Serializable ● Parallelizable ● Idempotent
ParDo ● ParDo = “Parallel Do” ● Maps 1 input to (1, 0, many) outputs ● Takes as input a PCollection ● Applies the user-defined ParDo to the input PCollection ● Outputs results as a new PCollection ● Typical usage: filtering, formatting, extracting parts of data, performing computations on data elements
ParDo Example Source File: https://github.com/cs327e-fall2018/snippets/blob/master/word_length.py
Aggregation Example Source File: https://github.com/cs327e-fall2018/snippets/blob/master/group_words_by_length.py
BigQuery Data Sink Example Source File: https://github.com/cs327e-fall2018/snippets/blob/master/word_length_bq_out.py
How to “Dataflow” 1. Start with some initial working code. 2. Test and debug each new line of code. 3. Write temporary and final PCollections to log files. 4. Test and debug end-to-end pipeline locally before running on Dataflow. 5. If you get stuck, make a Piazza post that has enough details for the instructors to reproduce the error and help you troubleshoot. 6. Start assignments early . The Beam Python documentation is sparse and learning Beam requires patience and experimentation.
Dataflow Set Up https://github.com/cs327e-fall2018/snippets/wiki/Dataflow-Setup-Guide
Milestone 7 http://www.cs.utexas.edu/~scohen/milestones/Milestone7.pdf

Recommend

CS 327E Class 7 October 21, 2019 Announcements Midterm is next class from 6pm - 7:30pm

CS 327E Class 7 October 21, 2019 Announcements Midterm is next class from 6pm - 7:30pm Midterm location: Mary E Gearing Hall, GEA 105 Review session: Friday from 1pm - 2pm in GDC 1.304 Milestone 7 due this Friday. 1) Which

612 views • 22 slides

CS 327E Class 6 October 14, 2019 1) PTransforms such as Pardo mutate their input elements. A.

CS 327E Class 6 October 14, 2019 1) PTransforms such as Pardo mutate their input elements. A. True B. False 2) What kind of object does the ParDo transform expect? A. A DoFn subclass B. A DoFn super class C. A DoFn abstract class 3) Does

371 views • 20 slides

CS 327E Class 9 November 19, 2018 Announcements What to expect from the next 3 milestones

CS 327E Class 9 November 19, 2018 Announcements What to expect from the next 3 milestones (Milestones 8 - 10) How to get feedback on your cross-dataset queries and pipeline designs today. Sign-up sheet: https://tinyurl.com/y9fdogqk

481 views • 19 slides

CS 327E Class 10 November 26, 2018 Announcements Scheduling your group presentation for

CS 327E Class 10 November 26, 2018 Announcements Scheduling your group presentation for Milestone 10. All presentations will happen on week of 12/10 M-F in the evenings. Send me your preferred days/times by Friday . How to get

651 views • 21 slides

CS 327E Class 11 November 25, 2019 Announcements Milestone 12: What: Group Presentations.

CS 327E Class 11 November 25, 2019 Announcements Milestone 12: What: Group Presentations. When: Week of Dec. 9th. M-F 6:00pm - 8:00pm. Where: TBD. Requested Action: Email me your preferred times by EOD tomorrow. 1) In

583 views • 14 slides

CS 327E Class 9 November 11, 2019 Grading update What to expect from remaining

CS 327E Class 9 November 11, 2019 Grading update What to expect from remaining Milestones: Milestone 9 : Find dataset2 + ingest into BQ + model the data Milestone 10 : Create Beam pipelines + cross-dataset queries

375 views • 16 slides

CS 327E Class 9 April 8, 2019 No Quiz Today :) What to expect from upcoming Milestones:

CS 327E Class 9 April 8, 2019 No Quiz Today :) What to expect from upcoming Milestones: Milestone 9: Find your secondary dataset, load into BQ and model the data with SQL transforms Milestone 10: Create Beam pipelines that transform the

441 views • 23 slides

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters . Milestone 12: Presentation Schedule GCP credits: check your balance and request second coupon if needed. 1) What infrastructure components does

349 views • 13 slides

CS 327E Class 4 Sept 18, 2020 Announcements Rubric clarification Test 1 details Exam

CS 327E Class 4 Sept 18, 2020 Announcements Rubric clarification Test 1 details Exam rules: Open-note and open-book Piazza will be disabled during exam May not consult with any human in any form A World without

270 views • 13 slides

CS 327E Class 7 Oct 16, 2020 Review session for Test 2 Test 2 details Exam rules:

CS 327E Class 7 Oct 16, 2020 Review session for Test 2 Test 2 details Exam rules: Open-note and open-book Piazza will be disabled during exam May not consult with any human in any form Designed for storing and

394 views • 17 slides

CS 327E Class 2 September 16, 2019 1) Which is not a Data Manipulation Language construct? a)

CS 327E Class 2 September 16, 2019 1) Which is not a Data Manipulation Language construct? a) CREATE b) SELECT c) INSERT d) UPDATE 2) How many fields does this query return? SELECT group FROM ACL_Lineup_2019 a) 5 b) 4 c) 1 d) 0 3) How

663 views • 16 slides

CS 327E Class 8 Oct 30, 2020 Final Project Components Choose a primary and secondary

CS 327E Class 8 Oct 30, 2020 Final Project Components Choose a primary and secondary dataset (Milestone 1) Load the raw data into BigQuery (Milestone 1) Explore the raw data with SQL (Milestone 1) Cleanse the data with SQL

495 views • 22 slides

CS 327E Class 4 September 30, 2019 1) What type of relationship do we have between the Actor and

CS 327E Class 4 September 30, 2019 1) What type of relationship do we have between the Actor and Movie entity types as shown? A. 1:1 B. 1:m C. m:n 2) How many joins would we need to find the cast members who acted in 'Avengers: Endgame'

398 views • 25 slides

CS 327E Class 10 November 18, 2019 1) What is meant by the following usage pattern? A. The

CS 327E Class 10 November 18, 2019 1) What is meant by the following usage pattern? A. The elements in the PCollection are split up such that 1/2 of the elements are written to BigQuery and 1/2 are written to Bigtable. B. The same

689 views • 15 slides

CS 327E Class 10 April 15, 2019 1) What is meant by the following usage pattern? A. The

CS 327E Class 10 April 15, 2019 1) What is meant by the following usage pattern? A. The elements in the PCollection are split up such that 1/2 of the elements are written to BigQuery and 1/2 are written to Bigtable. B. The same PCollection

395 views • 13 slides

CS 327E Class 8 November 4, 2019 1) Does Q1 contain a subquery? Q1: SELECT * FROM Lineup WHERE

CS 327E Class 8 November 4, 2019 1) Does Q1 contain a subquery? Q1: SELECT * FROM Lineup WHERE band_id = (SELECT id FROM Band WHERE name = 'Asleep at the Wheel') A. Yes B. No 2) What is the output from Q2s subquery when run against the

209 views • 17 slides

Differential Dataflow McSherry, Frank D., Murray, Derek G., Isaacs, Rebecca, Isard, Michael

Differential Dataflow McSherry, Frank D., Murray, Derek G., Isaacs, Rebecca, Isard, Michael Chathura Kankanamge 08th November 2016 Outline Motivation for Differential Dataflow Key Concepts Differential Dataflow in practice

1.45k views • 81 slides

Two wo Approa proach ches s to to In Inte terproc procedura dural l Data ta Flow w

Two wo Approa proach ches s to to In Inte terproc procedura dural l Data ta Flow w Analysi lysis Micha a Sharir ir Amir Pnuel ueli Part one: The Functional Approach 12.06.2010 Klaas Boesche Int ntra raproc proced edura

1.24k views • 55 slides

Dataflow analysis First example (analysis #1) Available expressions Michel Schinz Advanced

Dataflow analysis First example (analysis #1) Available expressions Michel Schinz Advanced compiler construction, 2008-05-09 Common subexp. elimination Available expressions The following C program fragment sets r to x y for y > 0. How can

672 views • 10 slides

WaveScalar Good old days 2 Good old days ended in Nov. 2002 Complexity Clock scaling

WaveScalar Good old days 2 Good old days ended in Nov. 2002 Complexity Clock scaling Area scaling 3 Chip MultiProcessors Low complexity Scalable Fast 4 CMP Problems Hard to program Not practical to scale

833 views • 65 slides

23. Actjon-Oriented Design Methods 1) Action-Oriented Design Prof. Dr. Uwe Amann 2) Structured

Fakultt Informatik - Institut Software- und Multimediatechnik - Softwaretechnologie Prof. Amann - Softwaretechnologie II 23. Actjon-Oriented Design Methods 1) Action-Oriented Design Prof. Dr. Uwe Amann 2) Structured

858 views • 51 slides

Introduction to Artificial Intelligence Deep Learning - Tensor Flow Janyl Jumadinova December 2,

Introduction to Artificial Intelligence Deep Learning - Tensor Flow Janyl Jumadinova December 2, 2016 Credit: Google Workshop Neural Networks 2/24 Neural Networks 3/24 Neural Networks A fully connected NN layer 4/24 Implementation as

1.2k views • 34 slides

Dataflow Process Network Goals Formalize dataflow process network Widely used in signal

Dataflow Process Network Goals Formalize dataflow process network Widely used in signal processing community SPW, COSSAP, Khoros, Ptolemy, etc Good basis for programming language Hierarchy, higher order function, recursion, etc

287 views • 14 slides

Executing a Program on the MIT Tagged Token Dataflow Architecture Arvind and Nikhil Notes on

Executing a Program on the MIT Tagged Token Dataflow Architecture Arvind and Nikhil Notes on the paper This is a Big A Architecture paper Its a PL, an ISA, and an execution model and a dash of hardware Execution Models:

729 views • 27 slides