ProgrammingandDebugging LargeScaleDataProcessingWorkflows - PowerPoint PPT Presentation

Programming and Debugging   Large‐Scale Data Processing Workflows  Christopher Olston and many others  Yahoo! Research 

Context  • Elaborate processing of large data sets    e.g.:  • web search pre‐processing  • cross‐dataset linkage  • web informa=on extrac=on  storage &  serving  inges?on  processing 

Context  storage &  workflow manager  processing  e.g. Nova  dataflow programming  framework  e.g. Pig  distributed   Overview  sor=ng & hashing  e.g. Map‐Reduce  scalable file system  e.g. GFS    Debugging aides:  • Before: example data generator  Detail:  • During: instrumenta=on framework    Inspector   • ABer: provenance metadata manager    Gadget 

Pig:  A High‐Level Dataflow Language   and Run=me for Hadoop  Web browsing sessions with “happy endings.”  Visits = load ‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load ‘/data/pages’ as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(FindSessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings';

vs. map‐reduce: less code!  "The [Hofmann PLSA E/M] algorithm was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig.” -- Prasenjit Mukherjee, Mahout project 1/20 the lines of code  1/16 the development =me  180 300 160 250 140 Minutes 120 200 100 150 80 100 60 40 50 20 0 0 Hadoop Pig Hadoop Pig performs on par with raw Hadoop 

vs. SQL:        step‐by‐step style;       lower‐level control  "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.” -- Jasmine Novak, Engineer, Yahoo! "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP .. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on top of Map- Reduce] doesn ʼ t).” -- Ricky Ho, Adobe Software

Conceptually:   A Graph of Data Transforma=ons  Find users who tend to visit “good” pages.  Load  Load  Visits(user, url, =me)  Pages(url, pagerank)  Transform  to (user, Canonicalize(url), =me)  Join  url = url  Group  by user  Transform  to (user, Average(pagerank) as avgPR)  Filter  avgPR > 0.5 

Load  Load  Visits(user, url, =me)  Pages(url, pagerank)  (Amy, cnn.com, 8am)   (Amy, hhp://www.snails.com, 9am)  (Fred, www.snails.com/index.html, 11am)  (www.cnn.com, 0.9)   (www.snails.com, 0.4)  Transform  to (user, Canonicalize(url), =me)  Join  Illustrated!  url = url  (Amy, www.cnn.com, 8am)   (Amy, www.snails.com, 9am)  (Amy, www.cnn.com, 8am, 0.9)   (Fred, www.snails.com, 11am)  (Amy, www.snails.com, 9am, 0.4)  (Fred, www.snails.com, 11am, 0.4)  Group  by user  (Amy, { (Amy, www.cnn.com, 8am, 0.9),                  (Amy, www.snails.com, 9am, 0.4)  })  (Fred, { (Fred, www.snails.com, 11am, 0.4) })  Transform  to (user, Average(pagerank) as avgPR)  (Amy, 0.65)  “ILLUSTRATE lets me check the output of my lengthy batch jobs and their (Fred, 0.4)  custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive.” Filter  avgPR > 0.5  -- Russell Jurney, LinkedIn (Amy, 0.65) 

Load  Load  Visits(user, url, =me)  Pages(url, pagerank)  (Amy, cnn.com, 8am)   (Amy, hhp://www.snails.com, 9am)  (Fred, www.snails.com/index.html, 11am)  (www.youtube.com, 0.9)   (Naïve Algorithm)  (www.frogs.com, 0.4)  Transform  to (user, Canonicalize(url), =me)  Join  url = url  (Amy, www.cnn.com, 8am)   (Amy, www.snails.com, 9am)  (Fred, www.snails.com, 11am)  Group  by user  Transform  to (user, Average(pagerank) as avgPR)  Filter  avgPR > 0.5 

Pig Project Status  • Produc=zed at Yahoo (~12‐person team)  – 1000s of jobs/day  – 70% of Hadoop jobs  • Open‐source (the Apache Pig Project)  • Offered on Amazon Elas=c Map‐Reduce  • Used by LinkedIn, Twiher, Yahoo, ... 

Next: NOVA  storage &  workflow manager  processing  e.g. Nova  dataflow programming  ✔   framework  e.g. Pig  distributed   ✔   sor=ng & hashing  e.g. Map‐Reduce  scalable file system  e.g. GFS    Debugging aides:  ✔   • Before: example data generator  • During: instrumenta=on framework  • ABer: provenance metadata manager 

Why a Workflow Manager?  • Modularity:  a workflow connects N dataflow modules  – Wrihen independently, and re‐used in other workflows  – Scheduled independently  • Op?miza?on:  op=mize across modules  – Share read costs among side‐by‐side modules  – Pipeline data between end‐to‐end modules  • Con?nuous processing:  push new data through  – Selec=ve re‐running  – Incremental algorithms (“view maintenance”)  • Manageability:  help humans keep tabs on execu=on  – Alerts  – Metadata (e.g. data provenance) 

RSS feed  Example  NEW  news  ALL  Workflow  template  ar=cles  detec=on  ALL  NEW  news site  ALL  template  templates  tagging  NEW  NEW  shingling  NEW  NEW  ALL  shingle  de‐ hashes  duping  seen  NEW  NEW  unique  ar=cles 

Data Passes Through Many Sub‐Systems  Nova   datum X  Pig   low‐latency  inges=on   serving   processor   datum Y  Map‐Reduce   provenance of X?  GFS   metadata  queries 

Ibis Project  metadata  metadata  queries  Ibis  answers  integrated  metadata  data processing sub‐systems  metadata manager  users  Benefits:  • – Provide uniform view to users  – Factor out metadata management code  – Decouple metadata life=me from data/subsystem life=me  Challenges:  • – Overhead of shipping metadata  – Disparate data/processing granulari=es 

What’s Hard About   Mul=‐Granularity Provenance?  • Inference:  Given rela=onships expressed at  one granularity, answer queries about other  granulari=es  (the seman;cs are tricky here!)  • Efficiency:  Implement inference without  resor=ng to materializing everything in terms  of finest granularity (e.g. cells) 

Next: INSPECTOR GADGET  storage &  workflow manager  ✔   processing  e.g. Nova  dataflow programming  ✔   framework  e.g. Pig  distributed   ✔   sor=ng & hashing  e.g. Map‐Reduce  scalable file system  e.g. GFS    Debugging aides:  ✔   • Before: example data generator  • During: instrumenta=on framework  ✔   • ABer: provenance metadata manager 

Mo=vated by   User Interviews  • Interviewed 10 Yahoo dataflow programmers  (mostly Pig users; some users of other  dataflow environments)  • Asked them how they (wish they could) debug 

Summary of User Interviews  # of requests  feature  7  crash culprit determina=on  5  row‐level integrity alerts  4  table‐level integrity alerts  4  data samples  3  data summaries  3  memory use monitoring  3  backward tracing (provenance)  2  forward tracing  2  golden data/logic tes=ng  2  step‐through debugging  2  latency alerts  1  latency profiling  1  overhead profiling  1  trial runs 

Our Approach  • Goal: a programming framework for adding  these behaviors, and others, to Pig  • Precept: avoid modifying Pig or tampering  with data flowing through Pig  • Approach: perform Pig script rewri=ng –   insert special UDFs that look like no‐ops to Pig 

load  load  Pig w/ Inspector Gadget   IG agent  IG agent  filter  IG agent  join  IG agent  IG  group  coordinator  IG agent  count  IG agent  store 

ProgrammingandDebugging LargeScaleDataProcessingWorkflows - PowerPoint PPT Presentation

ProgrammingandDebugging LargeScaleDataProcessingWorkflows ChristopherOlstonandmanyothers Yahoo!Research Context Elaborateprocessingoflargedatasets e.g.:

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

Large-scale Data Processing and Optimisation Eiko Yoneki University of Cambridge Computer

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is

Introduction to Debugging with Windbg Module Overview Introduction to Debugging Callstacks and

Kernel Debugging and Virtualization John Baldwin January 15, 2015 What is Kernel Debugging

Scalable Post-Mortem Debugging Abel Mathew CEO - Backtrace amathew@backtrace.io @nullisnt0

Debugging microservices in production Bryan Cantrill CTO bryan@joyent.com @bcantrill

Embedded Software TI2726-B 8. Debugging techniques Koen Langendoen Embedded Software Group

Debugging Techniques for C Programs Debugging Basics Will focus on the gcc/gdb combination.

HS2 Update 1 Community engagement update Schools engagement Rolling programme of activities for

A Fresh Approach to Startup Fundraising Fundraising Sprint Program Your SmartMoney Team Luis

Permutations and codes: set of words. Polynomials, bases, and covering radius In the binary

TOWARDS A SCIENCE DATA CENTER FOR EST Morten Franz European Data Provider & Training

Demonstrating the Value of Community Services Sussex Community Trust Marion Homersham, Head of

First Results of POLAR: A dedicated Gamma-Ray Burst Polarimeter Merlin Kole on behalf of the

Research on Effects of Integrating Computational Science and Model Building in Water Systems

Occupancy Where a species occurs; which of a set of suitable patches are occupied; what

ProgrammingandDebugging LargeScaleDataProcessingWorkflows - PowerPoint PPT Presentation

ProgrammingandDebugging LargeScaleDataProcessingWorkflows ChristopherOlstonandmanyothers Yahoo!Research Context Elaborateprocessingoflargedatasets e.g.:

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

Large-scale Data Processing and Optimisation Eiko Yoneki University of Cambridge Computer

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is

Introduction to Debugging with Windbg Module Overview Introduction to Debugging Callstacks and

Kernel Debugging and Virtualization John Baldwin January 15, 2015 What is Kernel Debugging

Scalable Post-Mortem Debugging Abel Mathew CEO - Backtrace amathew@backtrace.io @nullisnt0

Debugging microservices in production Bryan Cantrill CTO bryan@joyent.com @bcantrill

Embedded Software TI2726-B 8. Debugging techniques Koen Langendoen Embedded Software Group

Debugging Techniques for C Programs Debugging Basics Will focus on the gcc/gdb combination.

HS2 Update 1 Community engagement update Schools engagement Rolling programme of activities for

A Fresh Approach to Startup Fundraising Fundraising Sprint Program Your SmartMoney Team Luis

Permutations and codes: set of words. Polynomials, bases, and covering radius In the binary

TOWARDS A SCIENCE DATA CENTER FOR EST Morten Franz European Data Provider &amp; Training

Demonstrating the Value of Community Services Sussex Community Trust Marion Homersham, Head of

First Results of POLAR: A dedicated Gamma-Ray Burst Polarimeter Merlin Kole on behalf of the

Research on Effects of Integrating Computational Science and Model Building in Water Systems

Occupancy Where a species occurs; which of a set of suitable patches are occupied; what

TOWARDS A SCIENCE DATA CENTER FOR EST Morten Franz European Data Provider & Training