Re-Engineering Software Engineering in a Data-Centric World - PowerPoint PPT Presentation

Re-Engineering Software Engineering in a Data-Centric World Miryung Kim University of California, Los Angeles 1

Confluence Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote 2

Confluence: Interdisciplinary Thinking Inflection Point Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote

Confluence: Impressionism Inflection Point Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote

Confluence: Data Analytics and SE Inflection Point ML Big Data AI Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote

Takeaway Message: A Case for Software Engineering for Data Analytics (SE4DA) Bug finding is a huge problem in data analytics. SE4DA is underserved ; somehow people have gravitated to applying data analytics to SE. SE4DA requires re-thinking software engineering techniques. 6

There is a huge opportunity for data analytics. 7

Data analytics are in high demand, yet … 8

Bugs are huge problems in data analytics. Data analytics used by The widespread harm thousands of scientists includes from a wrong produce misleading or medical diagnosis to wrong results incorrect interpretation [BBC News] of stock history [Dataversity] Predictably inaccurate : The prevalence and perils of bad big data. [Deloitte] 9

Growth of Data Analytics Papers in SE Data Analytics (AI, Big Data, ML) Growth in ASE Papers 100 39 40 50 50 38 47 28 21 22 0 2016 2017 2018 2019 Data Analytics Rest 10

SE4DA is under-investigated. (SE4DA: 13, DA4SE: 105) SE4DA (4%): SE4DA Improving SE for 4% data analytics DA4SE (37%): DA4SE Applying data 37% Rest analytics to SE 59% 11

Outline: Making a Case for Software Engineering for Data Analytics (SE4DA) Shift to data-centric SW ① Studies: development Data Differences between traditional SW Scientists ② vs. data-centric SW dev process Debugging & testing for big data ③ Tools analytics ④ Open problems in SE4DA

Part 1. Data Scientists in Software Teams: State of the Art and Challenges Miryung Kim, Thomas Zimmermann, Rob DeLine, Andrew Begel

① Data ② ③ ④ Scientists Challenges Difference Tools The Emerging Roles of Data Scientists on Software Teams We are at a tipping point where there are large scale telemetry, machine, quality, and user data. Data scientists are emerging roles in SW teams. To understand working styles and challenges, we conducted the first in-depth interview study and the largest scale survey of professional data scientists. 14

① Data ② ③ ④ Scientists Challenges Difference Tools Methodology for Studying “Data Scientists” Survey [TSE 2018] In-Depth Interviews [ICSE’16]: 793 responses • 5 women and 11 men from • demographics/self- eight different Microsoft perception organizations • skills and tool usage • working styles Bio Finance Physics • time spent Informatics • challenges and best Economics Math practices Computer Cog Statistics Sci Science ML 15

① Data ② ③ ④ Scientists Challenges Difference Tools Time Spent on Activities Hours spent on certain activities (self reported, survey, N=532) 16

① Data ② ③ ④ Scientists Challenges Difference Tools What is a “Data Scientist”? # $ ! "# " ! # " # $ ! " # ! Clustering $ " $ $ $ ! " # ! " $ based on # $ 532 data scientists relative time spent at Microsoft in activities !! # … " 9 Distinct Categories 17

① Data ② ③ ④ Scientists Challenges Difference Tools Category 1: Data Shaper Analyzing and preparing data Post-graduate degrees Algorithms, machine learning, and optimizations Less familiar with front-end programming 18

① Data ② ③ ④ Scientists Challenges Difference Tools Category 2: Platform Builder Instrument code to collect data Big data and distributed systems Back-end and front-end programming SQL, C, C++ and C# 19

① Data ② ③ ④ Scientists Challenges Difference Tools Category 3: Data Analyzer Familiar with statistics Not familiar with front-end programming Difficulty with data transformation R Studio or statistical analysis 20

① Data ② ③ ④ Scientists Challenges Difference Tools Common challenges: Data scientists find it difficult to ensure “correctness” Validation is a major challenge. “Honestly, we don’t have a good method for this.” “Just because the math is right, doesn’t mean that the answer is right.” Explainability is important— “to gain insights, you must go one level deeper.” 21

① Data ② ③ ④ Scientists Challenges Difference Tools Outline: Making a Case for Software Engineering for Data Analytics (SE4DA) Shift to data-centric SW ① Studies: development Data Differences between traditional SW Scientists ② vs. data-centric SW dev process Debugging & testing for big data ③ Tools analytics ④ Open problems in SE4DA 22

① Data ② ③ ④ Scientists Challenges Difference Tools Part 2. How is Traditional Development Different from Big Data Analytics Development? [Interactions’12] [ICSE-SEIP’19] [NIPS’15] [TSE’19] [ICSE’16] [TSE’18]

① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development 1 Develop 1 Develop locally 2 Run 2 Test locally with Sample Data 3 Execute the job on the cloud 3 Test hoping that it would work 4 Several hours later, the job crashes 4 Debug or produces wrong output 5 Repeat 5 Repeat 24

① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development 1. Data is huge , remote , 1 Develop locally and distributed . 2 Test with Sample 25

① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development 2. Writing test is hard . Don’t even know the full input and don’t know the expected output. 3. Failures are hard to 2 Test with Sample define. 4 The job crashes or produces wrong output 26

① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development 4. System stack is complex with little visibility. Filter Map Reduce 3 Execute the job on the cloud 27

① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development Zipcode Trips Map Map Filter Join: ⨝ Map ReduceByKey 5. Gap between logical 3 Execute the job on the vs. physical execution cloud 28

① Data ② ③ ④ Scientists Challenges Difference Tools Traditional vs. Big Data Analytics Development Task 31 failed 3 times; aborting job ERROR Executor: Exception in task 31 in stage 0 (TID 31) java.lang.NumberFormatException 6. Data tracing is hard. 3 Execute the job on the cloud 4 The job crashes or produces wrong output � 5 Repeat 29

① Data ② ③ ④ Scientists Challenges Difference Tools Outline: Making a Case for Software Engineering for Data Analytics (SE4DA) Shift to data-centric SW ① Studies: development Data Differences between traditional SW Scientists ② vs. data-centric SW dev process Debugging & testing for big data ③ Tools analytics ④ Open problems in SE4DA 30

Part 3. Debugging and Testing for Big Data Analytics Tyson Condie, Ari Ekmekji, Muhammad Ali Gulzar, Miryung Kim, Matteo Interlandi, Shaghayegh Mardani, Todd Millstein, Madanlal Musuvathi, Kshitij Shah, Sai Deep Tetali, Seunghyun Yoo

① Data ② ③ ④ Scientists Challenges Difference Tools Insights from Debugging and Testing for Apache Spark • Designing interactive debug primitives requires deep understanding of internal execution model, job scheduling, and materialization . • Providing traceability requires modifying a runtime . • Abstraction is a powerful force in simplifying program paths. 32

① Data ② ③ ④ Scientists Challenges Difference Tools Enabling interactive debugging requires us to re-think a traditional debugger • Pausing the entire computation on the cluster could reduce throughput • It is clearly infeasible for a user to inspect billion of records through a regular watchpoint 33

① Data ② ③ ④ Scientists Challenges Difference Tools BigDebug: Interactive Debug Primitives for Big Data Analytics [ICSE 2016] Program Stage 2 Stage 1 (DAG) Map Map Filter Map Map Reduce Map � Stored ④Backward ①Simulated Data Tracing Breakpoint Records Reduce age < 0 ②On Demand ③ Realtime Watchpoint Repair 34

① Data ② ③ ④ Scientists Challenges Difference Tools Titian: Data Provenance for Apache Spark [VLDB 2016] Program Stage 1 Stage 2 (DAG) Map Filter Map Map Reduce Map Lineage Table Worker 1 Worker 1 � � ⨝ ⨝ Worker 2 Worker 2 Worker 3 Worker 3 35

Re-Engineering Software Engineering in a Data-Centric World - PowerPoint PPT Presentation

Re-Engineering Software Engineering in a Data-Centric World Miryung Kim University of California, Los Angeles 1 Confluence Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote 2 Confluence: Interdisciplinary

Data-centric Profiling Working Group Outbrief Basic Concept Associating performance data with

TransMR: Data Centric Programming Beyond Data Parallelism Naresh Rapolu Karthik Kambatla Prof.

The Worlds First LED Human Centric Fluorescent Tube by Human Centric Optics Inc. 333,

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

Various Faces of Data Centric Networking Eiko Yoneki University of Cambridge Computer Laboratory

Six Faces of Data Centric Networking Eiko Yoneki University of Cambridge Computer Laboratory

Data Centric Networking Session 1: Introduction to R202 Data Centric Networking Eiko Yoneki

Data Data- -Centric Query in Sensor Networks Centric Query in Sensor Networks Jie Gao Jie Gao

Data- -Centric Query in Sensor Centric Query in Sensor Data Networks Networks Jie Gao

Data- -Centric Query in Sensor Networks Centric Query in Sensor Networks Data Jie Gao Computer

The Case for a Unified Extensible Data-centric Mobility Infrastructure Data-centric Mobility

Data- -Centric Query in Sensor Networks II Centric Query in Sensor Networks II Data Jie Gao

Various Faces of Data Centric Networking and Systems Eiko Yoneki University of Cambridge

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

Mobile Tactical Ops Center using ATCA MOSA, Swap and Net Centric Architecture DoD - Net Centric

Human Centric Human Centric Machine Learning Infrastructure Machine Learning Infrastructure @

Introduction and Applications of the Semantic Web Ivan Herman, W3C May 2009 2 Lets organize

Government-Business Coordination Finn Tarp Workshop on International Development Peking

From identification towards exploitation of geothermal reservoir: concepts and experience KOHL

military families RAF Waddington, 15 th June 2018 Matt Blyton, SCISS Executive vice-chair

D. All of the texts (except for the Cambridge Companion to Darwin ) are on the Chalk site under

L ECTURE 14: C ELLULAR A UTOMATA 4 / D ISCRETE -T IME D YNAMICAL S YSTEMS 5 I NSTRUCTOR : G IANNI

Apache Kylin Balance between Space and Time Debashis Saha | Luke Han 2015-06-09 http://kylin.io

Draft History of Random Number Generators Seed x 0 , x i = f ( x i 1 ) , u i = g ( x i )