dsc 102 systems for scalable analytics
play

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML - PowerPoint PPT Presentation

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML Data Preparation and Model Selection Chapter 8, 8.1, 8.2, 8.3, 8.4 of MLSys Book 1 DSC 102 will get you thinking about the fundamentals of scalable analytics systems 1.


  1. DSC 102 
 Systems for Scalable Analytics Arun Kumar Topic 4: ML Data Preparation and Model Selection Chapter 8, 8.1, 8.2, 8.3, 8.4 of MLSys Book 1

  2. DSC 102 will get you thinking about the fundamentals of scalable analytics systems 1. “ Systems ”: What resources does a computer have? How to store and compute efficiently over large data? What is cloud computing? 2. “ Scalability ”: How to scale and parallelize data- intensive computations? 3. Scalable Systems for “Analytics” : 3.1. Source : Data acquisition & preparation for ML 3.2. Build : Dataflow & Deep Learning systems 3.3. Deploying ML models 4. Hands-on experience with tools for scalable analytics 2

  3. The Lifecycle of ML-based Analytics Feature Engineering Data acquisition Model Serving Training & Inference Data preparation Monitoring Model Selection 3

  4. Data Science in the Real World Q: How do real-world data scientists spend their time? CrowdFlower Data Science Report 2016 4 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

  5. Data Science in the Real World Q: How do real-world data scientists spend their time? CrowdFlower Data Science Report 2016 5 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

  6. Data Science in the Real World Q: How do real-world data scientists spend their time? Kaggle State of ML and Data Science Survey 2018 6

  7. Data Science in the Real World Q: How do real-world data scientists spend their time? 7 IDC-Alteryx State of Data Science and Analytics Report 2019

  8. Sourcing Stage of Data Science ❖ Data science does not exist in a vacuum. It must interplay with the data-generating process and prediction application ❖ Sourcing: The stage of the data science lifecycle where you go from raw datasets to “ analytics/ML-ready ” datasets ❖ What makes sourcing challenging/time-consuming? ❖ Data access /availability constraints ❖ Heterogeneity of data sources/formats/types ❖ Messy , incomplete, ambiguous, and/or erroneous data ❖ Poor data governance in organization ❖ Bespoke /diverse kinds of prediction applications ❖ Evolution of data-generating process/application ❖ Large scale of data 8

  9. Sourcing Stage of Data Science ❖ Sourcing: The stage of the data science lifecycle where you go from raw datasets to “ analytics/ML-ready ” datasets ❖ At a high level, roughly 5 kinds of activities: 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling 9 (Sometimes)

  10. Acquiring Data 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling (Sometimes) 10

  11. Acquiring Data: Data Sources ❖ Modern data-driven applications tend to have multitudes of data storage repositories and sources ❖ Data scientist must know how to access and get datasets! ❖ Structured data: Exported from RDBMSs (e.g., Redshift), often with SQL ❖ Semistruct. data: Exported from “NoSQL” stores (e.g., MongoDB) ❖ Log files, text files, documents, multimedia files, etc.: Typically stored on HDFS, S3, etc. Raw data ❖ Graph/network data: Managed by Neo4j sources/repos Ad: Take DSC 104 to learn semistruct. and graph databases 11

  12. Acquiring Data: Examples Example: Recommendation System (e.g., Netflix) Prediction App: Identify top movies to display for user Data Sources: User data and Movie data Movie images past click logs Example: Social media analytics for social science Prediction App: Predicts which tweets will go viral Data Sources: Entity Graph data Tweets as JSON Dictionaries Structured metadata 12

  13. Acquiring Data ❖ Modern data-driven applications tend to have multitudes of data storage repositories and sources ❖ Data scientist must know how to access and get datasets! Potential challenges and mitigation: ❖ Access control : Learn organization’s data security and authentication policies ❖ Data heterogeneity : Do you really need all data sources/types? ❖ Data volume : Do you really need all data? ❖ Scale : Avoid sequential file copying Raw data ❖ Manual errors : Use automated “data sources/repos pipeline” services such as AirFlow (later) 13

  14. Organizing Data 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling (Sometimes) 14

  15. (Re-)Organizing Data ❖ Given diverse data sources/file formats, data scientist must reorganize them into a usable format for analytics/ML ❖ Organization is specific to the analytics/ML task at hand ❖ Might need SQL, MapReduce (later), and file handling ❖ Examples of usable organization: Prediction App: Fraud detection in banking Joins to denormalize Large single-table CSV file, say, on HDFS Flatten JSON records Prediction App: Image captioning on social media Large binary file with Fuse JSON records 1 image tensor and 1 Extract image tensors string per line 15

  16. (Re-)Organizing Data: Tips ❖ Data re-organization these days often involves a lot of coding (Python, SQL, Java) and scripting (bash) Some suggested best practices: ❖ Documentation: Maintain notes/READMEs with your code ❖ Automation: Use scripts (meta-programs) to automate orchestration of data re-org. code ❖ Provenance: Manage metadata on where your data records/variables come from and why they are there ❖ Versioning: You might do data re-org. many times; manage metadata on what version has what and when 16

  17. (Re-)Organizing Data: Schematization ❖ Increasingly, “ ML platforms ” in industry are imposing more discipline on what re-organized data must look like ❖ Lightweight and flexible schemas becoming common ❖ Makes it easier to automate data validation 17 https://www.tensorflow.org/tfx/guide

  18. (Re-)Organizing Data ❖ Custom ML platforms proliferating in industry, each with its own approach to organizing and cataloging ML data! 18 https://eng.uber.com/michelangelo/

  19. Data Cleaning 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling (Sometimes) 19

  20. Data Cleaning ❖ Real-world datasets often have errors, ambiguity, incompleteness, inconsistency, and other quality issues ❖ Data cleaning: The process of fixing data quality issues to ensure errors do not cascade/corrupt analytics/ML results ❖ Diverse sources/causes of data quality issues: ❖ Human-generated data: Mistakes, misunderstandings ❖ Hardware-generated data: Noise, failures ❖ Software-generated data: Bugs, errors, semantic issues ❖ Attribute encoding/formatting conventions (e.g., dates) ❖ Attribute unit/semantics conventions (e.g., km vs mi) ❖ Data integration: Duplicate entities, value differences ❖ Evolution of data schemas in application 20

  21. Data Cleaning Task: Missing Values ❖ Long standing problem studied in statistics and DB/AI ❖ Various assumptions on “missingness” property in terms of correlations of missing vs observed values in dataset: ❖ Missing Completely at Random ( MCAR ): No (causal) relationships for missing vs non-missing values ❖ Missing at Random ( MAR ): Systematic relationships between missing values and observed values ❖ Missing Not at Random ( MNAR ): Missingness itself depends on the value missing ❖ Add 0/1 missingness variable and impute missing values: ❖ Statistical approaches: distributional properties ❖ ML/DL-based approaches: self-supervised ❖ Some ML packages offer these at scale (e.g., DaskML) 21

  22. Data Cleaning Task: Entity Matching ❖ A common cleaning task for multi-source datasets ❖ Duplications of real-world entities can arise when using data drawn from multiple sources ❖ Often need to match and deduplicate entities in unified data; o/w, query/ML answers can be wrong! ❖ Aka entity deduplication/record linkage/entity linkage FullName Age City Sate Customers1 Aisha Williams 27 San Diego CA LastName FirstName MI Age Zipcode Customers2 Williams Aisha R 27 92122 Q: Are these the same person (“entity”)? 22

  23. General Workflow of Entity Matching ❖ 3 main stages: Blocking -> Pairwise check -> Clustering ❖ Pairwise check: ❖ Given 2 records, how likely is it that they are the same entity? SOTA: “Entity embeddings ” + deep learning ❖ Blocking: ❖ Pairwise check cost for a whole table is too high: O(n 2 ) ❖ Create “blocks”/subsets of records; pairwise only within ❖ Domain-specific heuristics for “obvious” non-matches using similarity/distance metrics (e.g., edit dist. on Name) ❖ Clustering: ❖ Given pairwise scores, consolidate records into entities 23

  24. Data Cleaning Q: How can we even hope to automate data cleaning? ❖ Many approaches studied by DB and AI communities: ❖ Integrity constraints: E.g., if ZipCode is same across customer records, State must be same too ❖ Business logic/rules: domain knowledge programs ❖ Supervised ML: E.g., predict missing values ❖ Unfortunately, data quality issues are often so peculiar and specific to dataset/application that human intervention (by data scientist) is often the only reliable way in practice! ☺ ❖ Crowdsourcing / expertsourcing another alternative Data cleaning in practice is “death by a thousand cuts”! :) 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend