 
              DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML Data Preparation and Model Selection Chapter 8, 8.1, 8.2, 8.3, 8.4 of MLSys Book 1
DSC 102 will get you thinking about the fundamentals of scalable analytics systems 1. “ Systems ”: What resources does a computer have? How to store and compute efficiently over large data? What is cloud computing? 2. “ Scalability ”: How to scale and parallelize data- intensive computations? 3. Scalable Systems for “Analytics” : 3.1. Source : Data acquisition & preparation for ML 3.2. Build : Dataflow & Deep Learning systems 3.3. Deploying ML models 4. Hands-on experience with tools for scalable analytics 2
The Lifecycle of ML-based Analytics Feature Engineering Data acquisition Model Serving Training & Inference Data preparation Monitoring Model Selection 3
Data Science in the Real World Q: How do real-world data scientists spend their time? CrowdFlower Data Science Report 2016 4 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
Data Science in the Real World Q: How do real-world data scientists spend their time? CrowdFlower Data Science Report 2016 5 https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
Data Science in the Real World Q: How do real-world data scientists spend their time? Kaggle State of ML and Data Science Survey 2018 6
Data Science in the Real World Q: How do real-world data scientists spend their time? 7 IDC-Alteryx State of Data Science and Analytics Report 2019
Sourcing Stage of Data Science ❖ Data science does not exist in a vacuum. It must interplay with the data-generating process and prediction application ❖ Sourcing: The stage of the data science lifecycle where you go from raw datasets to “ analytics/ML-ready ” datasets ❖ What makes sourcing challenging/time-consuming? ❖ Data access /availability constraints ❖ Heterogeneity of data sources/formats/types ❖ Messy , incomplete, ambiguous, and/or erroneous data ❖ Poor data governance in organization ❖ Bespoke /diverse kinds of prediction applications ❖ Evolution of data-generating process/application ❖ Large scale of data 8
Sourcing Stage of Data Science ❖ Sourcing: The stage of the data science lifecycle where you go from raw datasets to “ analytics/ML-ready ” datasets ❖ At a high level, roughly 5 kinds of activities: 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling 9 (Sometimes)
Acquiring Data 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling (Sometimes) 10
Acquiring Data: Data Sources ❖ Modern data-driven applications tend to have multitudes of data storage repositories and sources ❖ Data scientist must know how to access and get datasets! ❖ Structured data: Exported from RDBMSs (e.g., Redshift), often with SQL ❖ Semistruct. data: Exported from “NoSQL” stores (e.g., MongoDB) ❖ Log files, text files, documents, multimedia files, etc.: Typically stored on HDFS, S3, etc. Raw data ❖ Graph/network data: Managed by Neo4j sources/repos Ad: Take DSC 104 to learn semistruct. and graph databases 11
Acquiring Data: Examples Example: Recommendation System (e.g., Netflix) Prediction App: Identify top movies to display for user Data Sources: User data and Movie data Movie images past click logs Example: Social media analytics for social science Prediction App: Predicts which tweets will go viral Data Sources: Entity Graph data Tweets as JSON Dictionaries Structured metadata 12
Acquiring Data ❖ Modern data-driven applications tend to have multitudes of data storage repositories and sources ❖ Data scientist must know how to access and get datasets! Potential challenges and mitigation: ❖ Access control : Learn organization’s data security and authentication policies ❖ Data heterogeneity : Do you really need all data sources/types? ❖ Data volume : Do you really need all data? ❖ Scale : Avoid sequential file copying Raw data ❖ Manual errors : Use automated “data sources/repos pipeline” services such as AirFlow (later) 13
Organizing Data 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling (Sometimes) 14
(Re-)Organizing Data ❖ Given diverse data sources/file formats, data scientist must reorganize them into a usable format for analytics/ML ❖ Organization is specific to the analytics/ML task at hand ❖ Might need SQL, MapReduce (later), and file handling ❖ Examples of usable organization: Prediction App: Fraud detection in banking Joins to denormalize Large single-table CSV file, say, on HDFS Flatten JSON records Prediction App: Image captioning on social media Large binary file with Fuse JSON records 1 image tensor and 1 Extract image tensors string per line 15
(Re-)Organizing Data: Tips ❖ Data re-organization these days often involves a lot of coding (Python, SQL, Java) and scripting (bash) Some suggested best practices: ❖ Documentation: Maintain notes/READMEs with your code ❖ Automation: Use scripts (meta-programs) to automate orchestration of data re-org. code ❖ Provenance: Manage metadata on where your data records/variables come from and why they are there ❖ Versioning: You might do data re-org. many times; manage metadata on what version has what and when 16
(Re-)Organizing Data: Schematization ❖ Increasingly, “ ML platforms ” in industry are imposing more discipline on what re-organized data must look like ❖ Lightweight and flexible schemas becoming common ❖ Makes it easier to automate data validation 17 https://www.tensorflow.org/tfx/guide
(Re-)Organizing Data ❖ Custom ML platforms proliferating in industry, each with its own approach to organizing and cataloging ML data! 18 https://eng.uber.com/michelangelo/
Data Cleaning 1. Acquiring 2. Organizing 3. Cleaning 5. Feature Engineering (aka Feature Extraction) Raw data Analytics/ML- sources/repos ready data 4. Labeling (Sometimes) 19
Data Cleaning ❖ Real-world datasets often have errors, ambiguity, incompleteness, inconsistency, and other quality issues ❖ Data cleaning: The process of fixing data quality issues to ensure errors do not cascade/corrupt analytics/ML results ❖ Diverse sources/causes of data quality issues: ❖ Human-generated data: Mistakes, misunderstandings ❖ Hardware-generated data: Noise, failures ❖ Software-generated data: Bugs, errors, semantic issues ❖ Attribute encoding/formatting conventions (e.g., dates) ❖ Attribute unit/semantics conventions (e.g., km vs mi) ❖ Data integration: Duplicate entities, value differences ❖ Evolution of data schemas in application 20
Data Cleaning Task: Missing Values ❖ Long standing problem studied in statistics and DB/AI ❖ Various assumptions on “missingness” property in terms of correlations of missing vs observed values in dataset: ❖ Missing Completely at Random ( MCAR ): No (causal) relationships for missing vs non-missing values ❖ Missing at Random ( MAR ): Systematic relationships between missing values and observed values ❖ Missing Not at Random ( MNAR ): Missingness itself depends on the value missing ❖ Add 0/1 missingness variable and impute missing values: ❖ Statistical approaches: distributional properties ❖ ML/DL-based approaches: self-supervised ❖ Some ML packages offer these at scale (e.g., DaskML) 21
Data Cleaning Task: Entity Matching ❖ A common cleaning task for multi-source datasets ❖ Duplications of real-world entities can arise when using data drawn from multiple sources ❖ Often need to match and deduplicate entities in unified data; o/w, query/ML answers can be wrong! ❖ Aka entity deduplication/record linkage/entity linkage FullName Age City Sate Customers1 Aisha Williams 27 San Diego CA LastName FirstName MI Age Zipcode Customers2 Williams Aisha R 27 92122 Q: Are these the same person (“entity”)? 22
General Workflow of Entity Matching ❖ 3 main stages: Blocking -> Pairwise check -> Clustering ❖ Pairwise check: ❖ Given 2 records, how likely is it that they are the same entity? SOTA: “Entity embeddings ” + deep learning ❖ Blocking: ❖ Pairwise check cost for a whole table is too high: O(n 2 ) ❖ Create “blocks”/subsets of records; pairwise only within ❖ Domain-specific heuristics for “obvious” non-matches using similarity/distance metrics (e.g., edit dist. on Name) ❖ Clustering: ❖ Given pairwise scores, consolidate records into entities 23
Data Cleaning Q: How can we even hope to automate data cleaning? ❖ Many approaches studied by DB and AI communities: ❖ Integrity constraints: E.g., if ZipCode is same across customer records, State must be same too ❖ Business logic/rules: domain knowledge programs ❖ Supervised ML: E.g., predict missing values ❖ Unfortunately, data quality issues are often so peculiar and specific to dataset/application that human intervention (by data scientist) is often the only reliable way in practice! ☺ ❖ Crowdsourcing / expertsourcing another alternative Data cleaning in practice is “death by a thousand cuts”! :) 24
Recommend
More recommend