Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design 19) Agenda - - PowerPoint PPT Presentation

advanced ml in google cloud 2
SMART_READER_LITE
LIVE PREVIEW

Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design 19) Agenda - - PowerPoint PPT Presentation

CS341: Project in Mining Massive Datasets Advanced ML in Google Cloud (2) Abhay Agarwal (MS Design 19) Agenda Productizing analytics Data wrangling Data fundamentals Data studio vs datalab vs colab


slide-1
SLIDE 1

Advanced ML in Google Cloud (2)

Abhay Agarwal (MS Design ‘19)

CS341: Project in Mining Massive Datasets

slide-2
SLIDE 2

Agenda

  • ‘Productizing’ analytics
  • Data wrangling
  • Data fundamentals
  • Data studio vs datalab vs colab
slide-3
SLIDE 3

‘Productizing’

  • What does it mean to ‘productize’ your ML?
slide-4
SLIDE 4

Pitfalls in Productizing

  • My algorithm has a 95% accuracy -- is it ready for production?
  • My algorithm has a 95% accuracy and 95% precision -- is it ready for

production?

  • My algorithm has a 95% accuracy, 95% precision, and my training data is

roughly sampled from real examples -- is it ready for production?

  • My algorithm has a 95% accuracy, 95% precision, training data sampled from

real examples, and my algorithm tests hypotheses that match the use cases -- is it ready for production?

slide-5
SLIDE 5

Data wrangling

slide-6
SLIDE 6

DATA COLLECTION FUNDAMENTALS

6

slide-7
SLIDE 7

Key Concepts

7

Freshness Quality Structure Cost Quantity

slide-8
SLIDE 8

Quantity

  • Breadth
  • Number of entities or observations
  • E.g., People, companies, stars, shopping trips,…
  • Ideally: comprehensive
  • Depth
  • Data gathered on each entity or observation

8

slide-9
SLIDE 9

Breadth and Depth

9

Depth Brea dth

World Bank Development Indicators

slide-10
SLIDE 10

Structure

10

Structured Unstructured Semi-structured

slide-11
SLIDE 11

Graph Data

Graphs arise naturally in many settings Many interesting techniques e.g., Page Rank, community detection

11

Moz.com

slide-12
SLIDE 12

Data Quality

  • Errors
  • E.g., human labeling mistakes
  • Missing data
  • E.g., missing addresses in

customer records

  • Bias
  • Sample bias, measurement bias,

prejudice/stereotype

12

slide-13
SLIDE 13

Data Quality: Sample Bias

13

Day Driving vs Night Driving Tank recognition

slide-14
SLIDE 14

Data Quality: Prejudice/Stereotype Bias Algorithmic Law Enforcement

14

The Economist, August 20, 2016

But what about perpetuating bias against minorities?

slide-15
SLIDE 15

Data Quality: Measurement Bias

15

slide-16
SLIDE 16

Data Freshness

Rate of data collection must match rate of change of underlying phenomenon

16

slide-17
SLIDE 17

Data manipulation in Google Cloud

  • Data Studio
  • Datalab
  • Colab
  • (offline!)
slide-18
SLIDE 18

Data Studio

  • Data Studio - glorified spreadsheets with a few integrations to Google Cloud

to pull data

  • Use cases: excel-like functions, simple visualizations (e.g. geographic)
slide-19
SLIDE 19

Datalab

  • Datalab - hosted Jupyter instance with preset libraries
  • Use cases: python scripting, visualization, ML pipelining, some long-running

scripting, versioned scripts and models

slide-20
SLIDE 20

Colab

  • Colab - Shared, no-setup version of Datalab that is designed around sharing
  • Use cases: creating publicly accessible work, collaboration, but no

long-running scripting