Finding that dress at scale @Rent The Runway Saurabh Bhatnagar Bio - PowerPoint PPT Presentation

Finding that dress at scale @Rent The Runway Saurabh Bhatnagar

Bio 17 years in ML/data Prev: Responsible for personalization and ML at RTR Prev: Found Data Science at Barnes & Nobles Prev: consulted at HP, Unilever, … Now: Founder, Virevol AI @analyticsaurabh www.sanealytics.com

Rent The Runway - Democratize luxury fashion - eCommerce rental model - Closet in the cloud - 8m registered users - Optional Unlimited membership programs - 1,500 dresses dry cleaned every hour - Biggest dry cleaner in the World 1 Sr Data Scientist + 2 Jr Data Scientists Team!

What you will learn How to scale using - Strategy - How to bet on the right infra stack - Software engineering and tests/checks for ML - Maintain complex ML jungle - Practical lessons you can take to work on Monday

Image search

Instagram + Humans => Dress review

Reverse Logistics

Scale? - Typical Scope creep: You need a team of 60 data scientists + huge infra team - RTR Story: One Sr Data Scientist + 2 Jr over 5 years! - Artisanal hand-rolled ML

Fashion is unsolved RTR != Netflix High stakes It is visual Preferences change Underlying reason for buying is poorly understood Supply side challenges

Teams N (N-1) 2 Take home lesson: The complexity of communication increases exponentially proportional to number of teams involved

Scale: You need a strategy - KISS: An exercise to exonerate complexity is an exercise in simplicity - Study opportunity, pitch , demo, get team, agree on deliverables - Tie metrics to $$$$, obsess over product, set expectations KISS - Metric will be wrong over time - Simple linear baseline first , deploy, then reiterate (80/20) - Research vs “Can’t fail” expectations, overcommunicate - Need more backend/frontend engineers, UX design, project, etc to drive algorithmic success than ML Engineers - Mentor

Good fences make good neighbours - ML Saas, i.e SLAs - Latency SLA, uptime SLAs: Engineering caches last served + default user - Separation of concerns/clear action on failure: - For Engineering: Just restart, it will go back to previous version of model on start KISS - For ETL team: We can create jobs to put data on pipe, but ownership with that team - Reliability via tests and checks (ML flavor) - Graceful fallbacks... And fallbacks to fallbacks (cold start) - Recompute models daily (depends), continuous deployment - Software architect for parts to be switchable (even languages) - Data analysis (R/shiny, SQL, Tableau, python) vs ML - Reports tracking what ML can’t do, guardrails

Data infra through time (not accurate) - Look ma, I can store files… yay! - OK, too many files, directories, organization (databases) - Hadoop - Look ma, I can store files on multiple computers… yay! - Spark - Need to organize for ML on multiple machines… needs a lot of infra, JVM - GPU - Look ma, I can process a LOT of data really fast on one box KISS - Future? Streaming GPU databases, ML framework, ... Lesson: It is hard to pick tech that lasts 5 years! Be switchable by design

Scaling: When not to use GPUs - Those network costs add up.. Keep data transfer at minimum - Example, sparse SVD/cf: Y = R ( U I ) - r * u * i + r + u * k + i * k <= 1 GB / (32-bit floats) <= 8e9 / 32 - u <= (g * 8e9 / 32 - i * k) / (i * r + k) - If r=1%, k=100, i=1e4, u ~1.2m. If i=1e6, u <= 14k - AWS C5.18xlarge = 72CPUs, 144 Gb => 180m vs 3.5m per batch - 1080i GPU = 3,584 cores, 11 Gb => 140m or 262k users per batch - Spark cluster, if we’re talking petabytes (are we, though? See num of items, hashing tricks) - Other reasons: To keep network costs low, where does data already live? Algo not parallelizable?

1m items 1m items

ML as a Service Software at scale

DeepDress ML Lib Data Bus Train user fit recommendations Serve style Train user style Train user event recommendations recommendations recommendations (gRPC) Train review language Serve image search dress allocation solver model (flask) (spacy) ...

Train Membership recos XFL S3 (GPU) Serve Membership recos Engg S3 (CPU gRPC python) Update recos server JAVA cache server XFL Kafka Engg Kafka (CPU)

Software at scale: Reuse and flexibility - DeepDress library has shared data buses, embeddings, models. Library can load KISS latest embeddings/models across the ecosystem (over S3) - Versioning and default users/products are important for fallbacks - Languages are irrelevant, problem you’re solving is important - However for reliability, base in one language (python), glue for others. We have R and C++ bindings via feather/arrow - gRPC + JAVA to serve (RTR backend stack is only in JAVA) - Data: Abstracted Bus. Can be Disk, S3 or Kafka or something else in the future Lesson: People who forget relational databases are condemned to reinvent it

DeepDress AI library - Bus - S3/Disk/Kafka/DoubleDecker - DataLoader - Orders, Person, People, Reviews, Photos - Model - CollaborativeFiltering, CarouselNet, DressNet, Dress2Vec, FitModel, RModel, FulfillmentSolver … - Load/Save models, embeddings, auto checkpoints to Bus - Checks - HoldOutCheck, SelfDriftCheck, MetricDriftCheck, ... - Tests - Metrics - Rcpp - Utils Built on top of pyTorch, numpy, pandas and some R/C++, external libs like spaCy, PuLP not required

Simplify your workflow KISS - Understand and improve model, reuse, no ensembles! - Work with product to figure out better ways to capture data and improve model - This isn’t Kaggle, you have influence on data, UX and roadmap. - North star ($)

DressNet v3 BCE Loss Item Vector ... ReLU ... Dress2Vec DressReviews2Vec User Embedding Item Embedding

Validation as integration test Regular deterministic code: ML code: Inputs -> Some known function -> Output Inputs -> Black box ML (function + data) -> Output Test: Make sure output works for some expected inputs… unit tests, fuzz tests, Test: Change in data changes assumptions.. random tests, integration tests Could be upstream ETL problem but blind to it.

HoldoutCheck Train: 90% of users with full history + 10% of users with last k missing Test: For those 10%, check against those last k

Other useful checks SelfDriftCheck MetricDriftCheck Did the prediction metrics change compared to Compare to another business metric ($$$) last n day moving average? Does this metric still track reality? IntegrationCheck Scrape website and see if that’s what we sent Tests/Checks are a way to encode our assumptions for building that model, choosing that metric KISS and assuming those relationships in data

Strategy : SLAs, $, tracking Infra : GPU, glue MLaas : Embedding DB/API

Virevol AI Automating and augmenting retail

Keep in touch @analyticsaurabh www.virevol.com www.sanealytics.com www.RentTheRunway.com

Finding that dress at scale @Rent The Runway Saurabh Bhatnagar Bio - PowerPoint PPT Presentation

Finding that dress at scale @Rent The Runway Saurabh Bhatnagar Bio 17 years in ML/data Prev: Responsible for personalization and ML at RTR Prev: Found Data Science at Barnes & Nobles Prev: consulted at HP, Unilever, Now: Founder,

Phase 1 Q Runway 18/ 36 Open Q Runway 28 ILS Available 25 Calendar Days Q Runway 10 VASI

Runway Safety NextGen Financial analysis of runway safety Rob van Eekeren Safe-Runway GmbH

Knights Dress Code Knights Dress Code Knights Dress Code Knights Dress Code Presented by

RUNWAY LIGHTING 45 PROJECTS IN 20 COUNTRIES 3.000 M RUNWAY SOLAR & WIRELESS LIGHTING 1800M

Runway Safety Team Case Study / Workshop Presented to: Regional Runway Safety Seminar By: John

CITY OF RICHMOND RENT PROGRAM Nicolas Traylor, Executive Director 1 Agenda Rent Program THE

Environmental Issues of a Third Runway at HKIA Presented by Ir. Dr HF Chan on 13 th August 2011

Cuyahoga County Airport Runway 6/24 Safety Area Improvement Program OVERVIEW September 2015

SMUHSDs New Dress Code Evolution of the New Dress Code Policy Student representatives

MORTGAGE TO RENT SCHEME What is Mortgage to Rent Mortgage to Rent is a Government Scheme that

Rent Stabilization in Mountain View Community Stabilization and Fair Rent Act (CSFRA) Measure

Runway I ncursion Ladies and Gentleman, I am very pleased to make a presentation regarding

Guidelines for Guidelines for Narrow Runway Narrow Runway Operations Operations Rob Root Rob

New LGA RNAV (GPS) Arrival Procedure to Runway 13_____ Background When JFK flights are

SFO Runway Layout Runway 28L Offset PROCEDURE: Visual Arrivals, Foster City Arrivals TEAL LINE:

1 2,500-Acre Multimodal Industrial Park 11,500 Foot Runway, NCs Largest Civilian Runway

Pr oposed Capital Pr ojec t Pr esentation No ve mb e r 22, 2016 Rhine b e c k Ce ntra l Sc

Operations WARNING OVERHEIGHT VEHICLES AHEAD Todd Trautz on behalf of the Pennsylvania

Table of Contents Basics of Food Safety . . . . . . . . pg. 3- 4 The Employee: Working Clean .

Are Minimum Wages Absorbed by Price Increases? Sylvia Allegretto, Co-director Center on Wage

M&A Contracts and the Role of Written Representations Navigating Lead-Ins, Disclosures in

Live Happy Stay Safe AIR CLEANING ANTI MICROBIAL Life Time Protection Air-Cleaning

Trafford Services for Education Content Professional Background Trafford Services for Education

MS State Department of Health Bureau of Genetic Services Request for Proposal 3675 Presented By

Finding that dress at scale @Rent The Runway Saurabh Bhatnagar Bio - PowerPoint PPT Presentation

Finding that dress at scale @Rent The Runway Saurabh Bhatnagar Bio 17 years in ML/data Prev: Responsible for personalization and ML at RTR Prev: Found Data Science at Barnes & Nobles Prev: consulted at HP, Unilever, Now: Founder,

Phase 1 Q Runway 18/ 36 Open Q Runway 28 ILS Available 25 Calendar Days Q Runway 10 VASI

Runway Safety NextGen Financial analysis of runway safety Rob van Eekeren Safe-Runway GmbH

Knights Dress Code Knights Dress Code Knights Dress Code Knights Dress Code Presented by

RUNWAY LIGHTING 45 PROJECTS IN 20 COUNTRIES 3.000 M RUNWAY SOLAR &amp; WIRELESS LIGHTING 1800M

Runway Safety Team Case Study / Workshop Presented to: Regional Runway Safety Seminar By: John

CITY OF RICHMOND RENT PROGRAM Nicolas Traylor, Executive Director 1 Agenda Rent Program THE

Environmental Issues of a Third Runway at HKIA Presented by Ir. Dr HF Chan on 13 th August 2011

Cuyahoga County Airport Runway 6/24 Safety Area Improvement Program OVERVIEW September 2015

SMUHSDs New Dress Code Evolution of the New Dress Code Policy Student representatives

MORTGAGE TO RENT SCHEME What is Mortgage to Rent Mortgage to Rent is a Government Scheme that

Rent Stabilization in Mountain View Community Stabilization and Fair Rent Act (CSFRA) Measure

Runway I ncursion Ladies and Gentleman, I am very pleased to make a presentation regarding

Guidelines for Guidelines for Narrow Runway Narrow Runway Operations Operations Rob Root Rob

New LGA RNAV (GPS) Arrival Procedure to Runway 13_____ Background When JFK flights are

SFO Runway Layout Runway 28L Offset PROCEDURE: Visual Arrivals, Foster City Arrivals TEAL LINE:

1 2,500-Acre Multimodal Industrial Park 11,500 Foot Runway, NCs Largest Civilian Runway

Pr oposed Capital Pr ojec t Pr esentation No ve mb e r 22, 2016 Rhine b e c k Ce ntra l Sc

Operations WARNING OVERHEIGHT VEHICLES AHEAD Todd Trautz on behalf of the Pennsylvania

Table of Contents Basics of Food Safety . . . . . . . . pg. 3- 4 The Employee: Working Clean .

Are Minimum Wages Absorbed by Price Increases? Sylvia Allegretto, Co-director Center on Wage

M&amp;A Contracts and the Role of Written Representations Navigating Lead-Ins, Disclosures in

Live Happy Stay Safe AIR CLEANING ANTI MICROBIAL Life Time Protection Air-Cleaning

Trafford Services for Education Content Professional Background Trafford Services for Education

MS State Department of Health Bureau of Genetic Services Request for Proposal 3675 Presented By

RUNWAY LIGHTING 45 PROJECTS IN 20 COUNTRIES 3.000 M RUNWAY SOLAR & WIRELESS LIGHTING 1800M

M&A Contracts and the Role of Written Representations Navigating Lead-Ins, Disclosures in