bootstrapping labels for one hundred million images
play

Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker - PowerPoint PPT Presentation

Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker We are drowning in data Data Never Sleeps 2.0 - DOMO (2014) 4/5/16 GTC 2016 2 Ripe Opportunities Many problems to solve Limitless amounts of image data Deep


  1. Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker

  2. We are drowning in data Data Never Sleeps 2.0 - DOMO (2014) 4/5/16 GTC 2016 2

  3. Ripe Opportunities • Many problems to solve • Limitless amounts of image data • Deep Learning pushing State of the Art everywhere • GPUs making everything possible 4/5/16 GTC 2016 3

  4. The Problem • Deep Learning is data-driven • ImageNet has 1.2 million training examples • Few large, labeled image datasets exist • It’s expensive to label data • Our datasets are +100m images • Few qualified to label • Highly sensitive customer data • Necessary subject matter expertise 4/5/16 GTC 2016 4

  5. Ever labeled data? • Not as easy as it seems: • It’s repetitive • Less accurate over time • One day computers will do it all for you? • But not yet • Can some of this effort be automated? 4/5/16 GTC 2016 5

  6. Many Approaches • Mechanical Turk • Pre-trained classifiers • Costly • What if pre-trained • Time Consuming classifiers don’t work • Clustering well on data? • Expensive • Active learning • How many clusters • Iterative labeling • What features to • Open problem use? Can we combine these into something useful? 4/5/16 GTC 2016 6

  7. The Goal • Inspired by Image Similarity experience and Jeremy Howard TED talk • Use machines to filter the noise • Reduce repetitive tasks • Leverage human labeler • Understand the data • Label iteratively • Allow exploration 4/5/16 GTC 2016 7

  8. Our Approach 4/5/16 GTC 2016 8

  9. Our Approach Compare Image Hashes to filter Duplicate Images 4/5/16 GTC 2016 9

  10. Our Approach 4/5/16 GTC 2016 10

  11. Our Approach 4/5/16 GTC 2016 11

  12. Our Approach 4/5/16 GTC 2016 12

  13. Our Approach Prevents over- focusing on one portion of feature space 4/5/16 GTC 2016 13

  14. Our Approach 4/5/16 GTC 2016 14

  15. Our Approach Label Images on the boundary of the class 4/5/16 GTC 2016 15

  16. Our Approach Improve CNN features for labeled classes 4/5/16 GTC 2016 16

  17. GUI 4/5/16 GTC 2016 17

  18. Hardware • Cirrascale GB5670 • 56 CPU Cores • 8x NVIDIA K-80 • 512GB DDR4 • 1 TB SSD 4/5/16 GTC 2016 18

  19. Benefits • Create Large, Labeled Datasets • High quality • Allows data exploration • Dramatic time reduction • ~3-5x faster initially • Multiplicative efficiency gains • Flexible framework • Perform data science with images 4/5/16 GTC 2016 19

  20. CONFIDENTIAL 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend