Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker - - PowerPoint PPT Presentation

bootstrapping labels for one hundred million images
SMART_READER_LITE
LIVE PREVIEW

Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker - - PowerPoint PPT Presentation

Bootstrapping Labels for One-Hundred Million Images Jimmy Whitaker We are drowning in data Data Never Sleeps 2.0 - DOMO (2014) 4/5/16 GTC 2016 2 Ripe Opportunities Many problems to solve Limitless amounts of image data Deep


slide-1
SLIDE 1

Bootstrapping Labels for One-Hundred Million Images

Jimmy Whitaker

slide-2
SLIDE 2

We are drowning in data

4/5/16 GTC 2016 2

Data Never Sleeps 2.0 - DOMO (2014)

slide-3
SLIDE 3

Ripe Opportunities

  • Many problems to solve
  • Limitless amounts of image data
  • Deep Learning pushing State of the Art

everywhere

  • GPUs making everything possible

4/5/16 GTC 2016 3

slide-4
SLIDE 4

The Problem

  • Deep Learning is data-driven
  • ImageNet has 1.2 million training examples
  • Few large, labeled image datasets exist
  • It’s expensive to label data
  • Our datasets are +100m images
  • Few qualified to label
  • Highly sensitive customer data
  • Necessary subject matter expertise

4/5/16 GTC 2016 4

slide-5
SLIDE 5

Ever labeled data?

  • Not as easy as it seems:
  • It’s repetitive
  • Less accurate over time
  • One day computers will do it all for you?
  • But not yet
  • Can some of this effort be automated?

4/5/16 GTC 2016 5

slide-6
SLIDE 6

Many Approaches

  • Mechanical Turk
  • Costly
  • Time Consuming
  • Clustering
  • Expensive
  • How many clusters
  • What features to

use?

4/5/16 GTC 2016 6

  • Pre-trained classifiers
  • What if pre-trained

classifiers don’t work well on data?

  • Active learning
  • Iterative labeling
  • Open problem

Can we combine these into something useful?

slide-7
SLIDE 7

The Goal

  • Inspired by Image Similarity experience

and Jeremy Howard TED talk

  • Use machines to filter the noise
  • Reduce repetitive tasks
  • Leverage human labeler
  • Understand the data
  • Label iteratively
  • Allow exploration

4/5/16 GTC 2016 7

slide-8
SLIDE 8

Our Approach

4/5/16 GTC 2016 8

slide-9
SLIDE 9

Our Approach

4/5/16 GTC 2016 9

Compare Image Hashes to filter Duplicate Images

slide-10
SLIDE 10

Our Approach

4/5/16 GTC 2016 10

slide-11
SLIDE 11

Our Approach

4/5/16 GTC 2016 11

slide-12
SLIDE 12

Our Approach

4/5/16 GTC 2016 12

slide-13
SLIDE 13

Our Approach

4/5/16 GTC 2016 13

Prevents over- focusing on one portion of feature space

slide-14
SLIDE 14

Our Approach

4/5/16 GTC 2016 14

slide-15
SLIDE 15

Our Approach

4/5/16 GTC 2016 15

Label Images on the boundary of the class

slide-16
SLIDE 16

Our Approach

4/5/16 GTC 2016 16

Improve CNN features for labeled classes

slide-17
SLIDE 17

GUI

4/5/16 GTC 2016 17

slide-18
SLIDE 18

Hardware

  • Cirrascale GB5670
  • 56 CPU Cores
  • 8x NVIDIA K-80
  • 512GB DDR4
  • 1 TB SSD

4/5/16 GTC 2016 18

slide-19
SLIDE 19

Benefits

  • Create Large, Labeled Datasets
  • High quality
  • Allows data exploration
  • Dramatic time reduction
  • ~3-5x faster initially
  • Multiplicative efficiency gains
  • Flexible framework
  • Perform data science with images

4/5/16 GTC 2016 19

slide-20
SLIDE 20

CONFIDENTIAL 20