Sharing is Caring in the Land of The Long Tail Samy Bengio Real - - PowerPoint PPT Presentation

sharing is caring in the land of the long tail
SMART_READER_LITE
LIVE PREVIEW

Sharing is Caring in the Land of The Long Tail Samy Bengio Real - - PowerPoint PPT Presentation

Sharing is Caring in the Land of The Long Tail Samy Bengio Real life setting Real problems rarely come packaged as 1M images uniformly belonging to a set of 1000 classes 2 The long tail Well known phenomena where a small number


slide-1
SLIDE 1

Sharing is Caring in the Land of The Long Tail

Samy Bengio

slide-2
SLIDE 2

Real life setting

2

“Real problems rarely come packaged as 1M images uniformly belonging to a set of 1000 classes…”

slide-3
SLIDE 3

The long tail

3

  • Well known phenomena where a small number of

generic objects/entities/words appear very often and most others appear more rarely.

  • Also knows as Zipf or Power law, or Pareto distribution.
  • The web is littered by this kind of distributions:
  • the frequency of each unique query on search

engines,

  • the occurrences of each unique word in text

documents,

  • etc.
slide-4
SLIDE 4

Example of a long tail

4 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 the anyways trickiest h-plane Frequency Words Frequency of words in Wikipedia

slide-5
SLIDE 5

Representation sharing

5

  • How do we design a classifier or a ranker when data

follows a long tail distribution?

  • If we train one model per class, it is hard for poor

classes to be well trained.

  • How come we humans are able to recognize objects

we have seen only once or even never?

  • Most likely answer: representation sharing: all class

models share/learn a joint representation.

  • Poor classes can then benefit from knowledge learned

from semantically similar but richer classes.

  • Extreme case: zero-shot setting!
slide-6
SLIDE 6

Outline

6

  • Wsabie: a joint embedding space of images

and labels

  • The many facets of text embeddings
  • Zero-shot setting through embeddings
  • Incorporate Knowledge Graph constraints
  • Use of a language model

In this talk, I will cover the following ideas: I will NOT cover the following important issues:

  • Prediction time issues for extreme classification
  • Memory issues
slide-7
SLIDE 7

Wsabie

7 100-dimensional embedding space Labels Obama Eiffel Tower Shark Dolphin Lion ...

Learn to embed images & labels to optimize top-ranked items.

Wsabie: J. Weston et al, ECML 2010, IJCAI 2011

slide-8
SLIDE 8

Wsabie: summary

8

Label i Image x W V sim(i,x) = <Wi,Vx> Triplet Loss: sim( , dolphin) > sim( , obama) + 1

000100 real values

Trained by stochastic gradient descent and smart sampling of negative examples

slide-9
SLIDE 9

9

Wsabie: experiments - results

Method ImageNet 2010 Web prec@1 prec@10 prec@1 prec@10 approx kNN 1.55% 0.41% 0.30% 0.34% One-vs- Rest 2.27% 1.02% 0.52% 0.29% Wsabie 4.03% 1.48% 1.03% 0.44% Ensemble

  • f 10

Wsabies 10.03% 3.02%

ImageNet 2010: 16000 labels and 4M images Web: 109000 labels and 16M images

slide-10
SLIDE 10

10

Wsabie: embeddings

Label Nearest Neighbors barack obama barak obama, obama, barack, barrack obama, bow wow david beckham beckham, david beckam, alessandro del piero, del piero santa santa claus, papa noel, pere noel, santa clause, joyeux noel dolphin delphin, dauphin, whale, delfin, delfini, baleine, blue whale cows cattle, shire, dairy cows, kuh, horse, cow, shire horse, kone rose rosen, hibiscus, rose flower, rosa, roze, pink rose, red rose eiffel tower eiffel, tour eiffel, la tour eiffel, big ben, paris, blue mosque ipod i pod, ipod nano, apple ipod, ipod apple, new ipod f18 f 18, eurofighter, f14, fighter jet, tomcat, mig 21, f 16

slide-11
SLIDE 11

11

Wsabie: annotations

delfini, orca, dolphin, mar, delfin, dauphin, whale, cancun, killer whale, sea world blue whale, whale shark, great white shark, underwater, white shark, shark, manta ray, dolphin, requin, blue shark, diving barrack obama, barak obama, barack hussein

  • bama, barack obama, james marsden, jay z,
  • bama, nelly, falco, barack

eiffel, paris by night, la tour eiffel, tour eiffel, eiffel tower, las vegas strip, eifel, tokyo tower, eifel tower

slide-12
SLIDE 12

“Why not an embedding

  • f text only?”
slide-13
SLIDE 13

Skip-Gram (Word2Vec)

Learn dense embedding vectors from an unannotated text corpus, e.g. Wikipedia

http://code.google.com/p/word2vec Tomas Mikolov, Kai Chen, Greg Corrado, Jeff Dean (ICLR 2013)

bull shark tuna Obama tiger shark wing chair “an exceptionally large male tiger shark can grow up to” W E E

13

slide-14
SLIDE 14

Skip-Gram Wikipedia

tiger shark bull shark blacktip shark shark

  • ceanic whitetip shark

sandbar shark dusky shark blue shark requiem shark great white shark lemon shark car cars muscle car sports car compact car autocar automobile pickup truck racing car passenger car dealership

transportation dogs birds musical instruments aquatic life insects animals clothing food reptiles

t-SNE visualization of ImageNet labels

Skip-gram trained on Wikipedia, 155K terms

14

slide-15
SLIDE 15

15

Berlin Germany Rome Italy big hotter bigger hot

E(Rome) - E(Italy) + E(Germany) ≈ E(Berlin) E(hotter) - E(hot) + E(big) ≈ E(bigger)

Embeddings are powerful

slide-16
SLIDE 16

Let’s go back to images!

slide-17
SLIDE 17

17

Deep convolutional models for images

Input Layer 1 Layer 7 ...

But what about the long tail of classes? What about using our semantic embeddings for that?

slide-18
SLIDE 18

ConSE: Convex Combination of Semantic Embeddings [Norouzi et al, ICLR’2014]

18

p(Lion|x) p(Apple|x) p(Orange|x) p(Tiger|x) p(Bear|x)

slide-19
SLIDE 19

19

f(x) = X

i

p(yi|x)s(yi)

ConSE: Convex Combination of Semantic Embeddings

p(Apple|x)s(Apple)+ p(Orange|x)s(Orange)+ p(Tiger|x)s(Tiger)+ p(Bear|x)s(Bear) f(x) = p(Lion|x)s(Lion)+ s(y) = embedding position of y Do a nearest neighbor search around f(x) to find the corresponding label from Skip-Gram for instance:

slide-20
SLIDE 20

20

ConSE(T): Convex Combination of Semantic Embeddings

top(T) = {i | p(yi|x) is among top T probabilities} f(x) = 1 Z X

i∈top(T )

p(yi|x)s(yi) In practice, consider the average of only a few labels:

slide-21
SLIDE 21

Training 2-hops 3-hops

ConSE(T): experiments on ImageNet

  • Model trained with 1.2M

ILSVRC 2012 images from 1,000 classes

  • Evaluated on images from same

classes.

  • Results are measured as hit@k.
slide-22
SLIDE 22

ConSe(T) experiments

22

slide-23
SLIDE 23

Knowledge Graph

23

slide-24
SLIDE 24

Multiclass Classifiers

24

Softmax Logistic GoogleLeNet model

slide-25
SLIDE 25

Object labels have rich relations

25

Corgi Puppy Dog Cat

Exclusion Hierarchical Dog Cat Corgi Puppy Overlap

slide-26
SLIDE 26

Visual Model + Knowledge Graph

26

Corgi Puppy Dog Cat Visual Model

0.9 0.8 0.9 0.1

Knowledge Graph

Joint Inference

Corgi Puppy Dog Cat

Exclusion Hierarchical

Hierarchy and Exclusion (HEX) Graph [Deng et al, ECCV 2014]

slide-27
SLIDE 27

HEX Classification Model

27

Pr(y| x) = 1 Z(x) φi(xi, yi)

i

x ∈ Rn

y ∈ {0,1}n

Binary Label vector

All illegal configurations have probability zero.

ψi, j (yi, yj )

i, j

φi(xi, yi) =

sigmoid(xi)

1− sigmoid(xi)

Unary: same as logistic regression

ψi, j(yi, yj ) =

1

If violates constraints Otherwise Pairwise: set illegal configuration to zero

Input scores

if yi =1

if yi = 0

slide-28
SLIDE 28

Exp: Learning with weak labels

28

  • ILSVRC 2012: “relabel” or “weaken” a portion
  • f fine-grained leaf labels to basic level labels.
  • Evaluate on fine-grained recognition

Dog Corgi Animal

Husky Dog Corgi Animal

Husky Relabel

Training (“weakened” labels) Test

Dog Corgi Animal

Husky

Original ILSVRC 2012 
 (leaf labels)

slide-29
SLIDE 29

Exp: Learning with weak labels

29

  • ILSVRC 2012: “relabel” or “weaken” a portion
  • f fine-grained leaf labels to basic level

labels.

  • Evaluate on fine-grained recognition.
  • Consistently outperforms baselines.

Top 1 accuracy (top 5 accuracy)

slide-30
SLIDE 30

What about textual descriptions?

30

  • We have considered the long tail of objets.
  • What about more complex descriptions,

involving multi-word descriptions, or captions?

  • We can use language models to help.
slide-31
SLIDE 31

Neural Image Caption Generator

31

  • 1. Two pizzas

sitting on top of a stove top

  • ven.
  • 2. A pizza

sitting on top

  • f a pan on

top of a stove.

Vision Deep CNN Language Generating RNN

A group of people shopping at an

  • utdoor market.

!

There are many vegetables at the fruit stand.

Vision! Deep CNN Language ! Generating! RNN

[Vinyals et al, CVPR 2015]

slide-32
SLIDE 32

NIC: objective

32

log p(S|I) =

N

X

t=0

log p(St|I, S0, . . . , St−1)

  • Let I be an image (pixels).
  • Let S be the corresponding sentence (sequence of words).
  • Likelihood of producing the right sentence given the image:

θ? = arg max

X

(I,S)

log p(S|I; θ)

  • We maximize the likelihood of producing the right sentence

given the image:

slide-33
SLIDE 33

NIC: model

33

Image Embedding P(word 1) P(word 2) P(<end>) Convolution Neural Net Recurrent Neural Net word 1 word N

slide-34
SLIDE 34

Two dogs play in the grass. Two hockey players are fighting over the puck. A skateboarder does a trick

  • n a ramp.

A little girl in a pink hat is blowing bubbles. A herd of elephants walking across a dry grass field. A group of young people playing a game of frisbee. A close up of a cat laying

  • n a couch.

A red motorcycle parked on the side of the road. A dog is jumping to catch a frisbee. A yellow school bus parked in a parking lot. A person riding a motorcycle on a dirt road. A refrigerator filled with lots of food and drinks.

Describes without errors Describes with minor errors Somewhat related to the image Unrelated to the image

Examples

34

slide-35
SLIDE 35

Human: A blue and black dress ... No! I see white and gold! Our model: A close up of a vase with flowers.

It doesn’t always work…

slide-36
SLIDE 36

Scheduled Sampling[NIPS 2015]

36

slide-37
SLIDE 37

Scheduled Sampling

37

slide-38
SLIDE 38

Scheduled Sampling

38

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 200 400 600 800 1000 Exponential decay Inverse sigmoid decay Linear decay

slide-39
SLIDE 39

Conclusions

39

  • The long tail problem happens in most of the

interesting tasks.

  • Sharing approaches can help “poor” classes to

generalize thanks to “rich” classes.

  • At the extreme, semantic embeddings can

represent classes with zero training examples.

  • Sharing approaches are interesting not only for

text and images, but also for complete sentences.

  • Other recent approaches: represent text using

characters; good for long tail words.

  • … but never under-estimate the long tail …
slide-40
SLIDE 40

Open source machine learning tensorflow.org

slide-41
SLIDE 41

Google Brain Residency Programme

41

  • For more information: g.co/brainresidency
  • Contact us: brain-residency@google.com

New one year immersion program in deep learning research

  • Learn to conduct deep learning research w/experts in our team
  • Fixed one-year employment with salary, benefits, ...
  • Goal after one year is to have conducted several research projects
  • Interesting problems, TensorFlow, and access to computational

resources

  • Apply before January 15, 2016.