Phrase-based Image Captioning
Rémi Lebret, Pedro O. Pinheiro, Ronan Collobert
Idiap Research Institute / EPFL
Phrase-based Image Captioning Rmi Lebret , Pedro O. Pinheiro, Ronan - - PowerPoint PPT Presentation
Phrase-based Image Captioning Rmi Lebret , Pedro O. Pinheiro, Ronan Collobert Idiap Research Institute / EPFL ICML, 9 July 2015 Image Captioning Objective: Generate descriptive sentences given a sample image. A man is grinding a ramp on
Idiap Research Institute / EPFL
◮ Objective: Generate descriptive sentences given a sample image.
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 2 / 18
◮ Recent models based on Deep CNN + RNN [Vinyals et al., Karpathy &
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 3 / 18
◮ Recent models based on Deep CNN + RNN [Vinyals et al., Karpathy &
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 3 / 18
◮ Recent models based on Deep CNN + RNN [Vinyals et al., Karpathy &
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 3 / 18
◮ Recent models based on Deep CNN + RNN [Vinyals et al., Karpathy &
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 3 / 18
A given image i ∈ I Ground-truth descriptions s ∈ S: a man riding a skateboard up the side
a wooden ramp a man is grinding a ramp
a skateboard man riding on edge
an oval ramp with a skate board a man in a helmet skateboarding before an audience a man
a skateboard is doing a trick
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 4 / 18
A given image i ∈ I Ground-truth descriptions s ∈ S: a man riding a skateboard up the side
a wooden ramp
VP NP PP NP PP NP a man is grinding a ramp
a skateboard man riding on edge
an oval ramp with a skate board a man in a helmet skateboarding before an audience a man
a skateboard is doing a trick
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 4 / 18
A given image i ∈ I Ground-truth descriptions s ∈ S: a man riding a skateboard up the side
a wooden ramp
VP NP PP NP PP NP a man is grinding a ramp
a skateboard
VP NP PP NP man riding on edge
an oval ramp with a skate board a man in a helmet skateboarding before an audience a man
a skateboard is doing a trick
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 4 / 18
A given image i ∈ I Ground-truth descriptions s ∈ S: a man riding a skateboard up the side
a wooden ramp
VP NP PP NP PP NP a man is grinding a ramp
a skateboard
VP NP PP NP man riding on edge
an oval ramp with a skate board
VP NP PP NP PP NP a man in a helmet skateboarding before an audience a man
a skateboard is doing a trick
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 4 / 18
A given image i ∈ I Ground-truth descriptions s ∈ S: a man riding a skateboard up the side
a wooden ramp
VP NP PP NP PP NP a man is grinding a ramp
a skateboard
VP NP PP NP man riding on edge
an oval ramp with a skate board
VP NP PP NP PP NP a man in a helmet skateboarding before an audience
PP NP PP NP a man
a skateboard is doing a trick
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 4 / 18
A given image i ∈ I Ground-truth descriptions s ∈ S: a man riding a skateboard up the side
a wooden ramp
VP NP PP NP PP NP a man is grinding a ramp
a skateboard
VP NP PP NP man riding on edge
an oval ramp with a skate board
VP NP PP NP PP NP a man in a helmet skateboarding before an audience
PP NP PP NP a man
a skateboard is doing a trick
PP NP VP NP
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 4 / 18
A given image i ∈ I Ground-truth descriptions s ∈ S: a man riding a skateboard up the side
a wooden ramp
VP NP PP NP PP NP a man is grinding a ramp
a skateboard
VP NP PP NP man riding on edge
an oval ramp with a skate board
VP NP PP NP PP NP a man in a helmet skateboarding before an audience
PP NP PP NP a man
a skateboard is doing a trick
PP NP VP NP ◮ Noun phrases (NP)
◮ Verbal phrases (VP)
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 4 / 18
◮ Two datasets: Flickr30k + COCO (≈ 560k training sentences).
NP VP NP PP NP O NP VP NP O NP PP NP VP NP O NP PP NP PP NP O NP VP NP PP NP PP NP O NP VP NP VP NP O NP PP NP VP NP PP NP O NP PP NP O NP PP NP PP NP PP NP O NP VP NP VP NP PP NP O NP NP VP NP O NP VP NP PP NP VP NP O NP PP NP O NP O NP PP NP PP NP VP NP O NP VP NP PP NP PP NP PP NP O NP NP VP NP PP NP O NP PP NP VP NP PP NP PP NP O NP O NP VP NP O NP VP NP SBAR VP NP O NP VP NP O VP NP O 5 10 15
Cumulative Distribution Function 0.2 0.3 0.4 0.5 0.6 0.7
◮ Describing images:
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 5 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 6 / 18
I = set of training images C = set of all phrases used to describe I U = (uc1, . . . , uc|C|) ∈ Rm×|C| V ∈ Rm×n
A man in a helment skateboarding before an audience. Man riding on edge of an oval ramp with a skate board. A man riding a skateboard up the side of a wooden ramp. A man on a skateboard is doing a trick. A man is grinding a ramp on a skateboard.
a man a wooden ramp riding
a skate board is grinding with
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 7 / 18
I = set of training images C = set of all phrases used to describe I U = (uc1, . . . , uc|C|) ∈ Rm×|C| V ∈ Rm×n
A man in a helment skateboarding before an audience. Man riding on edge of an oval ramp with a skate board. A man riding a skateboard up the side of a wooden ramp. A man on a skateboard is doing a trick. A man is grinding a ramp on a skateboard.
a man a wooden ramp riding
a skate board is grinding with
pre-trained CNN representation zi ∈ Rn
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 7 / 18
I = set of training images C = set of all phrases used to describe I U = (uc1, . . . , uc|C|) ∈ Rm×|C| V ∈ Rm×n
A man in a helment skateboarding before an audience. Man riding on edge of an oval ramp with a skate board. A man riding a skateboard up the side of a wooden ramp. A man on a skateboard is doing a trick. A man is grinding a ramp on a skateboard.
a man a wooden ramp riding
a skate board is grinding with
pre-trained CNN representation zi ∈ Rn representation uc for a phrase c = {w1, . . . , wK } by averaging pre-trained word vector representations xw ∈ Rm: uc = 1
K
K
k=1 xwk Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 7 / 18
I = set of training images C = set of all phrases used to describe I U = (uc1, . . . , uc|C|) ∈ Rm×|C| V ∈ Rm×n
A man in a helment skateboarding before an audience. Man riding on edge of an oval ramp with a skate board. A man riding a skateboard up the side of a wooden ramp. A man on a skateboard is doing a trick. A man is grinding a ramp on a skateboard.
a man a wooden ramp riding
a skate board is grinding with
pre-trained CNN representation zi ∈ Rn representation uc for a phrase c = {w1, . . . , wK } by averaging pre-trained word vector representations xw ∈ Rm: uc = 1
K
K
k=1 xwk Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 7 / 18
I = set of training images C = set of all phrases used to describe I U = (uc1, . . . , uc|C|) ∈ Rm×|C| V ∈ Rm×n
A man in a helment skateboarding before an audience. Man riding on edge of an oval ramp with a skate board. A man riding a skateboard up the side of a wooden ramp. A man on a skateboard is doing a trick. A man is grinding a ramp on a skateboard.
a man a wooden ramp riding
a skate board is grinding with
pre-trained CNN representation zi ∈ Rn representation uc for a phrase c = {w1, . . . , wK } by averaging pre-trained word vector representations xw ∈ Rm: uc = 1
K
K
k=1 xwk Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 7 / 18
I = set of training images C = set of all phrases used to describe I U = (uc1, . . . , uc|C|) ∈ Rm×|C| V ∈ Rm×n
A man in a helment skateboarding before an audience. Man riding on edge of an oval ramp with a skate board. A man riding a skateboard up the side of a wooden ramp. A man on a skateboard is doing a trick. A man is grinding a ramp on a skateboard.
a man a wooden ramp riding
a skate board is grinding with
pre-trained CNN representation zi ∈ Rn representation uc for a phrase c = {w1, . . . , wK } by averaging pre-trained word vector representations xw ∈ Rm: uc = 1
K
K
k=1 xwk
score between the image i and a phrase c: fθ(c, i) = uT
c Vzi Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 7 / 18
A man in a helment skateboarding before an audience. Man riding on edge of an oval ramp with a skate board. A man riding a skateboard up the side of a wooden ramp. A man on a skateboard is doing a trick. A man is grinding a ramp on a skateboard.
a man a wooden ramp riding
a skate board is grinding with
−uT
cj Vzi
ck Vzi
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 8 / 18
◮ Bilinear model gives the L most likely phrases cj. ◮ Generating sentences from this set using l ∈ {1, . . . , L} phrases:
l
l
◮ Prior knowledge on chunking tags t ∈ {NP, VP, PP}:
l
l
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 9 / 18
l
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 10 / 18
◮ Ranking to find the sentence which is the closest to sample image. ◮ Leveraging score between the image i and a phrase c: fθ(c, i) = uT c Vzi. ◮ Averaging phrase scores fθ(cj, i) ∀cj ∈ s:
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 11 / 18
◮ COCO dataset: 82783/5000/5000 images, 5 sentences per image. ◮ Only phrases occurring at least 10 times:
◮ 8,982 NP (73%) ◮ 3,083 VP (75%) ◮ 189 PP (99%)
◮ Image features: VGG ConvNet pre-trained on Imagenet (4096D vector). ◮ Word features: Hellinger PCA of a word co-occurrence matrix, built over English Wikipedia
◮ Trainable parameters θ:
◮ V ∈ R400×4096 → initialized randomly. ◮ U ∈ R400×|C| → initialized by averaging word features + fine-tuned.
◮ 15 negative samples.
◮ Transition probabilities between phrases from COCO dataset. ◮ No smoothing. ◮ Subset of top-ranked phrases: 20 best NP, 5 best VP and 5 best PP.
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 12 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 13 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 14 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 15 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 15 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 15 / 18
BEFORE AFTER A GREY CAT
A GREY DOG A GRAY CAT
A GREY AND BLACK CAT A GREY AND BLACK CAT
A GRAY CAT A BROWN CAT
A GREY ELEPHANT A GREY AND WHITE CAT
A YELLOW CAT GREY AND WHITE CAT
HOME PLATE
A HOME PLATE A HOME PLATE
A PLATE HOME BASE
ANOTHER PLATE THE PITCH
A RED PLATE THE BATTER
A DINNER PLATE A BASEBALL PITCH
A HALF PIPE
A PIPE A PIPE
A HALF THE RAMP
A SMALL CLOCK A HAND RAIL
A LARGE CLOCK A SKATE BOARD RAMP
A SMALL PLATE AN EMPTY POOL
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 16 / 18
◮ Generate image caption by inferring phrases that best describe them. ◮ Simple model and very fast to train/test. ◮ We achieve results similar to CNN+RNN models. ◮ Enriching phrase representations with visual features.
◮ Leveraging unsupervised data ◮ More complex language models
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 17 / 18
Rémi Lebret (Idiap Research Institute / EPFL) Phrase-based Image Captioning ICML 2015 18 / 18