Adversarial Reward Learning for Visual Storytelling Xin Wang, Wenhu - - PowerPoint PPT Presentation

adversarial reward learning for visual storytelling
SMART_READER_LITE
LIVE PREVIEW

Adversarial Reward Learning for Visual Storytelling Xin Wang, Wenhu - - PowerPoint PPT Presentation

Adversarial Reward Learning for Visual Storytelling Xin Wang, Wenhu Chen, Yuan-Fang Wang, William Yang Wang Maria Fabiano Outline 1. Motivation 2. AREL Model Overview 3. Policy Model 4. Reward Model 5. AREL Objective 6. Data 7.


slide-1
SLIDE 1

Adversarial Reward Learning for Visual Storytelling

Maria Fabiano

Xin Wang, Wenhu Chen, Yuan-Fang Wang, William Yang Wang

slide-2
SLIDE 2

Outline

1. Motivation 2. AREL Model Overview 3. Policy Model 4. Reward Model 5. AREL Objective 6. Data 7. Training and Testing 8. Evaluation 9. Critique

slide-3
SLIDE 3

Motivation

The authors want to explore how well a computer can create a story from a set of images. Up to the point of this paper, little research had been done in visual storytelling. Visual storytelling represents a deeper understanding of images. This goes beyond image captioning because it requires understanding more complicated visual scenarios, relating sequential images, and associating implicit concepts in the image (e.g., emotions).

slide-4
SLIDE 4
  • RL

○ Hand-crafted rewards (e.g., METEOR) are too biased or too sparse to drive the policy search ○ Fail to learn implicit semantics (coherence, expressiveness, etc) ○ Require extensive feature and reward engineering

  • GANs

○ Prone to unstable gradients or vanishing gradients

  • IRL

○ Maximum margin approaches, probabilistic approaches

Motivation

Problems with previous storytelling approaches

slide-5
SLIDE 5
  • Policy model: produces the

story sequence from an image sequence

  • Reward model: learns implicit

reward function from human-annotated stories and sampled predictions

  • Alternately train the models

via SGD

AREL Model Overview

dversarial ward earning

slide-6
SLIDE 6
  • Images go through

pre-trained CNN

  • Encoder (bidirectional

GRUs) gets high-level features of images

  • Five decoders (single-layer

GRU, shared weights) create five substories

  • Concatenate substories

Policy Model

Takes an image sequence and sequentially chooses words from the vocabulary to create a story.

slide-7
SLIDE 7

(Partial) Reward Model

Aims to derive a human-like reward from human-annotated stories and sampled predictions.

slide-8
SLIDE 8

We achieve the optimal reward function R* when the Reward-Boltzmann distribution pθ equals the actual data distribution p* W = Story Rθ = Reward function Zθ = Partition function (a normalizing constant) pθ = Approximate data distribution

Adversarial Reward Learning: Reward Boltzmann Distribution

slide-9
SLIDE 9

Adversarial Reward Learning

We want the Reward Boltzmann distribution pθ to get close the actual data distribution p*.

  • Adversarial objective: min-max two player game

Maximize the similarity of pθ with the empirical distribution pe while minimizing the similarity of pθ with the generated data from the policy πβ. Meanwhile, πβ wants to maximize its similarity with pθ.

  • Distribution similarity is measured using KL-divergence.
  • The objective of the reward is to distinguish between human-annotated

stories and machine-generated stories.

○ Minimize KL-divergence with pe and maximize KL-divergence with πβ

  • The objective of the policy is to create stories indistinguishable from

human-written stories.

○ Minimize KL-divergence with pθ

slide-10
SLIDE 10

Data

  • VIST dataset of Flickr photos aligned to stories
  • One sample is a story for five images from a photo album
  • The same album is paired with five different stories as references
  • Vocabulary of size 9,837 words (words have to appear more than three times

in the training set)

slide-11
SLIDE 11

Training and Testing

1. Create a baseline model XE-ss (cross-entropy loss with scheduled sampling) with the same architecture as the policy model

a. Scheduled sampling uses a sampling probability to decide which action to take

2. Use XE-ss to initialize the policy model 3. Train with AREL framework

slide-12
SLIDE 12

Training and Testing

  • Objective of the policy model:

maximize similarity with pθ

  • Objective of the reward model:

distinguish between human-generated and machine-generated stories

  • Alternate between training policy

and reward using SGD

○ N = 50 or 100

  • For testing, policy uses beam

search to create the story

slide-13
SLIDE 13
slide-14
SLIDE 14
  • AREL achieves SOTA on all metrics except ROUGE; however, these gains are

very small, and are very similar to the baseline model and vanilla GAN

2.2 1.1

Gain

0.9 0.4

Range of new methods

Automatic Evaluation

0.9 0.4 0.9 0.9 1.7 0.2 2.1

slide-15
SLIDE 15

AREL greatly outperforms all

  • ther models in human

evaluations:

  • Turing test
  • Relevance
  • Expressiveness
  • Concreteness

Comparison of Turing test results

Human Evaluation

slide-16
SLIDE 16
  • AREL – novel framework of adversarial reward learning to tell stories
  • SOTA on VIST dataset and automatic metrics
  • Automatic metrics are not great for training or evaluation (empirically shown)
  • Comprehensive human evaluation via Turk

○ Better results on relevance, expressiveness, and concreteness ○ Clear description of how human evaluation was conducted

Critique

The “Good”

slide-17
SLIDE 17
  • Motivation: interesting problem to solve, but what are practical applications?

○ Limited to five photos in a story

  • XE-ss: not mentioned until evaluation section, but it initializes AREL
  • Partial rewards: more discussion and motivation needed for this approach
  • Missing details

○ Type of pooling in reward model is not specified (average? max?) ○ Fine tuning the pre-trained ResNet?

  • Data bias (gender and event), and the model augments the largest majority’s influence
  • Small gain in automatic evaluation metrics, and XE-ss performs similarly to AREL; no

direct comparison of human evaluation between AREL and previous methods

  • Human evaluation improvements

○ Include a reason why they chose which sentence was machine-generated or not ○ Rankings instead of pairwise comparisons

  • Decoder shared weights: maybe there is something specific about an image’s position

that requires different weights (e.g., the structure of a narrative: setting, problem, rising action, climax, falling action, resolution)

Critique

The “Not so Good”