Language to Image Generation Generate a bird with Generate a bird - - PowerPoint PPT Presentation

language to image generation
SMART_READER_LITE
LIVE PREVIEW

Language to Image Generation Generate a bird with Generate a bird - - PowerPoint PPT Presentation

Language to Image Generation Generate a bird with Generate a bird with Generate a bird with wings that are blue and wings that are black wings that are red and and a white a yellow a red be red belly lly white belly belly


slide-1
SLIDE 1
slide-2
SLIDE 2

”Generate a bird with wings that are black and a white white belly belly” ”Generate a bird with wings that are blue and a red be red belly lly” ”Generate a bird with wings that are red and a yellow yellow belly belly”

ARTIFICIAL IMAGINATION

Language to Image Generation

slide-3
SLIDE 3

Language-to-Image generation with GANs

slide-4
SLIDE 4

Language-to-Image generation with GANs

slide-5
SLIDE 5

Propose AttnGANs to improve image generation

slide-6
SLIDE 6

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

Training pairs

slide-7
SLIDE 7

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

Attentional Generative Network:

  • Takes multi-level conditions (global-level sentence feature and fine-grained word features) as input.
  • Generates images from low-to-high resolutions at multiple stages.
slide-8
SLIDE 8

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

▪ In the first stage: ▪ based on the sentence feature, the image with basic color and shape is generated by generator G0; ▪ hidden features h0 are decoded from the sentence feature.

slide-9
SLIDE 9

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

▪ In following stages, attention models are built. ▪ For each region feature of previous generated image, compute its word-context vector. ▪ Concatenate previous image region features (e.g., h0) and word-context vectors to generate image with higher resolution.

slide-10
SLIDE 10

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

Training pairs

The conditional GAN loss:𝑀𝐻𝐵𝑂 = 𝑊 𝐸0, 𝐻0 + 𝑊 𝐸1, 𝐻1 + 𝑊 𝐸2, 𝐻2

slide-11
SLIDE 11

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

Training pairs

slide-12
SLIDE 12

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

❖ Text encoder (LSTM) extracts word features e1, e2, …, eT ❖ Image encoder (CNN) extracts image region features v1, v2, …, vN , where N = 288 ❖ Attention mechanism: for the i-th word, compute its region-context vector ci,

  • ҧ

𝑡𝑗,𝑘 is the dot product between features of the i-th word and the j-th image region; ❖ Compute the similarity score 𝑆(𝑑𝑗, 𝑓𝑗) between word and image from cosine similarity between 𝑓𝑗 and 𝑑𝑗; ❖ Compute the similarity score between the sentence (D) and the image (Q) from the fine-grained word-region pair information.

Training pairs

slide-13
SLIDE 13

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

❖ The DAMSM loss: maximize the similarity score between the images and their corresponding text descriptions (ground truth), i.e.,

  • M is the number of training pairs.

❖ The DAMSM loss provides a fine-grained word-region matching loss for training the generator. Training pairs

slide-14
SLIDE 14

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

The final objective function: 𝑀 = 𝑀𝐻𝐵𝑂 + 𝜇𝑀𝐸𝐵𝑁𝑇𝑁

Training pairs

slide-15
SLIDE 15

Datasets asets CUB CUB-201 011 MS MS-COC OCO train test train test # samples 8,855 2,933 80,000 40,000 caption/ image 10 10 5 5

slide-16
SLIDE 16

Dataset aset GAN-INT-CLS CLS [1] GAWWN [2] Stack ckGAN [3] Stack tackGAN AN-v2 v2 [4] [4] PPG PGN [5] Our r AttnGAN AN CUB 2.88 ± .04 3.62 ± .07 3.70 ± .04 3.82 ± .06 \ 4.36 ± .03 COCO 7.88 ± .07 \ 8.45 ± .03 \ 9.58 ± .21 25.89 ± .47

  • On CUB dataset, our AttnGAN achieves 4.36 inception score, which significantly
  • utperforms the previous best inception score of 3.82.
  • On the COCO dataset, our AttnGAN boosts the best reported inception score

from 9.58 to 25.89, a 170.25% improvement relatively.

[1] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016. [2] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016. [3] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017. [4] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. arXiv: 1710.10916, 2017. [5] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, 2017.

slide-17
SLIDE 17

Higher inception score means better image quality and diversity. Higher R-precision rate means better conditioned. The inception score and the corresponding R-precision rate of AttnGAN models on CUB.

  • “AttnGAN1” architecture has one attention model and generates images of 128x128 resolution;
  • “AttnGAN2” architecture has two attention models and generates images of 256x256 resolution.
slide-18
SLIDE 18

Higher inception score means better image quality and diversity. Higher R-precision rate means better conditioned. The inception score and the corresponding R-precision rate of AttnGAN models on CUB.

  • “AttnGAN1” architecture has one attention model and generates images of 128x128 resolution;
  • “AttnGAN2” architecture has two attention models and generates images of 256x256 resolution.
slide-19
SLIDE 19

Higher inception score means better image quality and diversity. Higher R-precision rate means better conditioned. The inception score and the corresponding R-precision rate of AttnGAN models on CUB.

  • “AttnGAN1” architecture has one attention model and generates images of 128x128 resolution;
  • “AttnGAN2” architecture has two attention models and generates images of 256x256 resolution.
slide-20
SLIDE 20

this bird red white a very short beak

slide-21
SLIDE 21

A fruit stand display with bananas and kiwi.

slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29

A herd of cows that are grazing

  • n the grass.

A stop sign flying in the sky. An old clock next to a light post in front of a steeple. The girl is surfing a small wave in the water. A red bus is floating on a lake. I think it's a herd

  • f cattle grazing
  • n a lush green

field. I think it's a clock tower in the middle of the street. I think it's a red and white sign. I think it's a young girl riding a wave on a surfboard in the water. I think it's a boat that is sitting on a bus.

What Microsoft CaptionBot sees… https://www.captionbot.ai/

slide-30
SLIDE 30

https://github.com/taoxugit/AttnGAN https://www.captionbot.ai/

slide-31
SLIDE 31
slide-32
SLIDE 32

256x256x3

Attentional Generative Network

c z~N(0,I)

h2 h1 h0

D0

128x128x3 64x64x3

Text Encoder

sentence feature word features Attention models Local image features

Deep Attentional Multimodal Similarity Model (DAMSM)

Conv3x3 Joining Upsampling FC with reshape Residual

this bird is red with white and has a very short beak

D2 D1

attn

F

1 attn

F

2

ca

F

F0 F0 F1 F1 F2 F2 G2 G1 G0

Image Encoder

The conditional GAN loss:𝑀𝐻𝐵𝑂 = 𝑊 𝐸0, 𝐻0 + 𝑊 𝐸1, 𝐻1 + 𝑊 𝐸2, 𝐻2

Training pairs