Self-Attention For Generative Models Ashish Vaswani and Anna Huang - PowerPoint PPT Presentation

Dec 28, 2022 •137 likes •1.3k views

Self-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin, Llion Jones, Justin Gilmer, David Bieber, Jonathan Frankle, Jakob Uszkoreit, and others. Learning

Continuations to given initial motif Given motif RNN-LSTM
Continuations to given initial motif Given motif RNN-LSTM
Continuations to given initial motif Given motif RNN-LSTM Transformer
Continuations to given initial motif Given motif RNN-LSTM Transformer
Continuations to given initial motif Given motif RNN-LSTM Transformer Music Transformer
Continuations to given initial motif Given motif RNN-LSTM Transformer Music Transformer
Self-Similarity in Music
Sample from Music Transformer
Attention: a weighted average TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
Attention: a weighted average TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
Convolution: Different linear transformations by relative position. TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
Relative attention (Shaw et al, 2018) Multihead attention + convolution? TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90 TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
Closer look at attention QE rT
Closer look at relative attention Modulated by relative positions 0,0 0 0,1 1 0,2 2 1,0 -1 1,1 0 1,2 1 2,0 -2 2,1 -1 2,2 0 QE rT
Machine Translation (Shaw et al, 2018) Model Position BLEU BLEU Representati En-De En-Fr on Transformer Big Absolute 27.9 41.3 Transformer Big Relative 29.2 41.5
Previous work O(L 2 D): 8.5 GB per layer (Shaw et al, 2018) Per layer, L=2048, D=512 Relative embeddings Multiply by Q Relative distances 0 -1 0 -2 -1 0
Our formulation O(LD): 4.2 MB per layer Per layer, L=2048, D=512 Absolute by absolute Absolute by relative Skew Pad Reshape Slice i q
Goal of skewing procedure Indexed by absolute by absolute absolute by relative
Skewing to reduce relative memory from O(L 2 D) to O(LD) Per layer, L=2048, D=512 0 -1 0 Previous work -2 -1 0 O(L 2 D): 8.5 GB Relative Multiply by Q i q embeddings E r Skew Our work O(LD): 4.2 MB QE rT S rel Reshape Slice Pad i q Directly multiply by Q Per layer, L=2048, D=512 O(L 2 D): 8.5 GB Q skew(QE T ) O(LD): 4.2 MB (ours) E T
A Jazz sample from Music Transformer
A Jazz sample from Music Transformer
Convolutions and Translational Equivariance 0 32 0 32 0.5 0.5 32 32
Relative positions Translational Equivariance 0 32 0 32 0.5 0.5 32 32
Relative Attention And Graphs

Recommend

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in encoder-decoder networks Various kinds of attention 2 Overview What is attention? Attention in encoder-decoder networks 3 Visual

971 views • 73 slides

generative design systems Generative Brief Design Definitions Workshop Processes

Generative Brief Design Definitions Workshop Processes Applications generative design systems Generative Brief Design Definitions Workshop Processes Applications generative design systems Generative Brief Design Definitions

943 views • 75 slides

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide a way to sample from any distribution. 1. Sample z , where denotes an efficiently sampleable distribution (e.g., uniform or Gaussian). 2.

486 views • 47 slides

Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is

Day 4 Lecture 6 Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is used to predict the output... 2 Attention Models: Motivation bird Image: H x W x 3 The whole input volume is used to predict

497 views • 31 slides

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is attention? How is attention allocated? How are eye movements related to attention? Further questions Attention Attention

331 views • 18 slides

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A. Waswani et al., NIPS , 2017 Google Brain & University of Toronto 2 Attention Visual attention and textual attention

628 views • 21 slides

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba CSC421/2516 Lecture 18: Generative Adversarial Networks 1 / 20 Implicit Generative Models Recall: implicit generative models learn a

237 views • 20 slides

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Attention Models Attention Models Focus on parts of input Olof Mogren Improves NN performance on different tasks Chalmers University of Technology IBM1 attention mechanism (1980s) Feb 2016 Attention Models Arxiv 2016

348 views • 6 slides

Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan

Introduction Variational Inference Deep Generative Models Summary Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan Fall 2015 Rahul G. Krishnan Learning Deep Generative Models Introduction

1.23k views • 83 slides

Deep Generative models for Inverse Problems Alex Dimakis joint work with Ashish Bora, Dave Van

Deep Generative models for Inverse Problems Alex Dimakis joint work with Ashish Bora, Dave Van Veen and Ajil Jalal, Sriram Vishwanath and Eric Price, UT Austin Outline Generative Models Using generative models for Inverse

1.17k views • 101 slides

Invertible Generative Models for Inverse Problems Mitigating Representation Error and Dataset Bias

Invertible Generative Models for Inverse Problems Mitigating Representation Error and Dataset Bias M. Asim, M. Daniels, O. Leong, P . Hand, A. Ahmed Inverse Problems with Generative Models as Image Priors Inverse Problems with Generative Models

369 views • 22 slides

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is a type of neural net, used in deep learning/machine learning problems The goal of a GAN is to train two simultaneous models: a generative model

441 views • 18 slides

Introduction to Generative Models (and GANs) Haoqiang Fan fhq@megvii.com Nov. 2017 Figures

Introduction to Generative Models (and GANs) Haoqiang Fan fhq@megvii.com Nov. 2017 Figures adapted from NIPS 2016 Tutorial Generative Adversarial Networks Generative Models: Learning the Distributions Discriminative: learns the likelihood

1.37k views • 77 slides

Compressed Sensing and Generative Models Ashish Bora Ajil Jalal Eric Price Alex Dimakis UT

Compressed Sensing and Generative Models Ashish Bora Ajil Jalal Eric Price Alex Dimakis UT Austin Ashish Bora, Ajil Jalal, Eric Price , Alex Dimakis (UT Austin) Compressed Sensing and Generative Models 1 / 33 Talk Outline Using generative

2.31k views • 190 slides

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models Language Models Language models are generative models of text s ~ P(x) The Malfoys! said Hermione. Harry was watching him. He looked like

1.18k views • 37 slides

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from: Mausam, Jay Alammar The Illustrated Transformer Attention in seq2seq models (Bahdanau 2014) Multi-head attention Self-attention (single-head,

720 views • 48 slides

Maximizing Skills in Office Bartholin duct and vulvar abscesses GYN Procedures Vaso-vagal

Outline UCSF Essentials of Primary Care Conference Squaw Creek, CA Pain relief for office procedures August 8, 2019 Endometrial biopsy Vulvar biopsy Maximizing Skills in Office Bartholin duct and vulvar abscesses GYN Procedures

292 views • 25 slides

Health Impact Assessment (HIA) in context rainer.fehr @ uni-bielefeld.de, www.rfehr.eu [19-10] 1

12th European Public Health Conference 2019 in Marseille, France: Building bridges for solidarity and public health Pre-conference : HIA institutionalization and multisectoral collaboration in Europe, 20 Nov 2019 Health Impact Assessment

429 views • 24 slides

What type(s) of ARC grant(s) did you apply for this year? Who prepared your ARC

G ETTING A H EAD S TART ON ARC G RANTS June 2016 C ENTER FOR C REATIVE L AND R ECYCLING Workshops Technical Assistance: EPA TAB grantee Policy & Research Consulting Online at www.cclr.org Sarah Sieloff

240 views • 23 slides

Vapor Intrusion: Regulatory Update and Advances in Assessment Tools January 14, 2016 SERDP and

SERDP and ESTCP Webinar Series Vapor Intrusion: Regulatory Update and Advances in Assessment Tools January 14, 2016 SERDP and ESTCP Webinar Series (#25) SERDP and ESTCP Webinar Series Welcome and Introductions Rula A. Deeb, Ph.D. Webinar

840 views • 64 slides

CS224N/Ling284 Lecture 8: Machine Translation, Sequence-to-sequence and Attention Abigail See

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Machine Translation, Sequence-to-sequence and Attention Abigail See Announcements We are taking attendance today Sign in with the TAs outside the auditorium

1.05k views • 79 slides

Attention-based Networks M. Malinowski Why attention? Long term memories - attending to

Attention-based Networks M. Malinowski Why attention? Long term memories - attending to memories Dealing with gradient vanishing problem Exceeding limitations of a global representation Attending/focusing to smaller parts of

1.42k views • 47 slides

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush HarvardNLP 1 Deep Neural Networks for Text Processing and Generation 2 Attention Networks 3 Structured Attention Networks Computational Challenges

1.33k views • 110 slides

A Neural Attention Model for Sentence Summarization Alexander M. Rush, Sumit Chopra, Jason

A Neural Attention Model for Sentence Summarization Alexander M. Rush, Sumit Chopra, Jason Weston. EMNLP 2015 Presented by Peiyao Li, Spring 2020 Extractive vs. Abstractive Summarization Extractive Summarization: Abstractive Summarization:

782 views • 64 slides