Alternative Architectures Philipp Koehn 15 October 2020 Philipp - PowerPoint PPT Presentation

Alternative Architectures Philipp Koehn 15 October 2020 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Alternative Architectures 1 • We introduced one translation model – attentional seq2seq model – core organizing feature: recurrent neural networks • Other core neural architectures – convolutional neural networks – attention • But first: look at various components of neural architectures Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

2 components Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Components of Neural Networks 3 • Neural networks originally inspired by the brain – a neuron receives signals from other neurons – if sufficiently activated, it sends signals – feed-forward layers are roughly based on this • Computation graph – any function possible, as long as it is partially differentiable – not limited by appeals to biological validity • Deep learning maybe a better name Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Feed-Forward Layer 4 • Classic neural network component • Given an input vector x , matrix multiplication M with adding a bias vector b Mx + b • Adding a non-linear activation function y = activation ( Mx + b ) • Notation y = FF activation ( x ) = a ( Mx + b ) Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Feed-Forward Layer 5 • Historic neural network designs: several feed-forward layers – input layer – hidden layers – output layer • Powerful tools for a wide range of machine learning problems • Matrix multiplication also called affine transforms – appeals to its geometrical properties – straight lines in input still straight lines in output Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Factored Decomposition 6 • One challenge: very large input and output vectors • Number of parameters in matrix M = | x | × | y | ⇒ Need to reduce size of matrix • Solution: first reduce to smaller representation M y A y v B x x Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Factored Decomposition: Math 7 M y A y v B x x • Intuition – given highly dimension vector x – first map to into lower dimensional vector v (matrix A ) – then map to output vector y (matrix B ) v = Ax y = Bv = BAx • Example – | x | = 20,000, | y | = 50,000 → M = 1,000,000,000 – | v | = 100 → A = 20,000 × 100 = 2,000,000, B = 100 × 50,000 = 5,000,000 – reduction from 1,000,000,000 to 7,000,000 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Factored Decomposition: Interpretation 8 • Vector v is a bottleneck feature • Forced to captures salient features • One example: word embeddings Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

9 basic mathematical operations Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Concatenation 10 • Often multiple input vectors to processing step • For instance recurrent neural network – input word – previous state • Combined in feed-forward layer y = activation ( M 1 x 1 + M 2 x 2 + b ) • Another view x = concat ( x 1 , x 2 ) y = activation ( Mx + b ) • Splitting hairs here, but concatenation useful generally Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Addition 11 • Adding vectors: very simplistic, but often done • Example: compute sentence embeddings s from word embeddings w 1 , ..., w n n � s = w i i • Reduces varying length sentence representation into fixed sized vector • Maybe weight the words, e.g., by attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Multiplication 12 • Another elementary mathematical operation • Three ways to multiply vectors – element-wise multiplication � � � � � � v 1 × u 1 v 1 u 1 v ⊙ u = ⊙ = v 2 × u 2 v 2 u 2 – dot product � T � � � v 1 u 1 v · u = v T u = = v 1 × u 1 + v 2 × u 2 v 2 u 2 used for simple version of attention mechanism – third possibility: vu T , not commonly done Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Maximum 13 • Goal: reduce the dimensionality of representation • Example: detect if a face is in image – any region of image may have positive match – represent different regions with element in a vector – maximum value: any region has a face • Max pooling – given: n dimensional vector – goal: reduce to n k dimensional vector – method: break up vector into blocks of k elements, map each into single value Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Max Out 14 • Max out – first branch out into multiple feed-forward layers W 1 x + b 1 W 2 x + b 2 – element-wise maximum maxout ( x ) = max ( W 1 x + b 1 , W 2 x + b 2 ) • ReLu activation is a maxout layer: maximum of feed-forward layer and 0 ReLu ( x ) = max ( Wx + b, 0) Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

15 processing sequences Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Recurrent Neural Networks 16 • Already described recurrent neural networks at length – propagate state s – over time steps t – receiving an input x t at each turn s t = f ( s t − 1 , x t ) (state may computed may as a feed-forward layer) • More successful – gated recurrent units (GRU) – long short-term memory cells (LSTM) • Good fit for sequences, like words in a sentence – humans also receive word by word – most recent words most relevant → closer to current state • But computational problematic: very long computation chains Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Alternative Sequence Processing 17 • Convolutional neural networks • Attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

18 convolutional neural networks Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Convolutional Neural Networks (CNN) 19 • Popular in image processing • Regions of an image are reduced into increasingly smaller representation – matrix spanning part of image reduced to single value – overlapping regions Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

CNNs for Language 20 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed Embed • Map words into fixed-sized sentence representation Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Hierarchical Structure and Language 21 • Syntactic and semantic theories of language – language is recursive – central: verb – dependents: subject, objects, adjuncts – their dependents: adjectives, determiners – also nested: relative clauses • How to compute sentence embeddings active research topic Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Convolutional Neural Networks 22 • Key step – take a high dimensional input representation – map to lower dimensional representation • Several repetitions of this step • Examples – map 50 × 50 pixel area into scalar value – combine 3 or more neighboring words into a single vector • Machine translation – encode input sentence into single vector – decode this vector into a sentence in the output language Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

23 attention Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Attention 24 • Machine translation is a structured prediction task – output is not a single label – output structure needs to be built, word by word • Relevant information for each word prediction varies • Human translators pay attention to different parts of the input sentence when translating ⇒ Attention mechanism Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Computing Attention 25 • Attention mechanism in neural translation model (Bahdanau et al., 2015) – previous hidden state s i − 1 – input word embedding h j – trainable parameters b , W a , U a , v a a ( s i − 1 , h j ) = v T a tanh ( W a s i − 1 + U a h j + b ) • Other ways to compute attention – Dot product: a ( s i − 1 , h j ) = s T i − 1 h j √ | h j | s T 1 – Scaled dot product: a ( s i − 1 , h j ) = i − 1 h j – General: a ( s i − 1 , h j ) = s T i − 1 W a h j – Local: a ( s i − 1 ) = W a s i − 1 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Attention of Luong et al. (2015) 26 • Luong et al. (2015) demonstrate good results with the dot product a ( s i − 1 , h j ) = s T i − 1 h j • No trainable parameters • Additional changes • Currently more popular Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020

Alternative Architectures Philipp Koehn 15 October 2020 Philipp - PowerPoint PPT Presentation

Alternative Architectures Philipp Koehn 15 October 2020 Philipp Koehn Machine Translation: Alternative Architectures 15 October 2020 Alternative Architectures 1 We introduced one translation model attentional seq2seq model core

Architectures Architectural styles Software architectures Architectures versus middleware

Langdale Dock Langdale Dock Langdale Dock Langdale Dock Alternative Approval Alternative

ALF-CEMI ND Supporting the use of alternative fuels Alternative Fuels and Alternative Raw

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Architectures, Architectures, Microkernels, IPC, Microkernels, IPC, Capabilities Capabilities

Overview Agent Architectures Definition of agent architecture Classical Architectures for

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

HPC Architectures Types of resource currently in use Outline Shared memory architectures

HPC Architectures Types of resource currently in use Outline Shared memory architectures

TAMA LAW TAMA LAW (Traditional (Traditional Alternative Alternative Medicine Medicine Act) RA

CHILD PROTECTION (ALTERNATIVE CARE) ACT CHILD PROTECTION (ALTERNATIVE CARE) ACT STREAMLINED

Alternative Powertrains House of Automobile - HBC Group Alternative Powertrains House of

Alternative Water Installation Training 28 May 2019 Alternative Water Installation Training May

Introduction Interim Report and Consultation The Alternative Reference Rates Committee 1

Alternative Set Theories Introduction NGB MK Yurii Khomskii KP NF AFA IZF / CZF Other

Route 16/Quinobequin Road Trail Connection Alternatives March 2020 Alternative A: Quinobequin

Learning Multi-touch Conversion Attribution with Dual-attention Mechanisms for Online Advertising

8 Neural MT 2: Attentional Neural MT In the past chapter, we described a simple model for neural

CMP784 DEEP LEARNING Lecture #08 Attention and Memory Aykut Erdem // Hacettepe University //

Physiological measures in Learning Sciences Research Patrick.Jermann@epfl.ch

Neural Text Summarization Piji Li NLP Center, Tencent AI Lab pijili@tencent.com Paper Reading,

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Relationship between attentional processing of input and working Bimali Indrarathne memory: