Time-aware Large Kernel Convolutions Vasileios Lioutas and Yuhong - PowerPoint PPT Presentation

ICML | 2020 Time-aware Large Kernel Convolutions Vasileios Lioutas and Yuhong Guo

ICML | 2020 Brief Overview ● In this work, we introduce a novel sequence modeling approach called TaLK convolutions that is not based on self-attention. ● The proposed method has time complexity and it uses an adaptive summation convolution kernel. ● Experiments on machine translation, abstractive summarization and language modeling suggest that this method can yield comparative results with other self-attention and convolution based competitive methods.

ICML | 2020 Introduction ● Sequence modeling is a fundamental task in ML ● It's the process of learning how to combine timesteps to form representations of higher abstraction. [Karpathy, 2015] ● Many applications such as machine translation, POS tagging, sentiment classification, video processing, time-series etc.

ICML | 2020 Sequence Modeling Approaches

ICML | 2020 Comparison

ICML | 2020 Motivation Currently, self-attention is considered vital for modern sequence learning approaches. ● Self-attention is expensive . It has quadratic time complexity. ● Hard to be deployed on devices with limited hardware (i.e. edge devices) ● Dynamic Convolutions [Wu et al. 2019] showed that you can achieve good results using a limited context window . ● Still relies on a special type of attention (i.e. dynamic value-based attention)

ICML | 2020 Research Questions ● Q1: Is (self-)attention critical to get good performance? ● Q2: Can we reduce the time-complexity to using a parallelizable non-autoregressive method?

ICML | 2020 One-dimensional Large Kernel Convolution ● One of the simplest ways to model a sequence of representations is to aggregate the appropriate number of vector representations together. where are the left and right offsets (boundaries).

ICML | 2020 One-dimensional Large Kernel Convolution

ICML | 2020 Summed-area Table ● Applying the previous aggregation can be slow because we compute the same aggregations again and again. ● To address this issue we can use the summed-area table (integral image operation). ● Let be the summed-area table computed using ● The above operation can be efficiently parallelized with complexity using the parallel prefix sum algorithm. ● Given the left and right offsets, we can compute using the summed-area table in time:

ICML | 2020 Time-aware Large Kernel Generation ● So far, we assumed that and are given. ● Ideally, we want to learn to generate these offsets for each input timestep. ● We can’t directly predict the index which corresponds to the offset word: ● Indexes are positive unbounded integers; ● We address this issue using relative offsets. ● We generate these relative offsets using where

ICML | 2020 Offsets Interpolation ● Convert the relative offsets to absolute by using where are the maximum allowed tokens to the left and to the right. ● We can’t directly use the absolute indexes because they are real values. ● We use linear interpolation to approximately generate and directly:

ICML | 2020 Output Normalization ● The proposed method works well when used with shallow models. ● Aggregating many representations together can lead to disproportional magnitude on the representation values passed to the next layers. ● Solution: Normalize by the maximum window length ● To further increase the performance, we apply dropout to the generated relative offsets ● Set relative offset to zero which effectively cancels the expansion of the window towards that direction. ● Forcing the model to produce smaller windows to robustify the importance of the number of tokens that are needed to model a timestep.

ICML | 2020 Multi-headed Kernels ● Similar to MHSA, we introduce multiple heads. ● We tie every subsequent number of channels together and group the channels into groups. where . ● This helps to further increase the expressivity and diversity of the representation of each timestep.

ICML | 2020 The TaLK Convolution Operation

ICML | 2020 Architecture & Implementation ● We implemented our own CUDA primitives to support the TaLK Convolution operation.

ICML | 2020 Computational Complexity

ICML | 2020 Machine Translation

ICML | 2020 Abstractive Summarization & Language Modeling

ICML | 2020 Model Ablation

ICML | 2020 Encoding Inference Speed Comparison

ICML | 2020 Conclusion ● We introduced a new way of doing sequence modeling that has time complexity. ● The results show that the proposed method can perform on par with transformers and dynamic convolutions without using self-attention or a variant of it. ● In the future, we will do more research on how to apply TaLK Convolutions in a non-contiguous way. Thank you! github.com/lioutasb/TaLKConvolutions

Time-aware Large Kernel Convolutions Vasileios Lioutas and Yuhong - PowerPoint PPT Presentation

ICML | 2020 Time-aware Large Kernel Convolutions Vasileios Lioutas and Yuhong Guo ICML | 2020 Brief Overview In this work, we introduce a novel sequence modeling approach called TaLK convolutions that is not based on self-attention.

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

Dense Predictions Using Dilated Convolutions Najmus Ibrahim University of Toronto Institute for

Laplace Transforms and Convolutions Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Paper Reading Paper HetConv: Heterogeneous Kernel-Based Convolutions for Deep

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein <

Linux Kernel Synchronization System Calls Synchronization in Kernel the kernel RCU File

SI232 push(2) SlideSet #4: Procedures push(1) (more Chapter 2) pop() pop() push(6) pop()

Addressing modes, Procedure calls and the Stack Frame Eric McCreath Indirect load/store

Introduction The goal of neuromorphic engineering is to design and implement micro- electronic

Previous Lecture Slides for Lecture 12 ENCM 501: Principles of Computer Architecture Winter 2014

The Hardware/Software Interface CSE351 Spring 2013 Procedures and Stacks II University of

Pointers The Pointer Defined int *x; Read as: declare x as a pointer to a 32-bit integer

The ATmega328 Instruction Set Architecture ISA: All of the programmer-visible components and

Instructions and Addressing 1 ISA vs. Microarchitecture ISA vs. Microarchitecture An ISA or

Time-aware Large Kernel Convolutions Vasileios Lioutas and Yuhong - PowerPoint PPT Presentation

ICML | 2020 Time-aware Large Kernel Convolutions Vasileios Lioutas and Yuhong Guo ICML | 2020 Brief Overview In this work, we introduce a novel sequence modeling approach called TaLK convolutions that is not based on self-attention.

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

Dense Predictions Using Dilated Convolutions Najmus Ibrahim University of Toronto Institute for

Laplace Transforms and Convolutions Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Paper Reading Paper HetConv: Heterogeneous Kernel-Based Convolutions for Deep

Toolkit to Support Intelligibility in Context Aware Applications Context-Aware Applications P

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein &lt;

Linux Kernel Synchronization System Calls Synchronization in Kernel the kernel RCU File

SI232 push(2) SlideSet #4: Procedures push(1) (more Chapter 2) pop() pop() push(6) pop()

Addressing modes, Procedure calls and the Stack Frame Eric McCreath Indirect load/store

Introduction The goal of neuromorphic engineering is to design and implement micro- electronic

Previous Lecture Slides for Lecture 12 ENCM 501: Principles of Computer Architecture Winter 2014

The Hardware/Software Interface CSE351 Spring 2013 Procedures and Stacks II University of

Pointers The Pointer Defined int *x; Read as: declare x as a pointer to a 32-bit integer

The ATmega328 Instruction Set Architecture ISA: All of the programmer-visible components and

Instructions and Addressing 1 ISA vs. Microarchitecture ISA vs. Microarchitecture An ISA or

A kernel in a library Genodes custom kernel approach Martin Stein <