Deep Learning Barun Patra Index Convolutional Networks - PowerPoint PPT Presentation

Deep Learning Barun Patra

Index Convolutional Networks ● Introduction to Neural Nets ● Inspiration ○ Activations ● Kernels ○ Sigmoid ○ Idea ○ Tanh ○ As used in NLP ○ Relu (Derivatives) ○ Paper Discussion ● Gradients ● Initialization ● Regularization ● Dropout ○ Batch Norm ○

Introduction Image from Stanford’s CS231n supplementary notes

Representational Power A single hidden layer NN can approximate any function ● http://neuralnetworksanddeeplearning.com/chap4.html ● So why do we use deep neural networks ?? Sometimes more intuitive (Images) ● Works well in practice ●

Commonly Used Activations : Sigmoid Historically used. ● Has a nice interpretation as neuron firing ● Tendency to saturate and kill gradient ● In regions where neuron is 1 or 0, gradient is 0 ●

Commonly Used Activations: Tanh Still Saturates, killing gradient ● Gradient ≠ 0, when the function is 0 ●

Commonly Used Activations: ReLU Does not saturate ● Faster to implement ● Can cause network to die ● Converges faster in practice ●

Commonly Used Activations: Leaky ReLU & Maxout Leaky ReLU Generalizing Leaky ReLU (Maxout) Double the number of parameters ●

Backpropagation and Gradient Computation Let z (i) be the output of the i (th) layer, and s (i) be the input. ● Let f be the activation being applied. ● Let w (i) jk be the weight connecting the jth and the kth unit in the ith ● layer We have: ●

Backpropagation and Gradient Computation

Backpropagation and Activation Why does sigmoid learn slowly ?? ● Taken from “Understanding the difficulty of training deep feedforward neural networks”, Glorot and Bengio

Babysitting your gradient: For few examples (4-5 in a batch), compute numerical gradient ● Compare gradient from backprop and the numerical gradient ● Use relative error instead of absolute error ○ Rule of thumb: ● relative error > 1e-2 usually means the gradient is probably wrong ○ 1e-2 > relative error > 1e-4 should make you feel uncomfortable ○ 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. ○ use of tanh nonlinearities and softmax), then 1e-4 is too high. 1e-7 and less you should be happy. ○

Initialization : Glorot Uniform / Xavier Do not start with all 0’s (Nothing to break symmetry in a layer) ● Sample from above uniform distribution/ Gaussian distribution ●

Initialization : Kuch Bhi ?? Consider a network with linear neurons. ● Let z (i) be the output of the i (th) layer, and s (i) be the input ● Let Input (x) be of uniform variance Var[x] and 0 mean ● Let all weights are i.i.d’s. Then: ●

Initialization : Kuch Bhi ?? Similarly ●

Initialization : Kuch Bhi ?? For Information to flow, we want ● And hence: ●

Initialization : Sigmoid and ReLU The linear assumption good enough for tanh ● For Sigmoid and ReLU, small modifications needed ● The modification for ReLU + Some other stuff: ● By He, Zang and Reng : https://arxiv.org/pdf/1502.01852.pdf ○ Surpassed human level performance on ImageNet Classification ○

Regularization (Motivation): Strong tendency of a Neural Net to overfit

Regularization (Motivation): Effect of L2 Regularization

Regularization (Methodology): L1 Regularization ● L2 Regularization ● Introducing Noise ● Max norm of weights ● Early stopping using validation set ● Dropouts ● Batch Normalization ●

Dropouts: Each hidden unit in a neural network trained with dropout must learn to work with a ● randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes. http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf ●

Batch Normalization: Normalizing the input helps in training. ● What if we could normalize the input to every layer of the network ? ● For each layer with d dim input (x1 … xd), we want ● But normalizing like this may change what the layer represents ● To overcome that, the transformation inserted in the network can ● represent the identity transform

Batch Normalization: Leads to faster training ● Less dependance on initializations ●

Some practical advice: Gradient check on small data ● Overfit without regularization on ● small data. Decay learning rate with time ● Regularize ● Always check learning curves ●

Introduction to Convolutional networks What are these convolutions and kernels ?? ● https://docs.gimp.org/en/plug-in-convmatrix.html ●

Introduction to Convolutional Networks Animation at http://cs231n.github.io/convolutional-networks/ ●

Kind of features learnt :

Convolutional Networks in NLP Gives a good generalization of unigram, bigram etc. features in embedding space ● With more layers, the receptor field increases ● Taken from Hierarchical Question Answering using Co Attention

Relation Extraction with Conv Networks:

Issue 1 with the Previous task (Mintz et al., 2009): Assumption : Every sentence between two entities express the relation ● Issue : Too Strong ● Solution: Use Multi Instance Multi Label Model ● Taken from (Zeng et. al, 2015)

Issue 2 with the Previous Task : Used hand crafted features + other NLP tools like dependency parsers ● Have poor performance with increased sentence length ○ Long sentences form nearly 50% of the corpus being used to extract the ○ relations Solution : Use Deep Learning ! ● Enter Convolutional Networks ○

The Model (Overview) : Taken from (Zeng et. al, 2015)

The Model (Embedding): Train word2vec (skip gram model) [Why not CBOW ?] ● Use positional embeddings ( distance from the two entities) : ● Capture the notion of distance of the word from the entities ○ The same word, at different locations at the sentence, might have different semantics ○ A proxy to LSTM embeddings ○ Final dimension for one word : R (embed_dim + 2*embed_position) ● Final dimension of the embedding layer : R |Sentence| * ( embed_dim + ● 2*embed_position)

The Model (Convolution): Convolution with kernel width W ● W ∈ R W * ( n_dim_vector) ● N filters (Hyper parameter) ● Zero padding to ensure every word ● gets convolved Final Layer Dimension: R |N| * (|S| + |W| - ● 1)

The Model (Pooling + Softmax): Pooling done in piecewise manner ● Idea from three parts of sentence ● Remember ReVerb ?? ○ Less coarse than a single softmax ● The final dimension is R |num_filters| * 3 ● Tanh ● Softmax to get probability over all ● relations

The Data : A bag is labeled r if there is at least one sentence which contains r ● A bag contains all sentences between a pair of entities ● Potentially multiple same bags with different labels [ Unclear ] ●

The Objective Function And Training : Trained with mini batches of bags, with Ada Delta ●

Inference : Given a bag and a relation r ● The bag is marked to contain r if there exists one sentence in bag with ● predicted r

Experiment Setup: Dataset: Aligning Freebase relations with NYT corpus ● Training: Sentence from 2005-06 ● Testing : Sentences from 2007 ● Held out evaluation : Extracted Relations against Freebase ● Manual evaluation : Evaluation by human ● Word2Vec trained on the NYT corpus. Entity tokens concatenated with ## ● Grid search over hyper parameters ●

Results (Held out evaluation) : Half of the Freebase relations used for testing [ Doubt ] ● Relations extracted from test articles compared against the Freebase Relations ● Results compared against Mintz, MultiR and Multi Instance Multi Label Learning ●

Results (Manual Evaluation) Chose Entity pairs where at least one was not present in Freebase as ● candidate (To avoid overlap with the held out set) Top N relations extracted, and precision computed ● Since not all relations are known, recall not given (Pseudo Recall ??) ●

Results (Ablation Study):

Problems : Analysis of where PCNN improves over MultiR/MIML lacking [Surag] ● No coreference resolution [Rishab] ● No alternatives to 3 segment piecewise convolution [Haroun] ● Suffers from incomplete Freebase [Daraksha] ● Does not consider overlapping relations [Daraksha] ● A lot of training examples not being used [Shantanu] ● No comparison with other architectures [Akg] ●

Deep Learning Barun Patra Index Convolutional Networks - PowerPoint PPT Presentation

Deep Learning Barun Patra Index Convolutional Networks Introduction to Neural Nets Inspiration Activations Kernels Sigmoid Idea Tanh As used in NLP Relu (Derivatives) Paper Discussion

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Identification of Nonlinear LFR Systems starting from the Best Linear Approximation M. Schoukens

Review: Creating Arrays } Like any Java construct, arrays are declared and instantiated Array

Initializer lists and uniform initialization slides based on talk by Bjarne Stroustrup Jon

Pattern und Copattern Matching Anton Setzer Swansea, UK (Sect 1 - 3 joint work with Andreas

Unit 7a 'while' and 'for' Loop Syntax and Semantics 2 Control Structures We need ways of

Cube-like Attack on Round-Reduced Initialization of Ketje Sr Xiaoyang Dong, Zheng Li, Xiaoyun Wang

1

Goal First version of linear search Generic Programming Input was array of int More

Deep Learning Barun Patra Index Convolutional Networks - PowerPoint PPT Presentation

Deep Learning Barun Patra Index Convolutional Networks Introduction to Neural Nets Inspiration Activations Kernels Sigmoid Idea Tanh As used in NLP Relu (Derivatives) Paper Discussion

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Identification of Nonlinear LFR Systems starting from the Best Linear Approximation M. Schoukens

Review: Creating Arrays } Like any Java construct, arrays are declared and instantiated Array

Initializer lists and uniform initialization slides based on talk by Bjarne Stroustrup Jon

Pattern und Copattern Matching Anton Setzer Swansea, UK (Sect 1 - 3 joint work with Andreas

Unit 7a 'while' and 'for' Loop Syntax and Semantics 2 Control Structures We need ways of

Cube-like Attack on Round-Reduced Initialization of Ketje Sr Xiaoyang Dong, Zheng Li, Xiaoyun Wang

1

Goal First version of linear search Generic Programming Input was array of int More

Deep learning for natural language processing A short primer on deep learning Benoit Favre <