Deep Learning Barun Patra Index Convolutional Networks - - PowerPoint PPT Presentation

deep learning
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Barun Patra Index Convolutional Networks - - PowerPoint PPT Presentation

Deep Learning Barun Patra Index Convolutional Networks Introduction to Neural Nets Inspiration Activations Kernels Sigmoid Idea Tanh As used in NLP Relu (Derivatives) Paper Discussion


slide-1
SLIDE 1

Deep Learning

Barun Patra

slide-2
SLIDE 2

Index

  • Introduction to Neural Nets
  • Activations

○ Sigmoid ○ Tanh ○ Relu (Derivatives)

  • Gradients
  • Initialization
  • Regularization

○ Dropout ○ Batch Norm

  • Convolutional Networks

○ Inspiration ○ Kernels ○ Idea ○ As used in NLP

  • Paper Discussion
slide-3
SLIDE 3

Introduction

Image from Stanford’s CS231n supplementary notes

slide-4
SLIDE 4

Representational Power

  • A single hidden layer NN can approximate any function
  • http://neuralnetworksanddeeplearning.com/chap4.html

So why do we use deep neural networks ??

  • Sometimes more intuitive (Images)
  • Works well in practice
slide-5
SLIDE 5

Commonly Used Activations : Sigmoid

  • Historically used.
  • Has a nice interpretation as neuron firing
  • Tendency to saturate and kill gradient
  • In regions where neuron is 1 or 0, gradient is 0
slide-6
SLIDE 6

Commonly Used Activations: Tanh

  • Still Saturates, killing gradient
  • Gradient ≠ 0, when the function is 0
slide-7
SLIDE 7

Commonly Used Activations: ReLU

  • Does not saturate
  • Faster to implement
  • Can cause network to die
  • Converges faster in practice
slide-8
SLIDE 8

Commonly Used Activations: Leaky ReLU & Maxout

Generalizing Leaky ReLU (Maxout) Leaky ReLU

  • Double the number of parameters
slide-9
SLIDE 9

Backpropagation and Gradient Computation

  • Let z(i) be the output of the i(th) layer, and s(i) be the input.
  • Let f be the activation being applied.
  • Let w(i)

jk be the weight connecting the jth and the kth unit in the ith

layer

  • We have:
slide-10
SLIDE 10

Backpropagation and Gradient Computation

slide-11
SLIDE 11

Backpropagation and Gradient Computation

slide-12
SLIDE 12

Backpropagation and Activation

  • Why does sigmoid learn slowly ??

Taken from “Understanding the difficulty of training deep feedforward neural networks”, Glorot and Bengio

slide-13
SLIDE 13

Babysitting your gradient:

  • For few examples (4-5 in a batch), compute numerical gradient
  • Compare gradient from backprop and the numerical gradient

○ Use relative error instead of absolute error

  • Rule of thumb:

○ relative error > 1e-2 usually means the gradient is probably wrong ○ 1e-2 > relative error > 1e-4 should make you feel uncomfortable ○ 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high. ○ 1e-7 and less you should be happy.

slide-14
SLIDE 14

Initialization : Glorot Uniform / Xavier

  • Do not start with all 0’s (Nothing to break symmetry in a layer)
  • Sample from above uniform distribution/ Gaussian distribution
slide-15
SLIDE 15
  • Consider a network with linear neurons.
  • Let z(i) be the output of the i(th) layer, and s(i) be the input
  • Let Input (x) be of uniform variance Var[x] and 0 mean
  • Let all weights are i.i.d’s. Then:

Initialization : Kuch Bhi ??

slide-16
SLIDE 16

Initialization : Kuch Bhi ??

  • Similarly
slide-17
SLIDE 17

Initialization : Kuch Bhi ??

  • For Information to flow, we want
  • And hence:
slide-18
SLIDE 18

Initialization : Sigmoid and ReLU

  • The linear assumption good enough for tanh
  • For Sigmoid and ReLU, small modifications needed
  • The modification for ReLU + Some other stuff:

○ By He, Zang and Reng : https://arxiv.org/pdf/1502.01852.pdf ○ Surpassed human level performance on ImageNet Classification

slide-19
SLIDE 19

Regularization (Motivation):

Strong tendency of a Neural Net to overfit

slide-20
SLIDE 20

Regularization (Motivation):

Effect of L2 Regularization

slide-21
SLIDE 21

Regularization (Methodology):

  • L1 Regularization
  • L2 Regularization
  • Introducing Noise
  • Max norm of weights
  • Early stopping using validation set
  • Dropouts
  • Batch Normalization
slide-22
SLIDE 22

Dropouts:

  • Each hidden unit in a neural network trained with dropout must learn to work with a

randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes.

  • http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf
slide-23
SLIDE 23

Batch Normalization:

  • Normalizing the input helps in training.
  • What if we could normalize the input to every layer of the network ?
  • For each layer with d dim input (x1 … xd), we want
  • But normalizing like this may change what the layer represents
  • To overcome that, the transformation inserted in the network can

represent the identity transform

slide-24
SLIDE 24

Batch Normalization:

  • Leads to faster training
  • Less dependance on initializations
slide-25
SLIDE 25

Some practical advice:

  • Gradient check on small data
  • Overfit without regularization on

small data.

  • Decay learning rate with time
  • Regularize
  • Always check learning curves
slide-26
SLIDE 26

Introduction to Convolutional networks

  • What are these convolutions and kernels ??
  • https://docs.gimp.org/en/plug-in-convmatrix.html
slide-27
SLIDE 27

Introduction to Convolutional Networks

  • Animation at http://cs231n.github.io/convolutional-networks/
slide-28
SLIDE 28

Kind of features learnt :

slide-29
SLIDE 29

Convolutional Networks in NLP

  • Gives a good generalization of unigram, bigram etc. features in embedding space
  • With more layers, the receptor field increases

Taken from Hierarchical Question Answering using Co Attention

slide-30
SLIDE 30

Relation Extraction with Conv Networks:

slide-31
SLIDE 31
  • Assumption : Every sentence between two entities express the relation
  • Issue : Too Strong

Issue 1 with the Previous task (Mintz et al., 2009):

  • Solution: Use Multi Instance Multi Label Model

Taken from (Zeng et. al, 2015)

slide-32
SLIDE 32

Issue 2 with the Previous Task :

  • Used hand crafted features + other NLP tools like dependency parsers

○ Have poor performance with increased sentence length ○ Long sentences form nearly 50% of the corpus being used to extract the relations

  • Solution : Use Deep Learning !

○ Enter Convolutional Networks

slide-33
SLIDE 33

The Model (Overview) :

Taken from (Zeng et. al, 2015)

slide-34
SLIDE 34

The Model (Embedding):

  • Train word2vec (skip gram model) [Why not CBOW ?]
  • Use positional embeddings ( distance from the two entities) :

○ Capture the notion of distance of the word from the entities ○ The same word, at different locations at the sentence, might have different semantics ○ A proxy to LSTM embeddings

  • Final dimension for one word : R(embed_dim + 2*embed_position)
  • Final dimension of the embedding layer : R|Sentence| * ( embed_dim +

2*embed_position)

slide-35
SLIDE 35

The Model (Convolution):

  • Convolution with kernel width W
  • W ∈ R W * ( n_dim_vector)
  • N filters (Hyper parameter)
  • Zero padding to ensure every word

gets convolved

  • Final Layer Dimension: R|N| * (|S| + |W| -

1)

slide-36
SLIDE 36

The Model (Pooling + Softmax):

  • Pooling done in piecewise manner
  • Idea from three parts of sentence

○ Remember ReVerb ??

  • Less coarse than a single softmax
  • The final dimension is R|num_filters| * 3
  • Tanh
  • Softmax to get probability over all

relations

slide-37
SLIDE 37

The Data :

  • A bag is labeled r if there is at least one sentence which contains r
  • A bag contains all sentences between a pair of entities
  • Potentially multiple same bags with different labels [ Unclear ]
slide-38
SLIDE 38

The Objective Function And Training :

  • Trained with mini batches of bags, with Ada Delta
slide-39
SLIDE 39

Inference :

  • Given a bag and a relation r
  • The bag is marked to contain r if there exists one sentence in bag with

predicted r

slide-40
SLIDE 40

Experiment Setup:

  • Dataset: Aligning Freebase relations with NYT corpus
  • Training: Sentence from 2005-06
  • Testing : Sentences from 2007
  • Held out evaluation : Extracted Relations against Freebase
  • Manual evaluation : Evaluation by human
  • Word2Vec trained on the NYT corpus. Entity tokens concatenated with ##
  • Grid search over hyper parameters
slide-41
SLIDE 41

Results (Held out evaluation) :

  • Half of the Freebase relations used for testing [ Doubt ]
  • Relations extracted from test articles compared against the Freebase Relations
  • Results compared against Mintz, MultiR and Multi Instance Multi Label Learning
slide-42
SLIDE 42

Results (Manual Evaluation)

  • Chose Entity pairs where at least one was not present in Freebase as

candidate (To avoid overlap with the held out set)

  • Top N relations extracted, and precision computed
  • Since not all relations are known, recall not given (Pseudo Recall ??)
slide-43
SLIDE 43

Results (Ablation Study):

slide-44
SLIDE 44

Problems :

  • Analysis of where PCNN improves over MultiR/MIML lacking [Surag]
  • No coreference resolution [Rishab]
  • No alternatives to 3 segment piecewise convolution [Haroun]
  • Suffers from incomplete Freebase [Daraksha]
  • Does not consider overlapping relations [Daraksha]
  • A lot of training examples not being used [Shantanu]
  • No comparison with other architectures [Akg]
slide-45
SLIDE 45

Extensions :

  • LSTM’s [A lot of people]/ More convolutional layers
  • Bootstrapping along with Multi R
  • Use other lexical features:

○ A comparison with handcrafted features and kernel based approach could be done to see what the architecture fails to capture [Anshul] ○ Consequently, kernel features could be added [Haroun, Dinesh Raghu] ○ Critiquing the critics [Arindam]

  • Recursive auto encoders [Dinesh R.] (?)
  • Better embeddings [Gagan, Happy]
  • Using alternative Knowledge Bases for evaluation [Happy]
  • Taking all the sentences which show the relation instead of the one with

the max probability [Shantanu]

slide-46
SLIDE 46

Extensions (Contd.) :

  • Binning according to length (But how many weights to train?) [Anshul]
  • Adding Logic based constraints while learning (Confidence(y1) =>

Confidence(y2)) [Prachi]

  • Instead of at least 1, at least k paradigm [Yashoteja]
  • Attention based LSTM, CNN [Prachi] (?)
  • A probability distribution over the sentences in a bag, with every sentence

predicting a relation, or NONE [Yashoteja]