Deep Learning
Barun Patra
Deep Learning Barun Patra Index Convolutional Networks - - PowerPoint PPT Presentation
Deep Learning Barun Patra Index Convolutional Networks Introduction to Neural Nets Inspiration Activations Kernels Sigmoid Idea Tanh As used in NLP Relu (Derivatives) Paper Discussion
Barun Patra
○ Sigmoid ○ Tanh ○ Relu (Derivatives)
○ Dropout ○ Batch Norm
○ Inspiration ○ Kernels ○ Idea ○ As used in NLP
Image from Stanford’s CS231n supplementary notes
So why do we use deep neural networks ??
Generalizing Leaky ReLU (Maxout) Leaky ReLU
jk be the weight connecting the jth and the kth unit in the ith
layer
Taken from “Understanding the difficulty of training deep feedforward neural networks”, Glorot and Bengio
○ Use relative error instead of absolute error
○ relative error > 1e-2 usually means the gradient is probably wrong ○ 1e-2 > relative error > 1e-4 should make you feel uncomfortable ○ 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high. ○ 1e-7 and less you should be happy.
○ By He, Zang and Reng : https://arxiv.org/pdf/1502.01852.pdf ○ Surpassed human level performance on ImageNet Classification
Strong tendency of a Neural Net to overfit
Effect of L2 Regularization
randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes.
represent the identity transform
small data.
Taken from Hierarchical Question Answering using Co Attention
Taken from (Zeng et. al, 2015)
○ Have poor performance with increased sentence length ○ Long sentences form nearly 50% of the corpus being used to extract the relations
○ Enter Convolutional Networks
Taken from (Zeng et. al, 2015)
○ Capture the notion of distance of the word from the entities ○ The same word, at different locations at the sentence, might have different semantics ○ A proxy to LSTM embeddings
2*embed_position)
gets convolved
1)
○ Remember ReVerb ??
relations
predicted r
candidate (To avoid overlap with the held out set)
○ A comparison with handcrafted features and kernel based approach could be done to see what the architecture fails to capture [Anshul] ○ Consequently, kernel features could be added [Haroun, Dinesh Raghu] ○ Critiquing the critics [Arindam]
the max probability [Shantanu]
Confidence(y2)) [Prachi]
predicting a relation, or NONE [Yashoteja]