INF5820: Language technological applications Course summary Andrey - - PowerPoint PPT Presentation
INF5820: Language technological applications Course summary Andrey - - PowerPoint PPT Presentation
INF5820: Language technological applications Course summary Andrey Kutuzov, Lilja vrelid, Stephan Oepen, Taraka Rama & Erik Velldal University of Oslo 20 November 2018 Today Exam preparations Collectively summing up Results
Today
◮ Exam preparations ◮ Collectively summing up ◮ Results of obligatory assignment(s) ◮ Current trends, beyond INF5820.
◮ Cutting edge in word embedding pre-training ◮ Transfer and multitask learning ◮ Adversarial learning ◮ Transformers ◮ And more. . . 2
Exam
◮ When: Monday November 26, 09:00 AM (4 hours). ◮ Where: Store fysiske lesesal, Fysikkbygningen ◮ How: ◮ No aids (no textbooks, etc.) ◮ Pen and paper (not Inspera) ◮ Not a programming exam ◮ Focus on conceptual understanding ◮ Could still involve equations, but no complicated calculations by hand ◮ Details of use cases we’ve considered (in lectures or assignments) are
also relevant
3
Neural Network Methods for NLP
(The Great Wave off Kanagawa by Katsushika Hokusai) 4
What has changed?
◮ We’re still within the realm of supervised machine learning. But: ◮ A shift from linear models with discrete representations of manually
specified features,
◮ to non-linear models with distributed and learned representations.
5
What has changed?
◮ We’re still within the realm of supervised machine learning. But: ◮ A shift from linear models with discrete representations of manually
specified features,
◮ to non-linear models with distributed and learned representations. ◮ We’ll consider two main themes running through the semester:
architectures and representations.
5
Architectures and model design
6
Architectures and model design
◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs.
6
Architectures and model design
◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:
6
Architectures and model design
◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:
◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . . 6
Architectures and model design
◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:
◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . .
◮ Variations beyond simple RNNs:
6
Architectures and model design
◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:
◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . .
◮ Variations beyond simple RNNs:
◮ (Bi)LSTM + GRU (gating), attention and stacking. ◮ Variations of how RNNs can be used: Acceptors, transducers,
conditioned generation (encoder-decoder / seq.-to-seq.)
◮ Various ways of performing sequence labeling with RNNs 6
Architectures and model design
◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:
◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . .
◮ Variations beyond simple RNNs:
◮ (Bi)LSTM + GRU (gating), attention and stacking. ◮ Variations of how RNNs can be used: Acceptors, transducers,
conditioned generation (encoder-decoder / seq.-to-seq.)
◮ Various ways of performing sequence labeling with RNNs
Various aspects of modeling common to all the neural architetctures:
6
Architectures and model design
◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:
◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . .
◮ Variations beyond simple RNNs:
◮ (Bi)LSTM + GRU (gating), attention and stacking. ◮ Variations of how RNNs can be used: Acceptors, transducers,
conditioned generation (encoder-decoder / seq.-to-seq.)
◮ Various ways of performing sequence labeling with RNNs
Various aspects of modeling common to all the neural architetctures:
◮ dimensionalities, regularization, initialization, handling OOVs, activation
functions, batches, loss-functions, learning rate, optimizer,. . .
◮ Embedding pre-training and text pre-processing ◮ Backpropagation, vanishing / exploding gradients
6
Representations
◮ An important part of the neural ‘revolution’ in NLP: the input
representations provided to the learner.
◮ Traditional feature vectors: ◮ Word embeddings: ◮ Main benefit of using embeddings rather than one-hot encodings:
7
Representations
◮ An important part of the neural ‘revolution’ in NLP: the input
representations provided to the learner.
◮ Traditional feature vectors: High-dimensional, sparse, categorical and
- discrete. Based on manualy specified feature templates.
◮ Word embeddings: Low-dimensional, dense, continous and distributed.
Often learned automatically, eg. as a language model.
◮ Main benefit of using embeddings rather than one-hot encodings: ◮ Information-sharing between features, counteracts data-sparsness. ◮ Can be computed from unlabelled data.
7
Representations
◮ An important part of the neural ‘revolution’ in NLP: the input
representations provided to the learner.
◮ Traditional feature vectors: High-dimensional, sparse, categorical and
- discrete. Based on manualy specified feature templates.
◮ Word embeddings: Low-dimensional, dense, continous and distributed.
Often learned automatically, eg. as a language model.
◮ Main benefit of using embeddings rather than one-hot encodings: ◮ Information-sharing between features, counteracts data-sparsness. ◮ Can be computed from unlabelled data. ◮ We’ve also considered various tasks for intrinsic evaluation of
distributional word vectors.
7
Representation learning
◮ With neural network models, our main interest is not always in the final
classification outcome itself.
◮ Rather, we might be interested in the learned internal representations. ◮ Examples?
8
Representation learning
◮ With neural network models, our main interest is not always in the final
classification outcome itself.
◮ Rather, we might be interested in the learned internal representations. ◮ Examples? ◮ Embeddings in neural models
◮ Pre-trained or learned from scratch (with one-hot input) ◮ Static (frozen) or dynamic.
◮ The pooling layer of a CNN or the final hidden state of an RNN
provides a fixed-length representation of an arbitray-length sequence.
8
Specialized NN architectures
◮ Focus of manual engineering shifted from features to architechture
decisions and hyper-parameters.
◮ The elimination of feature-engineering is only partially true: ◮ Need for specialized NN architectures that extract higher-level features: ◮ CNNs and RNNs. ◮ Pitch: layers and architectures are like Lego bricks – mix and match.
9
Specialized NN architectures
◮ Focus of manual engineering shifted from features to architechture
decisions and hyper-parameters.
◮ The elimination of feature-engineering is only partially true: ◮ Need for specialized NN architectures that extract higher-level features: ◮ CNNs and RNNs. ◮ Pitch: layers and architectures are like Lego bricks – mix and match. ◮ Examples of things you could be asked to reflect on: ◮ When would you use each architecture? ◮ What are some of the ways we’ve combined the various bricks? ◮ When choosing to apply a non-hierarchical CNN, what assumptions are
you implicitly making about the nature of your task or data?
◮ Why could it make sense run a CNN over the word-by-word vector
- utputs of an RNN (e.g. a BiLSTM)?
9
INF5820: Experiment Design
10
INF5820: Experiment Design
Methodology
◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.
10
INF5820: Experiment Design
Methodology
◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.
Main Results
◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation:
10
INF5820: Experiment Design
Methodology
◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.
Main Results
◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate:
10
INF5820: Experiment Design
Methodology
◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.
Main Results
◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9;
10
INF5820: Experiment Design
Methodology
◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.
Main Results
◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate:
10
INF5820: Experiment Design
Methodology
◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.
Main Results
◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate: 9/9.
10
INF5820: Experiment Design
Methodology
◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.
Main Results
◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate: 9/9.
F1 = 1.0 Hooray!
10
INF5820: Experiment Design
Methodology
◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.
Main Results
◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate: 9/9.
F1 = 1.0 Hooray! Post Mortem Debugging
10
INF5820: Experiment Design
Methodology
◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.
Main Results
◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate: 9/9.
F1 = 1.0 Hooray! Post Mortem Debugging
◮ Please submit your views through the on-line course evaluation!
10
INF5820: Evaluation Protocol
11
INF5820: Empirical Results
12
INF5820: Empirical Results
◮ Total points:
12
INF5820: Empirical Results
◮ Total points:
Stig Berggren
12
INF5820: Empirical Results
◮ Total points:
Stig Berggren
◮ Average rank:
12
INF5820: Empirical Results
◮ Total points:
Stig Berggren
◮ Average rank:
Celina Moldestad Filip Stefaniuk Congratulations!
12
INF5820: Empirical Results
◮ Total points:
Stig Berggren
◮ Average rank:
Celina Moldestad Filip Stefaniuk Congratulations! To Everyone!
12
The Olympic Spirit (Laboratory this Thursday)
13
Looking Ahead: Yoav Goldberg in 2018
14
Contents
1
Results of obligatory assignments
2
Recent topics in Deep Learning NLP not covered in this course
- 1. Multi-task learning
- 2. Adversarial generators
- 3. Transformers
- 4. Pre-trained language models
14
- 1. Multi-task learning
◮ Sharing parameters between models trained on multiple tasks:
◮ tying the weights of different layers 15
- 1. Multi-task learning
◮ Sharing parameters between models trained on multiple tasks:
◮ tying the weights of different layers
[Collobert et al., 2011]
15
- 1. Multi-task learning
◮ Sharing parameters between models trained on multiple tasks:
◮ tying the weights of different layers
[Collobert et al., 2011] ◮ Conceptually, even using pre-trained word embeddings is multi-task
learning or semi-supervised learning.
◮ See Chapter 20 of [Goldberg, 2017]
15
- 1. Multi-task learning
◮ Way to inject linguistic information (inductive bias) into models.
16
- 1. Multi-task learning
◮ Way to inject linguistic information (inductive bias) into models. ◮ Human eye-tracking data helps document classification [Barrett et al., 2018]
16
- 1. Multi-task learning
◮ Way to inject linguistic information (inductive bias) into models. ◮ Human eye-tracking data helps document classification [Barrett et al., 2018] ◮ Dedicated benchmarks for multi-task learning are appearing
◮ Natural Language Decathlon [McCann et al., 2018] 16
- 2. Adversarial generators
◮ Generative Adversarial Networks (GANs): several neural networks
contesting with each other [Goodfellow et al., 2014]
17
- 2. Adversarial generators
◮ Generative Adversarial Networks (GANs): several neural networks
contesting with each other [Goodfellow et al., 2014]
◮ For example, one generates a text and another tries to tell the
generated text from natural text.
17
- 2. Adversarial generators
◮ Generative Adversarial Networks (GANs): several neural networks
contesting with each other [Goodfellow et al., 2014]
◮ For example, one generates a text and another tries to tell the
generated text from natural text.
◮ Eventually, the first networks learns to ‘deceive’ the second one... ◮ ... that is, to generate naturally-looking text.
17
- 2. Adversarial generators
◮ Generative Adversarial Networks (GANs): several neural networks
contesting with each other [Goodfellow et al., 2014]
◮ For example, one generates a text and another tries to tell the
generated text from natural text.
◮ Eventually, the first networks learns to ‘deceive’ the second one... ◮ ... that is, to generate naturally-looking text.
‘GANs were poorly understood and hard to get to work in the beginning and only took off once researchers figured out the right tricks and learned how to make them work.’ (Yann Le Cun)
17
- 2. Adversarial generators
[Goodfellow et al., 2014]
18
- 2. Adversarial generators
◮ Adding ‘adversarial’ examples in the training data make models more
robust.
19
- 2. Adversarial generators
◮ Adding ‘adversarial’ examples in the training data make models more
robust.
◮ One can find cases when the model is accurate for the wrong reasons.
19
- 2. Adversarial generators
◮ Adding ‘adversarial’ examples in the training data make models more
robust.
◮ One can find cases when the model is accurate for the wrong reasons. ◮ Shown to be useful for question answering systems [Mudrakarta et al., 2018]
19
- 2. Adversarial generators
◮ Adding ‘adversarial’ examples in the training data make models more
robust.
◮ One can find cases when the model is accurate for the wrong reasons. ◮ Shown to be useful for question answering systems [Mudrakarta et al., 2018]
19
- 2. Adversarial generators
◮ Adding ‘adversarial’ examples in the training data make models more
robust.
◮ One can find cases when the model is accurate for the wrong reasons. ◮ Shown to be useful for question answering systems [Mudrakarta et al., 2018]
...but ‘How fast are the bricks speaking on either side of the building?’ still produces the same answer!
19
- 3. Transformers
◮ Transformers idea: no separate encoders and decoders, only multi-head
self-attention [Vaswani et al., 2017].
20
- 3. Transformers
◮ Transformers idea: no separate encoders and decoders, only multi-head
self-attention [Vaswani et al., 2017].
◮ Transduction model computing representations of input and output
without using sequence aligned RNNs or convolutions.
20
- 3. Transformers
◮ Transformers idea: no separate encoders and decoders, only multi-head
self-attention [Vaswani et al., 2017].
◮ Transduction model computing representations of input and output
without using sequence aligned RNNs or convolutions.
◮ Brought major improvements in machine translation.
20
- 3. Transformers
Some code walk-throughs
◮ https://nlp.seas.harvard.edu/2018/04/03/attention.html ◮ https://github.com/tensorflow/tensor2tensor
21
- 3. Transformers
Some code walk-throughs
◮ https://nlp.seas.harvard.edu/2018/04/03/attention.html ◮ https://github.com/tensorflow/tensor2tensor
Recently, transformers were combined with bidirectional pre-trained language models in BERT [Devlin et al., 2018].
21
- 4. Pre-trained language models
Language models can provide contextualized word embeddings, with different representations in different contexts.
22
- 4. Pre-trained language models
Language models can provide contextualized word embeddings, with different representations in different contexts.
◮ Embeddings from Language MOdels (ELMo) use LSTMs [Peters et al., 2018] ◮ Bidirectional Encoder Representations from Transformer (BERT) use
bidirectional transformers [Devlin et al., 2018]
22
- 4. Pre-trained language models
Language models can provide contextualized word embeddings, with different representations in different contexts.
◮ Embeddings from Language MOdels (ELMo) use LSTMs [Peters et al., 2018] ◮ Bidirectional Encoder Representations from Transformer (BERT) use
bidirectional transformers [Devlin et al., 2018]
22
- 4. Pre-trained language models
ELMo seem to improve any NLP task you apply them for:
23
- 4. Pre-trained language models
ELMo seem to improve any NLP task you apply them for: ‘ImageNet for NLP’ (Sebastian Ruder)
23
- 4. Pre-trained language models
Modes of usage
- 1. ‘as is’: contextualized representations are fed into the overarching
architecture like the old-school ‘static’ embeddings;
24
- 4. Pre-trained language models
Modes of usage
- 1. ‘as is’: contextualized representations are fed into the overarching
architecture like the old-school ‘static’ embeddings;
- 2. the whole model is fine-tuned on target task data.
24
- 4. Pre-trained language models
Modes of usage
- 1. ‘as is’: contextualized representations are fed into the overarching
architecture like the old-school ‘static’ embeddings;
- 2. the whole model is fine-tuned on target task data.
Layers of ELMo reflect language tiers
◮ word embedding layer: morphology; ◮ the first LSTM layer: syntax; ◮ the second LSTM layer: semantics (including word senses).
24
- 4. Pre-trained language models
Modes of usage
- 1. ‘as is’: contextualized representations are fed into the overarching
architecture like the old-school ‘static’ embeddings;
- 2. the whole model is fine-tuned on target task data.
Layers of ELMo reflect language tiers
◮ word embedding layer: morphology; ◮ the first LSTM layer: syntax; ◮ the second LSTM layer: semantics (including word senses).
More info
◮ https://allennlp.org/elmo ◮ https://github.com/allenai/bilm-tf ◮ https://github.com/google-research/bert
24
Deep Learning in NLP: future is bright!
From Min-Yen Kan keynote speech at COLING-2018
New and exciting research is coming, stay tuned to arXiv.org and ACL Anthology!
25
References I
Barrett, M., Bingel, J., Hollenstein, N., Rei, M., and Søgaard, A. (2018). Sequence classification with human attention. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 302–312. Association for Computational Linguistics. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
26
References II
Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680. McCann, B., Keskar, N. S., Xiong, C., and Socher, R. (2018). The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
27
References III
Mudrakarta, P. K., Taly, A., Sundararajan, M., and Dhamdhere, K. (2018). Did the model understand the question? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1896–1906. Association for Computational Linguistics. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter
- f the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.
28
References IV
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
- A. N., Kaiser, Ł., and Polosukhin, I. (2017).
Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
29