INF5820: Language technological applications Course summary Andrey - - PowerPoint PPT Presentation

inf5820 language technological applications course summary
SMART_READER_LITE
LIVE PREVIEW

INF5820: Language technological applications Course summary Andrey - - PowerPoint PPT Presentation

INF5820: Language technological applications Course summary Andrey Kutuzov, Lilja vrelid, Stephan Oepen, Taraka Rama & Erik Velldal University of Oslo 20 November 2018 Today Exam preparations Collectively summing up Results


slide-1
SLIDE 1

INF5820: Language technological applications Course summary

Andrey Kutuzov, Lilja Øvrelid, Stephan Oepen, Taraka Rama & Erik Velldal

University of Oslo

20 November 2018

slide-2
SLIDE 2

Today

◮ Exam preparations ◮ Collectively summing up ◮ Results of obligatory assignment(s) ◮ Current trends, beyond INF5820.

◮ Cutting edge in word embedding pre-training ◮ Transfer and multitask learning ◮ Adversarial learning ◮ Transformers ◮ And more. . . 2

slide-3
SLIDE 3

Exam

◮ When: Monday November 26, 09:00 AM (4 hours). ◮ Where: Store fysiske lesesal, Fysikkbygningen ◮ How: ◮ No aids (no textbooks, etc.) ◮ Pen and paper (not Inspera) ◮ Not a programming exam ◮ Focus on conceptual understanding ◮ Could still involve equations, but no complicated calculations by hand ◮ Details of use cases we’ve considered (in lectures or assignments) are

also relevant

3

slide-4
SLIDE 4

Neural Network Methods for NLP

(The Great Wave off Kanagawa by Katsushika Hokusai) 4

slide-5
SLIDE 5

What has changed?

◮ We’re still within the realm of supervised machine learning. But: ◮ A shift from linear models with discrete representations of manually

specified features,

◮ to non-linear models with distributed and learned representations.

5

slide-6
SLIDE 6

What has changed?

◮ We’re still within the realm of supervised machine learning. But: ◮ A shift from linear models with discrete representations of manually

specified features,

◮ to non-linear models with distributed and learned representations. ◮ We’ll consider two main themes running through the semester:

architectures and representations.

5

slide-7
SLIDE 7

Architectures and model design

6

slide-8
SLIDE 8

Architectures and model design

◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs.

6

slide-9
SLIDE 9

Architectures and model design

◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:

6

slide-10
SLIDE 10

Architectures and model design

◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:

◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . . 6

slide-11
SLIDE 11

Architectures and model design

◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:

◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . .

◮ Variations beyond simple RNNs:

6

slide-12
SLIDE 12

Architectures and model design

◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:

◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . .

◮ Variations beyond simple RNNs:

◮ (Bi)LSTM + GRU (gating), attention and stacking. ◮ Variations of how RNNs can be used: Acceptors, transducers,

conditioned generation (encoder-decoder / seq.-to-seq.)

◮ Various ways of performing sequence labeling with RNNs 6

slide-13
SLIDE 13

Architectures and model design

◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:

◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . .

◮ Variations beyond simple RNNs:

◮ (Bi)LSTM + GRU (gating), attention and stacking. ◮ Variations of how RNNs can be used: Acceptors, transducers,

conditioned generation (encoder-decoder / seq.-to-seq.)

◮ Various ways of performing sequence labeling with RNNs

Various aspects of modeling common to all the neural architetctures:

6

slide-14
SLIDE 14

Architectures and model design

◮ Linear classifiers, feed-forward networks, (MLPs and CNNs) and RNNs. ◮ Various instantiations of 1d CNNs:

◮ Multi-channel, stacked / hierarchical, graph CNNs ◮ Other choices: pooling strategy, window sizes, number of filters, stride. . .

◮ Variations beyond simple RNNs:

◮ (Bi)LSTM + GRU (gating), attention and stacking. ◮ Variations of how RNNs can be used: Acceptors, transducers,

conditioned generation (encoder-decoder / seq.-to-seq.)

◮ Various ways of performing sequence labeling with RNNs

Various aspects of modeling common to all the neural architetctures:

◮ dimensionalities, regularization, initialization, handling OOVs, activation

functions, batches, loss-functions, learning rate, optimizer,. . .

◮ Embedding pre-training and text pre-processing ◮ Backpropagation, vanishing / exploding gradients

6

slide-15
SLIDE 15

Representations

◮ An important part of the neural ‘revolution’ in NLP: the input

representations provided to the learner.

◮ Traditional feature vectors: ◮ Word embeddings: ◮ Main benefit of using embeddings rather than one-hot encodings:

7

slide-16
SLIDE 16

Representations

◮ An important part of the neural ‘revolution’ in NLP: the input

representations provided to the learner.

◮ Traditional feature vectors: High-dimensional, sparse, categorical and

  • discrete. Based on manualy specified feature templates.

◮ Word embeddings: Low-dimensional, dense, continous and distributed.

Often learned automatically, eg. as a language model.

◮ Main benefit of using embeddings rather than one-hot encodings: ◮ Information-sharing between features, counteracts data-sparsness. ◮ Can be computed from unlabelled data.

7

slide-17
SLIDE 17

Representations

◮ An important part of the neural ‘revolution’ in NLP: the input

representations provided to the learner.

◮ Traditional feature vectors: High-dimensional, sparse, categorical and

  • discrete. Based on manualy specified feature templates.

◮ Word embeddings: Low-dimensional, dense, continous and distributed.

Often learned automatically, eg. as a language model.

◮ Main benefit of using embeddings rather than one-hot encodings: ◮ Information-sharing between features, counteracts data-sparsness. ◮ Can be computed from unlabelled data. ◮ We’ve also considered various tasks for intrinsic evaluation of

distributional word vectors.

7

slide-18
SLIDE 18

Representation learning

◮ With neural network models, our main interest is not always in the final

classification outcome itself.

◮ Rather, we might be interested in the learned internal representations. ◮ Examples?

8

slide-19
SLIDE 19

Representation learning

◮ With neural network models, our main interest is not always in the final

classification outcome itself.

◮ Rather, we might be interested in the learned internal representations. ◮ Examples? ◮ Embeddings in neural models

◮ Pre-trained or learned from scratch (with one-hot input) ◮ Static (frozen) or dynamic.

◮ The pooling layer of a CNN or the final hidden state of an RNN

provides a fixed-length representation of an arbitray-length sequence.

8

slide-20
SLIDE 20

Specialized NN architectures

◮ Focus of manual engineering shifted from features to architechture

decisions and hyper-parameters.

◮ The elimination of feature-engineering is only partially true: ◮ Need for specialized NN architectures that extract higher-level features: ◮ CNNs and RNNs. ◮ Pitch: layers and architectures are like Lego bricks – mix and match.

9

slide-21
SLIDE 21

Specialized NN architectures

◮ Focus of manual engineering shifted from features to architechture

decisions and hyper-parameters.

◮ The elimination of feature-engineering is only partially true: ◮ Need for specialized NN architectures that extract higher-level features: ◮ CNNs and RNNs. ◮ Pitch: layers and architectures are like Lego bricks – mix and match. ◮ Examples of things you could be asked to reflect on: ◮ When would you use each architecture? ◮ What are some of the ways we’ve combined the various bricks? ◮ When choosing to apply a non-hierarchical CNN, what assumptions are

you implicitly making about the nature of your task or data?

◮ Why could it make sense run a CNN over the word-by-word vector

  • utputs of an RNN (e.g. a BiLSTM)?

9

slide-22
SLIDE 22

INF5820: Experiment Design

10

slide-23
SLIDE 23

INF5820: Experiment Design

Methodology

◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.

10

slide-24
SLIDE 24

INF5820: Experiment Design

Methodology

◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.

Main Results

◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation:

10

slide-25
SLIDE 25

INF5820: Experiment Design

Methodology

◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.

Main Results

◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate:

10

slide-26
SLIDE 26

INF5820: Experiment Design

Methodology

◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.

Main Results

◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9;

10

slide-27
SLIDE 27

INF5820: Experiment Design

Methodology

◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.

Main Results

◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate:

10

slide-28
SLIDE 28

INF5820: Experiment Design

Methodology

◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.

Main Results

◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate: 9/9.

10

slide-29
SLIDE 29

INF5820: Experiment Design

Methodology

◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.

Main Results

◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate: 9/9.

F1 = 1.0 Hooray!

10

slide-30
SLIDE 30

INF5820: Experiment Design

Methodology

◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.

Main Results

◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate: 9/9.

F1 = 1.0 Hooray! Post Mortem Debugging

10

slide-31
SLIDE 31

INF5820: Experiment Design

Methodology

◮ Small, elite group of I:ST finishers; ◮ Engage everyone from start to finish; ◮ an Olympic twist: friendly competition; ◮ acquire practical skills and intutitions.

Main Results

◮ We are very happy with results from experiment (so far); ◮ commonly apply two key metrics in internal evaluation: ◮ retention rate: 9/9; survival rate: 9/9.

F1 = 1.0 Hooray! Post Mortem Debugging

◮ Please submit your views through the on-line course evaluation!

10

slide-32
SLIDE 32

INF5820: Evaluation Protocol

11

slide-33
SLIDE 33

INF5820: Empirical Results

12

slide-34
SLIDE 34

INF5820: Empirical Results

◮ Total points:

12

slide-35
SLIDE 35

INF5820: Empirical Results

◮ Total points:

Stig Berggren

12

slide-36
SLIDE 36

INF5820: Empirical Results

◮ Total points:

Stig Berggren

◮ Average rank:

12

slide-37
SLIDE 37

INF5820: Empirical Results

◮ Total points:

Stig Berggren

◮ Average rank:

Celina Moldestad Filip Stefaniuk Congratulations!

12

slide-38
SLIDE 38

INF5820: Empirical Results

◮ Total points:

Stig Berggren

◮ Average rank:

Celina Moldestad Filip Stefaniuk Congratulations! To Everyone!

12

slide-39
SLIDE 39

The Olympic Spirit (Laboratory this Thursday)

13

slide-40
SLIDE 40

Looking Ahead: Yoav Goldberg in 2018

14

slide-41
SLIDE 41

Contents

1

Results of obligatory assignments

2

Recent topics in Deep Learning NLP not covered in this course

  • 1. Multi-task learning
  • 2. Adversarial generators
  • 3. Transformers
  • 4. Pre-trained language models

14

slide-42
SLIDE 42
  • 1. Multi-task learning

◮ Sharing parameters between models trained on multiple tasks:

◮ tying the weights of different layers 15

slide-43
SLIDE 43
  • 1. Multi-task learning

◮ Sharing parameters between models trained on multiple tasks:

◮ tying the weights of different layers

[Collobert et al., 2011]

15

slide-44
SLIDE 44
  • 1. Multi-task learning

◮ Sharing parameters between models trained on multiple tasks:

◮ tying the weights of different layers

[Collobert et al., 2011] ◮ Conceptually, even using pre-trained word embeddings is multi-task

learning or semi-supervised learning.

◮ See Chapter 20 of [Goldberg, 2017]

15

slide-45
SLIDE 45
  • 1. Multi-task learning

◮ Way to inject linguistic information (inductive bias) into models.

16

slide-46
SLIDE 46
  • 1. Multi-task learning

◮ Way to inject linguistic information (inductive bias) into models. ◮ Human eye-tracking data helps document classification [Barrett et al., 2018]

16

slide-47
SLIDE 47
  • 1. Multi-task learning

◮ Way to inject linguistic information (inductive bias) into models. ◮ Human eye-tracking data helps document classification [Barrett et al., 2018] ◮ Dedicated benchmarks for multi-task learning are appearing

◮ Natural Language Decathlon [McCann et al., 2018] 16

slide-48
SLIDE 48
  • 2. Adversarial generators

◮ Generative Adversarial Networks (GANs): several neural networks

contesting with each other [Goodfellow et al., 2014]

17

slide-49
SLIDE 49
  • 2. Adversarial generators

◮ Generative Adversarial Networks (GANs): several neural networks

contesting with each other [Goodfellow et al., 2014]

◮ For example, one generates a text and another tries to tell the

generated text from natural text.

17

slide-50
SLIDE 50
  • 2. Adversarial generators

◮ Generative Adversarial Networks (GANs): several neural networks

contesting with each other [Goodfellow et al., 2014]

◮ For example, one generates a text and another tries to tell the

generated text from natural text.

◮ Eventually, the first networks learns to ‘deceive’ the second one... ◮ ... that is, to generate naturally-looking text.

17

slide-51
SLIDE 51
  • 2. Adversarial generators

◮ Generative Adversarial Networks (GANs): several neural networks

contesting with each other [Goodfellow et al., 2014]

◮ For example, one generates a text and another tries to tell the

generated text from natural text.

◮ Eventually, the first networks learns to ‘deceive’ the second one... ◮ ... that is, to generate naturally-looking text.

‘GANs were poorly understood and hard to get to work in the beginning and only took off once researchers figured out the right tricks and learned how to make them work.’ (Yann Le Cun)

17

slide-52
SLIDE 52
  • 2. Adversarial generators

[Goodfellow et al., 2014]

18

slide-53
SLIDE 53
  • 2. Adversarial generators

◮ Adding ‘adversarial’ examples in the training data make models more

robust.

19

slide-54
SLIDE 54
  • 2. Adversarial generators

◮ Adding ‘adversarial’ examples in the training data make models more

robust.

◮ One can find cases when the model is accurate for the wrong reasons.

19

slide-55
SLIDE 55
  • 2. Adversarial generators

◮ Adding ‘adversarial’ examples in the training data make models more

robust.

◮ One can find cases when the model is accurate for the wrong reasons. ◮ Shown to be useful for question answering systems [Mudrakarta et al., 2018]

19

slide-56
SLIDE 56
  • 2. Adversarial generators

◮ Adding ‘adversarial’ examples in the training data make models more

robust.

◮ One can find cases when the model is accurate for the wrong reasons. ◮ Shown to be useful for question answering systems [Mudrakarta et al., 2018]

19

slide-57
SLIDE 57
  • 2. Adversarial generators

◮ Adding ‘adversarial’ examples in the training data make models more

robust.

◮ One can find cases when the model is accurate for the wrong reasons. ◮ Shown to be useful for question answering systems [Mudrakarta et al., 2018]

...but ‘How fast are the bricks speaking on either side of the building?’ still produces the same answer!

19

slide-58
SLIDE 58
  • 3. Transformers

◮ Transformers idea: no separate encoders and decoders, only multi-head

self-attention [Vaswani et al., 2017].

20

slide-59
SLIDE 59
  • 3. Transformers

◮ Transformers idea: no separate encoders and decoders, only multi-head

self-attention [Vaswani et al., 2017].

◮ Transduction model computing representations of input and output

without using sequence aligned RNNs or convolutions.

20

slide-60
SLIDE 60
  • 3. Transformers

◮ Transformers idea: no separate encoders and decoders, only multi-head

self-attention [Vaswani et al., 2017].

◮ Transduction model computing representations of input and output

without using sequence aligned RNNs or convolutions.

◮ Brought major improvements in machine translation.

20

slide-61
SLIDE 61
  • 3. Transformers

Some code walk-throughs

◮ https://nlp.seas.harvard.edu/2018/04/03/attention.html ◮ https://github.com/tensorflow/tensor2tensor

21

slide-62
SLIDE 62
  • 3. Transformers

Some code walk-throughs

◮ https://nlp.seas.harvard.edu/2018/04/03/attention.html ◮ https://github.com/tensorflow/tensor2tensor

Recently, transformers were combined with bidirectional pre-trained language models in BERT [Devlin et al., 2018].

21

slide-63
SLIDE 63
  • 4. Pre-trained language models

Language models can provide contextualized word embeddings, with different representations in different contexts.

22

slide-64
SLIDE 64
  • 4. Pre-trained language models

Language models can provide contextualized word embeddings, with different representations in different contexts.

◮ Embeddings from Language MOdels (ELMo) use LSTMs [Peters et al., 2018] ◮ Bidirectional Encoder Representations from Transformer (BERT) use

bidirectional transformers [Devlin et al., 2018]

22

slide-65
SLIDE 65
  • 4. Pre-trained language models

Language models can provide contextualized word embeddings, with different representations in different contexts.

◮ Embeddings from Language MOdels (ELMo) use LSTMs [Peters et al., 2018] ◮ Bidirectional Encoder Representations from Transformer (BERT) use

bidirectional transformers [Devlin et al., 2018]

22

slide-66
SLIDE 66
  • 4. Pre-trained language models

ELMo seem to improve any NLP task you apply them for:

23

slide-67
SLIDE 67
  • 4. Pre-trained language models

ELMo seem to improve any NLP task you apply them for: ‘ImageNet for NLP’ (Sebastian Ruder)

23

slide-68
SLIDE 68
  • 4. Pre-trained language models

Modes of usage

  • 1. ‘as is’: contextualized representations are fed into the overarching

architecture like the old-school ‘static’ embeddings;

24

slide-69
SLIDE 69
  • 4. Pre-trained language models

Modes of usage

  • 1. ‘as is’: contextualized representations are fed into the overarching

architecture like the old-school ‘static’ embeddings;

  • 2. the whole model is fine-tuned on target task data.

24

slide-70
SLIDE 70
  • 4. Pre-trained language models

Modes of usage

  • 1. ‘as is’: contextualized representations are fed into the overarching

architecture like the old-school ‘static’ embeddings;

  • 2. the whole model is fine-tuned on target task data.

Layers of ELMo reflect language tiers

◮ word embedding layer: morphology; ◮ the first LSTM layer: syntax; ◮ the second LSTM layer: semantics (including word senses).

24

slide-71
SLIDE 71
  • 4. Pre-trained language models

Modes of usage

  • 1. ‘as is’: contextualized representations are fed into the overarching

architecture like the old-school ‘static’ embeddings;

  • 2. the whole model is fine-tuned on target task data.

Layers of ELMo reflect language tiers

◮ word embedding layer: morphology; ◮ the first LSTM layer: syntax; ◮ the second LSTM layer: semantics (including word senses).

More info

◮ https://allennlp.org/elmo ◮ https://github.com/allenai/bilm-tf ◮ https://github.com/google-research/bert

24

slide-72
SLIDE 72

Deep Learning in NLP: future is bright!

From Min-Yen Kan keynote speech at COLING-2018

New and exciting research is coming, stay tuned to arXiv.org and ACL Anthology!

25

slide-73
SLIDE 73

References I

Barrett, M., Bingel, J., Hollenstein, N., Rei, M., and Søgaard, A. (2018). Sequence classification with human attention. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 302–312. Association for Computational Linguistics. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

26

slide-74
SLIDE 74

References II

Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680. McCann, B., Keskar, N. S., Xiong, C., and Socher, R. (2018). The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.

27

slide-75
SLIDE 75

References III

Mudrakarta, P. K., Taly, A., Sundararajan, M., and Dhamdhere, K. (2018). Did the model understand the question? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1896–1906. Association for Computational Linguistics. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter

  • f the Association for Computational Linguistics: Human Language

Technologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.

28

slide-76
SLIDE 76

References IV

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,

  • A. N., Kaiser, Ł., and Polosukhin, I. (2017).

Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.

29