Lesson 10 Deep learning for NLP: Mul6lingual Word Sequence Modeling - - PowerPoint PPT Presentation

lesson 10 deep learning for nlp mul6lingual word sequence
SMART_READER_LITE
LIVE PREVIEW

Lesson 10 Deep learning for NLP: Mul6lingual Word Sequence Modeling - - PowerPoint PPT Presentation

Human Language Technology: Applica6on to Informa6on Access Lesson 10 Deep learning for NLP: Mul6lingual Word Sequence Modeling December 15, 2016 EPFL Doctoral Course EE-724 Nikolaos Pappas Idiap Research Ins6tute, Mar6gny Outline of the


slide-1
SLIDE 1

Lesson 10 Deep learning for NLP: Mul6lingual Word Sequence Modeling

December 15, 2016

EPFL Doctoral Course EE-724 Nikolaos Pappas Idiap Research Ins6tute, Mar6gny

Human Language Technology: Applica6on to Informa6on Access

slide-2
SLIDE 2

Nikolaos Pappas

/88

Outline of the talk

  • 1. Recap: Word Representa6on Learning
  • 2. Mul6lingual Word Representa6ons
  • Alignment models
  • Evalua6on tasks
  • 3. Mul6lingual Word Sequence Modeling
  • Essen6als: RNN, LSTM, GRU
  • Machine Transla6on
  • Document Classifica6on
  • 4. Summary

2

* Figure from Lebret's thesis, EPFL, 2016

slide-3
SLIDE 3

Nikolaos Pappas

/88

Disclaimer

3

  • Research highlights rather than in-depth analysis
  • By no means exhaus6ve (progress too fast!)
  • Tried to keep most representa6ves
  • Focus on feature learning and two major NLP tasks
  • Not enough 6me to cover other exci6ng tasks:
  • Ques6on answering
  • Rela6on classifica6on
  • Paraphrase detec6on
  • Summariza6on
slide-4
SLIDE 4

Nikolaos Pappas

/88

Recap: Learning word representa6ons from text

  • Why should we care about them?
  • tackles curse of dimensionality
  • captures seman6c and analogy rela6ons of words
  • captures general knowledge in an unsupervised way

4

king - man + woman ≈ queen

slide-5
SLIDE 5

Nikolaos Pappas

/88

Recap: Learning word representa6ons from text

  • How can we benefit from them?
  • study linguis6c proper6es of words
  • inject general knowledge on downstream tasks
  • transfer knowledge across languages or modali6es
  • compose representa6ons of word sequences

5

slide-6
SLIDE 6

Nikolaos Pappas

/88

Recap: Learning word representa6ons from text

  • Which method to use for learning them?
  • neural versus count-based methods

➡ neural ones implicitly do SVD over a PMI matrix ➡ similar to count-based when using the same tricks

  • neural methods appear to have the edge (word2vec)

➡ efficient and scalable objec6ve + toolkit ➡ intui6ve formula6on (=predict words in context)

6

slide-7
SLIDE 7

Nikolaos Pappas

/88

Recap: Con6nuous Bag-of- Words (CBOW)

7

slide-8
SLIDE 8

Nikolaos Pappas

/88

Recap: Con6nuous Bag-of- Words (CBOW)

8

slide-9
SLIDE 9

Nikolaos Pappas

/88

Recap: Learning word representa6ons from text

  • What else can we do with word embeddings?
  • dependency-based embeddings: Levy and Goldberg 2014
  • retrofijed-to-lexicons embeddings: Faruqui et al. 2014
  • sense-aware embeddings: Li and Jurafsky 2015
  • visually-grounded embeddings: Lazaridou et al. 2015
  • mul6lingual embeddings: Gouws et al 2015

9

slide-10
SLIDE 10

Nikolaos Pappas

/88

Outline of the talk

  • 1. Recap: Word Representa6on Learning
  • 2. Mul6lingual Word Representa6ons
  • Alignment models
  • Evalua6on tasks
  • 3. Mul6lingual Word Sequence Modeling
  • Essen6als: RNN, LSTM, GRU
  • Machine Transla6on
  • Document Classifica6on
  • 4. Summary

10

* Figure from Gouts et al., 2015.

slide-11
SLIDE 11

Nikolaos Pappas

/88

Learning cross-lingual word representa6ons

  • Monolingual embeddings capture seman6c, syntac6c

and analogy rela6ons between words

  • Goal: capture this rela6onships two or more languages

11

* Figure from Gouts et al., 2015.

slide-12
SLIDE 12

Nikolaos Pappas

/88

Supervision of cross-lingual alignment methods

  • Parallel sentences for MT: Guo et al., 2015

Sentence by sentence and word alignments

  • Parallel sentences: Gouws et al., 2015

Sentence by sentence alignments

  • Parallel documents: Søgaard et al., 2015

Documents with topic or label alignments

  • Bilingual dicHonary: Ammar et al., 2016

Word by word transla6ons

  • No parallel data: Faruqui and Dyer, 2014

Really!

12

Annotation cost low high

slide-13
SLIDE 13

Nikolaos Pappas

/88

Cross-lingual alignment with no parallel data

  • 13
slide-14
SLIDE 14

Nikolaos Pappas

/88

Cross-lingual alignment with parallel sentences

  • 14
slide-15
SLIDE 15

Nikolaos Pappas

/88

Cross-lingual alignment with parallel sentences

15

(Gows et al., 2016)

slide-16
SLIDE 16

Nikolaos Pappas

/88

Cross-lingual alignment with parallel sentences for MT

16

slide-17
SLIDE 17

Nikolaos Pappas

/88

Unified framework for analysis

  • f cross-lingual methods

17

  • Minimize monolingual objec6ve
  • Constraint/Regularize with bilingual objec6ve
slide-18
SLIDE 18

Nikolaos Pappas

/88

Evalua6on: Cross-lingual document classifica6on and transla6on

18

(Gows et al., 2015)

slide-19
SLIDE 19

Nikolaos Pappas

/88

Bonus: Mul6lingual visual sen6ment concept matching

19

(Pappas et al., 2016)

concept = adjec6ve-noun-phrase (ANP)

slide-20
SLIDE 20

Nikolaos Pappas

/88

Mul6lingual visual sen6ment concept ontology

20

(Jou et al., 2015)

slide-21
SLIDE 21

Nikolaos Pappas

/88

Word embedding model

21

(Pappas et al., 2016)

slide-22
SLIDE 22

Nikolaos Pappas

/88

Mul6lingual visual sen6ment concept retrieval

22

(Pappas et al., 2016)

slide-23
SLIDE 23

Nikolaos Pappas

/88

Mul6lingual visual sen6ment concept clustering

23

(Pappas et al., 2016)

slide-24
SLIDE 24

Nikolaos Pappas

/88

Mul6lingual visual sen6ment concept clustering

24

(Pappas et al., 2016)

slide-25
SLIDE 25

Nikolaos Pappas

/88

Discovering interes6ng clusters: Mul6lingual

25

(Pappas et al., 2016) (Pappas et al., 2016)

slide-26
SLIDE 26

Nikolaos Pappas

/88

Discovering interes6ng clusters: Western vs. Eastern

26

(Pappas et al., 2016) (Pappas et al., 2016)

slide-27
SLIDE 27

Nikolaos Pappas

/88

Discovering interes6ng clusters: Monolingual

27

(Pappas et al., 2016)

slide-28
SLIDE 28

Nikolaos Pappas

/88

Evalua6on: Mul6lingual visual sen6ment concept analysis

28

  • Aligned embeddings are bejer than transla6on in

concept retrieval, clustering and sen6ment predic6on

slide-29
SLIDE 29

Nikolaos Pappas

/88

Conclusion

29

  • Aligned embeddings are cheaper than transla6on and

usually work bejer than it in several mul6lingual or crosslingual NLP tasks without parallel data

  • document classifica6on Gows et al., 2015
  • named en6ty recogni6on Al-Rfou et al., 2014
  • dependency parsing Guo et al., 2015
  • concept retrieval and clustering Pappas et al., 2016
slide-30
SLIDE 30

Nikolaos Pappas

/88

Outline of the talk

  • 1. Recap: Word Representa6on Learning
  • 2. Mul6lingual Word Representa6ons
  • Alignment models
  • Evalua6on tasks
  • 3. Mul6lingual Word Sequence Modeling
  • Essen6als: RNN, LSTM, GRU
  • Machine Transla6on
  • Document Classifica6on
  • 4. Summary

30

* Figure from Colah’s blog, 2015.

slide-31
SLIDE 31

Nikolaos Pappas

/88

Language Modeling

31

  • Computes the probability of a sequence of words or

simply “likelihood of a text”: P(w1, w2, …, wt)

  • N-gram models with Markov assump6on:
  • Where is it useful?
  • speech recogni6on
  • machine transla6on
  • POS tagging and parsing
  • What are its limitaHons?
  • unrealis6c assump6on
  • huge memory needs
  • back-off models
slide-32
SLIDE 32

Nikolaos Pappas

/88

Recurrent Neural Network (RNN)

32

  • Neural language model:
  • What are its main limitaHons?
  • vanishing gradient problem (error doesn’t propagate far)
  • fail to capture long-term dependencies
  • tricks: gradient clipping, iden6ty ini6aliza6on + ReLus
slide-33
SLIDE 33

Nikolaos Pappas

/88

33

  • Long-short term memory nets are able to learn long-

term dependencies: Hochreiter and Schmidhuber 1997

Simple RNN:

* Figure from Colah’s blog, 2015.

Long Short Term Memory (LSTM)

slide-34
SLIDE 34

Nikolaos Pappas

/88

Long Short Term Memory (LSTM)

34

  • Long-short term memory nets are able to learn long-

term dependencies: Hochreiter and Schmidhuber 1997

  • Ability to remove or add informa6on to the cell

state regulated by “gates”

* Figure from Colah’s blog, 2015.

slide-35
SLIDE 35

Nikolaos Pappas

/88

Gated Recurrent Unit (GRU)

35

  • Gated RNN by Chung et al, 2014 combines the forget

and input gates into a single “update gate”

  • keep memories to capture long-term dependencies
  • allow error messages to flow at different strengths

zt: update gate — rt: reset gate — ht: regular RNN update

* Figure from Colah’s blog, 2015.

slide-36
SLIDE 36

Nikolaos Pappas

/88

Deep Bidirec6onal Models

36

  • Here RNN but it applies to LSTMs and GRUs too

(Irsoy and Cardie, 2014)

slide-37
SLIDE 37

Nikolaos Pappas

/88

Convolu6onal Neural Network (CNN)

37

(Collobert et al., 2011) (Kim, 2014)

  • Typically good for images
  • Convolu6onal filter(s) is (are)

applied every k words:

  • Similar to Recursive NNs but without

constraining to gramma6cal phrases

  • nly, as Socher et al., 2011
  • no need for a parser (!)
  • less linguis6cally mo6vated ?
slide-38
SLIDE 38

Nikolaos Pappas

/88

Hierarchical Models

38

(Tang et al., 2015)

  • Word-level and sentence-level modeling with

any type of NN layers

slide-39
SLIDE 39

Nikolaos Pappas

/59

Ajen6on Mechanism for Machine Transla6on

39

  • Chooses “where to look” or learns to assign a relevance

to each input posi6on given encoder hidden state for that posi6on and the previous decoder state

  • learns a sou bilingual alignment model

(Bahdanau et al., 2015)

slide-40
SLIDE 40

Nikolaos Pappas

/59

Ajen6on Mechanism for Document Classifica6on

40

  • Operates on input word sequence (or intermediate

hidden states: Pappas and Popescu-Belis 2016)

  • Learns to focus on relevant parts of the input with

respect to the target labels

  • learns a sou extrac6ve summariza6on model

(Pappas and Popescu-Belis, 2014)

slide-41
SLIDE 41

Nikolaos Pappas

/88

Outline of the talk

  • 1. Recap: Word Representa6on Learning
  • 2. Mul6lingual Word Representa6ons
  • Alignment models
  • Evalua6on tasks
  • 3. Mul6lingual Word Sequence Modeling
  • Essen6als: RNN, LSTM, GRU
  • Machine Transla6on
  • Document Classifica6on
  • 4. Summary

41

* Figure from Colah’s blog, 2015.

slide-42
SLIDE 42

Nikolaos Pappas

/88

RNN encoder-decoder for Machine Transla6on

42

  • GRU as hidden layer
  • Maximize the log likelihood
  • f the target sequence

given the source sequence:

  • WMT 2014 (EN→FR)

(Cho et al., 2014)

slide-43
SLIDE 43

Nikolaos Pappas

/88

Sequence to sequence learning for Machine Transla6on

43

  • LSTM hidden layers instead of GRU
  • 4 layers deep instead of shallow encoder-decoder

(Sutskever et al., 2014)

slide-44
SLIDE 44

Nikolaos Pappas

/88

Sequence to sequence learning for Machine Transla6on

44

(Sutskever et al., 2014)

  • WMT 2014 (EN→FR)
  • PCA projec6on of the hidden state of the last encoder layer
slide-45
SLIDE 45

Nikolaos Pappas

/88

Jointly learning to align and translate for Machine Transla6on

45

(Bahdanau et al., 2015)

  • LimitaHon: can we compress all

the needed informa6on in the last encoder state?

  • Idea: use all the hidden states
  • f the encoder
  • length propor6onal to that
  • f the sentence!
  • compute a weighted average
  • f all the hidden states
slide-46
SLIDE 46

Nikolaos Pappas

/88

Jointly learning to align and translate for Machine Transla6on

46

(Bahdanau et al., 2015)

  • WMT 2014 (EN→FR)
slide-47
SLIDE 47

Nikolaos Pappas

/88

Effec6ve approaches to ajen6on-based NMT

47

(Luong et al., 2015)

  • Global and local ajen6on
  • Input-feeding approach
  • Stacked LSTM instead of single-layer
slide-48
SLIDE 48

Nikolaos Pappas

/88

Mul6-source NMT

48

(Zoph and Knight, 2016)

  • Train p(e|f, g) model

directly on trilingual data

  • Use it to decode e given

any (f, g) pair

  • Take local-ajen6on NMT

model and concatenate context from mul6ple sources

slide-49
SLIDE 49

Nikolaos Pappas

/88

Mul6-source NMT

49

(Zoph and Knight, 2016)

  • Mul6-source training improves over individual

French English and German English pairs

  • Best: basic concatena6on with ajen6on
slide-50
SLIDE 50

Nikolaos Pappas

/88

Mul6-source NMT

50

(Zoph and Knight, 2016)

  • Mul6-source training improves over individual

French English and German English pairs

  • Best: basic concatena6on with ajen6on
slide-51
SLIDE 51

Nikolaos Pappas

/88

Mul6-target NMT

51

(Dong et al., 2015)

  • Mul6-task learning framework for mul6ple target

language transla6on

  • Op6miza6on for one to many model
slide-52
SLIDE 52

Nikolaos Pappas

/88

Mul6-target NMT

52

(Dong et al., 2015)

  • Improves over NMT

and moses baselines

  • ver WMT 2013 test
  • but also on larger

datasets

  • Faster and bejer

convergence in mul6ple language transla6on

slide-53
SLIDE 53

Nikolaos Pappas

/88

Mul6-way, Mul6lingual NMT

53

(Firat et al., 2016)

  • Encoder-decoder model with

mul6ple encoders and decoders shared across pairs

  • share knowledge across langs
  • universal space for all langs
  • good for low-resource langs
  • Ajen6on is pair specific, hence

expensive O(L^2)

  • instead share ajen6on across

all pairs!

Figure: n_th encoder and m_th decoder at 6mestep t / φ makes encoder & decoder states compa6ble with the ajen6on mechanism / f_adp makes context vector compa6ble with the decoder → all these transforma6ons to support different types of encoders/decoders for different languages!

slide-54
SLIDE 54

Nikolaos Pappas

/88

Mul6-way, Mul6lingual NMT

54

(Firat et al., 2016)

  • Consistent improvements for low-

resource languages

  • the lower the training data the

bigger the improvement

  • In large-scale translaHon improves
  • nly translaHon to English
  • hypothesis: EN appears always as

source or target language for all pairs → bejer decoder ?

slide-55
SLIDE 55

Nikolaos Pappas

/88

Mul6-way, Mul6lingual NMT

55

(Firat et al., 2016)

  • Consistent improvements for low-

resource languages

  • the lower the training data the

bigger the improvement

  • In large-scale translaHon improves
  • nly translaHon to English
  • hypothesis: EN appears always as

source or target language for all pairs → bejer decoder ?

slide-56
SLIDE 56

Nikolaos Pappas

/88

Google’s Neural Machine Transla6on System “Monster”

56

(Wu et al., 2016)

  • An encoder, a decoder and an ajen6on network
  • Plus 8-layer deep with residual connec6ons
  • Plus refinement with Reinforcement Learning
  • Plus sub-word units…Plus….
slide-57
SLIDE 57

Nikolaos Pappas

/88

Google’s Neural Machine Transla6on System “Monster”

57

(Wu et al., 2016)

  • EN->FR training takes 6 days on 96GPUS !!!! and 3 more days for refinement...
slide-58
SLIDE 58

Nikolaos Pappas

/88

Future of NMT and other possibili6es

58

  • MulH-task learning: Training

mul6ple pairs of languages jointly and with other tasks → Image cap6oning, Speech recogni6on !

(Luong, Cho, Manning tutorial, 2016)

  • Larger context: Modeling larger sequences than sentences as

in document classifica6on will be key

  • understanding long-term dependencies
  • leveraging structural informa6on of the input
  • being able to reason over it to solve any task

→ Effec6ve Ajen6on / Memory?

slide-59
SLIDE 59

Nikolaos Pappas

/88

Outline of the talk

  • 1. Recap: Word Representa6on Learning
  • 2. Mul6lingual Word Representa6ons
  • Alignment models
  • Evalua6on tasks
  • 3. Mul6lingual Word Sequence Modeling
  • Essen6als: RNN, LSTM, GRU
  • Machine Transla6on
  • Document Classifica6on
  • 4. Summary

59

* Figure from Colah’s blog, 2015.

slide-60
SLIDE 60

Nikolaos Pappas

/88

Paragraph vectors for Document Classifica6on

60

  • Learning vectors of paragraphs inspired by word2vec
  • trained without supervision on a large corpus
  • preferably similar domain as the target
  • Two methods: with or without word ordering

(Le et al., 2014)

slide-61
SLIDE 61

Nikolaos Pappas

/88

Paragraph vectors for Document Classifica6on

61

  • Learned paragraph vectors + logis6c regression
  • Outperformed previous method on sentence-level and

document-level sen6ment classifica6on

(Le et al., 2014)

slide-62
SLIDE 62

Nikolaos Pappas

/88

Convolu6onal neural network for Document Classifica6on

62

(Kim et al., 2014)

  • Used mul6ple filter widths
  • Dropout regulariza6on (randomly dropping por6on of

hidden units during back-propaga6on)

slide-63
SLIDE 63

Nikolaos Pappas

/88

Convolu6onal neural network for Document Classifica6on

63

(Kim et al., 2014)

  • Not all baseline methods used drop-out though
slide-64
SLIDE 64

Nikolaos Pappas

/88

  • Similar to Kim et al, 2014 however different
  • K-max pooling instead of max pooling
  • Two layers of convolu6ons

64

(Denil et al., 2014)

Modeling and Summarizing Documents with a Convolu6onal Network

slide-65
SLIDE 65

Nikolaos Pappas

/88

Modeling and Summarizing Documents with a Convolu6onal Network

65

(Denil et al., 2014)

slide-66
SLIDE 66

Nikolaos Pappas

/88

Modeling and Summarizing Documents with a Convolu6onal Network

66

(Denil et al., 2014)

slide-67
SLIDE 67

Nikolaos Pappas

/88

Modeling and Summarizing Documents with a Convolu6onal Network

67

(Denil et al., 2014)

slide-68
SLIDE 68

Nikolaos Pappas

/88

Gated recurrent neural network for Document Classifica6on

68

(Tang et al., 2015)

slide-69
SLIDE 69

Nikolaos Pappas

/88

Gated recurrent neural network for Document Classifica6on

69

(Tang et al., 2015)

slide-70
SLIDE 70

Nikolaos Pappas

/88

Standard Pipeline for Document Classifica6on

70

  • Feature engineering: BOW, n-grams, topic models, etc.
  • Feature learning: auto-encoders, convolu6onal,

recurrent, recursive NNs

(Pappas and Popescu-Belis, 2014)

slide-71
SLIDE 71

Nikolaos Pappas

/88

Mul6ple-instance Learning for Document Classifica6on

71

(Pappas and Popescu-Belis, 2014)

slide-72
SLIDE 72

Nikolaos Pappas

/88

How to combine vectors? Structural assump6ons

72

(Pappas and Popescu-Belis, 2014)

slide-73
SLIDE 73

Nikolaos Pappas

/88

Joint learning of an instance relevance mechanism and a classifier

73

(Pappas and Popescu-Belis, 2014)

slide-74
SLIDE 74

Nikolaos Pappas

/88

Joint differen6able objec6ve for solving with SGD

74

(Pappas and Popescu-Belis, 2014)

slide-75
SLIDE 75

Nikolaos Pappas

/88

Observa6ons on aspect ra6ng predic6on

75

(Pappas and Popescu-Belis, 2014)

  • The proposed mechanism is

superior than alterna6ves

  • all text regions are useful

but to a different extent

  • Benefit regardless of the input

features used

  • Reaches state-of-the-art

without using:

  • structured output learning
  • segmented text
slide-76
SLIDE 76

Nikolaos Pappas

/88

Comparison with neural network models

76

(Pappas and Popescu-Belis, 2016)

  • This mechanism can be used as a parametric pooling func6on of NNs
  • opera6ng on intermediate hidden states
  • Works bejer than Dense, GRU neural methods + average pooling
  • Outperforms RCNN and uses far less parameters
slide-77
SLIDE 77

Nikolaos Pappas

/88

Hierarchical ajen6on networks for Document Classifica6on

77

(Yang et al., 2016)

  • Very similar hierarchical

structure as Tang et al., 2015 except average pooling

  • ajen6on mechanism at the

word and document levels

slide-78
SLIDE 78

Nikolaos Pappas

/88

Hierarchical ajen6on networks for Document Classifica6on

78

(Yang et al., 2016)

slide-79
SLIDE 79

Nikolaos Pappas

/88

Reflec6ons on Mul6lingual Document Classifica6on

79

  • What are the present limitaHons?
  • Current evalua6on datasets contain small number of

target classes and examples

  • RCV1/RCV2 → 6,000 documents, 2 langs, 4 labels
  • TED corpus → 12,078 documents, 12 langs, 15 labels
  • Requires the labels to be common across languages
  • Data are not enough to train SOA neural architectures
  • ObservaHon: currently there are several domains which

support mul6ple languages but only monolingual classifica6on is possible

slide-80
SLIDE 80

Nikolaos Pappas

/88

New dataset: Deutsche Welle corpus (600k docs, 8 langs)

80

slide-81
SLIDE 81

Nikolaos Pappas

/88

Conclusion

81

  • Mul6lingual word embeddings are useful for tasks where

there is lack of parallel data

  • Word sequence modeling is advancing quickly with the

establishment of neural methods

  • Machine Transla6on
  • Document Classifica6on
  • MulHlingual Neural Machine TranslaHon
  • is useful for low-resourced languages
  • transfers knowledge in large-scale se}ng
  • MulHlingual Document ClassificaHon
  • several large resources available but with disjoint labels
  • could possibly benefit from NMT lessons
slide-82
SLIDE 82

Nikolaos Pappas

/88

References (1/3)

82

  • Le, Quoc V., and Tomas Mikolov. "Distributed Representa6ons of Sentences and Documents." In ICML, vol. 14, pp.

1188-1196. 2014.

  • Kim, Yoon. "Convolu6onal neural networks for sentence classifica6on." arXiv preprint arXiv:1408.5882, 2014.
  • Denil, Misha, Alban Demiraj, Nal Kalchbrenner, Phil Blunsom, and Nando de Freitas. "Modelling, visualising and

summarising documents with a single convolu6onal neural network." arXiv preprint arXiv:1406.3830, 2014.

  • Yang, Zichao, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. "Hierarchical ajen6on networks for

document classifica6on." In Proceedings of the 2016 Conference of the North American Chapter of the Associa6on for Computa6onal Linguis6cs: Human Language Technologies. 2016.

  • Tang, Duyu, Bing Qin, and Ting Liu. "Document modeling with gated recurrent neural network for sen6ment

classifica6on." In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1422-1432, 2015.

  • Firat, Orhan, Kyunghyun Cho, Baskaran Sankaran, Fatos T. Yarman Vural, and Yoshua Bengio. "Mul6-way, mul6lingual

neural machine transla6on." Computer Speech & Language, 2016.

  • Pappas, Nikolaos, Miriam Redi, Mercan Topkara, Brendan Jou, Hongyi Liu, Tao Chen, and Shih-Fu Chang. "Mul6lingual

visual sen6ment concept matching.” In Interna6onal Conference of Mul6media Retrieval, 2016.

  • Pappas, Nikolaos, and Andrei Popescu-Belis. "Explaining the stars: Weighted mul6ple-instance learning for aspect-

based sen6ment analysis." In Conference on Empirical Methods in Natural Language Processing, 2014.

  • Pappas Nikolaos, Andrei Popescu-Belis. “Explicit Document Modeling through Weighted Mul6ple-Instance Learning”,

Under review.

  • Yoav Goldberg. “A primer on neural network models for natural language processing” arXiv preprint:1510.00726, 2015.
  • Ian Goodfellow, Aaron Courville, and Joshua Bengio. “Deep learning”. Book in prepara6on for MIT Press., 2015.
slide-83
SLIDE 83

Nikolaos Pappas

/88

References (2/3)

83

  • Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al.

"Google's Neural Machine Transla6on System: Bridging the Gap between Human and Machine Transla6on." arXiv preprint arXiv:1609.08144, 2016.

  • Zoph, Barret, and Kevin Knight. "Mul6-Source Neural Transla6on." arXiv preprint arXiv:1601.00710, 2016.
  • Dong, Daxiang, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. "Mul6-task learning for mul6ple language transla6on."

In Proceedings of the 53rd Annual Mee6ng of the ACL and the 7th Interna6onal Joint Conference on Natural Language Processing, pp. 1723-1732. 2015.

  • Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effec6ve approaches to ajen6on-based neural machine

transla6on." arXiv preprint arXiv:1508.04025, 2015.

  • Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine transla6on by jointly learning to align and

translate." arXiv preprint arXiv:1409.0473, 2014.

  • Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." In Advances in

neural informa6on processing systems, pp. 3104-3112. 2014.

  • Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and

Yoshua Bengio. "Learning phrase representa6ons using RNN encoder-decoder for sta6s6cal machine transla6on." arXiv preprint arXiv:1406.1078, 2014.

  • Irsoy, Ozan, and Claire Cardie. "Deep recursive neural networks for composi6onality in language." In Advances in Neural

Informa6on Processing Systems, pp. 2096-2104. 2014.

  • Chung, Junyoung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. "Empirical evalua6on of gated recurrent neural

networks on sequence modeling." arXiv preprint arXiv:1412.3555 (2014).

  • Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computa6on 9, no. 8 (1997): 1735-1780.
  • Levy, Omer, and Yoav Goldberg. "Dependency-Based Word Embeddings." In ACL (2), pp. 302-308. 2014.
slide-84
SLIDE 84

Nikolaos Pappas

/88

References (3/3)

84

  • Pappas, Nikolaos, Miriam Redi, Mercan Topkara, Brendan Jou, Hongyi Liu, Tao Chen, and Shih-Fu Chang. "Mul6cultural

Visual Concept Retrieval and Clustering.”, under review, 2016.

  • Klemen6ev, Alexandre, Ivan Titov, and Binod Bhajarai. "Inducing crosslingual distributed representa6ons of words."

2012.

  • Bengio, Yoshua, and Greg Corrado. "Bilbowa: Fast bilingual distributed representa6ons without word alignments." 2014.
  • Hermann, Karl Moritz, and Phil Blunsom. "Mul6lingual distributed representa6ons without word alignment." arXiv

preprint arXiv:1312.6173, 2013.

  • Faruqui, Manaal, and Chris Dyer. "Improving vector space word representa6ons using mul6lingual correla6on."

Associa6on for Computa6onal Linguis6cs, 2014.

  • Søgaard, Anders, Željko Agić, Héctor Marˆnez Alonso, Barbara Plank, Bernd Bohnet, and Anders Johannsen. "Inverted

indexing for cross-lingual nlp." In The 53rd Annual Mee6ng of the Associa6on for Computa6onal Linguis6cs and the 7th Interna6onal Joint Conference of the Asian Federa6on of Natural Language Processing (ACL-IJCNLP 2015), 2015.

  • Ammar, Waleed, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. "Massively

mul6lingual word embeddings." arXiv preprint arXiv:1602.01925, 2016.

  • Guo, Jiang, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. "Cross-lingual dependency parsing based on

distributed representa6ons." In Proceedings of the 53rd Annual Mee6ng of the Associa6on for Computa6onal Linguis6cs and the 7th Interna6onal Joint Conference on Natural Language Processing, vol. 1, pp. 1234-1244, 2015.

  • Lazaridou, Angeliki, Nghia The Pham, and Marco Baroni. "Combining language and vision with a mul6modal skip-gram

model." arXiv preprint arXiv:1501.02598, 2015.

  • Li, Jiwei, and Dan Jurafsky. "Do mul6-sense embeddings improve natural language understanding?." arXiv preprint

arXiv:1506.01070 (2015).

  • Faruqui, Manaal, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. "Retrofi}ng word vectors to
slide-85
SLIDE 85

Nikolaos Pappas

/59

Resources (1/2)

➡ Online courses

  • Coursera course on “Neural networks for machine learning” by Geoffrey Hinton

hjps://www.coursera.org/learn/neural-networks

  • Coursera course on “Machine learning” by Andrew Ng

hjps://www.coursera.org/learn/machine-learning

  • Stanford CS224d “Deep learning for NLP” by Richard Socher

hjp://cs224d.stanford.edu/

➡ Conference tutorials

  • Richard Socher and Christopher Manning, “Deep learning for NLP”, EMNLP 2013

tutorial. hjp://nlp.stanford.edu/courses/NAACL2013/

  • David Jurgens and Mohammad Taher Pilehvar, “Seman6c Similarity Fron6ers: From

Concepts to Documents”, EMNLP 2015 tutorial. hjp://www.emnlp2015.org/tutorials.html#t1

  • Mitesh M Kharpa, Sarath Chandar, “Mul6lingual and Mul6modal Language

Processing”, NAACL 2016 tutorial. hjp://naacl.org/naacl-hlt-2016/t2.html

85

slide-86
SLIDE 86

Nikolaos Pappas

/59

Resources (2/2)

➡ Deep learning toolkits

  • Theano hjp://deeplearning.net/souware/theano
  • Torch hjp://www.torch.ch/
  • Tensorflow hjp://www.tensorflow.org/
  • Keras hjp://keras.io/

➡ Pre-trained word vectors and codes

  • Word2vec toolkit and vectors

hjps://code.google.com/p/word2vec/

  • GloVe code and vectors

hjp://nlp.stanford.edu/projects/glove/

  • Hellinger PCA

hjps://github.com/rlebret/hpca

  • Online word vector evalua6on

hjp://wordvectors.org/

86