Language Modeling for Codeswitching HILA GONEN PHD STUDENT AT YOAV - - PowerPoint PPT Presentation

language modeling for codeswitching
SMART_READER_LITE
LIVE PREVIEW

Language Modeling for Codeswitching HILA GONEN PHD STUDENT AT YOAV - - PowerPoint PPT Presentation

Language Modeling for Codeswitching HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY Outline Background Codeswitching Language Modeling and Perplexity New Evaluation Method Definition Creation of data


slide-1
SLIDE 1

Language Modeling for Codeswitching

HILA GONEN PHD STUDENT AT YOAV GOLDBERG’S LAB BAR ILAN UNIVERSITY

slide-2
SLIDE 2

LANGUAGE MODELING FOR CODE SWITCHING

Outline

  • Background
  • Codeswitching
  • Language Modeling and Perplexity
  • New Evaluation Method
  • Definition
  • Creation of data set
  • Incorporation of Monolingual Data
  • Discriminative Training
  • Conclusion

2

slide-3
SLIDE 3

LANGUAGE MODELING FOR CODE SWITCHING

Codeswitching

“the alternation of two languages within a single discourse, sentence or constituent”

(Poplack, 1980) English – Spanish: that es su tío that has lived with him like I don't know how like ya several years... that his uncle who has lived with him like, I don't know how, like several years already... French – Arabic: mais les filles ta3na ysedkou n'import quoi ana hada face book jamais cheftou khlah kalbi Our girls believe anything, I have never seen this Facebook before.

3

slide-4
SLIDE 4

LANGUAGE MODELING FOR CODE SWITCHING

Codeswitching and its challenges

  • Very popular, mainly among bilingual communities
  • Extremely limited data
  • Non standard platforms (spoken data, social media)
  • Important challenge for automatic speech recognition (ASR) systems

4

slide-5
SLIDE 5

LANGUAGE MODELING FOR CODE SWITCHING

ASR with monolingual models

  • Output of IBM Models:
  • 5
slide-6
SLIDE 6

LANGUAGE MODELING FOR CODE SWITCHING

Language Modeling

  • The task of assigning a probability to a given sentence.
  • Useful for translation and for Automatic Speech Recognition:
  • The system produces several candidates and the LM scores them
  • Given a word sequence, the LM estimates the probability of each

word in the vocabulary to follow.

  • Standard training:
  • Lots of unlabeled text is used
  • Training examples: all sentence prefixes, along with the following word

6

slide-7
SLIDE 7

LANGUAGE MODELING FOR CODE SWITCHING

Language Modeling

7

I love chocolate cheesecakes winter strawberries me 0.6 0.1 0.08 0.06 0.04 …

slide-8
SLIDE 8

LANGUAGE MODELING FOR CODE SWITCHING

Automatic Speech Recognition (ASR)

  • Language models are traditionally used in the decoding process
  • The ASR system produces candidates for a given acoustic signal
  • The LM is used to rank the candidates:
  • Needs to differentiate between “good” and “bad” sentences
  • ASR systems are hard to set up and tune, not standarized

8

slide-9
SLIDE 9

LANGUAGE MODELING FOR CODE SWITCHING

Previous Work – LM for CS

  • Artificial CS data (Vu et al. 2012, Pratapa et al. 2018)
  • Syntactic constraints (Li and Fung 2012, 2014)
  • Factored LM (Adel et al. 2013, 2014, 2015)
  • Most of previous work depend on an ASR system (conflates the LM

performance with other aspects of the ASR system, hard to replicate the evaluation procedure and fairly compare results)

  • No previous work compares to another

9

slide-10
SLIDE 10

LANGUAGE MODELING FOR CODE SWITCHING

Previous Work – LM for CS

10

We want to evaluate the language model independently of an ASR

slide-11
SLIDE 11

LANGUAGE MODELING FOR CODE SWITCHING

Perplexity (Standard LM Evaluation)

Given a language model and a test sequence of words

  • ,

the perplexity of

  • ver the sequence is defined as:

where

is the probability the model assigns to .

The lower the perplexity, the better the LM is.

11

slide-12
SLIDE 12

LANGUAGE MODELING FOR CODE SWITCHING

Shortcomings of Perplexity

  • Not always well aligned with the quality of a language model (Tran

et al. 2018)

  • Better perplexities often do not translate to better

word-error-rate (WER) scores (Huang et al. 2018)

  • Does not penalize for assigning high probability to highly

implausible sentences

  • Strong dependence on vocabulary (e.g. word-based vs. char-based)

12

slide-13
SLIDE 13

LANGUAGE MODELING FOR CODE SWITCHING

Shortcomings of Perplexity - Example

  • We train a simple model on some data
  • We then measure the effect of the vocabulary:
  • We add words to the vocabulary
  • We train a model in the same manner, on the same data
  • We do not train the additional words
  • This results in a 2.37-points loss on the perplexity measure
  • Addition of words alone, with no change in the training procedure,

results in significant change in perplexity – why?

13

slide-14
SLIDE 14

LANGUAGE MODELING FOR CODE SWITCHING

Shortcomings of Perplexity - Example

14

We do not want to evaluate the language model with perplexity

slide-15
SLIDE 15

LANGUAGE MODELING FOR CODE SWITCHING

New Evaluation Method

  • We seek a method that meets the following requirements:
  • 1. Prefers LMs that prioritize correct sentences.
  • 2. Does not depend on the vocabulary of the LM.
  • 3. Is independent of an ASR system.

15

slide-16
SLIDE 16

LANGUAGE MODELING FOR CODE SWITCHING

New Evaluation Method

  • We suggest a method that simulates the task of an LM in ASR
  • Sets of sentences:
  • A single gold sentence in each set
  • ~30 similar-sounding alternatives in each set
  • LM should identify the gold sentence in each set
  • We use accuracy as our metric

16

slide-17
SLIDE 17

LANGUAGE MODELING FOR CODE SWITCHING

New Evaluation Method

  • This method answers all of our requirements:
  • 1. Prefers LMs that prioritize correct sentences.
  • 2. Does not depend on the vocabulary of the LM.
  • 3. Is independent of an ASR system.

17

slide-18
SLIDE 18

LANGUAGE MODELING FOR CODE SWITCHING

Codeswitching Corpus

Gold data:

  • Bangor Miami Corpus – transcripts of conversations by Spanish-

speakers in Florida, all of whom are bilingual in English

  • 45,621 sentences, split into train/dev/test
  • All three types: English, Spanish and CS sentences

Examples:

  • So I asked what was happening
  • Quieres un vaso de agua ?
  • Que by the way se vino ilegal

18

slide-19
SLIDE 19

LANGUAGE MODELING FOR CODE SWITCHING

Our Created data

How do we obtain similar-sounding sentences to build the sets? We create them! For each gold sentence, we create alternative sentences of all types:

  • English sentences
  • Spanish sentences
  • CS sentences

We do that using finite state transducers (FSTs) – to be explained

19

slide-20
SLIDE 20

LANGUAGE MODELING FOR CODE SWITCHING

Examples from the Dataset

20

slide-21
SLIDE 21

LANGUAGE MODELING FOR CODE SWITCHING

Dataset Statistics

21

slide-22
SLIDE 22

LANGUAGE MODELING FOR CODE SWITCHING

Finite State Transducer - FSTs

  • Similar to FSA (finite state automata), but with an additional component
  • f output (transitions have both input and output labels)
  • Capable of transforming a string into another
  • An FST can convert a string
  • into the string
  • ,

if there is a path with as its input labels and as its output labels

  • Composition – FSTs can be composed
  • Weighted FSTs – transitions can be labelled with weights

22

slide-23
SLIDE 23

LANGUAGE MODELING FOR CODE SWITCHING

Finite State Transducer - FSTs

Formally, an FST is a 6-tuple such that:

  • – the set of states (finite)
  • – input alphabet (finite)
  • – output alphabet (finite)
  • – initial states (subset of )
  • – final states (subset of )
  • – transition function

23

slide-24
SLIDE 24

LANGUAGE MODELING FOR CODE SWITCHING

FSTs – Toy Example

The girl is sad

24

sad: happy

  • thers:
  • thers

The girl is happy This is a happy story This is a sad story

slide-25
SLIDE 25

LANGUAGE MODELING FOR CODE SWITCHING

Dataset Creation

We implement the creation of the dataset with Carmel, an FST toolkit:

  • 1. An FST for converting a sentence into a sequence of phonemes
  • 2. An FST that allows minor changes in the phoneme sequence
  • 3. An FST for decoding a sequence of phonemes into a sentence

(the inverse of 1).

25

slide-26
SLIDE 26

LANGUAGE MODELING FOR CODE SWITCHING

  • 1. Sentence to Phonemes

We use pronunciation dictionaries for both languages:

  • book__en
  • cat__en
  • libro__sp
  • gato__sp

26

B UH K K AE T L IY B R OW G AA T OW

slide-27
SLIDE 27

LANGUAGE MODELING FOR CODE SWITCHING

  • 2. Change Phoneme Sequence

We allow minor change in the phoneme sequence to increase flexibility:

27

slide-28
SLIDE 28

LANGUAGE MODELING FOR CODE SWITCHING

  • 3. Phonemes to Sentence

We use the same pronunciation dictionaries:

  • S M EH L IY K EY T UW

28

smell y que to smelly que to

To favor frequent words over infrequent ones, we add unigram probabilities to the edges of the transducer

slide-29
SLIDE 29

29 LANGUAGE MODELING FOR CODE SWITCHING 29 29

gato:SP G AA T OW

Gold sentence: smelly gato Phoneme sequence: S M EH L IY G AA T OW

smelly:EN S M EH L IY

Alternative sequence: smell y que to Phoneme sequence: S M EH L IY K EY T UW

G K AA EY OW UW S M EH L smell:EN IY y:SP K EY que:SP T UW to:EN

Sentence to phonemes FST Phonemes to Sentence FST Changing phonemes FST

Alternative sequence: smelly que to

slide-30
SLIDE 30

LANGUAGE MODELING FOR CODE SWITCHING

Dataset Creation – cont.

Implementation details:

  • We can create monolingual and CS sentences regardless the source

sentence

  • We only convert a sampled part of the gold sentence when creating

a code-switched alternative

  • For CS alternatives, we encourage sentences to include both

languages, and to differ from each other, using some heuristics (e.g more words from the less dominant language)

  • We randomly choose 250/750 sets in which the gold sentence is

code-switched/monolingual

30

slide-31
SLIDE 31

LANGUAGE MODELING FOR CODE SWITCHING

So far…

New evaluation method that enables comparison of wide range of models:

  • Directly penalizing for preferring “bad” sentences
  • Does not depend on the vocabulary
  • Independent of an ASR system

Creating dataset using FSTs Applicable for any language or language-pair

31

slide-32
SLIDE 32

LANGUAGE MODELING FOR CODE SWITCHING

Baseline LM

Standard architecture:

  • A 2-layer LSTM followed a softmax layer
  • Auto-batching
  • SGD optimization
  • Decreasing learning rate according to dev performance
  • Clipping gradients, weight decay
  • LSTM dropout – same dropout mask at each time step, including the

recurrent layers (Gal and Ghahramani, 2016)

Parameters matter a lot (less so when only changing the training data)

32

slide-33
SLIDE 33

LANGUAGE MODELING FOR CODE SWITCHING

Baseline LM

Data: Codeswitching corpus only (Bangor Miami Corpus)

33

How can we improve over the baseline?

Baseline

slide-34
SLIDE 34

LANGUAGE MODELING FOR CODE SWITCHING

Monolingual Data

  • Data for code-switching is relatively scarce
  • Monolingual data is easy to obtain
  • We use the OpenSubtitles2018 corpus of subtitles of movies and

TV series (Tiedemann, 2009)

34

How do we efficiently incorporate monolingual data when training a CS LM?

slide-35
SLIDE 35

LANGUAGE MODELING FOR CODE SWITCHING

Take 1

We train a language model on both the monolingual and the CS data The CS data is used at the end of each epoch – ALL:CS-Last

35

slide-36
SLIDE 36

LANGUAGE MODELING FOR CODE SWITCHING

Take 1

We train a language model on both the monolingual and the CS data The CS data is used at the end of each epoch – ALL:CS-Last

36

Baseline This model

slide-37
SLIDE 37

LANGUAGE MODELING FOR CODE SWITCHING

Take 2

We train a language model on both the monolingual and the CS data All sentences are shuffled together – ALL:Shuffled

37

slide-38
SLIDE 38

LANGUAGE MODELING FOR CODE SWITCHING

Take 2

We train a language model on both the monolingual and the CS data All sentences are shuffled together – ALL:Shuffled

38

Baseline This model

slide-39
SLIDE 39

LANGUAGE MODELING FOR CODE SWITCHING

The better approach – Fine-tuning

  • We pre-train a model with the English and Spanish

monolingual sentences

  • This essentially trains two monolingual models, but

with full sharing of parameters

  • We then use the little amount of available

codeswitched data to further train the model

39

New training

slide-40
SLIDE 40

LANGUAGE MODELING FOR CODE SWITCHING

The better approach – Fine-tuning

40

Baseline This model

slide-41
SLIDE 41

LANGUAGE MODELING FOR CODE SWITCHING

Breaking down the Results

Let’s look at the results more carefully: What is our accuracy when the gold sentence is CS and when it is monolingual?

41

slide-42
SLIDE 42

LANGUAGE MODELING FOR CODE SWITCHING

Breaking down the Results

42

Most of the improvement stems from sets with monolingual sentences as gold. This is not surprising! Recall that during the pretraining phase, the model is exposed to monolingual data only.

slide-43
SLIDE 43

LANGUAGE MODELING FOR CODE SWITCHING

Breaking down the Results

43

Can we do better than that?

slide-44
SLIDE 44

LANGUAGE MODELING FOR CODE SWITCHING

Discriminative Training

Yes! Now we can help the model with negative examples we create, of all types:

  • Recall that for our new evaluation, the LM only needs to identify the

corrects sentence

  • No need for the standard probability setting anymore
  • We can score whole sentence, and add negative examples

44

slide-45
SLIDE 45

LANGUAGE MODELING FOR CODE SWITCHING

Discriminative Training

45

  • Assign the gold sentences with a higher score than the rest
  • Require the difference between the scores to be as large as the WER
  • The farther a sentence is from the gold one, the lower its score is
  • Formally, let be the gold sentence, and

be the other sentences in that

  • set. The new loss:
slide-46
SLIDE 46

LANGUAGE MODELING FOR CODE SWITCHING

Creating Data for Discriminative Training

Done using the same technique: composition of FSTs We use the CS training set as our gold sentences We get a training set that is 10 times bigger than the original one Fine-tuning is done in the same manner We create sentences from a random subset of the monolingual data

46

slide-47
SLIDE 47

LANGUAGE MODELING FOR CODE SWITCHING

Discriminative Training

47

slide-48
SLIDE 48

LANGUAGE MODELING FOR CODE SWITCHING

Discriminative Training

Dramatic improvements in cases where the gold sentence is CS:

48

slide-49
SLIDE 49

LANGUAGE MODELING FOR CODE SWITCHING

Improvements as a Function of Size of Data

As expected, we see that the less CS data we have, the more important it is the add monolingual data:

49

slide-50
SLIDE 50

LANGUAGE MODELING FOR CODE SWITCHING

Some Examples

Examples of sentences the FINE-TUNED-DISCRIMINATIVE model identifies correctly while the FINETUNED-LM model does not

50

slide-51
SLIDE 51

LANGUAGE MODELING FOR CODE SWITCHING

Some Examples

Examples of sentences that the FINE-TUNED-DISCRIMINATIVE model fails to identify

51

slide-52
SLIDE 52

LANGUAGE MODELING FOR CODE SWITCHING

Conclusion

  • Perplexity can be replaced with our new ranking-based evaluation

method:

  • Well-suited for ASR motivation
  • Independent of vocabulary and ASR systems
  • Creating data using FSTs with the help of pronunciation dictionaries
  • Fine-tuning improves performance when little quality data is available
  • Discriminative training is extremely helpful in our scenario

52

slide-53
SLIDE 53

LANGUAGE MODELING FOR CODE SWITCHING

Any Preguntas?

53

slide-54
SLIDE 54

LANGUAGE MODELING FOR CODE SWITCHING

Thank you!

54