Incorporating Dialectal Variability for Socially Equitable Language - - PowerPoint PPT Presentation

incorporating dialectal variability for socially
SMART_READER_LITE
LIVE PREVIEW

Incorporating Dialectal Variability for Socially Equitable Language - - PowerPoint PPT Presentation

Incorporating Dialectal Variability for Socially Equitable Language Identification David Jurgens, Yulia Tsvetkov, and Dan Jurafsky McNamee, P., Language identification: a solved problem suitable for undergraduate instruction Journal of


slide-1
SLIDE 1

Incorporating Dialectal Variability for Socially Equitable Language Identification

David Jurgens, Yulia Tsvetkov, and Dan Jurafsky

slide-2
SLIDE 2

McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal

  • f Computing Sciences in Colleges 20(3) 2005.
slide-3
SLIDE 3

“This paper describes […] how even the most simple of these methods using data obtained from the World Wide Web achieve accuracy approaching 100% on a test suite comprised of ten European languages”

McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal

  • f Computing Sciences in Colleges 20(3) 2005.
slide-4
SLIDE 4

Whose language are we identifying?

slide-5
SLIDE 5

Whose language are we identifying?

slide-6
SLIDE 6

Whose language are we identifying?

slide-7
SLIDE 7

Whose language are we identifying?

slide-8
SLIDE 8

Global platforms attract global diversity in a language

English

slide-9
SLIDE 9

Global platforms attract global diversity in a language

English

125M Speakers 90M Speakers 79M Speakers 60M Speakers 251M Speakers

slide-10
SLIDE 10

Global platforms attract global diversity in a language

English French Spanish Arabic

125M Speakers 90M Speakers 79M Speakers 60M Speakers 251M Speakers

slide-11
SLIDE 11

5

Human Development Index of 
 text’s origin country Estimated LID accuracy for English tweets

{

Education
 Life expectancy Income

(Labov, 1964; Ash, 2002)

slide-12
SLIDE 12

5

Human Development Index of 
 text’s origin country Estimated LID accuracy for English tweets

{

Education
 Life expectancy Income

More
 Dialect Less
 Dialect

(Labov, 1964; Ash, 2002)

slide-13
SLIDE 13

5

Human Development Index of 
 text’s origin country Estimated LID accuracy for English tweets

{

Education
 Life expectancy Income

More
 Dialect Less
 Dialect

(Labov, 1964; Ash, 2002)

slide-14
SLIDE 14

Current language detection methods perform significantly worse in less-developed countries

5

Human Development Index of 
 text’s origin country Estimated LID accuracy for English tweets

{

Education
 Life expectancy Income

More
 Dialect Less
 Dialect

(Labov, 1964; Ash, 2002)

slide-15
SLIDE 15

Current language detection methods perform significantly worse in less-developed countries

5

Human Development Index of 
 text’s origin country Estimated LID accuracy for English tweets

}

23%

{

Education
 Life expectancy Income

More
 Dialect Less
 Dialect

(Labov, 1964; Ash, 2002)

slide-16
SLIDE 16

6

Keyword Filter


“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP

Which symptoms?

6

slide-17
SLIDE 17

6

Keyword Filter


“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP

Which symptoms?

Language
 Detection

6

slide-18
SLIDE 18

6

Keyword Filter


“flu”, “sick”

Practical Motivation: Epidemic Detection

non-English

NLP

Which symptoms?

Language
 Detection

6

slide-19
SLIDE 19

6

Keyword Filter


“flu”, “sick”

Practical Motivation: Epidemic Detection

non-English

NLP

Which symptoms?

Language
 Detection

6

slide-20
SLIDE 20

6

Keyword Filter


“flu”, “sick”

Practical Motivation: Epidemic Detection

non-English

NLP

Which symptoms?

Language
 Detection

6

slide-21
SLIDE 21

6

Keyword Filter


“flu”, “sick”

Practical Motivation: Epidemic Detection

NLP

Which symptoms?

Language
 Detection

non-English?

6

slide-22
SLIDE 22

Failing to recognize a language silences its speakers’ voices

slide-23
SLIDE 23

Current language detection methods perform significantly worse in less-developed countries

8

Human Development Index of 
 text’s origin country Estimated accuracy for English tweets

More
 Dialect Less
 Dialect

(Labov, 1964; Ash, 2002)

slide-24
SLIDE 24

Current language detection methods perform significantly worse in less-developed countries

8

Human Development Index of 
 text’s origin country Estimated accuracy for English tweets

More
 Dialect Less
 Dialect

(Labov, 1964; Ash, 2002)

Our goal is make language ID performance equal for all languages across all dialects

slide-25
SLIDE 25

Current language detection methods perform significantly worse in less-developed countries

8

Human Development Index of 
 text’s origin country Estimated accuracy for English tweets

More
 Dialect Less
 Dialect

(Labov, 1964; Ash, 2002)

Our goal is make language ID performance equal for all languages across all dialects T h i s i s a u n i v e r s a l N L P i s s u e !

slide-26
SLIDE 26

Key Problems: Current methods struggle in the global setting because

9

slide-27
SLIDE 27

Key Problems: Current methods struggle in the global setting because

9

Data: No corpora that captures global variation in lexicon and dialect

slide-28
SLIDE 28

Key Problems: Current methods struggle in the global setting because

9

Data: No corpora that captures global variation in lexicon and dialect Model: makes simplistic assumptions about how multilinguals communicate

slide-29
SLIDE 29

Our approach

10

NLP methodologies capable of handling linguistic variation Better social representation through network-based sampling

slide-30
SLIDE 30

Our Data Solution: Improve linguistic representation through network-based sampling

11

slide-31
SLIDE 31

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

slide-32
SLIDE 32

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

slide-33
SLIDE 33

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

eng

slide-34
SLIDE 34

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

eng eng

slide-35
SLIDE 35

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

eng eng eng eng eng fra

slide-36
SLIDE 36

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

eng eng eng eng eng eng fra

slide-37
SLIDE 37

Our Data Solution: Improve linguistic representation through network-based sampling

11

Bootstrap dialectic corpora using existing classifiers to find monolingual individuals

Sample from the geolocated Twitter social network to include text from people at all locations

eng eng eng eng eng eng fra

slide-38
SLIDE 38

Build a strategically-diverse corpora
 and synthesize code-switched examples

12

slide-39
SLIDE 39

Build a strategically-diverse corpora
 and synthesize code-switched examples

12

Topical

slide-40
SLIDE 40

Build a strategically-diverse corpora
 and synthesize code-switched examples

12

Topical Geographic

slide-41
SLIDE 41

Build a strategically-diverse corpora
 and synthesize code-switched examples

12

Topical Social Geographic

slide-42
SLIDE 42

Build a strategically-diverse corpora
 and synthesize code-switched examples

12

Topical Social Geographic Multilingual

slide-43
SLIDE 43

Our model solution: treat language identification as a character-based sequence to sequence task. 


13

Encoder Decoder

Je vais commander à emporter. I’m too lazy to cook.

Jaech et al. 2016; Samih et al. 2016

slide-44
SLIDE 44

Our model solution: treat language identification as a character-based sequence to sequence task. 


13

Encoder Decoder

Je vais commander à emporter. I’m too lazy to cook.

Jaech et al. 2016; Samih et al. 2016

Represents a multi- layer recurrent neural network

slide-45
SLIDE 45

Our model solution: treat language identification as a character-based sequence to sequence task. 


13

Encoder Decoder

Je vais commander à emporter. I’m too lazy to cook.

J e _

  • k

.

Jaech et al. 2016; Samih et al. 2016

Represents a multi- layer recurrent neural network

slide-46
SLIDE 46

Our model solution: treat language identification as a character-based sequence to sequence task. 


13

Encodes the whole sentence using its characters

Encoder Decoder

Je vais commander à emporter. I’m too lazy to cook.

J e _

  • k

.

Jaech et al. 2016; Samih et al. 2016

Represents a multi- layer recurrent neural network

slide-47
SLIDE 47

Our model solution: treat language identification as a character-based sequence to sequence task. 


13

Encodes the whole sentence using its characters

Encoder Decoder

Je vais commander à emporter. I’m too lazy to cook.

J e _

  • k

.

Decode each word’s language from the sentence encoding

Jaech et al. 2016; Samih et al. 2016

Represents a multi- layer recurrent neural network

slide-48
SLIDE 48

Our model solution: treat language identification as a character-based sequence to sequence task. 


13

Encodes the whole sentence using its characters

Fra Fra Fra Fra Fra . Eng Eng Eng Eng Eng .

Encoder Decoder

Je vais commander à emporter. I’m too lazy to cook.

J e _

  • k

.

Decode each word’s language from the sentence encoding

Jaech et al. 2016; Samih et al. 2016

Represents a multi- layer recurrent neural network

slide-49
SLIDE 49

14

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our Method

slide-50
SLIDE 50

14

25 50 75 100

70 Languages on Twitter

Macro F1 langid.py CLD2 Our Method

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our Method

slide-51
SLIDE 51

14

25 50 75 100

70 Languages on Twitter

Macro F1 langid.py CLD2 Our Method

25 50 75 100

Geo-diverse Tweets

Macro F1 langid.py CLD2 Our Method

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our Method

slide-52
SLIDE 52

14

25 50 75 100

70 Languages on Twitter

Macro F1 langid.py CLD2 Our Method

25 50 75 100

Geo-diverse Tweets

Macro F1 langid.py CLD2 Our Method

25 50 75 100

Multilingual Tweets

Macro F1 Polyglot CLD2

Equilid vs off-the-shelf

Lui et al. 2013, 2014

Our Method

slide-53
SLIDE 53

15

Equilid even outperforms system specifically tuned for each dataset

50 100

70 Languages on Twitter

92 91.2

Macro F1 Our Method Macro F1 langid.py CLD2 Our Method

50 100

TweetLID

79.6 78.7

Jaech et al. (2016) Our Method Jaech et al. (2016)

slide-54
SLIDE 54

Case Study: Do our solutions provide socially-equitable language identification for health-related queries?

16

1M Tweets with any of 385 English terms from established lexicons for influenza, psychological well- being, and social health

slide-55
SLIDE 55

Case Study: Do our solutions provide socially-equitable language identification for health-related queries?

Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)

16

1M Tweets with any of 385 English terms from established lexicons for influenza, psychological well- being, and social health

slide-56
SLIDE 56

Case Study: Do our solutions provide socially-equitable language identification for health-related queries?

Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)

16

Task: does the language identification system recognize every tweet as English?

1M Tweets with any of 385 English terms from established lexicons for influenza, psychological well- being, and social health

slide-57
SLIDE 57

Equilid raises the bar for socially- equitable language identification

17

Human Development Index of 
 text’s origin country

Estimated accuracy for English tweets

slide-58
SLIDE 58

Social Equality doesn’t stop at Language Identification

18

Methodologies capable of handling language as it is used Better social representation in

  • ur data
slide-59
SLIDE 59

Social Equality doesn’t stop at Language Identification

18

Methodologies capable of handling language as it is used Better social representation in

  • ur data
slide-60
SLIDE 60

19

David Jurgens, Yulia Tsvetkov, and Dan Jurafsky

Be equitable!

https://github.com/davidjurgens/equilid