Incorporating Dialectal Variability for Socially Equitable Language Identification
David Jurgens, Yulia Tsvetkov, and Dan Jurafsky
Incorporating Dialectal Variability for Socially Equitable Language - - PowerPoint PPT Presentation
Incorporating Dialectal Variability for Socially Equitable Language Identification David Jurgens, Yulia Tsvetkov, and Dan Jurafsky McNamee, P., Language identification: a solved problem suitable for undergraduate instruction Journal of
David Jurgens, Yulia Tsvetkov, and Dan Jurafsky
McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal
McNamee, P., “Language identification: a solved problem suitable for undergraduate instruction” Journal
English
English
125M Speakers 90M Speakers 79M Speakers 60M Speakers 251M Speakers
English French Spanish Arabic
125M Speakers 90M Speakers 79M Speakers 60M Speakers 251M Speakers
5
Human Development Index of text’s origin country Estimated LID accuracy for English tweets
Education Life expectancy Income
(Labov, 1964; Ash, 2002)
5
Human Development Index of text’s origin country Estimated LID accuracy for English tweets
Education Life expectancy Income
More Dialect Less Dialect
(Labov, 1964; Ash, 2002)
5
Human Development Index of text’s origin country Estimated LID accuracy for English tweets
Education Life expectancy Income
More Dialect Less Dialect
(Labov, 1964; Ash, 2002)
5
Human Development Index of text’s origin country Estimated LID accuracy for English tweets
Education Life expectancy Income
More Dialect Less Dialect
(Labov, 1964; Ash, 2002)
5
Human Development Index of text’s origin country Estimated LID accuracy for English tweets
23%
Education Life expectancy Income
More Dialect Less Dialect
(Labov, 1964; Ash, 2002)
6
Keyword Filter
“flu”, “sick”
NLP
Which symptoms?
6
6
Keyword Filter
“flu”, “sick”
NLP
Which symptoms?
Language Detection
6
6
Keyword Filter
“flu”, “sick”
non-English
NLP
Which symptoms?
Language Detection
6
6
Keyword Filter
“flu”, “sick”
non-English
NLP
Which symptoms?
Language Detection
6
6
Keyword Filter
“flu”, “sick”
non-English
NLP
Which symptoms?
Language Detection
6
6
Keyword Filter
“flu”, “sick”
NLP
Which symptoms?
Language Detection
non-English?
6
8
Human Development Index of text’s origin country Estimated accuracy for English tweets
More Dialect Less Dialect
(Labov, 1964; Ash, 2002)
8
Human Development Index of text’s origin country Estimated accuracy for English tweets
More Dialect Less Dialect
(Labov, 1964; Ash, 2002)
Our goal is make language ID performance equal for all languages across all dialects
8
Human Development Index of text’s origin country Estimated accuracy for English tweets
More Dialect Less Dialect
(Labov, 1964; Ash, 2002)
Our goal is make language ID performance equal for all languages across all dialects T h i s i s a u n i v e r s a l N L P i s s u e !
9
9
Data: No corpora that captures global variation in lexicon and dialect
9
Data: No corpora that captures global variation in lexicon and dialect Model: makes simplistic assumptions about how multilinguals communicate
10
NLP methodologies capable of handling linguistic variation Better social representation through network-based sampling
Our Data Solution: Improve linguistic representation through network-based sampling
11
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
eng
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
eng eng
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
eng eng eng eng eng fra
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
eng eng eng eng eng eng fra
Our Data Solution: Improve linguistic representation through network-based sampling
11
Bootstrap dialectic corpora using existing classifiers to find monolingual individuals
Sample from the geolocated Twitter social network to include text from people at all locations
eng eng eng eng eng eng fra
12
12
Topical
12
Topical Geographic
12
Topical Social Geographic
12
Topical Social Geographic Multilingual
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encoder Decoder
Jaech et al. 2016; Samih et al. 2016
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encoder Decoder
Jaech et al. 2016; Samih et al. 2016
Represents a multi- layer recurrent neural network
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encoder Decoder
J e _
.
…
Jaech et al. 2016; Samih et al. 2016
Represents a multi- layer recurrent neural network
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encodes the whole sentence using its characters
Encoder Decoder
J e _
.
…
Jaech et al. 2016; Samih et al. 2016
Represents a multi- layer recurrent neural network
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encodes the whole sentence using its characters
Encoder Decoder
J e _
.
…
Decode each word’s language from the sentence encoding
Jaech et al. 2016; Samih et al. 2016
Represents a multi- layer recurrent neural network
Our model solution: treat language identification as a character-based sequence to sequence task.
13
Encodes the whole sentence using its characters
Fra Fra Fra Fra Fra . Eng Eng Eng Eng Eng .
Encoder Decoder
J e _
.
…
Decode each word’s language from the sentence encoding
Jaech et al. 2016; Samih et al. 2016
Represents a multi- layer recurrent neural network
14
Lui et al. 2013, 2014
Our Method
14
25 50 75 100
70 Languages on Twitter
Macro F1 langid.py CLD2 Our Method
Lui et al. 2013, 2014
Our Method
14
25 50 75 100
70 Languages on Twitter
Macro F1 langid.py CLD2 Our Method
25 50 75 100
Geo-diverse Tweets
Macro F1 langid.py CLD2 Our Method
Lui et al. 2013, 2014
Our Method
14
25 50 75 100
70 Languages on Twitter
Macro F1 langid.py CLD2 Our Method
25 50 75 100
Geo-diverse Tweets
Macro F1 langid.py CLD2 Our Method
25 50 75 100
Multilingual Tweets
Macro F1 Polyglot CLD2
Lui et al. 2013, 2014
Our Method
15
50 100
70 Languages on Twitter
92 91.2
Macro F1 Our Method Macro F1 langid.py CLD2 Our Method
50 100
TweetLID
79.6 78.7
Jaech et al. (2016) Our Method Jaech et al. (2016)
16
1M Tweets with any of 385 English terms from established lexicons for influenza, psychological well- being, and social health
Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)
16
1M Tweets with any of 385 English terms from established lexicons for influenza, psychological well- being, and social health
Lamb et al., (2013); Smith et al., (2016); Preotiuc-Pietro et al., (2015); Park et al., (2016)
16
Task: does the language identification system recognize every tweet as English?
1M Tweets with any of 385 English terms from established lexicons for influenza, psychological well- being, and social health
17
Human Development Index of text’s origin country
Estimated accuracy for English tweets
18
Methodologies capable of handling language as it is used Better social representation in
18
Methodologies capable of handling language as it is used Better social representation in
19
David Jurgens, Yulia Tsvetkov, and Dan Jurafsky
https://github.com/davidjurgens/equilid