Character Eyes: Seeing Language through Character-Level Taggers - - PowerPoint PPT Presentation

character eyes seeing language through character level
SMART_READER_LITE
LIVE PREVIEW

Character Eyes: Seeing Language through Character-Level Taggers - - PowerPoint PPT Presentation

Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob Eisenstein @yuvalpi @ruyimarone @jacobeisenstein Blackbox NLP 2019 https://github.com/ruyimarone/character-eyes Taggers N sg V past RB DET


slide-1
SLIDE 1

Character Eyes: Seeing Language through Character-Level Taggers

Yuval Pinter Marc Marone Jacob Eisenstein

@yuvalpi @ruyimarone @jacobeisenstein

https://github.com/ruyimarone/character-eyes

Blackbox NLP 2019

slide-2
SLIDE 2

Taggers

3

The cat walked fast DET Nsg Vpast RB

slide-3
SLIDE 3

Neural Taggers

4

The cat walked fast DET Nsg Vpast RB

slide-4
SLIDE 4

Character-level Neural Taggers

5

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

slide-5
SLIDE 5

Character-level Recurrent Neural Taggers

6

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

slide-6
SLIDE 6

Recurrent Taggers – Good at Finding Morphemes?

7

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

slide-7
SLIDE 7

Recurrent Taggers – Good at Finding Morphemes?

7

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

Agglutination

slide-8
SLIDE 8

Recurrent Taggers – Good at Prefixes and Suffixes?

8

thecat walked fast Nsg;def Vpast RB t h e c a t w a l k e d f a s t

slide-9
SLIDE 9

Recurrent Taggers – Good at Prefixes and Suffixes?

8

thecat walked fast Nsg;def Vpast RB t h e c a t w a l k e d f a s t

Prefixing morphology (e.g. Coptic)

slide-10
SLIDE 10

Recurrent Taggers – Can They Handle diSCoNtinUiTY?

9

The cat waeldk fast DET Nsg Vpast RB T h e c a t w a e l d k f a s t

slide-11
SLIDE 11

Recurrent Taggers – Can They Handle diSCoNtinUiTY?

9

The cat waeldk fast DET Nsg Vpast RB T h e c a t w a e l d k f a s t

Introflexive morphology (Hebrew, Arabic)

slide-12
SLIDE 12

Main Idea(s)

10

Language w a l k e d w a e l d k t h e c a t Model

slide-13
SLIDE 13

Main Idea(s)

10

Language w a l k e d w a e l d k t h e c a t Model

measure how models encode different linguistic patterns

slide-14
SLIDE 14

Main Idea(s)

11

Language w a l k e d w a e l d k t h e c a t Model

characterize languages based on model analysis; help engineer language- aware systems

slide-15
SLIDE 15

Analysis Primitive – Unit Decomposition

12

The cat walked fast

DET Nsg Vpast RB T h e c a t w a l k e d f a s t

slide-16
SLIDE 16

Analysis Primitive – Unit Decomposition

12

The cat walked fast

DET Nsg Vpast RB T h e c a t w a l k e d f a s t

Hidden unit #n

  • Assumption: units are “in

charge” of tracking morphemes that help predict POS

slide-17
SLIDE 17

Analysis Primitive – Unit Decomposition

12

The cat walked fast

DET Nsg Vpast RB T h e c a t w a l k e d f a s t

Hidden unit #n

  • Assumption: units are “in

charge” of tracking morphemes that help predict POS

  • Hypothesis: easy for

agglutinations, difficult for introflexions

slide-18
SLIDE 18

Analysis Primitive – Unit Decomposition

12

The cat walked fast

DET Nsg Vpast RB T h e c a t w a l k e d f a s t

Hidden unit #n

  • Assumption: units are “in

charge” of tracking morphemes that help predict POS

  • Hypothesis: easy for

agglutinations, difficult for introflexions

  • Hypothesis: unit’s direction

affects ease of tracking suffixes vs. prefixes

Hidden unit #m

slide-19
SLIDE 19

Evidence?

  • Turkish is an agglutinative language

○ ev ‘house’; evler ‘houses’; evleriniz ‘your houses’; evlerinizden ‘from your houses’

13

slide-20
SLIDE 20

Evidence?

  • Turkish is an agglutinative language

○ ev ‘house’; evler ‘houses’; evleriniz ‘your houses’; evlerinizden ‘from your houses’

13

Unit 3 (→)

slide-21
SLIDE 21

Evidence?

  • Turkish is an agglutinative language

○ ev ‘house’; evler ‘houses’; evleriniz ‘your houses’; evlerinizden ‘from your houses’

13

Unit 124 () Unit 3 (→)

slide-22
SLIDE 22

Model & Data

14

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

slide-23
SLIDE 23

Model & Data

  • Universal Dependencies (n=24)

○ POS tags + Morphosyntactic Descriptions

14

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

slide-24
SLIDE 24

Model & Data

  • Universal Dependencies (n=24)

○ POS tags + Morphosyntactic Descriptions

  • Linguistic diversity – morph. synthesis:

○ 5 agglutinative languages ○ 2 introflexive languages ○ 3 isolating, 14 fusional

14

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

Source for language classes: WALS

slide-25
SLIDE 25

Model & Data

  • Universal Dependencies (n=24)

○ POS tags + Morphosyntactic Descriptions

  • Linguistic diversity – morph. synthesis:

○ 5 agglutinative languages ○ 2 introflexive languages ○ 3 isolating, 14 fusional

  • Linguistic diversity – affixation:

○ (All) 1 prefixing language ○ 2 non-affixing ○ 2 equally pre- and suffixing ○ 19 suffixing

14

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

Source for language classes: WALS

slide-26
SLIDE 26

Model & Data

  • Universal Dependencies (n=24)

○ POS tags + Morphosyntactic Descriptions ○ Linguistic diversity (synthesis + affixation)

  • Word → Tag: Bidirectional LSTM + MLP

○ (Not analyzed) ○ No word embeddings

15

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

slide-27
SLIDE 27

Model & Data

  • Universal Dependencies (n=24)

○ POS tags + Morphosyntactic Descriptions ○ Linguistic diversity (synthesis + affixation)

  • Word → Tag: Bidirectional LSTM + MLP

○ (Not analyzed) ○ No word embeddings

  • Char → Word: Bidirectional LSTM

○ Char embedding size: 256

15

The cat walked fast DET Nsg Vpast RB T h e c a t w a l k e d f a s t

slide-28
SLIDE 28

Analysis Metrics

16

slide-29
SLIDE 29

Analysis Metrics

  • Run model on training data words

16

slide-30
SLIDE 30

Analysis Metrics

  • Run model on training data words
  • Collect activation levels for each unit

16

slide-31
SLIDE 31

0.42

Analysis Metrics

  • Run model on training data words
  • Collect activation levels for each unit
  • Aggregate to single measure

(e.g. average absolute or max-delta)

16

slide-32
SLIDE 32

0.42

Analysis Metrics

  • Run model on training data words
  • Collect activation levels for each unit
  • Aggregate to single measure

(e.g. average absolute or max-delta)

  • Bin per unit over parts of speech

16

Unit 42 [0.0,0.1) [0.1,0.2) … [0.9,1.0) NOUN 8 2 … 40 VERB 20 … 4 … … … … … ADJ 10 10 … 10

slide-33
SLIDE 33

0.42

Analysis Metrics

  • Run model on training data words
  • Collect activation levels for each unit
  • Aggregate to single measure

(e.g. average absolute or max-delta)

  • Bin per unit over parts of speech
  • Mutual Information metric – POS

Discrimination Index, or PDI

○ (Higher PDI = better discriminator)

16

Unit 42 [0.0,0.1) [0.1,0.2) … [0.9,1.0) NOUN 8 2 … 40 VERB 20 … 4 … … … … … ADJ 10 10 … 10

slide-34
SLIDE 34
  • Run model on training data words
  • Collect activation levels for each unit
  • Aggregate to single measure

(e.g. average absolute or max-delta)

  • Bin per unit over parts of speech
  • Mutual Information metric – POS

Discrimination Index, or PDI

○ (Higher PDI = better discriminator)

  • Aggregate across units by

Unit 40 [0.0,0.1) [0.1,0.2) … [0.9,1.0) NOUN 8 2 … 40 VERB 20 … 4 … … … … … ADJ 10 10 … 10 Unit 41 [0.0,0.1) [0.1,0.2) … [0.9,1.0) NOUN 8 2 … 40 VERB 20 … 4 … … … … … ADJ 10 10 … 10

Analysis Metrics

17

Unit 42 [0.0,0.1) [0.1,0.2) … [0.9,1.0) NOUN 8 2 … 40 VERB 20 … 4 … … … … … ADJ 10 10 … 10

slide-35
SLIDE 35
  • Run model on training data words
  • Collect activation levels for each unit
  • Aggregate to single measure

(e.g. average absolute or max-delta)

  • Bin per unit over parts of speech
  • Mutual Information metric – POS

Discrimination Index, or PDI

○ (Higher PDI = better discriminator)

  • Aggregate across units by

○ Summing total mass ○ Reporting % of forward units before mass median

Unit 40 [0.0,0.1) [0.1,0.2) … [0.9,1.0) NOUN 8 2 … 40 VERB 20 … 4 … … … … … ADJ 10 10 … 10 Unit 41 [0.0,0.1) [0.1,0.2) … [0.9,1.0) NOUN 8 2 … 40 VERB 20 … 4 … … … … … ADJ 10 10 … 10

Analysis Metrics

17

Unit 42 [0.0,0.1) [0.1,0.2) … [0.9,1.0) NOUN 8 2 … 40 VERB 20 … 4 … … … … … ADJ 10 10 … 10

 mass median

slide-36
SLIDE 36

Findings (Cherry Pick)

  • Coptic: agglutinative, prefixing

○ Large mass (easy to distinguish POS based on char sequence) ○ Forward-heavy (71%)

18

  • English: fusional, suffixing

○ Small mass (hard to capture POS) ○ Backward-heavy (80%)

slide-37
SLIDE 37

Findings (General Trends)

19

Total PDI mass

slide-38
SLIDE 38

Findings (General Trends)

  • 4/5 agglutinatives hold 4/6 top total-mass positions

19

Total PDI mass

slide-39
SLIDE 39

Findings (General Trends)

  • 4/5 agglutinatives hold 4/6 top total-mass positions
  • 2/2 introflexives in bottom 2/4 spots (Persian and Hindi

below, both fusional w/ non-Latin charsets)

19

Total PDI mass

slide-40
SLIDE 40

Direction Balance Study

20

slide-41
SLIDE 41

Direction Balance Study

  • Some languages might not need two equal LSTM directions

20

slide-42
SLIDE 42

Direction Balance Study

  • Some languages might not need two equal LSTM directions
  • What if… they don’t need one of them at all?

20

slide-43
SLIDE 43

Direction Balance Study

  • Some languages might not need two equal LSTM directions
  • What if… they don’t need one of them at all?

20

  • What if they need them in a different balance?

Somewhere in the middle?

slide-44
SLIDE 44

Balance Study – Results

21

slide-45
SLIDE 45

Balance Study – Results

  • Can unidirectional models outperform

bidirectionals?

21

slide-46
SLIDE 46

Balance Study – Results

  • Can unidirectional models outperform

bidirectionals?

  • Yes.

○ Especially on agglutinative languages and

  • n suffixing languages

○ Fully-forward better than fully-backward

21

  • 0.2

0.2 0.4 0.6 0.8

Agglutinative Introflexive Fusional

  • 0.4
  • 0.2

0.2 0.4 0.6

Strongly Suffixing Little Affixation Prefixing

Tagging accuracy change

slide-47
SLIDE 47

Balance Study – Results

  • Can unidirectional models outperform

bidirectionals?

  • Yes.

○ Especially on agglutinative languages and

  • n suffixing languages

○ Fully-forward better than fully-backward ○ MAJOR caveat – 128x128 > 2*(64x64)

21

  • 0.2

0.2 0.4 0.6 0.8

Agglutinative Introflexive Fusional

  • 0.4
  • 0.2

0.2 0.4 0.6

Strongly Suffixing Little Affixation Prefixing

Tagging accuracy change

slide-48
SLIDE 48

Balance Study – Results

  • Can unidirectional models outperform

bidirectionals?

  • Yes.

○ Especially on agglutinative languages and

  • n suffixing languages

○ Fully-forward better than fully-backward ○ MAJOR caveat – 128x128 > 2*(64x64)

  • Is there a sweet spot in the middle?

21

  • 0.2

0.2 0.4 0.6 0.8

Agglutinative Introflexive Fusional

  • 0.4
  • 0.2

0.2 0.4 0.6

Strongly Suffixing Little Affixation Prefixing

Tagging accuracy change

slide-49
SLIDE 49

Balance Study – Results

  • Can unidirectional models outperform

bidirectionals?

  • Yes.

○ Especially on agglutinative languages and

  • n suffixing languages

○ Fully-forward better than fully-backward ○ MAJOR caveat – 128x128 > 2*(64x64)

  • Is there a sweet spot in the middle?
  • Not that we can tell.

21

  • 0.2

0.2 0.4 0.6 0.8

Agglutinative Introflexive Fusional

  • 0.4
  • 0.2

0.2 0.4 0.6

Strongly Suffixing Little Affixation Prefixing

Tagging accuracy change

slide-50
SLIDE 50

Summary + Open Ends

22

slide-51
SLIDE 51

Summary + Open Ends

  • Introduced PDI to aggregate information from hidden units, some applicability

to language characterization

22

slide-52
SLIDE 52

Summary + Open Ends

  • Introduced PDI to aggregate information from hidden units, some applicability

to language characterization

○ Extensible to any <instance, unit> metric on any neural classifier

22

slide-53
SLIDE 53

Summary + Open Ends

  • Introduced PDI to aggregate information from hidden units, some applicability

to language characterization

○ Extensible to any <instance, unit> metric on any neural classifier

  • Found substantial differences between differently-balanced recurrent models

22

slide-54
SLIDE 54

Summary + Open Ends

  • Introduced PDI to aggregate information from hidden units, some applicability

to language characterization

○ Extensible to any <instance, unit> metric on any neural classifier

  • Found substantial differences between differently-balanced recurrent models
  • Are we quantifying data instead of languages?

22

slide-55
SLIDE 55

Summary + Open Ends

  • Introduced PDI to aggregate information from hidden units, some applicability

to language characterization

○ Extensible to any <instance, unit> metric on any neural classifier

  • Found substantial differences between differently-balanced recurrent models
  • Are we quantifying data instead of languages?
  • Affixing: many languages (e.g. English) have higher PDI for backward units, but

fare better with more forward units. Is this:

22

slide-56
SLIDE 56

Summary + Open Ends

  • Introduced PDI to aggregate information from hidden units, some applicability

to language characterization

○ Extensible to any <instance, unit> metric on any neural classifier

  • Found substantial differences between differently-balanced recurrent models
  • Are we quantifying data instead of languages?
  • Affixing: many languages (e.g. English) have higher PDI for backward units, but

fare better with more forward units. Is this:

○ A saturation effect?

22

slide-57
SLIDE 57

Summary + Open Ends

  • Introduced PDI to aggregate information from hidden units, some applicability

to language characterization

○ Extensible to any <instance, unit> metric on any neural classifier

  • Found substantial differences between differently-balanced recurrent models
  • Are we quantifying data instead of languages?
  • Affixing: many languages (e.g. English) have higher PDI for backward units, but

fare better with more forward units. Is this:

○ A saturation effect? ○ Fault in assuming PDI measures unit importance?

22

slide-58
SLIDE 58

Thank You!

uvp@gatech.edu mmarone6@gatech.edu https://github.com/ruyimarone/character-eyes