Learning Word Embeddings for Low-resource Languages by PU Learning - - PowerPoint PPT Presentation

learning word embeddings for low resource languages by pu
SMART_READER_LITE
LIVE PREVIEW

Learning Word Embeddings for Low-resource Languages by PU Learning - - PowerPoint PPT Presentation

Learning Word Embeddings for Low-resource Languages by PU Learning Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang 1 Word Embeddings are useful Many successful stories Named entity recognition Document ranking


slide-1
SLIDE 1

Learning Word Embeddings for Low-resource Languages by PU Learning

Chao Jiang, Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang

1

slide-2
SLIDE 2

Word Embeddings are useful

  • Many successful stories
  • Named entity recognition
  • Document ranking
  • Sentiment analysis
  • Question answering
  • Image captioning
  • Pre-trained word vectors have been widely used
  • GloVe [Pennington+14]: 3900+ citations
  • Word2Vec[Mikolov+13]: 7600+ citations

2

slide-3
SLIDE 3

Existing English Embeddings are trained on a large collection of text

  • Word2Vec is trained on

the Google News dataset.

  • GloVe is trained on a

crawled corpus.

840 billion tokens 100 billion tokens

3

slide-4
SLIDE 4

How about other language?

4

slide-5
SLIDE 5

How about other language?

  • # Wikipedia articles in different languages
  • English: ~ 2.5 M
  • German: ~ 800 K
  • French: ~ 700 K
  • Czech: ~100 K
  • Danish: ~95K
  • Chichewa: 58

5

High-resource languages: 23 languages have more than 100K articles low-resource languages: 60 languages have 10K ~ 100K articles very low-resource languages: 183 languages have less than 10K articles

slide-6
SLIDE 6

Sparsity of the co-occurrence matrix

  • Word Embeddings are trained based on

co-occurrence statistics

  • When training corpus is small
  • Many word pairs are unobserved
  • Co-occurrence matrix is very sparse
  • Example: The text8 data
  • 17,000,000 tokens and 71,000 distinct words
  • Co-occurrence matrix has more than

5,000,000,000 entries, > 99% are zeros.

6

slide-7
SLIDE 7

Zeros in the co-occurrence matrix

  • True zeros
  • Word pairs which are unlikely to co-occur
  • Missing entries
  • Word pairs can co-occur
  • Unobserved in the training data

7 7

0.8 0.1

  • 0.2
  • 0.7

alien table … cake space alien table … cake space

Center word context word True 0 Missing

slide-8
SLIDE 8

Motivation

  • 8
slide-9
SLIDE 9

Our contributions

9

1. Propose a PU-Learning framework for training word embedding 2. Design an efficient learning algorithm to deal with all negative pairs 3. Demonstrate that unobserved word pairs provide valuable information

slide-10
SLIDE 10

PU-Learning for Training Word Embedding

10

slide-11
SLIDE 11

PU Learning Framework

1. Pre-processing: Building co-occurrence matrix 2. Matrix factorization by PU-Learning 3. Post-processing

11

slide-12
SLIDE 12

Step 1 – Building co-occurrence matrix

  • Count words co-occurrence statistics
  • We follow [Levy+15] to scale the co-occurrence

counts by PPMI metric

12

0.8 0.1

  • 0.2
  • 0.2

the black … likes milk cat dog … table happy

Center word context word

… the black cat likes milk …

context window

Scaled by PPMI

(cat, the) (cat, black) (cat, likes) (cat, milk) ...

slide-13
SLIDE 13

13

0.8 0.1

  • 0.5
  • 0.2
  • 0.2

black blue … yellow milk cat dog … table happy

Center word context word frequent zeros

slide-14
SLIDE 14

Step 2 - PU-Learning for matrix factorization

14 0.3 0.1 … 0.2 0.3 0.1 0.1 … 0.1 0.1 … … … … … 0.1 0.1

  • 0.1

0.2

  • 0.1

0.1

  • 0.2

0.2

  • 0.8
slide-15
SLIDE 15

Step 2 - PU-Learning for matrix factorization

15 0.3 0.1 … 0.2 0.3 0.1 0.1 … 0.1 0.1 … … … … … 0.1 0.1

  • 0.1

0.2

  • 0.1

0.1

  • 0.2

0.2

  • 0.8

W e i g h t i n g f u n c t i

  • n

R e c

  • n

s t r u c t i

  • n

e r r

  • r

Regularization

slide-16
SLIDE 16

Step 2 – Weighting function

16 0.3 0.1 … 0.2 0.3 0.1 0.1 … 0.1 0.1 … … … … … 0.1 0.1

  • 0.1

0.2

  • 0.1

0.1

  • 0.2

0.2

  • frequent

zeros

slide-17
SLIDE 17

Step 2 - PU-Learning for matrix factorization

17

  • We consider all entries
  • Both positive and zero entries

17

W e i g h t i n g f u n c t i

  • n

R e c

  • n

s t r u c t i

  • n

e r r

  • r

Regularization

slide-18
SLIDE 18

Step 2 - PU-Learning for matrix factorization

18 0.3 0.1 … 0.2 0.3 0.1 0.1 … 0.1 0.1 … … … … … 0.1 0.1

  • 0.1

0.2

  • 0.1

0.1

  • 0.2

0.2

  • We design efficient coordinate descent algorithm

(see paper for details)

slide-19
SLIDE 19

Step 2 - PU-Learning for matrix factorization

19 0.3 0.1 … 0.2 0.3 0.1 0.1 … 0.1 0.1 … … … … … 0.1 0.1

  • 0.1

0.2

  • 0.1

0.1

  • 0.2

0.2

  • We design efficient coordinate descent algorithm

(see paper for details)

slide-20
SLIDE 20

Step 2 - PU-Learning for matrix factorization

20 0.3 0.1 … 0.2 0.3 0.1 0.1 … 0.1 0.1 … … … … … 0.1 0.1

  • 0.1

0.2

  • 0.1

0.1

  • 0.2

0.2

  • We design efficient coordinate descent algorithm

(see paper for details)

slide-21
SLIDE 21

Step 2 - PU-Learning for matrix factorization

21 0.3 0.1 … 0.2 0.3 0.1 0.1 … 0.1 0.1 … … … … … 0.1 0.1

  • 0.1

0.2

  • 0.1

0.1

  • 0.2

0.2

  • We design efficient coordinate descent algorithm

(see paper for details)

slide-22
SLIDE 22

Step 2 - PU-Learning for matrix factorization

22 0.3 0.1 … 0.2 0.3 0.1 0.1 … 0.1 0.1 … … … … … 0.1 0.1

  • 0.1

0.2

  • 0.1

0.1

  • 0.2

0.2

  • We design efficient coordinate descent algorithm

(see paper for details)

slide-23
SLIDE 23

Step 3 -- Post-processing

  • 23
slide-24
SLIDE 24

Experiments

24

slide-25
SLIDE 25

Results on English

25

Simulate the low-resource setting: Embedding is trained on a subset of Wikipedia with 32M tokens

Analogy Task on Google Dataset Word Similarity Task on WS353

slide-26
SLIDE 26

Results on Danish (more results in

paper)

26

Analogy Task on Google Dataset Word Similarity Task on WS353

Danish Wikipedia with 64M tokens Test set are translated by Google translation (w/ 90% accuracy verified by native speakers)

slide-27
SLIDE 27

27

  • Weight for zero entries in co-occurrence matrix
  • Zero entries can be true 0 or missing
  • 𝜍 reflects how confident that the zero entries are true zero
slide-28
SLIDE 28

Take home messages

  • A PU-Learning framework for learning word

embedding in the low resource setting

  • Unobserved word pairs provide valuable information

28

Thanks!