Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning - - PowerPoint PPT Presentation

cross lingual part of speech tagging through ambiguous
SMART_READER_LITE
LIVE PREVIEW

Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning - - PowerPoint PPT Presentation

1/27 Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning Guillaume Wisniewski Nicolas Pcheux Souhir Gahbiche-Braham Franois Yvon Universit Paris-Sud & LIMSI-CNRS October 28, 2014 2/27 Context performance standards


slide-1
SLIDE 1

1/27

Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning

Guillaume Wisniewski Nicolas Pécheux Souhir Gahbiche-Braham François Yvon

Université Paris-Sud & LIMSI-CNRS

October 28, 2014

slide-2
SLIDE 2

2/27

Context

▶ Supervised Machine Learning techniques have established new

performance standards for many NLP tasks

▶ Success crucially depends on the availability of annotated

in-domain data

▶ Not so common situation (e.g. under-resourced languages) ▶ What can we do then ?

slide-3
SLIDE 3

3/27

Context

▶ Unsupervised learning ▶ Crawl data (e.g. Wiktionary)

slide-4
SLIDE 4

4/27

Context

NO Do you want to detect
  • bjects in images?
Predicting the labels Trying to solve an assignment problem? NO NO YES YES NO

Structured Prediction

YES NO Want to make a tracker? structural_track_association_trainer YES NO svm_rank_trainer structural_object_detection_trainer structural_assignment_trainer structural_svm_problem Predicting a true or false label? Predicting a categorial label? Predicting a continuous quantity? Do you have labeled data? Are you trying to rank order something? Do you want to transform your data? YES Do you know how many categories? TOO SLOW YES NO Clustering < 5K Samples YES NO Do you have a graph of "similar" samples? YES NO YES NO YES NO Data Transformations Number of features < 100 NO YES < 20K Samples NOT WORKING YES NO NO Do you have labeled data? Are you trying to label things as anomalous
  • vs. normal?
< 20K Samples YES Go get labels! NO YES NO Number of features < 100 NO YES YES Classification NOT WORKING < 20K Samples YES NO Is this a time-series
  • r online prediction
problem? NO YES Regression NO YES YES NO NO YES NO YES svm_c_trainer with radial_basis_kernel or histogram_intersection_kernel svm_c_linear_trainer svm_one_class_trainer with radial_basis_kernel svm_c_linear_dcd_trainer (see one_class_classifiers_ex.cpp example program) svm_multiclass_linear_trainer
  • ne_vs_one_trainer
with krr_trainer using radial_basis_kernel krr_trainer with radial_basis_kernel newman_cluster or chinese_whispers kkmeans or find_clusters_using_kmeans vector_normalizer_frobmetric linear_manifold_regularizer krr_trainer with radial_basis_kernel svr_trainer with radial_basis_kernel or histogram_intersection_kernel svr_linear_trainer sammon_projection discriminant_pca krls or rls cca Learning a distance metric? Do you have two views of your data?

Ressource-rich language Less-ressourced language

Transfer

▶ Cross-lingual transfer (weakly supervised learning)

Example

. Un . marché . pour . la . recherche . scientifique . Making . a . Market . for . Scientific . Research

.

.

VERB

.

.

DET

.

.

NOUN

.

.

ADP

.

.

NOUN

.

.

NOUN

.

.

NOUN

.

.

ADP

.

.

NOUN

.

.

NOUN

slide-5
SLIDE 5

4/27

Context

NO Do you want to detect
  • bjects in images?
Predicting the labels Trying to solve an assignment problem? NO NO YES YES NO

Structured Prediction

YES NO Want to make a tracker? structural_track_association_trainer YES NO svm_rank_trainer structural_object_detection_trainer structural_assignment_trainer structural_svm_problem Predicting a true or false label? Predicting a categorial label? Predicting a continuous quantity? Do you have labeled data? Are you trying to rank order something? Do you want to transform your data? YES Do you know how many categories? TOO SLOW YES NO Clustering < 5K Samples YES NO Do you have a graph of "similar" samples? YES NO YES NO YES NO Data Transformations Number of features < 100 NO YES < 20K Samples NOT WORKING YES NO NO Do you have labeled data? Are you trying to label things as anomalous
  • vs. normal?
< 20K Samples YES Go get labels! NO YES NO Number of features < 100 NO YES YES Classification NOT WORKING < 20K Samples YES NO Is this a time-series
  • r online prediction
problem? NO YES Regression NO YES YES NO NO YES NO YES svm_c_trainer with radial_basis_kernel or histogram_intersection_kernel svm_c_linear_trainer svm_one_class_trainer with radial_basis_kernel svm_c_linear_dcd_trainer (see one_class_classifiers_ex.cpp example program) svm_multiclass_linear_trainer
  • ne_vs_one_trainer
with krr_trainer using radial_basis_kernel krr_trainer with radial_basis_kernel newman_cluster or chinese_whispers kkmeans or find_clusters_using_kmeans vector_normalizer_frobmetric linear_manifold_regularizer krr_trainer with radial_basis_kernel svr_trainer with radial_basis_kernel or histogram_intersection_kernel svr_linear_trainer sammon_projection discriminant_pca krls or rls cca Learning a distance metric? Do you have two views of your data?

Ressource-rich language Less-ressourced language

Transfer

▶ Cross-lingual transfer (weakly supervised learning)

Example

. Un . marché . pour . la . recherche . scientifique . Making . a . Market . for . Scientific . Research

.

.

VERB

.

.

DET

.

.

NOUN

.

.

ADP

.

.

NOUN

.

.

NOUN

.

.

NOUN

.

.

ADP

.

.

NOUN

.

.

NOUN

slide-6
SLIDE 6

5/27

State of the art

▶ In most cases this only results in partially annotated data ▶ Alternative ML techniques need to be designed

State of the art

▶ Partially observed CRF [Täckström et al., 2013] ▶ Posterior regularization [Ganchev and Das, 2013] ▶ Expectation maximization [Wang and Manning, 2014]

slide-7
SLIDE 7

6/27

Contributions

  • 1. We cast this problem in the framework of ambiguous

learning [Bordes et al., 2010, Cour et al., 2011]

  • 2. We present a novel method to learn from ambiguous

supervision data

  • 3. We show significant improvements over prior state of the art
  • 4. We conduct a detailed analysis that allows us to identify the

limits of transfer-based methods and their evaluation

slide-8
SLIDE 8

7/27

Part I Projecting Labels across Aligned Corpora

slide-9
SLIDE 9

8/27

Hypothesis

▶ In this work we focus on POS tagging

Strong assumption

Syntactic categories in the source language can be directly related to the ones in the target one

Universal tagset [Petrov et al., 2012]

{ Noun, Verb, Adj, Adv, Pron, Det, Adp, Num, Conj, Prt, ‘.’, X }

▶ All annotations are mapped to this universal tagset

slide-10
SLIDE 10

9/27

Type and token constraints

Transfer-based methods only deliver partial and noisy supervision

▶ Heuristic filtering rules [Yarowsky et al., 2001] ▶ Graph-base projection [Das and Petrov, 2011] ▶ Combine with monolingual information

[Täckström et al., 2013]

Type and token constraints [Täckström et al., 2013]

  • 1. type constraints from a dictionary

.

.

.

  • 2. token constraints projected through alignment links

.

.

.

slide-11
SLIDE 11

10/27

Type constraints

From tag dictionaries

▶ Automatically extracted from Wiktionary

Build from the projected labels across the aligned corpora . … . marché . … . marché . … . … . market . … . walked . … . market

.

. NOUN VERB

.

.

NOUN

.

.

VERB

.

.

NOUN

.

.

VERB

We use the intersection of the two above

slide-12
SLIDE 12

10/27

Type constraints

From tag dictionaries

▶ Automatically extracted from Wiktionary ▶ Build from the projected labels across the aligned corpora

. … . marché . … . marché . … . … . market . … . walked . … . ⇒ . market

.

. NOUN VERB

.

.

NOUN

.

.

VERB

.

.

NOUN

.

.

VERB

We use the intersection of the two above

slide-13
SLIDE 13

10/27

Type constraints

From tag dictionaries

▶ Automatically extracted from Wiktionary ▶ Build from the projected labels across the aligned corpora

. … . marché . … . marché . … . … . market . … . walked . … . ⇒ . market

.

. NOUN VERB

.

.

NOUN

.

.

VERB

.

.

NOUN

.

.

VERB

▶ We use the intersection of the two above

slide-14
SLIDE 14

11/27

Token constraints

  • 1. Use the type constraints

. . Making . a . Market . for . Scientific . Research

.

.

VERB

.

.

DET

.

.

NOUN

.

.

ADP

.

.

NOUN

.

.

NOUN

. Un . marché . pour . la . recherche . scientifique

.

. ADJ DET NOUN PRON

.

. NOUN VERB

.

. ADP NOUN . . DET NOUN PRON

.

. NOUN VERB

.

. NOUN ADJ

slide-15
SLIDE 15

11/27

Token constraints

  • 2. Use the alignment links from the parallel corpora

. Making . a . Market . for . Scientific . Research

.

.

VERB

.

.

DET

.

.

NOUN

.

.

ADP

.

.

NOUN

.

.

NOUN

. Un . marché . pour . la . recherche . scientifique

.

. ADJ DET NOUN PRON

.

. NOUN VERB

.

. ADP NOUN . . DET NOUN PRON

.

. NOUN VERB

.

. NOUN ADJ

slide-16
SLIDE 16

11/27

Token constraints

  • 3. Tag the source side (resource-rich)

. Making . a . Market . for . Scientific . Research

.

.

VERB

.

.

DET

.

.

NOUN

.

.

ADP

.

.

NOUN

.

.

NOUN

. Un . marché . pour . la . recherche . scientifique

.

. ADJ DET NOUN PRON

.

. NOUN VERB

.

. ADP NOUN . . DET NOUN PRON

.

. NOUN VERB

.

. NOUN ADJ

slide-17
SLIDE 17

11/27

Token constraints

  • 4. Project labels if licensed by type constraints

. Making . a . Market . for . Scientific . Research

.

.

VERB

.

.

DET

.

.

NOUN

.

.

ADP

.

.

NOUN

.

.

NOUN

. Un . marché . pour . la . recherche . scientifique

.

. ADJ DET NOUN PRON

.

. NOUN VERB

.

. ADP NOUN . . DET NOUN PRON

.

. NOUN VERB

.

. NOUN ADJ

slide-18
SLIDE 18

12/27

Part II Modeling Sequences under Ambiguous Supervision

slide-19
SLIDE 19

13/27

Problem

. Un . marché . pour . la . recherche . scientifique

.

. ADJ DET NOUN PRON

.

. NOUN

.

. ADP . . DET NOUN PRON

.

. NOUN

.

. NOUN

▶ Gold labels: a set of possible labels of which only one is true ▶ How to learn from ambiguous supervision ? ▶ Can be cast in the framework of ambiguous learning

[Bordes et al., 2010, Cour et al., 2011]

slide-20
SLIDE 20

14/27

History-based model: inference

. x: . Un . marché . pour . la . ... . y: .

DET

.

NOUN

.

ADP

. ? y∗

i =

arg max

y

NOUN, VERB, ...

F x y yi yi

Principle

▶ Structured prediction is reduced to a sequence of

multi-classification problems At each step, the decision is taken based on the input structure and the so far partially tagged sequence

slide-21
SLIDE 21

14/27

History-based model: inference

. x: . Un . marché . pour . la . ... . y: .

DET

.

NOUN

.

ADP

. ? y∗

i =

arg max

y∈{NOUN, VERB, ...}

F(x, y, y∗

i−1, y∗ i−2, ...)

Principle

▶ Structured prediction is reduced to a sequence of

multi-classification problems

▶ At each step, the decision is taken based on the input

structure and the so far partially tagged sequence

slide-22
SLIDE 22

15/27

History-based model: training

▶ Linear classifier y∗ i = arg maxy∈Y wTφ(x, i, y, hi) ▶ Perceptron

  • like

update

Full supervision

if y∗

i ̸= ˆ

yi then wt+1 ← wt − φ (x, i, y∗

i , hi) + yi

i

φ (x, i,ˆ yi, hi)

▶ Heighten the gold label

s score at the cost of the wrongly predicted one Theoretical guarantees for similar problems under mild assumptions [Bordes et al., 2010, Cour et al., 2011]

slide-23
SLIDE 23

15/27

History-based model: training

▶ Linear classifier y∗ i = arg maxy∈Y wTφ(x, i, y, hi) ▶ Perceptron-like update

Ambiguous supervision

if y∗

i ̸∈ ˆ

Yi then wt+1 ← wt − φ (x, i, y∗

i , hi) +

ˆ yi∈ ˆ Yi

φ (x, i,ˆ yi, hi)

▶ Heighten the gold labels score at the cost of the wrongly

predicted one Theoretical guarantees for similar problems under mild assumptions [Bordes et al., 2010, Cour et al., 2011]

slide-24
SLIDE 24

15/27

History-based model: training

▶ Linear classifier y∗ i = arg maxy∈Y wTφ(x, i, y, hi) ▶ Perceptron-like update

Ambiguous supervision

if y∗

i ̸∈ ˆ

Yi then wt+1 ← wt − φ (x, i, y∗

i , hi) +

ˆ yi∈ ˆ Yi

φ (x, i,ˆ yi, hi)

▶ Heighten the gold labels score at the cost of the wrongly

predicted one

▶ Theoretical guarantees for similar problems under mild

assumptions [Bordes et al., 2010, Cour et al., 2011]

slide-25
SLIDE 25

16/27

Part III Experiments

slide-26
SLIDE 26

17/27

Experimental setup

▶ Experiments on 10 languages from different families ▶ English as the source side

Our method needs

▶ Parallel corpora

Europarl, NIST, Open Subtitle

▶ English POS tagger

Wapiti

▶ Crawled dictionary

Wiktionary

▶ Labeled test data

CoNLL’07, UDT v2.0, Treebanks

▶ Standard feature set

slide-27
SLIDE 27

18/27

Results

CRF HBAL ∆ [1] [2] [3]

Unsupervised [1]

ar 33.9 27.9

  • 6.0

49.9 — — — cs 11.6 10.4

  • 1.2

19.3 18.9 — — de 12.2 8.8

  • 3.4

9.6 9.5 14.2 18.7 el 10.9 8.1

  • 2.8

9.4 10.5 20.8 28.2 es 10.7 8.2

  • 2.5

12.8 10.9 13.6 18.7 fi 12.9 13.3 +0.4 — — — — fr 11.6 10.2

  • 1.4

12.5 11.6 — — id 16.3 11.3

  • 5.0

— — — — it 10.4 9.1

  • 1.3

10.1 10.2 13.5 31.9 sv 11.6 10.1

  • 1.5

10.8 11.1 13.9 29.9

CRF Partially supervised CRF baseline [Täckström et al., 2013] HBAL Our History-based model [1] [Ganchev and Das, 2013] [2] [Täckström et al., 2013] [3] [Li et al., 2012]

slide-28
SLIDE 28

19/27

Part IV Discussion

slide-29
SLIDE 29

20/27

Discussion

Closer look on Spanish results:

State of the art 10.9% Our model HBAL 8.2% Our model trained on supervised data (HBSL) 2.4%

Our method still falls short of a fully supervised model!

slide-30
SLIDE 30

20/27

Discussion

Closer look on Spanish results:

State of the art 10.9% Our model HBAL 8.2% Our model trained on supervised data (HBSL) 2.4%

Our method still falls short of a fully supervised model!

slide-31
SLIDE 31

20/27

Discussion

Closer look on Spanish results:

State of the art 10.9% Our model HBAL 8.2% Our model trained on supervised data (HBSL) 2.4%

Our method still falls short of a fully supervised model!

slide-32
SLIDE 32

20/27

Discussion

Closer look on Spanish results:

State of the art 10.9% Our model HBAL 8.2% Our model trained on supervised data (HBSL) 2.4%

Our method still falls short of a fully supervised model!

slide-33
SLIDE 33

21/27

Why such a large gap ?

Noisy constraints

▶ Type constraints precision on test data is 94% ▶ I.e. using our type constraints as hard constraints at decoding

time yields at least 6% of errors

▶ In this setting HBSL gets 7.3% ▶ Noisy dictionaries

Out-of-domain evaluation

  • 1. tokenization differs
  • 2. domain differs
  • 3. annotation conventions differ
slide-34
SLIDE 34

21/27

Why such a large gap ?

Noisy constraints

▶ Type constraints precision on test data is 94% ▶ I.e. using our type constraints as hard constraints at decoding

time yields at least 6% of errors

▶ In this setting HBSL gets 7.3% ▶ Noisy dictionaries…not only ?

Out-of-domain evaluation

  • 1. tokenization differs
  • 2. domain differs
  • 3. annotation conventions differ
slide-35
SLIDE 35

21/27

Why such a large gap ?

Noisy constraints

▶ Type constraints precision on test data is 94% ▶ I.e. using our type constraints as hard constraints at decoding

time yields at least 6% of errors

▶ In this setting HBSL gets 7.3% ▶ Noisy dictionaries…not only ?

Out-of-domain evaluation

̸=

  • 1. tokenization differs
  • 2. domain differs
  • 3. annotation conventions differ
slide-36
SLIDE 36

21/27

Why such a large gap ?

Noisy constraints

▶ Type constraints precision on test data is 94% ▶ I.e. using our type constraints as hard constraints at decoding

time yields at least 6% of errors

▶ In this setting HBSL gets 7.3% ▶ Noisy dictionaries…not only ?

Out-of-domain evaluation

̸=

  • 1. tokenization differs
  • 2. domain differs
  • 3. annotation conventions differ
slide-37
SLIDE 37

22/27

The annotation convention problem

▶ Several independently designed information sources are

combined

▶ They follow conflicting annotation conventions

Example

. Numbers . Foreing names . poco . few

.

.

NUM

.

.

. ADJ DET .

.

.

NOUN

.

.

. X . .

.

.

ADJ

.

.

. DET .

.

. PRON NOUN .

slide-38
SLIDE 38

23/27

Impact of annotation and train/test mismatches

Fixing some annotation mismatches in type constraints

ar cs de el es fi fr id it sv HBAL 27.9 10.4 8.8 8.1 8.2 13.3 10.2 11.3 9.1 10.1 HBAL + match 24.1 7.6 8.0 7.3 7.4 12.2 7.4 9.8 8.3 8.8 ∆

  • 3.8
  • 2.8
  • 0.8
  • 0.8
  • 0.8
  • 1.1
  • 2.8
  • 1.5
  • 0.8
  • 1.3

Supervised experiments for Spanish

train train labels test error rate UDT manual 2.4% Europarl HBSL 4.2% Europarl FreeLing 6.1% Europarl Cross-lingual transfer (ambiguous) 8.2%

▶ Performance may be underestimated

slide-39
SLIDE 39

24/27

Part V Conclusion

slide-40
SLIDE 40

25/27

Conclusion

▶ We introduce a new, simple and efficient learning criterion ▶ Performance surpasses best reported results ▶ Results close to the best achievable performance ? ▶ Evaluation of such settings much be taken with great care ▶ Additional gains might be more easily obtained by fixing

systematic biases than by designing more sophisticated weakly supervised learners

slide-41
SLIDE 41

26/27

Thank you for your attention

Questions ?

Tools and resources available from http://perso.limsi.fr/wisniews/weakly

slide-42
SLIDE 42

27/27

References

Bordes, A., Usunier, N., and Weston, J. (2010). Label ranking under ambiguous supervision for learning semantic correspondences. In ICML, pages 103–110. Cour, T., Sapp, B., and Taskar, B. (2011). Learning from partial labels. Journal of Machine Learning Research, 12:1501–1536. Das, D. and Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 600–609, Stroudsburg, PA, USA. Association for Computational Linguistics. Ganchev, K. and Das, D. (2013). Cross-lingual discriminative learning of sequence models with posterior regularization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1996–2006, Seattle, Washington, USA. Association for Computational Linguistics. Li, S., Graça, J. a. V., and Taskar, B. (2012). Wiki-ly supervised part-of-speech tagging. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12, pages 1389–1398, Stroudsburg, PA, USA. Association for Computational Linguistics. Petrov, S., Das, D., and McDonald, R. (2012). A universal part-of-speech tagset. In Chair), N. C. C., Choukri, K., Declerck, T., Doğan, M. U., Maegaard, B., Mariani, J., Odijk, J., and Piperidis, S., editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA). Täckström, O., Das, D., Petrov, S., McDonald, R., and Nivre, J. (2013). Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics, 1:1–12. Wang, M. and Manning, C. D. (2014). Cross-lingual projected expectation regularization for weakly supervised learning. Transactions of the ACL, 2:55–66. Yarowsky, D., Ngai, G., and Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the First International Conference on Human Language Technology Research, HLT ’01, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics.