1/27
Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning
Guillaume Wisniewski Nicolas Pécheux Souhir Gahbiche-Braham François Yvon
Université Paris-Sud & LIMSI-CNRS
Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning - - PowerPoint PPT Presentation
1/27 Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning Guillaume Wisniewski Nicolas Pcheux Souhir Gahbiche-Braham Franois Yvon Universit Paris-Sud & LIMSI-CNRS October 28, 2014 2/27 Context performance standards
1/27
Université Paris-Sud & LIMSI-CNRS
2/27
▶ Supervised Machine Learning techniques have established new
▶ Success crucially depends on the availability of annotated
▶ Not so common situation (e.g. under-resourced languages) ▶ What can we do then ?
3/27
▶ Unsupervised learning ▶ Crawl data (e.g. Wiktionary)
4/27
Structured Prediction
YES NO Want to make a tracker? structural_track_association_trainer YES NO svm_rank_trainer structural_object_detection_trainer structural_assignment_trainer structural_svm_problem Predicting a true or false label? Predicting a categorial label? Predicting a continuous quantity? Do you have labeled data? Are you trying to rank order something? Do you want to transform your data? YES Do you know how many categories? TOO SLOW YES NO Clustering < 5K Samples YES NO Do you have a graph of "similar" samples? YES NO YES NO YES NO Data Transformations Number of features < 100 NO YES < 20K Samples NOT WORKING YES NO NO Do you have labeled data? Are you trying to label things as anomalousRessource-rich language Less-ressourced language
Transfer
▶ Cross-lingual transfer (weakly supervised learning)
.
.
.
.
.
.
.
.
.
.
4/27
Structured Prediction
YES NO Want to make a tracker? structural_track_association_trainer YES NO svm_rank_trainer structural_object_detection_trainer structural_assignment_trainer structural_svm_problem Predicting a true or false label? Predicting a categorial label? Predicting a continuous quantity? Do you have labeled data? Are you trying to rank order something? Do you want to transform your data? YES Do you know how many categories? TOO SLOW YES NO Clustering < 5K Samples YES NO Do you have a graph of "similar" samples? YES NO YES NO YES NO Data Transformations Number of features < 100 NO YES < 20K Samples NOT WORKING YES NO NO Do you have labeled data? Are you trying to label things as anomalousRessource-rich language Less-ressourced language
Transfer
▶ Cross-lingual transfer (weakly supervised learning)
.
.
.
.
.
.
.
.
.
.
5/27
▶ In most cases this only results in partially annotated data ▶ Alternative ML techniques need to be designed
▶ Partially observed CRF [Täckström et al., 2013] ▶ Posterior regularization [Ganchev and Das, 2013] ▶ Expectation maximization [Wang and Manning, 2014]
6/27
7/27
8/27
▶ In this work we focus on POS tagging
▶ All annotations are mapped to this universal tagset
9/27
▶ Heuristic filtering rules [Yarowsky et al., 2001] ▶ Graph-base projection [Das and Petrov, 2011] ▶ Combine with monolingual information
.
.
10/27
▶ Automatically extracted from Wiktionary
.
.
.
.
.
10/27
▶ Automatically extracted from Wiktionary ▶ Build from the projected labels across the aligned corpora
.
.
.
.
.
10/27
▶ Automatically extracted from Wiktionary ▶ Build from the projected labels across the aligned corpora
.
.
.
.
.
▶ We use the intersection of the two above
11/27
.
.
.
.
.
.
.
.
.
.
11/27
.
.
.
.
.
.
.
.
.
.
11/27
.
.
.
.
.
.
.
.
.
.
11/27
.
.
.
.
.
.
.
.
.
.
12/27
13/27
.
.
.
.
▶ Gold labels: a set of possible labels of which only one is true ▶ How to learn from ambiguous supervision ? ▶ Can be cast in the framework of ambiguous learning
14/27
i =
y
NOUN, VERB, ...
▶ Structured prediction is reduced to a sequence of
14/27
i =
y∈{NOUN, VERB, ...}
i−1, y∗ i−2, ...)
▶ Structured prediction is reduced to a sequence of
▶ At each step, the decision is taken based on the input
15/27
▶ Linear classifier y∗ i = arg maxy∈Y wTφ(x, i, y, hi) ▶ Perceptron
i ̸= ˆ
i , hi) + yi
i
▶ Heighten the gold label
15/27
▶ Linear classifier y∗ i = arg maxy∈Y wTφ(x, i, y, hi) ▶ Perceptron-like update
i ̸∈ ˆ
i , hi) +
ˆ yi∈ ˆ Yi
▶ Heighten the gold labels score at the cost of the wrongly
15/27
▶ Linear classifier y∗ i = arg maxy∈Y wTφ(x, i, y, hi) ▶ Perceptron-like update
i ̸∈ ˆ
i , hi) +
ˆ yi∈ ˆ Yi
▶ Heighten the gold labels score at the cost of the wrongly
▶ Theoretical guarantees for similar problems under mild
16/27
17/27
▶ Experiments on 10 languages from different families ▶ English as the source side
▶ Parallel corpora
▶ English POS tagger
▶ Crawled dictionary
▶ Labeled test data
▶ Standard feature set
18/27
Unsupervised [1]
19/27
20/27
20/27
20/27
20/27
21/27
▶ Type constraints precision on test data is 94% ▶ I.e. using our type constraints as hard constraints at decoding
▶ In this setting HBSL gets 7.3% ▶ Noisy dictionaries
21/27
▶ Type constraints precision on test data is 94% ▶ I.e. using our type constraints as hard constraints at decoding
▶ In this setting HBSL gets 7.3% ▶ Noisy dictionaries…not only ?
21/27
▶ Type constraints precision on test data is 94% ▶ I.e. using our type constraints as hard constraints at decoding
▶ In this setting HBSL gets 7.3% ▶ Noisy dictionaries…not only ?
21/27
▶ Type constraints precision on test data is 94% ▶ I.e. using our type constraints as hard constraints at decoding
▶ In this setting HBSL gets 7.3% ▶ Noisy dictionaries…not only ?
22/27
▶ Several independently designed information sources are
▶ They follow conflicting annotation conventions
.
.
.
.
.
.
.
23/27
ar cs de el es fi fr id it sv HBAL 27.9 10.4 8.8 8.1 8.2 13.3 10.2 11.3 9.1 10.1 HBAL + match 24.1 7.6 8.0 7.3 7.4 12.2 7.4 9.8 8.3 8.8 ∆
▶ Performance may be underestimated
24/27
25/27
▶ We introduce a new, simple and efficient learning criterion ▶ Performance surpasses best reported results ▶ Results close to the best achievable performance ? ▶ Evaluation of such settings much be taken with great care ▶ Additional gains might be more easily obtained by fixing
26/27
Tools and resources available from http://perso.limsi.fr/wisniews/weakly
27/27
Bordes, A., Usunier, N., and Weston, J. (2010). Label ranking under ambiguous supervision for learning semantic correspondences. In ICML, pages 103–110. Cour, T., Sapp, B., and Taskar, B. (2011). Learning from partial labels. Journal of Machine Learning Research, 12:1501–1536. Das, D. and Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 600–609, Stroudsburg, PA, USA. Association for Computational Linguistics. Ganchev, K. and Das, D. (2013). Cross-lingual discriminative learning of sequence models with posterior regularization. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1996–2006, Seattle, Washington, USA. Association for Computational Linguistics. Li, S., Graça, J. a. V., and Taskar, B. (2012). Wiki-ly supervised part-of-speech tagging. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12, pages 1389–1398, Stroudsburg, PA, USA. Association for Computational Linguistics. Petrov, S., Das, D., and McDonald, R. (2012). A universal part-of-speech tagset. In Chair), N. C. C., Choukri, K., Declerck, T., Doğan, M. U., Maegaard, B., Mariani, J., Odijk, J., and Piperidis, S., editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA). Täckström, O., Das, D., Petrov, S., McDonald, R., and Nivre, J. (2013). Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics, 1:1–12. Wang, M. and Manning, C. D. (2014). Cross-lingual projected expectation regularization for weakly supervised learning. Transactions of the ACL, 2:55–66. Yarowsky, D., Ngai, G., and Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the First International Conference on Human Language Technology Research, HLT ’01, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics.