Low-Resource Natural Language Processing Behnam Sabeti Sharif - PowerPoint PPT Presentation

Low-Resource Natural Language Processing Behnam Sabeti Sharif University of Technology October 2019

Who-am-i? Be Behnam m Sabet eti Ph.D. Candidate at Sharif arif Unive versity rsity of Techno chnology logy Project Manager and NPL Expert at Miras s Tech chnolog nologies es Intern ternation tional Does all kind of NLP stuff specially on Persi sian an behnamsabeti behnamsabeti Sharif Data Talks: Low-Resourced NLP 2

NLP @ Miras • Our focus at Miras NLP team is on developing text processing services for Persian: • Document classification • Named entity recognition • Sentiment analysis • Emotion analysis • … • Challenge: • Data! Sharif Data Talks: Low-Resourced NLP 3

Dataset Size (documents) IMDB 50 K SST 10 K Sentiment140 160 K Amazon Product Data 142.8 M Sharif Data Talks: Low-Resourced NLP 4

Problem? • Deep learning models are data hungry • Persian NLP community is not large • We do not have enough public resources • Funding is also limited so we can’t afford building huge resources either Sharif Data Talks: Low-Resourced NLP 5

Get More Date Get Better Data Use Related Data Problem Modeling Sharif Data Talks: Low-Resourced NLP 6

Get Better Data Better Use Related Data Related Problem Modeling Modeling Sharif Data Talks: Low-Resourced NLP 7

Solutions • Self Supervision • Emotion Analysis • Weak Supervision • Document Classification • Transfer Learning • Named Entity Recognition • Multi-Task Learning • Satire Detection • Active Learning • Sentiment Analysis Sharif Data Talks: Low-Resourced NLP 8

Modeling Self Supervision • Straight forward (document, label) modeling is not always your best choice. • Model your problem in an easy-to-acquire-label setting: • Self-supervision • Labels are already in your data: • Language modeling • Word embedding • Emotion Analysis Sharif Data Talks: Low-Resourced NLP 9

Modeling Case Study: Emotion Analysis • Emoji is a good indicator of emotion • Instead of manually label your data use emoji • Your dataset needs no hand-labeling effort! 𝐹𝑛𝑝𝑘𝑗 𝑄𝑠𝑓𝑒𝑗𝑑𝑢𝑗𝑝𝑜 ⟹ 𝐹𝑛𝑝𝑢𝑗𝑝𝑜 𝐵𝑜𝑏𝑚𝑧𝑡𝑗𝑡 Sharif Data Talks: Low-Resourced NLP 10

Image: medium.com/@bjarkefelbo/what- Sharif Data Talks: Low-Resourced NLP 11 can-we-learn-from-emojis

Modeling DeepMoji Model • Predict Emoji • Map Emoji to Emotion Image: medium.com/huggingface/understand ing-emotions-from-keras-to-pytorch Sharif Data Talks: Low-Resourced NLP 12

مراد تسود یلیخ ور زییاپ ! لصف هخآ یبوخ نیا هب !رتدوز ایب... Sharif Data Talks: Low-Resourced NLP 13

Modeling Weak Supervision • Provide noisy labels using a set of heuristics or domain knowledge • Use other weak classifiers • Constraints • Data transformation • Think of a transformation on your data: • Reduce the effort in annotation process Sharif Data Talks: Low-Resourced NLP 14

Modeling Case Study: Document Classification • Latent Dirichlet Allocation is a generative model for topic modeling: • computes a set of topics: each topic is a distribution on words • Computes the distribution of each document on topics • Instead of manually labeling documents, annotate topics! • With this transformation you can get a pretty good result by just labeling a handful of topics Sharif Data Talks: Low-Resourced NLP 15

Image: m-cacm.acm.org/magazines/2012/4/147361- Sharif Data Talks: Low-Resourced NLP 16 probabilistic-topic-models

رد طیارش یلعف قبط شرازگ دحاو یتاعلبطا ،تسیمونوکا نیرتشیب یکسیر هک داصتقا ناریا ار دیدهت ،دنک یم کسیر شخب یکناب و کسیر یسایس تسا. رصع ؛کناب تنواعم یسررب یاه یداصتقا قاتا یناگرزاب نارهت رد یشرازگ هب یواکاو لدم کسیر یروشک هتخادرپ تسا. ساسارب نیا شرازگ لدم کسیر ،یروشک یلدم تسا هک هب روظنم شجنس و هسیاقم کسیر یرابتعا یاهروشک فلتخم طسوت دحاو یتاعلبطا تسیمونوکا یحارط هدش تسا. نیا رازبا ،یلماعت ناکما یزاس یمک کسیر تلبدابم یلام زا هلمج یاه ماو ،یکناب نیمات یلام یراجت و یراذگ هیامرس رد قاروا راداهب ار مهارف دنک یم … Sharif Data Talks: Low-Resourced NLP 17

Related Transfer Learning • Train on a task for which you have enough data • Fine-Tune the trained model on a new task (for which limited data is available) • The source and target tasks need to have common characteristics: • Source: Language modeling, Target: Document Classification • Source: Emotion Detection, Target: Satire Detection • Source: Document Classification, Target: Sentiment Analysis • Document Classification: word based task • Sentiment Analysis: Phrase level and semantic based task Sharif Data Talks: Low-Resourced NLP 18

Related Image: machinelearningmastery.com/transfer- Sharif Data Talks: Low-Resourced NLP 19 learning-for-deep-learning

Related Pre-Trained Models • Train your own model on a source task Or use a Pre-trained model • Pre-Trained model are a good choice because they are trained on HUGE datasets. • Language modeling pre-trained models: • BERT • GPT • XLNet • XLM • CTRL • … Sharif Data Talks: Low-Resourced NLP 20

Image: Sharif Data Talks: Low-Resourced NLP 21 jalammar.github.io/illustrated-bert

Image: medium.com/huggingface/introducing-fastbert-a- Sharif Data Talks: Low-Resourced NLP 22 simple-deep-learning-library-for-bert-models

Related Case Study: Named Entity Recognition • Target Task: Named Entity Recognition • Extract locations, persons, organizations, events and times from text • Source: Multilingual BERT model • Data: 50K hand labeled sentences with NER tags Sharif Data Talks: Low-Resourced NLP 23

Sharif Data Talks: Low-Resourced NLP 24

Related Multi-Task Learning • Train multiple tasks together • More data • Synergic effects in training • Tasks: tweet reconstruction + emoji prediction + satire detection • General features • Emotion features • Satire features • Entails multi objective loss functions Sharif Data Talks: Low-Resourced NLP 25

Image: medium.com/manash-en-blog/multi-task- learning-in-keras-implementation-of-multi-task- classification-loss Sharif Data Talks: Low-Resourced NLP 26

Related Case Study: Satire Detection • Satire dataset: 2K tweets • Emotion dataset: 300K tweets • Reconstruction Tweets: as much as you have! (200M) Sharif Data Talks: Low-Resourced NLP 27

Related Sharif Data Talks: Low-Resourced NLP 28

Satire re Mod odel el Perfo rforman rmance ce (F1) Single task 55 % Multi task 68 % Sharif Data Talks: Low-Resourced NLP 29

Better Active Learning • How to select samples for annotation? • Random • Annotate as much as you can • Smart • Annotate “Better” samples Sharif Data Talks: Low-Resourced NLP 30

Better Sharif Data Talks: Low-Resourced NLP 31

Better Image: www.datacamp.com/community/tutorials Sharif Data Talks: Low-Resourced NLP 32 /active-learning

Better Active Learning • How to select samples for annotation • Random • Smart • Select samples that current model is uncertain about (LC) • Select samples with low margin between category labels (Margin) • Select samples with the highest entropy (Entropy) • Get better performance with fewer samples Sharif Data Talks: Low-Resourced NLP 33

Better D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 34

Better D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model Least Confident 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 36

Better D1 Positive Neutral Negative 0.5 0.05 Margin 0.45 Current D2 Model 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 38

Better 𝐹𝑜𝑢𝑠𝑝𝑞𝑧 𝑄 = − 𝑞 𝑗 𝑚𝑝𝑕𝑞 𝑗 𝑗 D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 39

Better 𝐹𝑜𝑢𝑠𝑝𝑞𝑧 𝑄 = − 𝑞 𝑗 𝑚𝑝𝑕𝑞 𝑗 𝑗 D1 Positive Neutral Negative 1.23 0.5 0.05 0.45 Current D2 Model 0.4 0.3 0.3 1.57 0.29 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 40

Better 𝐹𝑜𝑢𝑠𝑝𝑞𝑧 𝑄 = − 𝑞 𝑗 𝑚𝑝𝑕𝑞 𝑗 𝑗 D1 Positive Neutral Negative 0.5 0.05 0.45 Current D2 Model Entropy 0.4 0.3 0.3 0.95 0.05 0 D3 Sharif Data Talks: Low-Resourced NLP 41

Low-Resource Natural Language Processing Behnam Sabeti Sharif - PowerPoint PPT Presentation

Low-Resource Natural Language Processing Behnam Sabeti Sharif University of Technology October 2019 Who-am-i? Be Behnam m Sabet eti Ph.D. Candidate at Sharif arif Unive versity rsity of Techno chnology logy Project Manager and NPL

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Low-Resource NLP David R. Mortensen Algorithms for Natural Language Processing Learning

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

CSCI341 Lecture 22, MIPS Programming: Directives, Linkers, Loaders, Memory REVIEW Assemblers

1 Well, plenty of registers But where should X, Y and SUM go? MIPS Memory (2 32 -1)

Addresses Lus Oliveira Original slides by: Jarrett Billingsley Modified with bits from: Bruce

MedPAC perspective: The changing payment environment for physician practice Francis J. Crosson,

Welcome Remember, one purpose of our sessions is to network, so please sit with friends that

Lithium destruction and production observed in red giant stars Stefan Uttenthaler, University of

ALBA Synchrotron Data analysis IT infrastructure status report IT Systems - Computing

Activities on accelerators in Spain Francis Perez ALBA Accelerators Head on behalf of