word embeddings
play

Word Embeddings Natural Language Processing VU (706.230) - Andi - PowerPoint PPT Presentation

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings Agenda Traditional NLP Word Embeddings-1 Word Embeddings-2 Text preprocessing Topic Modeling ELMo Bag-of-words model Neural


  1. Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020

  2. Word Embeddings Agenda Traditional NLP Word Embeddings-1 Word Embeddings-2 Text preprocessing Topic Modeling ELMo ● ● ● Bag-of-words model Neural Embeddings ULMFit ● ● ● External Resources Word2Vec BERT ● ● ● Sequential classification GloVe RoBERTa, DistilBERT ● ● ● Other tasks (MT, LM) fastText Multilinguality ● ● ● 2 02/04/2020

  3. Word Embeddings Traditional NLP 3 02/04/2020

  4. Word Embeddings Preprocessing How to preprocess text? How do we (humans) split the text to analyse it? ● “ Divide et impera ” approach: ○ Word split ■ Sentence split ■ Paragraphs, etc ■ Is there any other information that we can collect? ○ 4 02/04/2020

  5. Word Embeddings Preprocessing (2) Other preprocessing steps: Morphological: ● Stemming/Lemmatization ○ Grammatical: ● Part of Speech Tagging (PoS) ○ Chunking/Constituency Parsing ○ Dependency Parsing ○ 5 02/04/2020

  6. Word Embeddings Preprocessing (3) Morphological: Stemming: ● The process of bringing the inflected words to their common root: ○ Producing => produc; produced => produc ■ are =>are ■ Lemmatization: ● Bringing the words to the same lemma word: ○ am , is, are => be ■ 6 02/04/2020

  7. Word Embeddings Preprocessing (4) Grammatical: Part of Speech Tagging (PoS): ● Assign to each word a grammatical tag ○ Sentence: “There are different examples that we might use!” ● Preprocessing: ○ Lemmatization PoS Tagging 7 02/04/2020

  8. Word Embeddings Preprocessing (5) Parsing: ● Shallow Parsing (Chunking): ○ Adds a tree structure to the POS-tags ■ First identifies its constituents and then their relation ■ Deep Parsing (Dependency Parsing): ○ Parses the sentence in its grammatical structure ■ “Head” - “Dependent” form ■ It is an acyclic directed graph (mostly implemented as a tree) ■ 8 02/04/2020

  9. Word Embeddings Preprocessing (6) Sentence: ● Constituency Parsing: ● “There are different examples that we might use!” Dependency Parsing: ● 9 02/04/2020

  10. Word Embeddings Bag-of-words Model Use the preprocessed text in Machine Learning tasks ● How to encode the features? ■ A major paradigm in NLP and IR - Bag-of-Words (BoW): ● The text is considered to be a set of its words ○ Grammatical dependencies are ignored ○ Features encoding: ○ Dictionary based (Nominal features) ■ One hot encoded/Frequency encoded ■ 10 02/04/2020

  11. Word Embeddings Bag-of-words Model (2) Sentences: ● Features: ● Representation of features for Machine Learning: ● 11 02/04/2020

  12. Word Embeddings Feature encoding PoS tagging: ● Word + PoS tag as part of dictionary: ○ Example: John- PN ■ Chunking: ● Use Noun Phrases: ○ Example: the bank account ■ Dependency Parsing: ● Word + Dependency Path as part of dictionary: ○ Example: use- nsubj-acl:relcl ■ 12 02/04/2020

  13. Word Embeddings External Resources Sparse dimension of the feature space: ● We miss linguistic resources: ○ Synonyms, antonyms, hyponyms, hypernyms,... ■ Enrich the feature of our examples with their synonyms ■ Set negative weight to antonyms ■ External resources to improve sparsity: ● Wordnet: A lexical database for English, which groups words in synsets ○ Wiktionary: A free multilingual dictionary enriched with relations between ○ words 13 02/04/2020

  14. Word Embeddings Sequential Classification We need to classify a sequence of tokens: ● Information Extraction: ○ Example: Extract the names of companies from documents (open domain) ■ How to Model? ● Classify each token as part or not part of the information: ○ The classification of the current token depends on the classification of the ■ previous one Sequential classifier ■ Still not enough; we need to encode the output ■ We need to know where every “annotation” starts and ends ■ 14 02/04/2020

  15. Word Embeddings Sequential Classification (2) Why do we need a schema? ● Example: I work for TU Graz Austria! ○ BILOU: Beginning, Inside, Last, Outside, Unit ● BIO(most used): Beginning, Inside, Outside ● BILOU has shown to perform better in some datasets ● Example: “The Know Center GmbH is a spinoff of TUGraz.” ● BILOU: The-O; Know-B; Center-I; GmbH-L; is-O; a-O; spinoff-O; of- O; TUGraz-U ○ BIO: The-O; Know-B; Center-I; GmbH-I; is-O; a-O; spinoff-O; of-O; TUGraz-B ○ Sequential classifiers: Hidden Markov Model, CRF, etc ● 15 02/04/2020 1: Named Entity recognition: https://www.aclweb.org/anthology/W09-1119.pdf

  16. Word Embeddings Sentiment Analysis Assign a sentiment to a piece of text: ● Binary (like/dislike) ○ Rating based (eg. 1-5) ○ Assign the sentiment to a target phrase: ● Usually involving features around the target ○ External resources: ● SentiWordnet http://sentiwordnet.isti.cnr.it/ ○ 16 02/04/2020

  17. Word Embeddings Language model Generating the next token of a sequence ● Usually based on the collection of co-occurrence of words in a window: ● Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ○ ■ In traditional approaches solved as an n-gram approximation: ● Usually solved by combining different sizes of n-grams and weighting them ○ 17 02/04/2020

  18. Word Embeddings Machine Translation Translate text from one language to another ● Different approaches: ● Rule based: ○ Usually by using a dictionary ■ Statistical (Involving a bilingual aligned corpora) ○ IBM models (1-6) for aligning and training ■ Hybrid : ○ The use of the two previous techniques ■ 18 02/04/2020

  19. Word Embeddings Traditional NLP End 19 02/04/2020

  20. Word Embeddings Dense Word Representation 20 02/04/2020

  21. Word Embeddings From Sparse to Dense Topic Modeling ● Since LSA(Latent Semantic Analysis): ● These methods utilize low-rank approximations to decompose large ○ matrices that capture statistical information about a corpus Other Methods later: ● pLSA (Probability Latent Semantic Analysis) ○ Uses probability instead of SVD (Single Value Decomposition) ■ LDA (Latent Dirichlet Allocation): ● A Bayesian version of pLSA ○ 21 02/04/2020

  22. Word Embeddings Neural embeddings Language models suffer from the “Curse of dimensionality”: ● The word sequence that we want to predict is likely to be different from ○ the ones we have seen in the training Seeing in the “ The cat is walking in the bedroom ” => should help us ○ generate: “ A dog was running in the room ”: Similar semantics and grammatical role ■ A Neural Probabilistic Language Model: ● Bengio et al. implemented in 2003 the idea of Mnih and Hinton (1989): ○ Learned a language model and embeddings for the words ■ 22 02/04/2020

  23. Word Embeddings Neural embeddings (2) Bengio’s architecture: ● ○ Approximate a function with a window approach ○ Model the approximation with a neural network ○ Input Layer in a 1-hot-encoding form ○ Two hidden layers (first more of a random initialization) ○ A tanh intermediate layer ○ 23 02/04/2020

  24. Word Embeddings Neural embeddings (3) A final softmax layer: ● Outputting the next word in the sequence ○ Learned a word representation of 18K words ● with almost 1M words in the corpus IMPORTANT Linguistic Theory: ● Words that tend to occur in similar linguistic ○ context tend to resemble each other in meanings 24 02/04/2020

  25. Word Embeddings Word2vec A deep learning model (2 layers) that compute dense vector representations of words ● Two different architectures: ● Continuous Bag-of-Words Model (CBOW) (Faster) ○ Predict the middle word in a window of words ■ Skip-gram Model (Better with small amount of data) ○ Predict the context of a middle word, given the word ■ Models the probability of words co-occurring with the current word(candidate) ● The embedding learned is the output of the hidden layer ● 25 02/04/2020

  26. Word Embeddings Word2vec (2) Skip-Gram: CBOW: 26 02/04/2020

  27. Word Embeddings Word2vec (3) The output is a softmax function ● Three new techniques: ● 1. Subsample of frequent word: Each word in the training set is discarded with a probability ■ is the frequency of word and t (around ) a threshold ● Keep words that are more likely to occur less often ● ■ accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words 27 02/04/2020

  28. Word Embeddings Word2vec (4) 2. Hierarchical Softmax: A tree approximation of the softmax, by using a sigmoid at every step ○ Intuition: at every step decide whether to go right or left ○ O(log(n)) instead of O(n) ○ 3. Negative sampling: Alternative to Hierarchical Softmax (works better): ○ Brings up infrequent terms, squeezes the probability of frequent terms ○ Change the weight of only a selected number of terms with probability: ○ ■ 28 02/04/2020

  29. Word Embeddings Word2vec (5) A serendipity effect from the Word2vec is the linearity(analogy) between ● embeddings: The famous example: (King - Man) + Woman = Queen ● 29 02/04/2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend