CMSC - 676 Classifications using word embedding techniques - PowerPoint PPT Presentation

CMSC - 676 Classifications using word embedding techniques Presented By - Prachi Bhalerao (prachib1@umbc.edu)

Introduction • Need - Finding similarity between words and extracting semantic/contextual information as much as possible is viable. - Applied across the complete NLP spectrum. • What is Word / Sentence Embedding? Technique for representing words into the vectors of real numbers which helps in comparing semantics of different words and in efficient representation of data (Dimensionality Reduction)

Types of word embeddings 1. Frequency-based word embedding Count Vector TF-IDF vector 2. Prediction-based word embedding Word2Vec fastText GloVe etc.

Word embedding techniques Word2Vec CBOW model Skip-gram model Predicts target word from given context Predicts the context from given word

Word embedding techniques FastText

Example 1 : Sub-topic detection • Technique based on sentence embeddings for detecting sub- topics is proposed. • Latent Dirichlet Allocation (LDA) to get the topics. • Topic Word Embedding (TWE) to train the weibo data set under a topic. • Taking the cosine between the word embeddings and the topic embeddings as the weight value, the word embeddings of the target words are weighted and added. • This is used to extend the topic information into the word embeddings and enhance the semantics of the word embeddings. • The p-means method is used to merge the blog into the sentence embeddings, which is the characteristic value of the blog • Finally, sub-topic clusters obtained through kmeans.

Example 2 : Named Entity Recognition • Word vectors obtained from the word2vec and fastText embedding approach are applied to the task of named entity recognition (NER). • Given a tokenized text, the task is that of predicting which words belong to which predefined category. • Word2vec model was trained • Then classification performed using greedy implementation of the Linear Support Vector algorithm. • Addressing Cluster Granularity • Unlabelled corpus size Measures given are achieved from adding clusters at granularity 1000, built from Results- word2vec models trained on the various 1. Performance of the NER model improved with growth of the size of data sets, to the NER classifier. the unlabelled data set but only to a limit (around 300 000 types, in One possible explanation for the stagnating this paper) at which it even started to drop. performance of the larger data set is that 2. Combining multiple cluster granularities led to our best other training settings need to be employed improvement. It didn’t improve performance for smaller data sets. for optimal training (e.g. higher vector dimensionality or more training iterations).

Challenges 1. Homographs: Different words sharing the same spelling Average of the contexts of all the words with same meaning is taken. When put into practice, this can significantly impact on the performance of ML systems posing a potential problem for conversational agents and text classifiers e.g. Apple, like 2. Inflection : Alterations of a word to express different grammatical categories Inflected forms (past tense or participle, for example) of verbs. That’s because some word inflections appear less frequently than others in certain contexts -> Fewer examples of those ‘less common’ words in context for the algorithm to learn from them -> ‘less similar’ vectors

References • Yu Xie et al, ‘A Method based on Sentence Embeddings for the Sub - Topics Detection’ 2019 J. Phys.: Conf. Ser. 1168 052004; https://iopscience.iop.org/article/10.1088/1742- 6596/1168/5/052004/pdf • Scharolta Katharina Siencnik , ‘Adapting word2vec to Named Entity Recognition’, Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015); https://www.ep.liu.se/ecp/109/030/ecp15109030.pdf • Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, Iryna Gurevych . ‘Classification and Clustering of Arguments with Contextualized Word Embeddings’ ACL 2019; https://arxiv.org/abs/1906.09821 • Indrajit Dhillon, Rahul Kumar, ‘Enhanced word clustering for hierarchical text classification’ at acm; https://dl.acm.org/doi/abs/10.1145/775047.775076

CMSC - 676 Classifications using word embedding techniques - PowerPoint PPT Presentation

CMSC - 676 Classifications using word embedding techniques Presented By - Prachi Bhalerao (prachib1@umbc.edu) Introduction Need - Finding similarity between words and extracting semantic/contextual information as much as possible is viable.

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding

Word Meaning: Distributional Representations & Word Sense Disambiguation CMSC 723 / LING 723

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

On The Information Geometry of Word Embedding Riccardo Volpi, joint work with D. Marinelli, P.

Classifications, Indications, and Techniques Christopher D. Muller, M.D. Faculty Advisor: Shawn

Exploration of a Threshold for Similarity based on Uncertainty in Word Embedding Navid Rekabsaz,

Word Sense Disambiguation CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky Today: Word

character word embedding and pos tagging for indian languages Anirban Majumdar Amit Kumar

Measuring the Influence of L1 on Learner English Errors in Content Words within Word Embedding

Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani

The Influence of Down-Sampling Strategies on SVD Word Embedding Stability Johannes Hellrich,

Words & their Meaning: Word Sense Disambiguation CMSC 470 Marine Carpuat Slides credit: Dan

Deep Learning for Natural Language Processing Inspecting and evaluating word embedding models

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

Reflection-based Word Attribute Transfer Background Analogy Analogy in the embedding space

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Supervised Classification with the Perceptron CMSC 470 Marine Carpuat Slides credit: Hal Daume

Distributional Semantics CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last

The Best of Both Worlds: Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for

Multilingual acoustic word embedding models for processing zero-resource languages ICASSP 2020

EMBEDDING TECHNIQUES OF FBG SENSORS IN ADHESIVE LAYERS OF COMPOSITE STRUCTURES AND APPLICATIONS S.

FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation Michael Gnther ,

CMSC - 676 Classifications using word embedding techniques - PowerPoint PPT Presentation

CMSC - 676 Classifications using word embedding techniques Presented By - Prachi Bhalerao (prachib1@umbc.edu) Introduction Need - Finding similarity between words and extracting semantic/contextual information as much as possible is viable.

OPTIMIZATION OF SKIP-GRAM MODEL Chenxi Wu Final Presentation for STA 790 Word Embedding

Word Meaning: Distributional Representations &amp; Word Sense Disambiguation CMSC 723 / LING 723

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

On The Information Geometry of Word Embedding Riccardo Volpi, joint work with D. Marinelli, P.

Classifications, Indications, and Techniques Christopher D. Muller, M.D. Faculty Advisor: Shawn

Exploration of a Threshold for Similarity based on Uncertainty in Word Embedding Navid Rekabsaz,

Word Sense Disambiguation CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky Today: Word

character word embedding and pos tagging for indian languages Anirban Majumdar Amit Kumar

Measuring the Influence of L1 on Learner English Errors in Content Words within Word Embedding

Word Embedding CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani

The Influence of Down-Sampling Strategies on SVD Word Embedding Stability Johannes Hellrich,

Words &amp; their Meaning: Word Sense Disambiguation CMSC 470 Marine Carpuat Slides credit: Dan

Deep Learning for Natural Language Processing Inspecting and evaluating word embedding models

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

Reflection-based Word Attribute Transfer Background Analogy Analogy in the embedding space

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin How to

Supervised Classification with the Perceptron CMSC 470 Marine Carpuat Slides credit: Hal Daume

Distributional Semantics CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last

The Best of Both Worlds: Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for

Multilingual acoustic word embedding models for processing zero-resource languages ICASSP 2020

EMBEDDING TECHNIQUES OF FBG SENSORS IN ADHESIVE LAYERS OF COMPOSITE STRUCTURES AND APPLICATIONS S.

FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation Michael Gnther ,

Word Meaning: Distributional Representations & Word Sense Disambiguation CMSC 723 / LING 723

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Words & their Meaning: Word Sense Disambiguation CMSC 470 Marine Carpuat Slides credit: Dan

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to

Dense Word Embeddings CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin How to