Word counts with bag-of- words Katharine Jarmul Founder, kjamistan - PowerPoint PPT Presentation

DataCamp Introduction to Natural Language Processing in Python INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON Word counts with bag-of- words Katharine Jarmul Founder, kjamistan

DataCamp Introduction to Natural Language Processing in Python Bag-of-words Basic method for finding topics in a text Need to first create tokens using tokenization ... and then count up all the tokens The more frequent a word, the more important it might be Can be a great way to determine the significant words in a text

DataCamp Introduction to Natural Language Processing in Python Bag-of-words example Text: "The cat is in the box. The cat likes the box. The box is over the cat." Bag of words (stripped punctuation): "The": 3, "box": 3 "cat": 3, "the": 3 "is": 2 "in": 1, "likes": 1, "over": 1

DataCamp Introduction to Natural Language Processing in Python Bag-of-words in Python In [1]: from nltk.tokenize import word_tokenize In [2]: from collections import Counter In [3]: Counter(word_tokenize( """The cat is in the box. The cat likes the box. The box is over the cat.""")) Out[3]: Counter({'.': 3, 'The': 3, 'box': 3, 'cat': 3, 'in': 1, ... 'the': 3}) In [4]: counter.most_common(2) Out[4]: [('The', 3), ('box', 3)]

DataCamp Introduction to Natural Language Processing in Python INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON Let's practice!

DataCamp Introduction to Natural Language Processing in Python INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON Simple text preprocessing Katharine Jarmul Founder, kjamistan

DataCamp Introduction to Natural Language Processing in Python Why preprocess? Helps make for better input data When performing machine learning or other statistical methods Examples: Tokenization to create a bag of words Lowercasing words Lemmatization/Stemming Shorten words to their root stems Removing stop words, punctuation, or unwanted tokens Good to experiment with different approaches

DataCamp Introduction to Natural Language Processing in Python Preprocessing example Input text: Cats, dogs and birds are common pets. So are fish. Output tokens: cat, dog, bird, common, pet, fish

DataCamp Introduction to Natural Language Processing in Python Text preprocessing with Python In [1]: from ntlk.corpus import stopwords In [2]: text = """The cat is in the box. The cat likes the box. The box is over the cat.""" In [3]: tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()] In [4]: no_stops = [t for t in tokens if t not in stopwords.words('english')] In [5]: Counter(no_stops).most_common(2) Out[5]: [('cat', 3), ('box', 3)]

DataCamp Introduction to Natural Language Processing in Python INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON Introduction to gensim Katharine Jarmul Founder, kjamistan

DataCamp Introduction to Natural Language Processing in Python What is gensim? Popular open-source NLP library Uses top academic models to perform complex tasks Building document or word vectors Performing topic identification and document comparison

DataCamp Introduction to Natural Language Processing in Python What is a word vector?

DataCamp Introduction to Natural Language Processing in Python Gensim Example (Source: http://tlfvincent.github.io/2015/10/23/presidential-speech-topics )

DataCamp Introduction to Natural Language Processing in Python Creating a gensim dictionary In [1]: from gensim.corpora.dictionary import Dictionary In [2]: from nltk.tokenize import word_tokenize In [3]: my_documents = ['The movie was about a spaceship and aliens.', ...: 'I really liked the movie!', ...: 'Awesome action scenes, but boring characters.', ...: 'The movie was awful! I hate alien films.', ...: 'Space is cool! I liked the movie.', ...: 'More space films, please!',] In [4]: tokenized_docs = [word_tokenize(doc.lower()) ...: for doc in my_documents] In [5]: dictionary = Dictionary(tokenized_docs) In [6]: dictionary.token2id Out[6]: {'!': 11, ',': 17, '.': 7, 'a': 2, 'about': 4, ... }

DataCamp Introduction to Natural Language Processing in Python Creating a gensim corpus In [7]: corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs] In [8]: corpus Out[8]: [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)], ... ] gensim models can be easily saved, updated, and reused Our dictionary can also be updated This more advanced and feature rich bag-of-words can be used in future exercises

DataCamp Introduction to Natural Language Processing in Python INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON Tf-idf with gensim Katharine Jarmul Founder, kjamistan

DataCamp Introduction to Natural Language Processing in Python What is tf-idf? Term frequency - inverse document frequency Allows you to determine the most important words in each document Each corpus may have shared words beyond just stopwords These words should be down-weighted in importance Example from astronomy: "Sky" Ensures most common words don't show up as key words Keeps document specific frequent words weighted high

DataCamp Introduction to Natural Language Processing in Python Tf-idf formula N = tf ∗ log( ) w i , j i , j df i = tf-idf weight for token i in document j w i , j = number of occurences of token i in document j tf i , j df = number of documents that contain token i i N = total number of documents

DataCamp Introduction to Natural Language Processing in Python Tf-idf with gensim In [10]: from gensim.models.tfidfmodel import TfidfModel In [11]: tfidf = TfidfModel(corpus) In [12]: tfidf[corpus[1]] Out[12]: [(0, 0.1746298276735174), (1, 0.1746298276735174), (9, 0.29853166221463673), (10, 0.7716931521027908), ... ]

Word counts with bag-of- words Katharine Jarmul Founder, kjamistan - PowerPoint PPT Presentation

DataCamp Introduction to Natural Language Processing in Python INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON Word counts with bag-of- words Katharine Jarmul Founder, kjamistan DataCamp Introduction to Natural Language Processing in

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Text Representation Bag-of-Words and Word Embeddings count vector unordered bag over

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

Word counts with bag- of-words Katharine Jarmul Founder, kjamistan DataCamp Natural Language

WINE BOTTLE AIRBAG SINGLE WINE BOTTLE AIRBAG SINGLE BOTTLE AIR BAG PROTECT ALL BOTTLED PRODUCT

Red-Bag Engineers Consultants Software User Day April 2017 Red-Bag 2017 1 Ves Online

Pathway Red Bag Scheme October 2018 The Red Bag concept The Red Bag scheme was first implemented

The Plastic Bag Free world in action Surfriders Ban the Bag Campaign Plastic bag free

Lecture: Visual Bag of Words Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning

Token to Words Expanding identified token to words numbers+type = word list

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

DC Bag Law Presented by Jeffrey Seltzer Associate Director Stormwater Management Division District

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Terminology: bag-of-words

Many words share the same root word This week we are focusing on words with the root gram.

13. Words and morphemes 13.1 Words and word forms 13.1.1 Different syntactic compatibilities of

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Topic 18 Bi Binary Search Trees S h T "Yes Shrubberies are my trade I am a Yes.

Measuring Success in Soft. Development Projects Open Leadership Summit, Tahoe 2017 Jess

Programme Welcome to The Majestic Hotel Hendrik Leroux, Sales Director, The Majestic Hotel

Timing Library Format (TLF) Advanced VLSI Design CMPE 414 Timing Library Format (TLF) TLF is an

Local and Geometric Beilinson-Tate Operators Amnon Yekutieli Department of Mathematics Ben

Methodo Methodo ology (1) ology (1) Pla an Strict separation between St i t ti b t n

Decidable Problems for Counter Systems Day 5 Model-Checking Counter Systems St ephane Demri

LQCD Facilities at Jefferson Lab Chip Watson May 14, 2009 Page 1 May 15, 2009 Existing

Word counts with bag-of- words Katharine Jarmul Founder, kjamistan - PowerPoint PPT Presentation

DataCamp Introduction to Natural Language Processing in Python INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON Word counts with bag-of- words Katharine Jarmul Founder, kjamistan DataCamp Introduction to Natural Language Processing in

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Text Representation Bag-of-Words and Word Embeddings count vector unordered bag over

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

Word counts with bag- of-words Katharine Jarmul Founder, kjamistan DataCamp Natural Language

WINE BOTTLE AIRBAG SINGLE WINE BOTTLE AIRBAG SINGLE BOTTLE AIR BAG PROTECT ALL BOTTLED PRODUCT

Red-Bag Engineers Consultants Software User Day April 2017 Red-Bag 2017 1 Ves Online

Pathway Red Bag Scheme October 2018 The Red Bag concept The Red Bag scheme was first implemented

The Plastic Bag Free world in action Surfriders Ban the Bag Campaign Plastic bag free

Lecture: Visual Bag of Words Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning

Token to Words Expanding identified token to words numbers+type = word list

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

DC Bag Law Presented by Jeffrey Seltzer Associate Director Stormwater Management Division District

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Terminology: bag-of-words

Many words share the same root word This week we are focusing on words with the root gram.

13. Words and morphemes 13.1 Words and word forms 13.1.1 Different syntactic compatibilities of

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Topic 18 Bi Binary Search Trees S h T &quot;Yes Shrubberies are my trade I am a Yes.

Measuring Success in Soft. Development Projects Open Leadership Summit, Tahoe 2017 Jess

Programme Welcome to The Majestic Hotel Hendrik Leroux, Sales Director, The Majestic Hotel

Timing Library Format (TLF) Advanced VLSI Design CMPE 414 Timing Library Format (TLF) TLF is an

Local and Geometric Beilinson-Tate Operators Amnon Yekutieli Department of Mathematics Ben

Methodo Methodo ology (1) ology (1) Pla an Strict separation between St i t ti b t n

Decidable Problems for Counter Systems Day 5 Model-Checking Counter Systems St ephane Demri

LQCD Facilities at Jefferson Lab Chip Watson May 14, 2009 Page 1 May 15, 2009 Existing

Topic 18 Bi Binary Search Trees S h T "Yes Shrubberies are my trade I am a Yes.