B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G - PowerPoint PPT Presentation

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

Recap of data format for ML algorithms For an y ML algorithm , Data m u st be in tab u lar form Training feat u res m u st be n u merical FEATURE ENGINEERING FOR NLP IN PYTHON

Bag of w ords model E x tract w ord tokens Comp u te freq u enc y of w ord tokens Constr u ct a w ord v ector o u t of these freq u encies and v ocab u lar y of corp u s FEATURE ENGINEERING FOR NLP IN PYTHON

Bag of w ords model e x ample Corp u s "The lion is the king of the jungle" "Lions have lifespans of a decade" "The lion is an endangered species" FEATURE ENGINEERING FOR NLP IN PYTHON

Bag of w ords model e x ample Vocab u lar y → a , an , decade , endangered , have , is , jungle , king , lifespans , lion , Lions , of , species , the , The "The lion is the king of the jungle" [0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1] "Lions have lifespans of a decade" [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0] "The lion is an endangered species" [0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1] FEATURE ENGINEERING FOR NLP IN PYTHON

Te x t preprocessing Lions , lion → lion The , the → the No p u nct u ations No stop w ords Leads to smaller v ocab u laries Red u cing n u mber of dimensions helps impro v e performance FEATURE ENGINEERING FOR NLP IN PYTHON

Bag of w ords model u sing sklearn corpus = pd.Series([ 'The lion is the king of the jungle', 'Lions have lifespans of a decade', 'The lion is an endangered species' ]) FEATURE ENGINEERING FOR NLP IN PYTHON

Bag of w ords model u sing sklearn # Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer # Create CountVectorizer object vectorizer = CountVectorizer() # Generate matrix of word vectors bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray()) array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 3], [0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0], [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64) FEATURE ENGINEERING FOR NLP IN PYTHON

Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

B u ilding a BoW Nai v e Ba y es classifier FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

Spam filtering message label WINNER !! As a v al u ed net w ork c u stomer y o u ha v e been selected to recei v e a $900 spam pri z e re w ard ! To claim call 09061701461 Ah , w ork . I v ag u el y remember that . What does it feel like ? ham FEATURE ENGINEERING FOR NLP IN PYTHON

Steps 1. Te x t preprocessing 2. B u ilding a bag - of -w ords model ( or representation ) 3. Machine learning FEATURE ENGINEERING FOR NLP IN PYTHON

Te x t preprocessing u sing Co u ntVectori z er Co u ntVectori z er arg u ments lowercase : False , True strip_accents : 'unciode' , 'ascii' , None stop_words : 'english' , list , None token_pattern : regex tokenizer : function FEATURE ENGINEERING FOR NLP IN PYTHON

B u ilding the BoW model # Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer # Create CountVectorizer object vectorizer = CountVectorizer(strip_accents='ascii', stop_words='english', lowercase=False) # Import train_test_split from sklearn.model_selection import train_test_split # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.25) FEATURE ENGINEERING FOR NLP IN PYTHON

B u ilding the BoW model ... ... # Generate training Bow vectors X_train_bow = vectorizer.fit_transform(X_train) # Generate test BoW vectors X_test_bow = vectorizer.transform(X_test) FEATURE ENGINEERING FOR NLP IN PYTHON

Training the Nai v e Ba y es classifier # Import MultinomialNB from sklearn.naive_bayes import MultinomialNB # Create MultinomialNB object clf = MultinomialNB() # Train clf clf.fit(X_train_bow, y_train) # Compute accuracy on test set accuracy = clf.score(X_test_bow, y_test) print(accuracy) 0.760051 FEATURE ENGINEERING FOR NLP IN PYTHON

B u ilding n - gram models FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

BoW shortcomings re v ie w label 'The movie was good and not boring' positi v e 'The movie was not good and boring' negati v e E x actl y the same BoW representation ! Conte x t of the w ords is lost . Sentiment dependent on the position of ' not '. FEATURE ENGINEERING FOR NLP IN PYTHON

n - grams Contig u o u s seq u ence of n elements ( or w ords ) in a gi v en doc u ment . n = 1 → bag - of -w ords 'for you a thousand times over' n = 2, n - grams : [ 'for you', 'you a', 'a thousand', 'thousand times', 'times over' ] FEATURE ENGINEERING FOR NLP IN PYTHON

n - grams 'for you a thousand times over' n = 3, n - grams : [ 'for you a', 'you a thousand', 'a thousand times', 'thousand times over' ] Capt u res more conte x t . FEATURE ENGINEERING FOR NLP IN PYTHON

Applications Sentence completion Spelling correction Machine translation correction FEATURE ENGINEERING FOR NLP IN PYTHON

B u ilding n - gram models u sing scikit - learn Generates onl y bigrams . bigrams = CountVectorizer(ngram_range=(2,2)) Generates u nigrams , bigrams and trigrams . ngrams = CountVectorizer(ngram_range=(1,3)) FEATURE ENGINEERING FOR NLP IN PYTHON

Shortcomings C u rse of dimensionalit y Higher order n - grams are rare Keep n small FEATURE ENGINEERING FOR NLP IN PYTHON

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G - PowerPoint PPT Presentation

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist Recap of data format for ML algorithms For an y ML algorithm , Data m u st be in tab u lar form Training feat u res m u st be

Stop w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What are stop w

Bag - of -w ords SE N TIME N T AN ALYSIS IN P YTH ON Violeta Mishe v a Data Scientist What is a

WINE BOTTLE AIRBAG SINGLE WINE BOTTLE AIRBAG SINGLE BOTTLE AIR BAG PROTECT ALL BOTTLED PRODUCT

Red-Bag Engineers Consultants Software User Day April 2017 Red-Bag 2017 1 Ves Online

Pathway Red Bag Scheme October 2018 The Red Bag concept The Red Bag scheme was first implemented

The Plastic Bag Free world in action Surfriders Ban the Bag Campaign Plastic bag free

C ONTENTS I I NTRODUCTION Notation Words and Free Groups Special Words T HEORETICAL F ACTS

DC Bag Law Presented by Jeffrey Seltzer Associate Director Stormwater Management Division District

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

Clean Land, Safe Water, Healthy Lives Understanding and Tracking Disposable Bag Consumption in the

City of Los Angeles Reusable Bag Program Single-Use Carryout Bag Ordinance Bureau of Sanitation

The Plastic Retail Bag Legislative Landscape Retail Bag Ordinances - Today Passed in 2007

Lecture: Visual Bag of Words Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning

Some considerations about the vacuum bag method. The vacuum bag method takes advantage of the good

St Georges Hospital Schools Emergency Asthma Bag How to use the Bag Training for

Text Representation Bag-of-Words and Word Embeddings count vector unordered bag over

Objec(ves Computer's representa(ons of data types Oct 25, 2017 Sprenkle - CSCI111 1 Review

TFTP Usage and Design RFC 783, 1350 CSCE 515: Computer Network Transfer files between

Goals for today Everything you wanted to know about C-strings (but were afraid to ask)

DISCUSSION session Chameleon ANTARES Mathematica Demokritos HOU KM3TRAY NEMO NESTOR soft

Q: How does the internet work? by @cba (slides for a talk given to hackbright on 3/16/16) Bare

Computer Architecture Chapter 2.8, Spring 2005 Department of Computer Science Kent State

DEEP PACKET INSPECTION AND VISUAL ANALYTICS More Info: www.bramcappers.nl 1 of 5 Advanced

CS2911: Code Reading Practice The purpose of this document is to provide a variety of simple (yet