b u ilding a bag of w ords model
play

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G - PowerPoint PPT Presentation

B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist Recap of data format for ML algorithms For an y ML algorithm , Data m u st be in tab u lar form Training feat u res m u st be


  1. B u ilding a bag of w ords model FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  2. Recap of data format for ML algorithms For an y ML algorithm , Data m u st be in tab u lar form Training feat u res m u st be n u merical FEATURE ENGINEERING FOR NLP IN PYTHON

  3. Bag of w ords model E x tract w ord tokens Comp u te freq u enc y of w ord tokens Constr u ct a w ord v ector o u t of these freq u encies and v ocab u lar y of corp u s FEATURE ENGINEERING FOR NLP IN PYTHON

  4. Bag of w ords model e x ample Corp u s "The lion is the king of the jungle" "Lions have lifespans of a decade" "The lion is an endangered species" FEATURE ENGINEERING FOR NLP IN PYTHON

  5. Bag of w ords model e x ample Vocab u lar y → a , an , decade , endangered , have , is , jungle , king , lifespans , lion , Lions , of , species , the , The "The lion is the king of the jungle" [0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1] "Lions have lifespans of a decade" [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0] "The lion is an endangered species" [0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1] FEATURE ENGINEERING FOR NLP IN PYTHON

  6. Te x t preprocessing Lions , lion → lion The , the → the No p u nct u ations No stop w ords Leads to smaller v ocab u laries Red u cing n u mber of dimensions helps impro v e performance FEATURE ENGINEERING FOR NLP IN PYTHON

  7. Bag of w ords model u sing sklearn corpus = pd.Series([ 'The lion is the king of the jungle', 'Lions have lifespans of a decade', 'The lion is an endangered species' ]) FEATURE ENGINEERING FOR NLP IN PYTHON

  8. Bag of w ords model u sing sklearn # Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer # Create CountVectorizer object vectorizer = CountVectorizer() # Generate matrix of word vectors bow_matrix = vectorizer.fit_transform(corpus) print(bow_matrix.toarray()) array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 3], [0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0], [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64) FEATURE ENGINEERING FOR NLP IN PYTHON

  9. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  10. B u ilding a BoW Nai v e Ba y es classifier FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  11. Spam filtering message label WINNER !! As a v al u ed net w ork c u stomer y o u ha v e been selected to recei v e a $900 spam pri z e re w ard ! To claim call 09061701461 Ah , w ork . I v ag u el y remember that . What does it feel like ? ham FEATURE ENGINEERING FOR NLP IN PYTHON

  12. Steps 1. Te x t preprocessing 2. B u ilding a bag - of -w ords model ( or representation ) 3. Machine learning FEATURE ENGINEERING FOR NLP IN PYTHON

  13. Te x t preprocessing u sing Co u ntVectori z er Co u ntVectori z er arg u ments lowercase : False , True strip_accents : 'unciode' , 'ascii' , None stop_words : 'english' , list , None token_pattern : regex tokenizer : function FEATURE ENGINEERING FOR NLP IN PYTHON

  14. B u ilding the BoW model # Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer # Create CountVectorizer object vectorizer = CountVectorizer(strip_accents='ascii', stop_words='english', lowercase=False) # Import train_test_split from sklearn.model_selection import train_test_split # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.25) FEATURE ENGINEERING FOR NLP IN PYTHON

  15. B u ilding the BoW model ... ... # Generate training Bow vectors X_train_bow = vectorizer.fit_transform(X_train) # Generate test BoW vectors X_test_bow = vectorizer.transform(X_test) FEATURE ENGINEERING FOR NLP IN PYTHON

  16. Training the Nai v e Ba y es classifier # Import MultinomialNB from sklearn.naive_bayes import MultinomialNB # Create MultinomialNB object clf = MultinomialNB() # Train clf clf.fit(X_train_bow, y_train) # Compute accuracy on test set accuracy = clf.score(X_test_bow, y_test) print(accuracy) 0.760051 FEATURE ENGINEERING FOR NLP IN PYTHON

  17. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  18. B u ilding n - gram models FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  19. BoW shortcomings re v ie w label 'The movie was good and not boring' positi v e 'The movie was not good and boring' negati v e E x actl y the same BoW representation ! Conte x t of the w ords is lost . Sentiment dependent on the position of ' not '. FEATURE ENGINEERING FOR NLP IN PYTHON

  20. n - grams Contig u o u s seq u ence of n elements ( or w ords ) in a gi v en doc u ment . n = 1 → bag - of -w ords 'for you a thousand times over' n = 2, n - grams : [ 'for you', 'you a', 'a thousand', 'thousand times', 'times over' ] FEATURE ENGINEERING FOR NLP IN PYTHON

  21. n - grams 'for you a thousand times over' n = 3, n - grams : [ 'for you a', 'you a thousand', 'a thousand times', 'thousand times over' ] Capt u res more conte x t . FEATURE ENGINEERING FOR NLP IN PYTHON

  22. Applications Sentence completion Spelling correction Machine translation correction FEATURE ENGINEERING FOR NLP IN PYTHON

  23. B u ilding n - gram models u sing scikit - learn Generates onl y bigrams . bigrams = CountVectorizer(ngram_range=(2,2)) Generates u nigrams , bigrams and trigrams . ngrams = CountVectorizer(ngram_range=(1,3)) FEATURE ENGINEERING FOR NLP IN PYTHON

  24. Shortcomings C u rse of dimensionalit y Higher order n - grams are rare Keep n small FEATURE ENGINEERING FOR NLP IN PYTHON

  25. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend