tokeni z ation and lemmati z ation
play

Tokeni z ation and Lemmati z ation FE ATU R E E N G IN E E R IN G - PowerPoint PPT Presentation

Tokeni z ation and Lemmati z ation FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist Te x t so u rces Ne w s articles T w eets Comments FEATURE ENGINEERING FOR NLP IN PYTHON Making te x t machine friendl y


  1. Tokeni z ation and Lemmati z ation FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  2. Te x t so u rces Ne w s articles T w eets Comments FEATURE ENGINEERING FOR NLP IN PYTHON

  3. Making te x t machine friendl y Dogs , dog reduction , REDUCING , Reduce don't , do not won't , will not FEATURE ENGINEERING FOR NLP IN PYTHON

  4. Te x t preprocessing techniq u es Con v erting w ords into lo w ercase Remo v ing leading and trailing w hitespaces Remo v ing p u nct u ation Remo v ing stop w ords E x panding contractions Remo v ing special characters ( n u mbers , emojis , etc .) FEATURE ENGINEERING FOR NLP IN PYTHON

  5. Tokeni z ation "I have a dog. His name is Hachi." Tokens : ["I", "have", "a", "dog", ".", "His", "name", "is", "Hachi", "."] "Don't do this." Tokens : ["Do", "n't", "do", "this", "."] FEATURE ENGINEERING FOR NLP IN PYTHON

  6. Tokeni z ation u sing spaC y import spacy # Load the en_core_web_sm model nlp = spacy.load('en_core_web_sm') # Initiliaze string string = "Hello! I don't know what I'm doing here." # Create a Doc object doc = nlp(string) # Generate list of tokens tokens = [token.text for token in doc] print(tokens) ['Hello','!','I','do',"n't",'know','what','I',"'m",'doing','here','.'] FEATURE ENGINEERING FOR NLP IN PYTHON

  7. Lemmati z ation Con v ert w ord into its base form reducing , reduces , reduced , reduction → reduce am , are , is → be n't → not 've → have FEATURE ENGINEERING FOR NLP IN PYTHON

  8. Lemmati z ation u sing spaC y import spacy # Load the en_core_web_sm model nlp = spacy.load('en_core_web_sm') # Initiliaze string string = "Hello! I don't know what I'm doing here." # Create a Doc object doc = nlp(string) # Generate list of lemmas lemmas = [token.lemma_ for token in doc] print(lemmas) ['hello','!','-PRON-','do','not','know','what','-PRON','be','do','here', '.'] FEATURE ENGINEERING FOR NLP IN PYTHON

  9. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  10. Te x t cleaning FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  11. Te x t cleaning techniq u es Unnecessar y w hitespaces and escape seq u ences P u nct u ations Special characters ( n u mbers , emojis , etc .) Stop w ords FEATURE ENGINEERING FOR NLP IN PYTHON

  12. isalpha () "Dog".isalpha() "!".isalpha() True False "3dogs".isalpha() "?".isalpha() False False "12347".isalpha() False FEATURE ENGINEERING FOR NLP IN PYTHON

  13. A w ord of ca u tion Abbre v iations : U.S.A , U.K , etc . Proper No u ns : word2vec and xto10x . Write y o u r o w n c u stom f u nction (u sing rege x) for the more n u anced cases . FEATURE ENGINEERING FOR NLP IN PYTHON

  14. Remo v ing non - alphabetic characters string = """ OMG!!!! This is like the best thing ever \t\n. Wow, such an amazing song! I'm hooked. Top 5 definitely. ? """ import spacy # Generate list of tokens nlp = spacy.load('en_core_web_sm') doc = nlp(string) lemmas = [token.lemma_ for token in doc] FEATURE ENGINEERING FOR NLP IN PYTHON

  15. Remo v ing non - alphabetic characters ... ... # Remove tokens that are not alphabetic a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma == '-PRON-'] # Print string after text cleaning print(' '.join(a_lemmas)) 'omg this be like the good thing ever wow such an amazing song -PRON- be hooked top definitely' FEATURE ENGINEERING FOR NLP IN PYTHON

  16. Stop w ords Words that occ u r e x tremel y commonl y Eg . articles , be v erbs , prono u ns , etc . FEATURE ENGINEERING FOR NLP IN PYTHON

  17. Remo v ing stop w ords u sing spaC y # Get list of stopwords stopwords = spacy.lang.en.stop_words.STOP_WORDS string = """ OMG!!!! This is like the best thing ever \t\n. Wow, such an amazing song! I'm hooked. Top 5 definitely. ? """ FEATURE ENGINEERING FOR NLP IN PYTHON

  18. Remo v ing stop w ords u sing spaC y ... ... # Remove stopwords and non-alphabetic tokens a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords] # Print string after text cleaning print(' '.join(a_lemmas)) 'omg like good thing wow amazing song hooked definitely' FEATURE ENGINEERING FOR NLP IN PYTHON

  19. Other te x t preprocessing techniq u es Remo v ing HTML / XML tags Replacing accented characters ( s u ch as é ) Correcting spelling errors FEATURE ENGINEERING FOR NLP IN PYTHON

  20. A w ord of ca u tion Al w a y s u se onl y those te x t preprocessing techniq u es that are rele v ant to y o u r application . FEATURE ENGINEERING FOR NLP IN PYTHON

  21. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  22. Part - of - speech tagging FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  23. Applications Word - sense disambig u ation "The bear is a majestic animal" "Please bear with me" Sentiment anal y sis Q u estion ans w ering Fake ne w s and opinion spam detection FEATURE ENGINEERING FOR NLP IN PYTHON

  24. POS tagging Assigning e v er y w ord , its corresponding part of speech . "Jane is an amazing guitarist." POS Tagging : Jane → proper no u n is → v erb an → determiner amazing → adjecti v e guitarist → no u n FEATURE ENGINEERING FOR NLP IN PYTHON

  25. POS tagging u sing spaC y import spacy # Load the en_core_web_sm model nlp = spacy.load('en_core_web_sm') # Initiliaze string string = "Jane is an amazing guitarist" # Create a Doc object doc = nlp(string) FEATURE ENGINEERING FOR NLP IN PYTHON

  26. POS tagging u sing spaC y ... ... # Generate list of tokens and pos tags pos = [(token.text, token.pos_) for token in doc] print(pos) [('Jane', 'PROPN'), ('is', 'VERB'), ('an', 'DET'), ('amazing', 'ADJ'), ('guitarist', 'NOUN')] FEATURE ENGINEERING FOR NLP IN PYTHON

  27. POS annotations in spaC y PROPN → proper no u n DET → determinant spaC y annotations at h � ps :// spac y. io / api / annotation FEATURE ENGINEERING FOR NLP IN PYTHON

  28. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  29. Named entit y recognition FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  30. Applications E � cient search algorithms Q u estion ans w ering Ne w s article classi � cation C u stomer ser v ice FEATURE ENGINEERING FOR NLP IN PYTHON

  31. Named entit y recognition Identif y ing and classif y ing named entities into prede � ned categories . Categories incl u de person , organi z ation , co u ntr y, etc . "John Doe is a software engineer working at Google. He lives in France." Named Entities John Doe → person Google → organi z ation France → co u ntr y ( geopolitical entit y) FEATURE ENGINEERING FOR NLP IN PYTHON

  32. NER u sing spaC y import spacy string = "John Doe is a software engineer working at Google. He lives in France." # Load model and create Doc object nlp = spacy.load('en_core_web_sm') doc = nlp(string) # Generate named entities ne = [(ent.text, ent.label_) for ent in doc.ents] print(ne) [('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')] FEATURE ENGINEERING FOR NLP IN PYTHON

  33. NER annotations in spaC y More than 15 categories of named entities NER annotations at h � ps :// spac y. io / api / annotation # named - entities FEATURE ENGINEERING FOR NLP IN PYTHON

  34. A w ord of ca u tion Not perfect Performance dependent on training and test data Train models w ith speciali z ed data for n u anced cases Lang u age speci � c FEATURE ENGINEERING FOR NLP IN PYTHON

  35. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend