introd u ction to nlp feat u re engineering
play

Introd u ction to NLP feat u re engineering FE ATU R E E N G IN E E - PowerPoint PPT Presentation

Introd u ction to NLP feat u re engineering FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist N u merical data Iris dataset sepal length sepal w idth petal length petal w idth class 6.3 2.9 5.6 1.8 Iris


  1. Introd u ction to NLP feat u re engineering FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  2. N u merical data Iris dataset sepal length sepal w idth petal length petal w idth class 6.3 2.9 5.6 1.8 Iris -v irginica 4.9 3.0 1.4 0.2 Iris - setosa 5.6 2.9 3.6 1.3 Iris -v ersicolor 6.0 2.7 5.1 1.6 Iris -v ersicolor 7.2 3.6 6.1 2.5 Iris -v irginica FEATURE ENGINEERING FOR NLP IN PYTHON

  3. One - hot encoding se x female male female male female ... FEATURE ENGINEERING FOR NLP IN PYTHON

  4. One - hot encoding se x one - hot encoding female → male → female → male → female → ... ... FEATURE ENGINEERING FOR NLP IN PYTHON

  5. One - hot encoding se x one - hot encoding se x_ female se x_ male female → 1 0 male → 0 1 female → 1 0 male → 0 1 female → 1 0 ... ... ... ... FEATURE ENGINEERING FOR NLP IN PYTHON

  6. One - hot encoding w ith pandas # Import the pandas library import pandas as pd # Perform one-hot encoding on the 'sex' feature of df df = pd.get_dummies(df, columns=['sex']) FEATURE ENGINEERING FOR NLP IN PYTHON

  7. Te x t u al data Mo v ie Re v ie w Dataset re v ie w class This mo v ie is for dog lo v ers . A v er y poignant ... positi v e The mo v ie is forge � able . The plot lacked ... negati v e A tr u l y ama z ing mo v ie abo u t dogs . A gripping ... positi v e FEATURE ENGINEERING FOR NLP IN PYTHON

  8. Te x t pre - processing Con v erting to lo w ercase E x ample : Reduction to reduction Con v erting to base - form E x ample : reduction to reduce FEATURE ENGINEERING FOR NLP IN PYTHON

  9. Vectori z ation re v ie w class This mo v ie is for dog lo v ers . A v er y poignant ... positi v e The mo v ie is forge � able . The plot lacked ... negati v e A tr u l y ama z ing mo v ie abo u t dogs . A gripping ... positi v e FEATURE ENGINEERING FOR NLP IN PYTHON

  10. Vectori z ation 0 1 2 ... n class 0.03 0.71 0.00 ... 0.22 positi v e 0.45 0.00 0.03 ... 0.19 negati v e 0.14 0.18 0.00 ... 0.45 positi v e FEATURE ENGINEERING FOR NLP IN PYTHON

  11. Basic feat u res N u mber of w ords N u mber of characters A v erage length of w ords T w eets FEATURE ENGINEERING FOR NLP IN PYTHON

  12. POS tagging Word POS I Prono u n ha v e Verb a Article dog No u n FEATURE ENGINEERING FOR NLP IN PYTHON

  13. Named Entit y Recognition Does no u n refer to person , organi z ation or co u ntr y? No u n NER Brian Person DataCamp Organi z ation FEATURE ENGINEERING FOR NLP IN PYTHON

  14. Concepts co v ered Te x t Preprocessing Basic Feat u res Word Feat u res Vectori z ation FEATURE ENGINEERING FOR NLP IN PYTHON

  15. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  16. Basic feat u re e x traction FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  17. N u mber of characters "I don't know." # 13 characters # Compute the number of characters text = "I don't know." num_char = len(text) # Print the number of characters print(num_char) 13 # Create a 'num_chars' feature df['num_chars'] = df['review'].apply(len) FEATURE ENGINEERING FOR NLP IN PYTHON

  18. N u mber of w ords # Split the string into words text = "Mary had a little lamb." words = text.split() # Print the list containing words print(words) ['Mary', 'had', 'a', 'little', 'lamb.'] # Print number of words print(len(words)) 5 FEATURE ENGINEERING FOR NLP IN PYTHON

  19. N u mber of w ords # Function that returns number of words in string def word_count(string): # Split the string into words words = string.split() # Return length of words list return len(words) # Create num_words feature in df df['num_words'] = df['review'].apply(word_count) FEATURE ENGINEERING FOR NLP IN PYTHON

  20. A v erage w ord length #Function that returns average word length def avg_word_length(x): # Split the string into words words = x.split() # Compute length of each word and store in a separate list word_lengths = [len(word) for word in words] # Compute average word length avg_word_length = sum(word_lengths)/len(words) # Return average word length return(avg_word_length) FEATURE ENGINEERING FOR NLP IN PYTHON

  21. A v erage w ord length # Create a new feature avg_word_length df['avg_word_length'] = df['review'].apply(doc_density) FEATURE ENGINEERING FOR NLP IN PYTHON

  22. Special feat u res FEATURE ENGINEERING FOR NLP IN PYTHON

  23. Hashtags and mentions # Function that returns number of hashtags def hashtag_count(string): # Split the string into words words = string.split() # Create a list of hashtags hashtags = [word for word in words if word.startswith('#')] # Return number of hashtags return len(hashtags) hashtag_count("@janedoe This is my first tweet! #FirstTweet #Happy") 2 FEATURE ENGINEERING FOR NLP IN PYTHON

  24. Other feat u res N u mber of sentences N u mber of paragraphs Words starting w ith an u ppercase All - capital w ords N u meric q u antities FEATURE ENGINEERING FOR NLP IN PYTHON

  25. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

  26. Readabilit y tests FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON Ro u nak Banik Data Scientist

  27. O v er v ie w of readabilit y tests Determine readabilit y of an English passage Scale ranging from primar y school u p to college grad u ate le v el A mathematical form u la u tili z ing w ord , s y llable and sentence co u nt Used in fake ne w s and opinion spam detection FEATURE ENGINEERING FOR NLP IN PYTHON

  28. Readabilit y te x t e x amples Flesch reading ease G u nning fog inde x Simple Meas u re of Gobbled y gook ( SMOG ) Dale - Chall score FEATURE ENGINEERING FOR NLP IN PYTHON

  29. Readabilit y test e x amples Flesch reading ease G u nning fog inde x Simple Meas u re of Gobbled y gook ( SMOG ) Dale - Chall score FEATURE ENGINEERING FOR NLP IN PYTHON

  30. Flesch reading ease One of the oldest and most w idel y u sed tests Dependent on t w o factors : Greater the a v erage sentence length , harder the te x t is to read " This is a short sentence ." " This is longer sentence w ith more w ords and it is harder to follo w than the � rst sentence ." Greater the a v erage n u mber of s y llables in a w ord , harder the te x t is to read " I li v e in m y home ." " I reside in m y domicile ." Higher the score , greater the readabilit y FEATURE ENGINEERING FOR NLP IN PYTHON

  31. Flesch reading ease score interpretation Reading ease score Grade Le v el 90-100 5 80-90 6 70-80 7 60-70 8-9 50-60 10-12 30-50 College 0-30 College Grad u ate FEATURE ENGINEERING FOR NLP IN PYTHON

  32. G u nning fog inde x De v eloped in 1954 Also dependent on a v erage sentence length Greater the percentage of comple x w ords , harder the te x t is to read Higher the inde x, lesser the readabilit y FEATURE ENGINEERING FOR NLP IN PYTHON

  33. G u nning fog inde x interpretation Fog inde x Grade le v el Fog inde x Grade le v el 17 College grad u ate 10 High school sophomore 16 College senior 9 High school freshman 15 College j u nior 8 Eighth grade 14 College sophomore 7 Se v enth grade 13 College freshman 6 Si x th grade 12 High school senior 11 High school j u nior FEATURE ENGINEERING FOR NLP IN PYTHON

  34. The te x tatistic librar y # Import the Textatistic class from textatistic import Textatistic # Create a Textatistic Object readability_scores = Textatistic(text).scores # Generate scores print(readability_scores['flesch_score']) print(readability_scores['gunningfog_score']) 21.14 16.26 FEATURE ENGINEERING FOR NLP IN PYTHON

  35. Let ' s practice ! FE ATU R E E N G IN E E R IN G FOR N L P IN P YTH ON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend