Basic Natural Language Processing Why NLP? Understanding Intent - - PowerPoint PPT Presentation

basic natural language
SMART_READER_LITE
LIVE PREVIEW

Basic Natural Language Processing Why NLP? Understanding Intent - - PowerPoint PPT Presentation

Basic Natural Language Processing Why NLP? Understanding Intent Search Engines Question Answering Azure QnA, Bots, Watson Digital Assistants Cortana, Siri, Alexa Translation Systems Azure Language Translation,


slide-1
SLIDE 1

Basic Natural Language Processing

slide-2
SLIDE 2

Why NLP?

  • Understanding Intent
  • Search Engines
  • Question Answering
  • Azure QnA, Bots, Watson
  • Digital Assistants
  • Cortana, Siri, Alexa
  • Translation Systems
  • Azure Language Translation, Google Translate
  • News Digest
  • Flipboard, Facebook, Twitter
  • Other uses
  • Pollect, Crime mapping, Earthquake prediction
slide-3
SLIDE 3

Understanding human language is hard

NLP requires inputs from :

  • Linguistics
  • Computer Science
  • Mathematics
  • Statistics
  • Machine Learning
  • Psychology
  • Databases

Human Computer Human (U)nderstanding (G)eneration

slide-4
SLIDE 4

THE KEY: Changing uncertainty to certainty

I am changing this sentence to numbers

1 2 3 4 5 6 7

You are changing too many sentences!

8 ? 3 ? 9 ? “Vectorizing” Remember: There is no ambiguity with numbers!

slide-5
SLIDE 5

Challenges in NLP: Syntax vs. Semantics

  • Syntax:
  • Lamb a Mary had little
  • Semantics:
  • Merry hat hey lid tell lam
  • Colorless orange liquid
  • Address, number, resent
slide-6
SLIDE 6

Challenges in NLP: Ambiguity pt 1

  • CC Attachment
  • I like swimming in warm lakes and rivers
  • Ellipsis and Parallelism
  • I gave the Steven a shovel and Joseph a ruler
  • Metonymy
  • Sydney is essential to this class
  • Phonetic
  • My toes are getting number
  • Pp Attachment
  • You ate spaghetti with meatballs / pleasure / a fork / Jillian /
slide-7
SLIDE 7

Challenges in NLP: Ambiguity pt 2

  • Referential
  • Sharon complimented Lisl. She had been kind all day.
  • Reflexive
  • Brandon brought himself an apple
  • Sense
  • Julia took the math quiz
  • Subjectivity
  • Karen believes that the Economy will stay strong
  • Syntactic
  • Call a dentist for Wayne
slide-8
SLIDE 8

Challenges in NLP: Others

  • Parsing N-grams:
  • United States of America
  • Hot dog
  • Typos
  • John Hopkins vs Johns Hopkins
  • Non-standard language
  • (208)929-6136 vs 208-929-6136
  • Cause = because
  • SARCASM
  • I love rotting apples
slide-9
SLIDE 9
  • Can reference box above, left, or

diagonal up-left

  • If letter matches, +0
  • If letter doesn’t match, +1
  • Score is the box at the bottom-right

Edit Distance: How we Spellcheck

S T R E N G T H 0 1 2 3 4 5 6 7 8 T 1 1 1 2 3 4 5 5 6 R 2 2 2 1 2 3 4 5 6 E 3 3 3 2 1 2 3 4 5 N 4 4 4 3 2 1 2 3 4 D 5 5 5 4 3 2 2 3 4

slide-10
SLIDE 10

Semantic Relationships

  • Measuring how words are related to

each other.

  • Birdcage will be more similar to Dog

Kennel than it will be to Bird

  • Many different systems to draw out

semantic relationships, but ‘Wordnet’ is one of the most commonly used

  • Similarity metric:
  • Sim(V,W) = - ln(pathlength(V,W))
  • Sim(Run, Miracle) would be = -ln(7)
slide-11
SLIDE 11

Preprocessing: Stopwords and punctuation

Why we want to get rid of them?

  • “And”, “If”, “But”, “.”, “,”
  • Will almost ALWAYS be your most significant words
  • Tells you nothing about what’s going on

Don’t get rid of them if you are focused on Natural Language Generation!

slide-12
SLIDE 12

Measure:

  • A ‘measure’ of a word is an indication of how many syllables are in it.
  • Consonants = ‘C’, Vowels = ‘V’
  • Every sequence of ‘VC’ is counted as +1
  • Intellectual = (VC)C(VC)C(VC)CV(VC) = 4

Stemming:

  • Strip a word down to its barest form
  • Ex: ‘Alleviation’ – ‘ation’ + ‘ate’ = ‘Alleviate’

Preprocessing: Porter’s Algorithm

Transformational Rule

slide-13
SLIDE 13

Stemming: Sample Rules

  • If m>0:
  • Lies -> li
  • Abilities = Abiliti
  • Ational -> ate
  • National = National
  • Recreational = recreate
  • Sses -> ss
  • Sunglasses = sunglass
  • Biliti -> ble
  • Abiliti = able
slide-14
SLIDE 14

Stemming: Example

  • Original Word: “Computational”
  • Computational – ‘ational’ + ‘ate’ = Computate
  • Computate – ‘ate’ = Comput
  • Final Word: “Comput”
  • Original Word: “Computer”
  • Computer – ‘er’ = Comput
  • Final Word: “Comput”
slide-15
SLIDE 15

Sentence Boundary Recognition

Problems with things like Dr., A.M., U.S.A. Use a decision tree to estimate the boundary Features:

  • Punctuation
  • Formatting
  • Fonts
  • Spaces
  • Capitalization
  • Known Abbreviations
slide-16
SLIDE 16

N-Gram Modeling

Words that have a separate meaning when combined with other words The best way to highlight the importance of context Examples:

  • Unigram: Apple
  • Bigram: Hot Dog
  • Trigram: George Bush Sr.

I’ll meet you in Times {?????}

slide-17
SLIDE 17

Preprocessing Checklist

Remove Extraneous Text Convert sentences to lower case Tokenize Sentences Tokenize Words Remove Stopwords & Punctuation Stemming / Lemmatizing Identify N- Grams

slide-18
SLIDE 18

Words to Numbers

  • Corpus creation
  • Create a library of all words in original dataset
  • Vectorizing
  • Changing words to numbers
  • Often a raw count
  • TFIDF
  • Term Frequency / Inverse Document Frequency
  • Example:
  • “This” mentioned 3 times in a given review, but the review has 27 words in it
  • Tfidf = 3 / 27 = 1/9
slide-19
SLIDE 19

Bayes Theorem

P(A|B) =

P(B) P(A) P(B|A)

slide-20
SLIDE 20

Predicting the next { … }

Example from Charles Dickens:

  • P(“Darnay looked at Dr. Manette”)
  • Use maximum likelihood estimates for the n-gram probabilities
  • Unigram: P(w) = c(w)/V
  • Bigram: P(w1 | w2) = c(w1,w2)/c(w2)
  • Values
  • P(“Darnay”) = 533 / 598633 = .00089
  • P(“looked”|”Darnay”) = 3 / 676 = .0044
  • P(“at|looked”) = 77 / 312 = .247
  • P(“Dr. Manette” | “at”) = 2 / 4512 = .000443
  • Bigram probability
  • P(“Darnay looked at Dr. Manette”) = 4.28 * e^-10
  • P(“at Dr. Manette Darnay looked”) = 0
slide-21
SLIDE 21

The Bag of Words Approach

  • P(Positive Review | Words Contained)
  • Look at the unordered words of a

document to determine underlying characteristics

  • Coffee reviews with the word ‘bean’

tend to be far more positive

  • Common in sentiment and feature

analysis