Natural Language Processing Artificial Intelligence @ Allegheny - - PowerPoint PPT Presentation

natural language processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing Artificial Intelligence @ Allegheny - - PowerPoint PPT Presentation

Natural Language Processing Artificial Intelligence @ Allegheny College Janyl Jumadinova March 6, 2020 (Lab Discussion) Credit: NLP Stanford Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 1 / 36 NLP Natural


slide-1
SLIDE 1

Natural Language Processing

Artificial Intelligence @ Allegheny College Janyl Jumadinova March 6, 2020 (Lab Discussion)

Credit: NLP Stanford Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 1 / 36

slide-2
SLIDE 2

NLP

Natural Language Processing Understand, interpret and manipulate natural language

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 2 / 36

slide-3
SLIDE 3

Question Answering: IBM’s Watson

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 3 / 36

slide-4
SLIDE 4

Information Extraction

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 4 / 36

slide-5
SLIDE 5

Sentiment Extraction

2016 Election

Source: Washington Post Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 5 / 36

slide-6
SLIDE 6

Machine Translation

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 6 / 36

slide-7
SLIDE 7

Language Technology

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 7 / 36

slide-8
SLIDE 8

Ambiguity makes NLP hard

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 8 / 36

slide-9
SLIDE 9

Ambiguity makes NLP hard Teacher Strikes Idle Kids Red Tape Holds Up New Bridges Juvenile Court to Try Shooting Defendant Local High School Dropouts Cut in Half

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 8 / 36

slide-10
SLIDE 10

Other NLP Difficulties

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 9 / 36

slide-11
SLIDE 11

Progress

What tools do we need?

Knowledge about language Knowledge about the world A way to combine knowledge sources

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 10 / 36

slide-12
SLIDE 12

Progress

What tools do we need?

Knowledge about language Knowledge about the world A way to combine knowledge sources

How we generally do this:

Probabilistic models built from language data P(“maison”→ “house”) → high P(“L’avocat general”→ “the general avocado”) → low

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 10 / 36

slide-13
SLIDE 13

Basic Text Processing

Word tokenization Every NLP task needs to do text normalization:

1 Segmenting/tokenizing words in running text 2 Normalizing word formats 3 Segmenting sentences in running text Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 11 / 36

slide-14
SLIDE 14

How Many Words?

N - all words V - distinct words

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 12 / 36

slide-15
SLIDE 15

Basic Text Processing

Normalization Every NLP task needs to do text normalization:

1 Segmenting/tokenizing words in running text 2 Normalizing word formats 3 Segmenting sentences in running text Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 13 / 36

slide-16
SLIDE 16

Issues in Tokenization

Finland’s capital → Finland Finlands Finland’s what’re, I’m, isn’t → What are, I am, is not Hewlett-Packard → Hewlett Packard state-of-the-art → state of the art Lowercase → lower-case lowercase lower case San Francisco → one token or two?

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 14 / 36

slide-17
SLIDE 17

Issues in Tokenization

Finland’s capital → Finland Finlands Finland’s what’re, I’m, isn’t → What are, I am, is not Hewlett-Packard → Hewlett Packard state-of-the-art → state of the art Lowercase → lower-case lowercase lower case San Francisco → one token or two? Language Issues: French, German, Japanese, Chinese,...

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 14 / 36

slide-18
SLIDE 18

Issues in Tokenization

Finland’s capital → Finland Finlands Finland’s what’re, I’m, isn’t → What are, I am, is not Hewlett-Packard → Hewlett Packard state-of-the-art → state of the art Lowercase → lower-case lowercase lower case San Francisco → one token or two? Language Issues: French, German, Japanese, Chinese,... Normalization: merging of different forms of a token into a canonical normalized form.

  • ex.: “Mr.”, “Mr”, “mister”, and “Mister” into a single form.

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 14 / 36

slide-19
SLIDE 19

Basic Text Processing

Stemming Every NLP task needs to do text normalization:

1 Segmenting/tokenizing words in running text 2 Normalizing word formats 3 Segmenting sentences in running text Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 15 / 36

slide-20
SLIDE 20

Stemming

Reduce terms to their stems in information retrieval Stemming is crude chopping of affixes language dependent Example: automate(s), automatic, automation all reduced to automat.

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 16 / 36

slide-21
SLIDE 21

Porter’s Algorithm

Most common English stemmer.

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 17 / 36

slide-22
SLIDE 22

Sentence Segmentation

!, ? are relatively unambiguous

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 18 / 36

slide-23
SLIDE 23

Sentence Segmentation

!, ? are relatively unambiguous Period “.” is quite ambiguous

  • Sentence boundary
  • Abbreviations like Inc. or Dr.
  • Numbers like .02 or 4.3

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 18 / 36

slide-24
SLIDE 24

Sentence Segmentation

!, ? are relatively unambiguous Period “.” is quite ambiguous

  • Sentence boundary
  • Abbreviations like Inc. or Dr.
  • Numbers like .02 or 4.3

Build a binary classifier

  • Classifiers: hand-written rules, regular expressions, or

machine-learning

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 18 / 36

slide-25
SLIDE 25

Information Extraction (IE)

Find and understand limited relevant parts of texts Gather information from many pieces of text Produce a structured representation of relevant information

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 19 / 36

slide-26
SLIDE 26

Information Extraction

Goals:

  • Organize information so that it is useful to people
  • Put information in a semantically precise form that allows further

inferences to be made by computer algorithms

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 20 / 36

slide-27
SLIDE 27

Information Extraction

Goals:

  • Organize information so that it is useful to people
  • Put information in a semantically precise form that allows further

inferences to be made by computer algorithms Roughly: Who did what to whom when?

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 20 / 36

slide-28
SLIDE 28

Low-level information extraction

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 21 / 36

slide-29
SLIDE 29

Named Entity Recognition (NER)

A very important sub-task: find and classify names in text

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 22 / 36

slide-30
SLIDE 30

Named Entity Recognition (NER)

A very important sub-task: find and classify names in text

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 23 / 36

slide-31
SLIDE 31

Named Entity Recognition (NER)

The uses: Named entities can be indexed, linked, etc. Sentiment can be attributed to companies or products A lot of IE relations are associations between named entities For question answering, answers are often named entities

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 24 / 36

slide-32
SLIDE 32

Named Entity Recognition (NER)

Data {(c, d)} of paired observations d and hidden classes c Features f are elementary pieces of evidence that link aspects of what we observe d with a category c that we want to predict

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 25 / 36

slide-33
SLIDE 33

Named Entity Recognition (NER)

Data {(c, d)} of paired observations d and hidden classes c Features f are elementary pieces of evidence that link aspects of what we observe d with a category c that we want to predict

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 25 / 36

slide-34
SLIDE 34

Parts of Speech (POS)

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 26 / 36

slide-35
SLIDE 35

POS Tagging

Words often have more than one POS: The back door On my back Win the voters back Promised to back the bill

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 27 / 36

slide-36
SLIDE 36

POS Tagging

Words often have more than one POS: The back door On my back Win the voters back Promised to back the bill The POS tagging problem is to determine the POS tag for a particular instance of a word.

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 27 / 36

slide-37
SLIDE 37

POS Tagging

Input: Plays well with others Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS Output: Plays/VBZ well/RB with/IN others/NNS Penn Treebank Tag-set

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 28 / 36

slide-38
SLIDE 38

Sentiment Analysis

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 29 / 36

slide-39
SLIDE 39

Sentiment Analysis

https://www.nltk.org/howto/sentiment.html https://nlp.stanford.edu/sentiment/ https://textblob.readthedocs.io/en/dev/

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 30 / 36

slide-40
SLIDE 40

Sentiment analysis has many other names

Opinion extraction Opinion mining Sentiment mining Subjectivity analysis

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 31 / 36

slide-41
SLIDE 41

Sentiment Analysis

Sentiment analysis is the detection of attitudes “enduring, affectively colored beliefs, dispositions towards objects or persons”

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 32 / 36

slide-42
SLIDE 42

Attitudes

Holder (source) of attitude Target (aspect) of attitude

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 33 / 36

slide-43
SLIDE 43

Attitudes

Holder (source) of attitude Target (aspect) of attitude Type of attitude

  • From a set of types:

Like, love, hate, value, desire, etc.

  • Or (more commonly) simple weighted polarity:

positive, negative, neutral, together with strength

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 33 / 36

slide-44
SLIDE 44

Attitudes

Holder (source) of attitude Target (aspect) of attitude Type of attitude

  • From a set of types:

Like, love, hate, value, desire, etc.

  • Or (more commonly) simple weighted polarity:

positive, negative, neutral, together with strength Text containing the attitude

  • Sentence or entire document

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 33 / 36

slide-45
SLIDE 45

Sentiment analysis

Simplest task: Is the attitude of this text positive or negative?

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 34 / 36

slide-46
SLIDE 46

Sentiment analysis

Simplest task: Is the attitude of this text positive or negative? More complex: Rank the attitude of this text from 1 to 5

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 34 / 36

slide-47
SLIDE 47

Sentiment analysis

Simplest task: Is the attitude of this text positive or negative? More complex: Rank the attitude of this text from 1 to 5 Advanced: Detect the target, source, or complex attitude types

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 34 / 36

slide-48
SLIDE 48

NLTK

$ python3 $ import nltk $ nltk.download()

Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 35 / 36

slide-49
SLIDE 49

NLTK Basic Pre-Processing

Tokenize using Python

1 urllin module to crawl the webpage 2 BeautifulSoup to clean the text with html tags 3 convert text into tokens using split() function Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 36 / 36

slide-50
SLIDE 50

NLTK Basic Pre-Processing

Tokenize using Python

1 urllin module to crawl the webpage 2 BeautifulSoup to clean the text with html tags 3 convert text into tokens using split() function

Remove Stop Words

1 get english stop words from nltk 2 remove stop words before plotting Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 36 / 36

slide-51
SLIDE 51

NLTK Basic Pre-Processing

Tokenize using Python

1 urllin module to crawl the webpage 2 BeautifulSoup to clean the text with html tags 3 convert text into tokens using split() function

Remove Stop Words

1 get english stop words from nltk 2 remove stop words before plotting

Frequency Analysis

1 nltk’s FreqDist to calculate the frequency distribution 2 plot function to produce a graph Janyl Jumadinova Natural Language Processing March 6, 2020 (Lab Discussion) 36 / 36

slide-52
SLIDE 52

WRITERS’ ROOM Lab 03 motivation

Credit: Casey Fiesler

slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60
slide-61
SLIDE 61
  • Creators:
slide-62
SLIDE 62
  • Creators: