Introduction to Artificial Intelligence CoreNLP, Semantic Analysis, - - PowerPoint PPT Presentation
Introduction to Artificial Intelligence CoreNLP, Semantic Analysis, - - PowerPoint PPT Presentation
Introduction to Artificial Intelligence CoreNLP, Semantic Analysis, Naives Bayes Classifier Janyl Jumadinova November 18, 2016 CoreNLP Reference: http://stanfordnlp.github.io/CoreNLP/ Package available in /opt/corenlp/ Run: java -cp
CoreNLP
◮ Reference: http://stanfordnlp.github.io/CoreNLP/ ◮ Package available in /opt/corenlp/ ◮ Run: java -cp
"/opt/corenlp/stanford-corenlp-3.7.0/*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP
- annotators tokenize,ssplit,pos,lemma,ner -file
input.txt
2/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
◮ tokenize: Creates tokens from the given text. 3/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
◮ tokenize: Creates tokens from the given text. ◮ ssplit: Separates a sequence of tokens into sentences. 3/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
◮ tokenize: Creates tokens from the given text. ◮ ssplit: Separates a sequence of tokens into sentences. ◮ pos: Creates Parts of Speech (POS) tags for tokens. ◮ ner: Performs Named Entity Recognition classification. 3/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
◮ lemma: Creates word lemmas for tokens. 4/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
◮ lemma: Creates word lemmas for tokens.
– The goal of lemmatization (as of stemming) is to reduce related forms of a word to a common base form.
4/24
CoreNLP Annotators
http://stanfordnlp.github.io/CoreNLP/annotators.html
◮ lemma: Creates word lemmas for tokens.
– The goal of lemmatization (as of stemming) is to reduce related forms of a word to a common base form. – Lemmatization usually uses a vocabulary and morphological analysis of words to:
- remove inflectional endings only, and
- to return the base or dictionary form of a word, which is
known as the lemma.
4/24
Sentiment Analysis
5/24
Sentiment Analysis
◮ https://www.csc.ncsu.edu/faculty/healey/tweet_viz/
tweet_app/
◮ http://www.alchemyapi.com/developers/
getting-started-guide/twitter-sentiment-analysis
◮ www.sentiment140.com 6/24
Sentiment analysis has many other names
◮ Opinion extraction ◮ Opinion mining ◮ Sentiment mining ◮ Subjectivity analysis 7/24
Sentiment analysis is the detection of attitudes
◮ “enduring, affectively colored beliefs, dispositions towards
- bjects or persons”
8/24
Attitudes
◮ Holder (source) of attitude ◮ Target (aspect) of attitude 9/24
Attitudes
◮ Holder (source) of attitude ◮ Target (aspect) of attitude ◮ Type of attitude
- From a set of types:
Like, love, hate, value, desire, etc.
- Or (more commonly) simple weighted polarity:
positive, negative, neutral, together with strength
9/24
Attitudes
◮ Holder (source) of attitude ◮ Target (aspect) of attitude ◮ Type of attitude
- From a set of types:
Like, love, hate, value, desire, etc.
- Or (more commonly) simple weighted polarity:
positive, negative, neutral, together with strength
◮ Text containing the attitude
- Sentence or entire document
9/24
Sentiment analysis
◮ Simplest task:
Is the attitude of this text positive or negative?
10/24
Sentiment analysis
◮ Simplest task:
Is the attitude of this text positive or negative?
◮ More complex:
Rank the attitude of this text from 1 to 5
10/24
Sentiment analysis
◮ Simplest task:
Is the attitude of this text positive or negative?
◮ More complex:
Rank the attitude of this text from 1 to 5
◮ Advanced:
Detect the target, source, or complex attitude types
10/24
Baseline Algorithm
◮ Tokenization ◮ Feature Extraction ◮ Classification using different classifiers
– Naive Bayes – MaxEnt – SVM
11/24
Sentiment Tokenization Issues
◮ Deal with HTML and XML markup ◮ Twitter/Facebook/... mark-up (names, hash tags) ◮ Capitalization (preserve for words in all caps) ◮ Phone numbers, dates ◮ Emoticons 12/24
Extracting Features for Sentiment Classification
◮ How to handle negation:
I didn’t like this movie vs. I really like this movie
13/24
Extracting Features for Sentiment Classification
◮ How to handle negation:
I didn’t like this movie vs. I really like this movie
◮ Which words to use?
–Only adjectives –All words
13/24
Negation
Add NOT to every word between negation and following punctuation
14/24
Naive Bayes Algorithm
◮ Simple (“naive”) classification method based on Bayes rule ◮ Relies on very simple representation of document:
- Bag of words
15/24
Naive Bayes Algorithm
16/24
Naive Bayes Algorithm
17/24
Naive Bayes Algorithm
18/24
Naive Bayes Algorithm
For a document d and a class c
19/24
Naive Bayes Algorithm
20/24
Naive Bayes Algorithm
21/24
Naive Bayes Algorithm
22/24
Binarized (Boolean feature) Multinomial Naive Bayes
Intuition:
◮ Word occurrence may matter more than word frequency ◮ The occurrence of the word fantastic tells us a lot ◮ The fact that it occurs 5 times may not tell us much more. 23/24
Binarized (Boolean feature) Multinomial Naive Bayes
Intuition:
◮ Word occurrence may matter more than word frequency ◮ The occurrence of the word fantastic tells us a lot ◮ The fact that it occurs 5 times may not tell us much more.
Boolean Multinomial Naive Bayes
Clips all the word counts in each document at 1
23/24
Neural Networks and Deep Learning: Next!
◮ http://nlp.stanford.edu/sentiment/ ◮ java -cp "/opt/corenlp/stanford-corenlp-3.7.0/*"
- Xmx2g edu.stanford.nlp.sentiment.SentimentPipeline
- file input.txt
24/24