Institute of Computational Perception
Natural Language Processing with Deep Learning Sentiment Analysis - - PowerPoint PPT Presentation
Natural Language Processing with Deep Learning Sentiment Analysis - - PowerPoint PPT Presentation
Natural Language Processing with Deep Learning Sentiment Analysis with Machine Learning Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction to Machine Learning Sentiment Analysis
Agenda
- Introduction to Machine Learning
- Sentiment Analysis
- Feature Extraction
- Breaking the curse of dimensionality!
Agenda
- Introduction to Machine Learning
- Sentiment Analysis
- Feature Extraction
- Breaking the curse of dimensionality!
4
Notation
Β§ π β a value or a scalar Β§ π β an array or a vector
- π!" element of π is the scalar π#
Β§ π« β a set of arrays or a matrix
- π!" vector of π« is π #
- π!" element of the π!" vector of π« is the scalar π#,%
5
Linear Algebra β Recap Β§ Transpose
- π is in 1Γd dimensions β ππ is in
- π© is in eΓd dimensions β π©π is in
Β§ Inverse of the square matrix π» is π»&π Β§ Dot product
- π + π( =
dimensions: 1Γd $ dΓ1 =
- π + πͺ =
dimensions: 1Γd $ dΓe =
- π© + πͺ =
dimensions: lΓm $ mΓn =
dΓ1 dimensions dΓe dimensions
π
1
π
1Γe
π«
lΓn
6
Statistical Learning Β§ Given π observed data points:
π = [π & , π ' , β¦ , π ( ]
accompanied with output (label) values:
π = [π§&, π§', β¦ , π§(]
and each data point is defined as a vector with π dimensions (features):
π ) = [π¦&
) , π¦' ) , β¦ , π¦* ) ]
7
Statistical Learning Β§ Statistical learning assumes that there exists a TRUE function (π()*+) that has generated these data:
π = π+,-. π + π
Β§ π()*+
- The true but unknown function that produces the data
- A fixed function
Β§ π > 0
- Called irreducible error
- Rooted in the constrains in gathering data, and measuring and
quantifying features
8
Example π()*+
π"#$% Γ blue surface
π Γ Red points with two features: Seniority, Years of Education π Γ Income π Γ the differences between the data points and the surface
9
Machine Learning Model Β§ A machine learning (ML) model tries to estimate π()*+ by defining function π:
1 π = π π
such that 7 π (predicted outputs) be close to π (real
- utputs).
Β§ The differences between the values of 7 π and π is reducible error
- Can be reduced by better models, better estimations of π"#$%
10
Generalization Β§ The aim of machine learning is to create a model using
- bserved experiences (training data) that generalizes
to the problem domain, namely performs well on unobserved instances (test data)
link
11
Learning the model β Splitting dataset Β§ Data points are splitted into:
- Training set: for training the model
- Validation set: for tuning modelβs hyper-parameters
- Test set: for evaluating modelβs performance
Β§ Common train β validation β test splitting sizes
- 60%, 20%, 20%
- 70%, 15%, 15%
- 80%, 10%, 10%
Observed data points Training set Test set Training set
Validation set
Test set
12
Learning the model
Pstatus: parent's cohabitation status ('T' - living together 'A' - apart) Romantic: with a romantic relationship Walc: weekend alcohol consumption (from 1 - very low to 5 - very high)
http://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOH OL+CONSUMPTION#
sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 M 15 T no 2 M 15 A yes 1 F 16 T no 2 F 16 T no 2 F 16 T no 1 M 17 T no 4
Dataset
Features / Variables (π) Labels / Output Variable (π)
13
Learning the model
sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3
Train Set
M 15 T no 2 M 15 A yes 1 F 16 T no 2 F 16 T no 2 F 16 T no 1 M 17 T no 4
Test Set
14
Learning the model
Train Set
2 1 2 2 1 4 M 15 T no ? M 15 A yes ? F 16 T no ? F 16 T no ? F 16 T no ? M 17 T no ? sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3
π§
Test Set
15
Learning the model
Train Set Train ML Model
π§
M 15 T no ? M 15 A yes ? F 16 T no ? F 16 T no ? F 16 T no ? M 17 T no ? sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 2 1 2 2 1 4
Test Set
16
Learning the model
Train Set Train ML Model Predict
2 π§
M 15 T no 1 M 15 A yes 1 F 16 T no 2 F 16 T no 2 F 16 T no 3 M 17 T no 4
π§
sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 2 1 2 2 1 4
Test Set
17
Learning the model
Train Set Train ML Model Predict
Evaluation β Generalization error
M 15 T no 1 M 15 A yes 1 F 16 T no 2 F 16 T no 2 F 16 T no 3 M 17 T no 4
2 π§ π§
sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 2 1 2 2 1 4
Test Set
18
Tuning hyper parameters β Model selection Β§ Decide on the exploration of several sets of the modelβs hyper-parameters Β§ Train a separate model per each set using training set Β§ Among the trained models, select the best performing
- ne based on the evaluation result on validation set
Β§ Take the selected model and evaluate it on test set β final model performance
19
ML models Β§ Parametric models
- The model is defined as a function (or a family of functions)
consisting of a set of parameters
- Functions such as linear regression, logistic regression, naΓ―ve
Bayes, and neural networks
- The problem of finding the ML model is reduced to finding the
- ptimum values for the parameters
Β§ Non-parametric models
- There is no assumption about the form of the function
- The model is directly learned from data
- ML models such as SVM, k-NN, smoothing spline, gaussian
processes
Term of the day! Inductive bias: all assumptions we consider in defining and creating an ML model. Our prior knowledge about what π%&'( should be.
20
A sample ML model: Linear Regression Β§ π is defined as a Linear Regression function: π§ = π π; π = π₯, + π₯-π¦- + π₯.π¦.+β¦+π₯/π¦/
where π = [π₯&, π₯', β¦ , π₯(] is the set of model parameters
Β§ In the βincomeβ example: ππππππ = π π; π = π₯, + π₯-Γπππ£πππ’πππ +π₯.Γπ‘πππππ ππ’π§
21
A trained Linear Regression model
22
Loss Function Β§ Optimization of parameters is done by first defining a loss function Β§ A loss function measures the discrepancies between the predicted outputs 7 π and real ones π Β§ E.g. Mean Square Error (MSE) β a common regression loss function: β(π§#, M π§#; π) = 1 π P
#0- 1
π§# β M π§# .
Loss functions for classification: Next lectures Good to know! What is Mean Absolute Error and how is it different from MSE?
23
Optimization Β§ Next, training data is used to find an optimum set of parameters πβ by optimizing the loss function: πβ = argmin
π
β (π§#, M π§#; π) MSE: πβ = argmin
π
- 1 β#0-
1
π§# β π π¦#; π
.
Β§ How to optimize:
- Stochastically, e.g. using Stochastic Gradient Descent (SGD)
β next lecture
- Analytically, e.g. in linear regression β Deep Learning book
5.1.4
24
ML models⦠cont.
Model Capacity
high low
more flexible more parameters higher variance lower bias prune to overfitting less flexible less parameters lower variance higher bias prune to underfitting
Terms of the day! (Statistical) Bias indicates the amount of assumptions, taken to define a model. Higher bias means more assumptions and less flexibility, as in linear regression. Variance: in what extent the estimated parameters of a model vary when the values of data points change (are resampled). Overfitting: When the model exactly fits to training data, namely when it also captures the noise in data.
25
Learning Curve
Models: black β π!"#$
- range β linear regression
blue and greenβ two smoothing spline models capacity error test set train set underfit
- verfit
sweet spot!
26
Regularization Β§ A regularization method introduces additional information (assumptions) to avoid overfitting by decreasing variance Β§ E.g. adding the squared L2 norm of parameters to loss function: β π§#, M π§#; π = 1 π P
#0- 1
π§# β M π§# . + π .
.
π . = P
#
π₯#
.
27
Common Evaluation Metrics Β§ Classification
- Accuracy
# 67 8699:8! ;9:<#8!#6=> # 67 >?@;/:>
- Precision
(A (ABCA
- Recall
(A (ABC1
- F-measure
. β ;9:8#>#6= β 9:8?// ;9:8#>#6=B 9:8?//
Β§ Regression
- MSE
- R-squared
28
k-fold Cross Validation Β§ A rigours evaluation method
- avoids bias in train/test splitting
Β§ How to
- Split data into k equal-size folds
(k=5 or 10)
- Repeat k times:
- Use one left-out fold for test
- Use the rest of k-1 folds for
training
- Final performance is the
average of the evaluation results of the k models
link
Agenda
- Introduction to Machine Learning
- Sentiment Analysis
- Feature Extraction
- Breaking the curse of dimensionality!
30
A tough Example!
βThis past Saturday, I bought a Nokia phone and my girlfriend bought a Motorola phone with
- Bluetooth. We called each other when we got
- home. The voice on my phone was clear,
better than my previous Samsung phone. The battery life was however short. My girlfriend was quite happy with her phone. I wanted a phone with good sound quality. So my purchase was a real disappointment. I returned the phone yesterday.β
[1]
31
Text-Level Sentiment Analysis Β§ Text- or document-level sentiment analysis assumes that whole the text expresses one sentiment about one
- pinion target
- Not like the previous example!
Β§ We approach sentiment prediction with ML
32
Problem definition Β§ A dataset consist of π text documents and their sentiments (outputs) Β§ Possible sentiment values:
- [-1, 0, 1] Γ [negative, neutral, positive] (classification problem)
- Real-valued numbers e.g. stock price (regression problem)
33
Sentiment Analysis with ML
features sentiment
π1 β¦ π§<- π2 β¦ π§<. β¦ β¦ β¦ ππ β¦ π§<D π β¦ ?
Create ML Model Predict
Agenda
- Introduction to Machine Learning
- Sentiment Analysis
- Feature Extraction
- Breaking the curse of dimensionality!
35
Dictionary Β§ To extract features, first a dictionary with π words (terms) is defined:
[π’1, π’2, β¦ , π’π]
36
Document-Term Matrix Β§ The features are based on the terms in the dictionary
- Bag of Words (BoW) representations of documents
Β§ π¦!,< is feature value β weight of term π’ in document π π’1 π’2 β¦ π’π
sentiment
π1 π¦!-,<- π¦!.,<- β¦ π¦!1,<- π§<- π2 π¦!-,<. π¦!.,<. β¦ π¦!1,<. π§<. β¦ β¦ β¦ β¦ β¦ β¦ ππ π¦!-,<D π¦!.,<D β¦ π¦!1,<D π§<D
37
Term Weightings Β§ A term weighting method measures the importance of a term in a document Β§ One common method is to count the number of
- ccurrences of a term in a document βΉ term count:
π¦!,< = tc!,< = # of occurrences of π’ in π Β§ Using logarithm to dampening raw counts is shown to be more effective βΉ term frequency: π¦!,< = tf!,< = log(1 + tc!,<)
38
On informativeness of less frequent terms Β§ Terms that do not appear often usually carry more information in comparison with highly frequent ones
- e.g., JKU in a large news corpora
Β§ Inverse document frequency (idf) is a well-known method to measure how often words appear in a collection: idf! = log( π df! + 1)
- df) is the document frequency of π’, namely the number of
documents that contain term π’
- Higher idf) means that the term appears less often in the
collection, and is therefore more informative (important)
- e.g., JKU has high idf, while the has very low idf
39
Term weightings Β§ tfβidf!,< term weighting is the product of tf!,< and idf! π¦!,< = tfβidf!,< = log 1 + tc!,< Γ log( h π df!) Β§ A well-known term weighting method!
increases with the number of
- ccurrences within a document
increases with the rarity of the term in the collection
40
Wrap-up! Β§ Use any term weightings to create document-term matrix Β§ The rest is standard machine learning!
- Model training
- Hyper-parameter tuning and model selection
- Evaluation
π’1 π’2 β¦ π’π
sentiment
π1 π¦%&,(& π¦%),(& β¦ π¦%*,(& π§(& π2 π¦%&,() π¦%),() β¦ π¦%*,() π§() β¦ β¦ β¦ β¦ β¦ β¦ ππ π¦%&,(+ π¦%),(+ β¦ π¦%*,(+ π§(+
Agenda
- Introduction to Machine Learning
- Sentiment Analysis
- Feature Extraction
- Breaking the curse of dimensionality!
42
Supervised Sentiment Analysis Β§ The feature vectors are
- sparse (a lot zeros)
- in a very high dimension
π ~[20πΏ β 500πΏ] π ~[10πΏ β 100πΏ] π’1 π’2 β¦ π’π
sentiment
π1 π¦)',+' π¦),,+' β¦ π¦)-,+' π§+' π2 π¦)',+, π¦),,+, β¦ π¦)-,+, π§+, β¦ β¦ β¦ β¦ β¦ β¦ ππ π¦)',+. π¦),,+. β¦ π¦)-,+. π§+.
43
Curse of dimensionality Β§ Curse of dimensionality happens when the amount of data does not suffice to support the sparsity in dimensionality Β§ It causes
- Data sparsity
- Issues in measuring βclosenessβ
link
44
Curse of dimensionality Β§ Why low-dimensional vectors?
- Easier to store and load (efficiency)
- More efficient when used as features in ML models
- Better generalization since the noise in data is reduced
- Able to capture higher-order relations:
- Synonymy like car and automobile can be merged into same
dimensions
- Polysomy like bank (financial institution) and bank (bank of river)
can be separated into different dimensions
45
Feature (Dimensionality) reduction Β§ Feature selection
- keep some important features and get rid of the rest!
Β§ Dimensionality reduction
- project data from high to a low dimensional space
MβN Mβd
βΉ
46
Feature selection Β§ During pre-processing
- Remove stop words or very common words
- tfβidf do it in a βsoftβ way why?
- Remove very rare words
- Usually done when creating dictionary
- Stemming & lemmatization
Β§ Features definition
- Use only the words in a domain-specific lexicon as features
Β§ Post-processing
- Keep important features using some informativeness measures
- Subset selection
47
Dimensionality reduction with LSA Β§ Latent Semantic Analysis (LSA)
- A common method in Information Retrieval to capture
semantics
- Based on Singular Value Decomposition (SVD)
Semantics matters!
48
Singular Value Decomposition Β§ An N Β΄ M matrix π can be factorized to three matrices:
π = π½π―πΎπ
Β§ π½ (left singular vectors) is an NΒ΄M unitary matrix Β§ π― is an MΒ΄M diagonal matrix, diagonal entries
- are eigenvalues,
- show the importance of corresponding M dimensions in π
- are all positive and sorted from large to small values
Β§ πΎπ (right singular vectors) is an MΒ΄M unitary matrix
* The definition of SVD is simplified. Refer to https://en.wikipedia.org/wiki/Singular_value_decomposition for the exact definition
49
Singular Value Decomposition
NβM NβM MβM MβM =
- riginal matrix
π eigenvalues π― right singular vectors πΎπ left singular vectors π½
50
Latent Semantic Indexing Β§ Step 1: apply SVD on the term-document matrix of training data
β Not the document-term matrix! Although it is also possible to start with the document-
term matrix, we follow the typical definition!
NβM terms documents = NβM MβM MβM
term vectors π½ eigenvalues π― document vectors πΎπ (sparse) term-document matrix π
At Training Time
51
Latent Semantic Indexing Β§ Step 2: keep only top k eigenvalues in π― and set the rest to zero, called l π― Β§ Truncate the π½ and πΎπ matrices based on the changes in π―, called l π½ and l πΎπ respectively Β§ If we multiply the truncated matrices, we have the rank k least-squares approximation to the original matrix
: π = : π½: π―: πΎπ
At Training Time
52
Latent Semantic Indexing
NβM = |C|β|C| |C|β|C| k k k terms
truncated term vectors
T π½
truncated eigenvalues
T π―
truncated document vectors
T πΎπ
smoothed term-document matrix
T π M k N
At Training Time
Β§ l πΎ matrix is the dense low-dimensional document vectors
documents
53