Natural Language Processing with Deep Learning Sentiment Analysis - - PowerPoint PPT Presentation

β–Ά
natural language processing with deep learning sentiment
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing with Deep Learning Sentiment Analysis - - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Sentiment Analysis with Machine Learning Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction to Machine Learning Sentiment Analysis


slide-1
SLIDE 1

Institute of Computational Perception

Natural Language Processing with Deep Learning Sentiment Analysis with Machine Learning

Navid Rekab-Saz

navid.rekabsaz@jku.at

slide-2
SLIDE 2

Agenda

  • Introduction to Machine Learning
  • Sentiment Analysis
  • Feature Extraction
  • Breaking the curse of dimensionality!
slide-3
SLIDE 3

Agenda

  • Introduction to Machine Learning
  • Sentiment Analysis
  • Feature Extraction
  • Breaking the curse of dimensionality!
slide-4
SLIDE 4

4

Notation

Β§ 𝑏 β†’ a value or a scalar Β§ 𝒄 β†’ an array or a vector

  • 𝑗!" element of 𝒄 is the scalar 𝑐#

Β§ 𝑫 β†’ a set of arrays or a matrix

  • 𝑗!" vector of 𝑫 is 𝒅#
  • π‘˜!" element of the 𝑗!" vector of 𝑫 is the scalar 𝑑#,%
slide-5
SLIDE 5

5

Linear Algebra – Recap Β§ Transpose

  • 𝒃 is in 1Γ—d dimensions β†’ 𝒃𝐔 is in
  • 𝑩 is in eΓ—d dimensions β†’ 𝑩𝐔 is in

Β§ Inverse of the square matrix 𝑻 is 𝑻&𝟐 Β§ Dot product

  • 𝒃 + 𝒄( =

dimensions: 1Γ—d $ dΓ—1 =

  • 𝒃 + π‘ͺ =

dimensions: 1Γ—d $ dΓ—e =

  • 𝑩 + π‘ͺ =

dimensions: lΓ—m $ mΓ—n =

dΓ—1 dimensions dΓ—e dimensions

𝑑

1

𝒅

1Γ—e

𝑫

lΓ—n

slide-6
SLIDE 6

6

Statistical Learning Β§ Given 𝑂 observed data points:

𝒀 = [π’š & , π’š ' , … , π’š ( ]

accompanied with output (label) values:

𝒛 = [𝑧&, 𝑧', … , 𝑧(]

and each data point is defined as a vector with π‘š dimensions (features):

π’š ) = [𝑦&

) , 𝑦' ) , … , 𝑦* ) ]

slide-7
SLIDE 7

7

Statistical Learning Β§ Statistical learning assumes that there exists a TRUE function (𝑔()*+) that has generated these data:

𝒛 = 𝑔+,-. 𝒀 + πœ—

Β§ 𝑔()*+

  • The true but unknown function that produces the data
  • A fixed function

Β§ πœ— > 0

  • Called irreducible error
  • Rooted in the constrains in gathering data, and measuring and

quantifying features

slide-8
SLIDE 8

8

Example 𝑔()*+

𝑔"#$% Γ  blue surface

𝒀 Γ  Red points with two features: Seniority, Years of Education 𝒛 Γ  Income πœ— Γ  the differences between the data points and the surface

slide-9
SLIDE 9

9

Machine Learning Model Β§ A machine learning (ML) model tries to estimate 𝑔()*+ by defining function 𝑔:

1 𝒛 = 𝑔 𝒀

such that 7 𝒛 (predicted outputs) be close to 𝒛 (real

  • utputs).

Β§ The differences between the values of 7 𝒛 and 𝒛 is reducible error

  • Can be reduced by better models, better estimations of 𝑔"#$%
slide-10
SLIDE 10

10

Generalization Β§ The aim of machine learning is to create a model using

  • bserved experiences (training data) that generalizes

to the problem domain, namely performs well on unobserved instances (test data)

link

slide-11
SLIDE 11

11

Learning the model – Splitting dataset Β§ Data points are splitted into:

  • Training set: for training the model
  • Validation set: for tuning model’s hyper-parameters
  • Test set: for evaluating model’s performance

Β§ Common train – validation – test splitting sizes

  • 60%, 20%, 20%
  • 70%, 15%, 15%
  • 80%, 10%, 10%

Observed data points Training set Test set Training set

Validation set

Test set

slide-12
SLIDE 12

12

Learning the model

Pstatus: parent's cohabitation status ('T' - living together 'A' - apart) Romantic: with a romantic relationship Walc: weekend alcohol consumption (from 1 - very low to 5 - very high)

http://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOH OL+CONSUMPTION#

sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 M 15 T no 2 M 15 A yes 1 F 16 T no 2 F 16 T no 2 F 16 T no 1 M 17 T no 4

Dataset

Features / Variables (π‘Œ) Labels / Output Variable (𝑍)

slide-13
SLIDE 13

13

Learning the model

sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3

Train Set

M 15 T no 2 M 15 A yes 1 F 16 T no 2 F 16 T no 2 F 16 T no 1 M 17 T no 4

Test Set

slide-14
SLIDE 14

14

Learning the model

Train Set

2 1 2 2 1 4 M 15 T no ? M 15 A yes ? F 16 T no ? F 16 T no ? F 16 T no ? M 17 T no ? sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3

𝑧

Test Set

slide-15
SLIDE 15

15

Learning the model

Train Set Train ML Model

𝑧

M 15 T no ? M 15 A yes ? F 16 T no ? F 16 T no ? F 16 T no ? M 17 T no ? sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 2 1 2 2 1 4

Test Set

slide-16
SLIDE 16

16

Learning the model

Train Set Train ML Model Predict

2 𝑧

M 15 T no 1 M 15 A yes 1 F 16 T no 2 F 16 T no 2 F 16 T no 3 M 17 T no 4

𝑧

sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 2 1 2 2 1 4

Test Set

slide-17
SLIDE 17

17

Learning the model

Train Set Train ML Model Predict

Evaluation – Generalization error

M 15 T no 1 M 15 A yes 1 F 16 T no 2 F 16 T no 2 F 16 T no 3 M 17 T no 4

2 𝑧 𝑧

sex age Pstatus romantic Walc F 18 A no 1 F 17 T no 1 F 15 T no 3 F 15 T yes 1 F 16 T no 2 M 16 T no 2 M 16 T no 1 F 17 A no 1 M 15 A no 1 M 15 T no 1 F 15 T no 2 F 15 T no 1 M 15 T no 3 2 1 2 2 1 4

Test Set

slide-18
SLIDE 18

18

Tuning hyper parameters – Model selection Β§ Decide on the exploration of several sets of the model’s hyper-parameters Β§ Train a separate model per each set using training set Β§ Among the trained models, select the best performing

  • ne based on the evaluation result on validation set

Β§ Take the selected model and evaluate it on test set β†’ final model performance

slide-19
SLIDE 19

19

ML models Β§ Parametric models

  • The model is defined as a function (or a family of functions)

consisting of a set of parameters

  • Functions such as linear regression, logistic regression, naΓ―ve

Bayes, and neural networks

  • The problem of finding the ML model is reduced to finding the
  • ptimum values for the parameters

Β§ Non-parametric models

  • There is no assumption about the form of the function
  • The model is directly learned from data
  • ML models such as SVM, k-NN, smoothing spline, gaussian

processes

Term of the day! Inductive bias: all assumptions we consider in defining and creating an ML model. Our prior knowledge about what 𝑔%&'( should be.

slide-20
SLIDE 20

20

A sample ML model: Linear Regression Β§ 𝑔 is defined as a Linear Regression function: 𝑧 = 𝑔 π’š; 𝒙 = π‘₯, + π‘₯-𝑦- + π‘₯.𝑦.+…+π‘₯/𝑦/

where 𝒙 = [π‘₯&, π‘₯', … , π‘₯(] is the set of model parameters

Β§ In the β€œincome” example: π‘—π‘œπ‘‘π‘π‘›π‘“ = 𝑔 π’š; 𝒙 = π‘₯, + π‘₯-Γ—π‘“π‘’π‘£π‘‘π‘π‘’π‘—π‘π‘œ +π‘₯.Γ—π‘‘π‘“π‘œπ‘—π‘π‘ π‘—π‘’π‘§

slide-21
SLIDE 21

21

A trained Linear Regression model

slide-22
SLIDE 22

22

Loss Function Β§ Optimization of parameters is done by first defining a loss function Β§ A loss function measures the discrepancies between the predicted outputs 7 𝒛 and real ones 𝒛 Β§ E.g. Mean Square Error (MSE) – a common regression loss function: β„’(𝑧#, M 𝑧#; 𝒙) = 1 𝑂 P

#0- 1

𝑧# βˆ’ M 𝑧# .

Loss functions for classification: Next lectures Good to know! What is Mean Absolute Error and how is it different from MSE?

slide-23
SLIDE 23

23

Optimization Β§ Next, training data is used to find an optimum set of parameters π’™βˆ— by optimizing the loss function: π’™βˆ— = argmin

𝒙

β„’ (𝑧#, M 𝑧#; 𝒙) MSE: π’™βˆ— = argmin

𝒙

  • 1 βˆ‘#0-

1

𝑧# βˆ’ 𝑔 𝑦#; 𝒙

.

Β§ How to optimize:

  • Stochastically, e.g. using Stochastic Gradient Descent (SGD)

β†’ next lecture

  • Analytically, e.g. in linear regression β†’ Deep Learning book

5.1.4

slide-24
SLIDE 24

24

ML models… cont.

Model Capacity

high low

more flexible more parameters higher variance lower bias prune to overfitting less flexible less parameters lower variance higher bias prune to underfitting

Terms of the day! (Statistical) Bias indicates the amount of assumptions, taken to define a model. Higher bias means more assumptions and less flexibility, as in linear regression. Variance: in what extent the estimated parameters of a model vary when the values of data points change (are resampled). Overfitting: When the model exactly fits to training data, namely when it also captures the noise in data.

slide-25
SLIDE 25

25

Learning Curve

Models: black β†’ 𝑔!"#$

  • range β†’ linear regression

blue and green→ two smoothing spline models capacity error test set train set underfit

  • verfit

sweet spot!

slide-26
SLIDE 26

26

Regularization Β§ A regularization method introduces additional information (assumptions) to avoid overfitting by decreasing variance Β§ E.g. adding the squared L2 norm of parameters to loss function: β„’ 𝑧#, M 𝑧#; 𝒙 = 1 𝑂 P

#0- 1

𝑧# βˆ’ M 𝑧# . + 𝒙 .

.

𝒙 . = P

#

π‘₯#

.

slide-27
SLIDE 27

27

Common Evaluation Metrics Β§ Classification

  • Accuracy

# 67 8699:8! ;9:<#8!#6=> # 67 >?@;/:>

  • Precision

(A (ABCA

  • Recall

(A (ABC1

  • F-measure

. βˆ— ;9:8#>#6= βˆ— 9:8?// ;9:8#>#6=B 9:8?//

Β§ Regression

  • MSE
  • R-squared
slide-28
SLIDE 28

28

k-fold Cross Validation Β§ A rigours evaluation method

  • avoids bias in train/test splitting

Β§ How to

  • Split data into k equal-size folds

(k=5 or 10)

  • Repeat k times:
  • Use one left-out fold for test
  • Use the rest of k-1 folds for

training

  • Final performance is the

average of the evaluation results of the k models

link

slide-29
SLIDE 29

Agenda

  • Introduction to Machine Learning
  • Sentiment Analysis
  • Feature Extraction
  • Breaking the curse of dimensionality!
slide-30
SLIDE 30

30

A tough Example!

β€œThis past Saturday, I bought a Nokia phone and my girlfriend bought a Motorola phone with

  • Bluetooth. We called each other when we got
  • home. The voice on my phone was clear,

better than my previous Samsung phone. The battery life was however short. My girlfriend was quite happy with her phone. I wanted a phone with good sound quality. So my purchase was a real disappointment. I returned the phone yesterday.”

[1]

slide-31
SLIDE 31

31

Text-Level Sentiment Analysis Β§ Text- or document-level sentiment analysis assumes that whole the text expresses one sentiment about one

  • pinion target
  • Not like the previous example!

Β§ We approach sentiment prediction with ML

slide-32
SLIDE 32

32

Problem definition Β§ A dataset consist of 𝑁 text documents and their sentiments (outputs) Β§ Possible sentiment values:

  • [-1, 0, 1] Γ  [negative, neutral, positive] (classification problem)
  • Real-valued numbers e.g. stock price (regression problem)
slide-33
SLIDE 33

33

Sentiment Analysis with ML

features sentiment

𝑒1 … 𝑧<- 𝑒2 … 𝑧<. … … … 𝑒𝑁 … 𝑧<D 𝑒 … ?

Create ML Model Predict

slide-34
SLIDE 34

Agenda

  • Introduction to Machine Learning
  • Sentiment Analysis
  • Feature Extraction
  • Breaking the curse of dimensionality!
slide-35
SLIDE 35

35

Dictionary Β§ To extract features, first a dictionary with 𝑂 words (terms) is defined:

[𝑒1, 𝑒2, … , 𝑒𝑂]

slide-36
SLIDE 36

36

Document-Term Matrix Β§ The features are based on the terms in the dictionary

  • Bag of Words (BoW) representations of documents

Β§ 𝑦!,< is feature value β†’ weight of term 𝑒 in document 𝑒 𝑒1 𝑒2 … 𝑒𝑂

sentiment

𝑒1 𝑦!-,<- 𝑦!.,<- … 𝑦!1,<- 𝑧<- 𝑒2 𝑦!-,<. 𝑦!.,<. … 𝑦!1,<. 𝑧<. … … … … … … 𝑒𝑁 𝑦!-,<D 𝑦!.,<D … 𝑦!1,<D 𝑧<D

slide-37
SLIDE 37

37

Term Weightings Β§ A term weighting method measures the importance of a term in a document Β§ One common method is to count the number of

  • ccurrences of a term in a document ⟹ term count:

𝑦!,< = tc!,< = # of occurrences of 𝑒 in 𝑒 Β§ Using logarithm to dampening raw counts is shown to be more effective ⟹ term frequency: 𝑦!,< = tf!,< = log(1 + tc!,<)

slide-38
SLIDE 38

38

On informativeness of less frequent terms Β§ Terms that do not appear often usually carry more information in comparison with highly frequent ones

  • e.g., JKU in a large news corpora

Β§ Inverse document frequency (idf) is a well-known method to measure how often words appear in a collection: idf! = log( 𝑁 df! + 1)

  • df) is the document frequency of 𝑒, namely the number of

documents that contain term 𝑒

  • Higher idf) means that the term appears less often in the

collection, and is therefore more informative (important)

  • e.g., JKU has high idf, while the has very low idf
slide-39
SLIDE 39

39

Term weightings Β§ tfβˆ’idf!,< term weighting is the product of tf!,< and idf! 𝑦!,< = tfβˆ’idf!,< = log 1 + tc!,< Γ— log( h 𝑁 df!) Β§ A well-known term weighting method!

increases with the number of

  • ccurrences within a document

increases with the rarity of the term in the collection

slide-40
SLIDE 40

40

Wrap-up! Β§ Use any term weightings to create document-term matrix Β§ The rest is standard machine learning!

  • Model training
  • Hyper-parameter tuning and model selection
  • Evaluation

𝑒1 𝑒2 … 𝑒𝑂

sentiment

𝑒1 𝑦%&,(& 𝑦%),(& … 𝑦%*,(& 𝑧(& 𝑒2 𝑦%&,() 𝑦%),() … 𝑦%*,() 𝑧() … … … … … … 𝑒𝑁 𝑦%&,(+ 𝑦%),(+ … 𝑦%*,(+ 𝑧(+

slide-41
SLIDE 41

Agenda

  • Introduction to Machine Learning
  • Sentiment Analysis
  • Feature Extraction
  • Breaking the curse of dimensionality!
slide-42
SLIDE 42

42

Supervised Sentiment Analysis Β§ The feature vectors are

  • sparse (a lot zeros)
  • in a very high dimension

𝑂 ~[20𝐿 βˆ’ 500𝐿] 𝑁 ~[10𝐿 βˆ’ 100𝐿] 𝑒1 𝑒2 … 𝑒𝑂

sentiment

𝑒1 𝑦)',+' 𝑦),,+' … 𝑦)-,+' 𝑧+' 𝑒2 𝑦)',+, 𝑦),,+, … 𝑦)-,+, 𝑧+, … … … … … … 𝑒𝑁 𝑦)',+. 𝑦),,+. … 𝑦)-,+. 𝑧+.

slide-43
SLIDE 43

43

Curse of dimensionality Β§ Curse of dimensionality happens when the amount of data does not suffice to support the sparsity in dimensionality Β§ It causes

  • Data sparsity
  • Issues in measuring β€œcloseness”

link

slide-44
SLIDE 44

44

Curse of dimensionality Β§ Why low-dimensional vectors?

  • Easier to store and load (efficiency)
  • More efficient when used as features in ML models
  • Better generalization since the noise in data is reduced
  • Able to capture higher-order relations:
  • Synonymy like car and automobile can be merged into same

dimensions

  • Polysomy like bank (financial institution) and bank (bank of river)

can be separated into different dimensions

slide-45
SLIDE 45

45

Feature (Dimensionality) reduction Β§ Feature selection

  • keep some important features and get rid of the rest!

Β§ Dimensionality reduction

  • project data from high to a low dimensional space

Mβœ•N Mβœ•d

⟹

slide-46
SLIDE 46

46

Feature selection Β§ During pre-processing

  • Remove stop words or very common words
  • tfβˆ’idf do it in a β€œsoft” way why?
  • Remove very rare words
  • Usually done when creating dictionary
  • Stemming & lemmatization

Β§ Features definition

  • Use only the words in a domain-specific lexicon as features

Β§ Post-processing

  • Keep important features using some informativeness measures
  • Subset selection
slide-47
SLIDE 47

47

Dimensionality reduction with LSA Β§ Latent Semantic Analysis (LSA)

  • A common method in Information Retrieval to capture

semantics

  • Based on Singular Value Decomposition (SVD)

Semantics matters!

slide-48
SLIDE 48

48

Singular Value Decomposition Β§ An N Β΄ M matrix 𝒀 can be factorized to three matrices:

𝒀 = π‘½πœ―π‘Ύπ”

Β§ 𝑽 (left singular vectors) is an NΒ΄M unitary matrix Β§ 𝜯 is an MΒ΄M diagonal matrix, diagonal entries

  • are eigenvalues,
  • show the importance of corresponding M dimensions in 𝒀
  • are all positive and sorted from large to small values

Β§ 𝑾𝐔 (right singular vectors) is an MΒ΄M unitary matrix

* The definition of SVD is simplified. Refer to https://en.wikipedia.org/wiki/Singular_value_decomposition for the exact definition

slide-49
SLIDE 49

49

Singular Value Decomposition

Nβœ•M Nβœ•M Mβœ•M Mβœ•M =

  • riginal matrix

𝒀 eigenvalues 𝜯 right singular vectors 𝑾𝐔 left singular vectors 𝑽

slide-50
SLIDE 50

50

Latent Semantic Indexing Β§ Step 1: apply SVD on the term-document matrix of training data

☞ Not the document-term matrix! Although it is also possible to start with the document-

term matrix, we follow the typical definition!

Nβœ•M terms documents = Nβœ•M Mβœ•M Mβœ•M

term vectors 𝑽 eigenvalues 𝜯 document vectors 𝑾𝐔 (sparse) term-document matrix 𝒀

At Training Time

slide-51
SLIDE 51

51

Latent Semantic Indexing Β§ Step 2: keep only top k eigenvalues in 𝜯 and set the rest to zero, called l 𝜯 Β§ Truncate the 𝑽 and 𝑾𝐔 matrices based on the changes in 𝜯, called l 𝑽 and l 𝑾𝐔 respectively Β§ If we multiply the truncated matrices, we have the rank k least-squares approximation to the original matrix

: 𝒀 = : 𝑽: 𝜯: 𝑾𝐔

At Training Time

slide-52
SLIDE 52

52

Latent Semantic Indexing

Nβœ•M = |C|βœ•|C| |C|βœ•|C| k k k terms

truncated term vectors

T 𝑽

truncated eigenvalues

T 𝜯

truncated document vectors

T 𝑾𝐔

smoothed term-document matrix

T 𝒀 M k N

At Training Time

Β§ l 𝑾 matrix is the dense low-dimensional document vectors

documents

slide-53
SLIDE 53

53

Latent Semantic Indexing Β§ Given a high-dimensional document vector 𝒆 in NΒ΄1 dimensions, we want to project it to the low-dimensional space, resulting in a new vector l 𝒆 with kΒ΄1 dimensions Β§ done through this calculation:

: 𝒆 = : 𝜯H𝟐: 𝑽𝐔𝒆

Β§ Exercise: calculate if the chain of dot products from 𝒆 to K 𝒆 results to the correct dimension At Inference Time (Validation or Test)