Natural Language Processing with Deep Learning Word Embeddings - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Word Embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception

Agenda • Introduction • Count-based word representation • Prediction-based word embedding

Distributional Representation § An entity is represented with a vector of 𝑒 dimensions § Distributed Representations - Each dimension (units) is a feature of the entity - Units in a layer are not mutually exclusive - Two units can be ‘‘active’’ at the same time 𝑒 𝒚 𝑦 ! 𝑦 " 𝑦 # … 𝑦 $ 4

Word Embedding Model 𝑒 𝑤1 𝑤2 Word Embedding 𝑤𝑂 Model When vector representations are dense, they are often called embedding e.g. word embedding 5

Word embeddings projected to a two-dimensional space

Word Embeddings – Nearest neighbors frog book asthma frogs books bronchitis toad foreword allergy litoria author allergies published arthritis leptodactylidae preface diabetes rana 7 https://nlp.stanford.edu/projects/glove/

Word Embeddings – Linear substructures § Analogy task: - man to woman is like king to ? (queen) 𝒚 !"#$% − 𝒚 #$% + 𝒚 &'%( = 𝒚 ∗ 𝒚 ∗ ≈ 𝒚 *+,-. 8 https://nlp.stanford.edu/projects/glove/

Intuition for Computational Semantics “You shall know a word by the company it keeps!” J. R. Firth, A synopsis of linguistic theory 1930–1955 (1957) 10

d e r d c a r s i n k alcoholic Tesgüino beverage out of corn f e r m e n t e d b o Mexico t t l e Nida[1975] 11

fermentation e g l t t o r b a i n medieval Ale brew pale d b a r r i n k alcoholic 12

Tesgüino ←→ Ale Algorithmic intuition: Two words are related when they have common context words 13

Word-Document Matrix – recap 𝔼 is a set of documents (plays of Shakespeare) § 𝔼 = [𝑒1, 𝑒2, … , 𝑒𝑁] 𝕎 is the set of words (vocabularies) in dictionary § 𝕎 = [𝑤1, 𝑤2, … , 𝑤𝑂] § Words as rows and documents as columns Values: term count tc !,# § Matrix size 𝑂×𝑁 § 𝑒1 𝑒2 𝑒3 𝑒4 As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 ... ... … ... ... 14

Cosine § Cosine is the normalized dot product of two vectors - Its result is between -1 and +1 ∑ 0 𝑦 0 𝑧 0 𝒚 𝒛 cos(𝒚, 𝒛) = . = 𝒚 / 𝒛 / / / ∑ 0 𝑦 0 ∑ 0 𝑧 0 𝒚 = 1 0 𝒛 = 4 § 1 5 6 1 ∗ 4 + 1 ∗ 5 + 0 ∗ 6 9 cos(𝒚, 𝒛) = 1 $ + 1 $ + 0 $ 4 $ + 5 $ + 6 $ = ~12.4 15

Word-Document Matrix 𝑒1 𝑒2 𝑒3 𝑒4 As You Like It Twelfth Night Julius Caesar Henry V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 ... ... … ... ... § Similarity between two words: similarity soldier, clown = cos 𝒚 1"23',4 , 𝒚 52"!% 16

Context § Context can be - Document - Paragraph, tweet - Window of (2-10) context words on each side of the word § Word-Context matrix - Every word as a unit (dimension) ℂ = [𝑑1, 𝑑2, … , 𝑑𝑀] - Matrix size: 𝑂×𝑀 - Usually ℂ = 𝕎 and therefore 𝑀 = 𝑂 17

Word-Context Matrix § Window context of 7 words sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer . In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 aardvark computer data pinch result sugar 𝑤1 apricot 0 0 0 1 0 1 𝑤2 pineapple 0 0 0 1 0 1 𝑤3 digital 0 2 1 0 1 0 𝑤4 information 0 1 6 0 4 0 18

Word-to-Word Relations 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 aardvark computer data pinch result sugar 𝑤1 apricot 0 0 0 1 0 1 𝑤2 pineapple 0 0 0 1 0 1 𝑤3 digital 0 2 1 0 1 0 𝑤4 information 0 1 6 0 4 0 § First-order co-occurrence relation - Each cell of the word-context matrix - Words that appear in the proximity of each other - Like drink to beer , and drink and wine § Second-order similarity relation - Cosine similarity between the representation vectors - Words that appear in similar contexts - Like beer to wine , tesgüino to ale , and frog to toad 19

Point Mutual Information § Problem with raw counting methods - Biased towards high frequent words (“and”, “the”) although they don’t contain much of information § Point Mutual Information (PMI) - Rooted in information theory - A better measure for the first-order relation in word-context matrix in order to reflect the informativeness of co-occurrences - Joint probability of two events (random variables) divided by their marginal probabilities PMI 𝑌, 𝑍 = log 𝑞(X, Y) 𝑞 X 𝑞(Y) 20

Point Mutual Information PMI 𝑤, 𝑑 = log 𝑞(𝑤, 𝑑) 𝑞 𝑤 𝑞(𝑑) 𝑞 𝑤, 𝑑 = # 𝑤, 𝑑 𝑇 ( * ∑ %&' # 𝑤, 𝑑 % 𝑞 𝑑 = ∑ )&' # 𝑤 ) , 𝑑 𝑞 𝑤 = 𝑇 𝑇 * ( 𝑇 = H H # 𝑤 ) , 𝑑 % )&' %&' § Positive Point Mutual Information (PPMI) PPMI 𝑢, 𝑑 = max(PMI, 0) 21

Point Mutual Information 𝑑1 𝑑2 𝑑3 𝑑4 𝑑5 𝑑6 aardvark computer data pinch result sugar 𝑤1 apricot 0 0 0 1 0 1 𝑤2 pineapple 0 0 0 1 0 1 𝑤3 digital 0 2 1 0 1 0 𝑤4 information 0 1 6 0 4 0 6 19 = .32 𝑄 𝑤 = information, 𝑑 = data = Q 11 19 = .58 𝑄 𝑤 = information = Q 7 19 = .37 𝑄 𝑑 = data = Q .32 PPMI 𝑤 = information, 𝑑 = data = max(0, log .58 ∗ .37) = .39 22

Singular Value Decomposition - Recap § An 𝑂×𝑀 matrix 𝒀 can be factorized to three matrices: 𝒀 = 𝑽𝜯𝑾 𝐔 § 𝑽 (left singular vectors) is an 𝑂×𝑀 unitary matrix § 𝜯 is an 𝑀×𝑀 diagonal matrix, diagonal entries - are eigenvalues, - show the importance of corresponding 𝑀 dimensions in 𝒀 - are all positive and sorted from large to small values § 𝑾 𝐔 (right singular vectors) is an 𝑀×𝑀 unitary matrix * The definition of SVD is simplified. Refer to https://en.wikipedia.org/wiki/Singular_value_decomposition for the exact definition 24

Singular Value Decomposition – Recap 𝑀×𝑀 𝑀×𝑀 𝑂×𝑀 𝑂×𝑀 = eigenvalues right singular vectors 𝑾 𝐔 𝜯 original matrix left singular vectors 𝒀 𝑽 25

Applying SVD to Word-Context Matrix § Step 1: create a sparse PPMI matrix of the size 𝑂 ✕ 𝑀 , § Apply SVD contexts 𝑀×𝑀 𝑀×𝑀 words 𝑂×𝑀 𝑂×𝑀 = eigenvalues context vectors 𝑾 𝐔 𝜯 (sparse) word vectors word-context matrix 𝑽 𝒀 26

Applying SVD to Term-Context Matrix § Step 2: keep only top 𝑒 eigenvalues in 𝜯 and set the rest to zero § Truncate the 𝑽 and 𝑾 𝐔 matrices based on the changes 𝑾 𝐔 respectively in 𝜯 , called N 𝑽 and N 27

Applying SVD to Term-Context Matrix 𝑀 𝑒 𝑒 𝑒 𝑒 𝑀×𝑀 𝑀×𝑀 𝑂 ×𝑀 truncated truncated eigenvalues context word % 𝜯 vectors : 𝑾 𝐔 truncated word vectors : 𝑽 § N 𝑽 matrix is the dense low-dimensional word vectors 28

Word Embedding with Neural Networks Recipe for creating (dense) word embedding with neural networks § Design a neural network architecture! § Loop over training data (𝑤, 𝑑) for some epochs Pass the word 𝑤 as input and execute forward passing - Calculate the probability of observing context word 𝑑 at output: - 𝑞(𝑑|𝑤) - Optimize the network to maximize this likelihood probability Details come next! 30

Training Data 𝒠 Window of size 2 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ 31

Neural embeddings – architecture Train sample: ( Tesgüino , drink ) Output Layer Input Layer (softmax) (One-hot encoder ) Forward pass 𝑞(𝐞𝐬𝐣𝐨𝐥|𝐔𝐟𝐭𝐡ü𝐣𝐨𝐩) Backpropagation 𝑽 % 𝑭 𝑂×𝑒 𝑒×𝑂 1×𝑒 1×𝑂 1×𝑂 Linear activation Encoder embedding Decoder embedding https://web.stanford.edu/~jurafsky/slp3/ 32

Ale Tesgüino Embedding vector 33

Ale Tesgüino Embedding vector 34

Ale Tesgüino Decoding vector Embedding vector 35

Ale drink Tesgüino Decoding vector Embedding vector 36

drink Ale Tesgüino Decoding vector Embedding vector 37

Ale drink Tesgüino - Train sample: ( Tesgüino , drink ) - Update vectors to maximize 𝑞(drink|Tesgüino) Decoding vector Embedding vector 38

Natural Language Processing with Deep Learning Word Embeddings - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Word Embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Count-based word representation Prediction-based word embedding

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2:

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 1:

Identifying Animals Today we will be... Identifying and comparing common UK birds and reptiles.

Using Formative Assessment and ELP Standards to Guide Instruction Guiding Questions What are

Darrell Bethea May 24, 2011 3 Objects and references More on Classes 4 Classes

Workshop 10.4: Generalized linear models Murray Logan 16 Aug 2016 Linear models Homogeneity

Tutorial 2: Promela/Spin Running Spin General Usage and Tips CISC422/853 Advice for

MAY FIVE-YEAR FORECAST MAY 20, 2019 Fiscal Year Fiscal Year Fiscal Year Fiscal Year Fiscal

T2TRG: Thing-to-Thing Research Group IETF 103, November 6, 2018, Bangkok, TH Chairs: Carsten

RSVP-TE Summary FRR Extensions draft-mtaillon-rsvpte-summary-frr-00 Author list: Mike Taillon

Sambuz

Useful Links

Newsletter

Mail Us

Natural Language Processing with Deep Learning Word Embeddings - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Word Embeddings Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Agenda Introduction Count-based word representation Prediction-based word embedding

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 2:

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 1:

Identifying Animals Today we will be... Identifying and comparing common UK birds and reptiles.

Using Formative Assessment and ELP Standards to Guide Instruction Guiding Questions What are

Darrell Bethea May 24, 2011 3 Objects and references More on Classes 4 Classes

Workshop 10.4: Generalized linear models Murray Logan 16 Aug 2016 Linear models Homogeneity

Tutorial 2: Promela/Spin Running Spin General Usage and Tips CISC422/853 Advice for

MAY FIVE-YEAR FORECAST MAY 20, 2019 Fiscal Year Fiscal Year Fiscal Year Fiscal Year Fiscal

T2TRG: Thing-to-Thing Research Group IETF 103, November 6, 2018, Bangkok, TH Chairs: Carsten

RSVP-TE Summary FRR Extensions draft-mtaillon-rsvpte-summary-frr-00 Author list: Mike Taillon

Sambuz

Useful Links

Newsletter

Mail Us

Deep learning for natural language processing A short primer on deep learning Benoit Favre <