Text Representation Bag-of-Words and Word Embeddings count vector - - PowerPoint PPT Presentation

text representation
SMART_READER_LITE
LIVE PREVIEW

Text Representation Bag-of-Words and Word Embeddings count vector - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Text Representation Bag-of-Words and Word Embeddings count vector unordered bag over large (fixed-size) of vocab symbols vocabulary 3 1 1


slide-1
SLIDE 1

7

Mike Hughes - Tufts COMP 135 - Fall 2020

Text Representation

Bag-of-Words and Word Embeddings

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

  • Prof. Mike Hughes

count vector

  • ver large (fixed-size)

vocabulary

songs the were the a n d

  • r

hilarious dinosaur the

unordered “bag”

  • f vocab symbols

best and were the 3 1 1 so hilarious

slide-2
SLIDE 2

PROJECT 2: Text Sentiment Classification

8

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-3
SLIDE 3

Example Text Reviews + Labels

9

Mike Hughes - Tufts COMP 135 - Fall 2020

Food was so gooodd. I could eat their bruschetta all day it is devine. The Songs Were The Best And The Muppets Were So Hilarious. VERY DISAPPOINTING. there was NO SPEAKERPHONE!!!!

slide-4
SLIDE 4

Issues our representation might need to handle

10

Mike Hughes - Tufts COMP 135 - Fall 2020

Food was so gooodd. I could eat their bruschetta all day it is devine. The Songs Were The Best And The Muppets Were So Hilarious. VERY DISAPPOINTING. there was NO SPEAKERPHONE!!!!

Misspellings? Misspellings? Unfamiliar Words Punctuation? Capitalization?

slide-5
SLIDE 5

Sentiment Analysis

  • Question: How to represent text reviews?

11

Mike Hughes - Tufts COMP 135 - Fall 2020

Friendly staff, good tacos, and fast service. What more can you look for at taco bell?

φ(xn)?

Need to produce a feature vector

  • f same length for every

sentence, whether it has 2 words

  • r 200 words.

Raw sentences vary in length and content.

slide-6
SLIDE 6

Proposal: 1) Define a fixed vocabulary (size F) 2) Feature representation: Count how often each term in vocabulary appears in each review

12

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-7
SLIDE 7

13

Mike Hughes - Tufts COMP 135 - Fall 2020

Bag-of-words representation

count vector

  • ver large (fixed-size)

vocabulary The Songs Were The Best And The Muppets Were So Hilarious.

songs the were the a n d

  • r

h i l a r i

  • u

s dinosaur the

  • riginal data

unordered “bag”

  • f vocab symbols

best and were the 3 1 1

Predefined vocabulary 0: the 1: and 2: or 3: dinosaur … 5005: hilarious

so hilarious

φ(xn)?

muppets Excludes out of vocabulary words

slide-8
SLIDE 8

Bag of words example

14

Mike Hughes - Tufts COMP 135 - Fall 2020

Food was so gooodd. I could eat their bruschetta all day it is devine. The Songs Were The Best And The Muppets Were So Hilarious. So Hilarious were the Muppets and the songs were the best VERY DISAPPOINTING. there was NO SPEAKERPHONE!!!!

food the eat was/were best/good disappoint no so 1 1 1 1 1 3 2 1 1 3 2 1 1 1 1 1

Most entries in BoW features will be zero. Can use sparse matrices to store/process efficiently. Each column of BoW feature array is interpretable

slide-9
SLIDE 9

BoW: Key Design Decisions for Project B

15

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-10
SLIDE 10

16

Mike Hughes - Tufts COMP 135 - Fall 2020

Sentiment Analysis

  • Question: How to represent text reviews?

Friendly staff, good tacos, and fast service. What more can you look for at taco bell?

Option 1: Bag-of-words count vectors Option 2: Word embedding vectors

slide-11
SLIDE 11

Word Embeddings (word2vec)

17

Goal: map each word in vocabulary to high-dimensional vector

  • Preserve semantic meaning in this new vector space

Ability to make an embedding is implemented as a simple lookup table

  • In: vocabulary word as string (“walked”)
  • Out: 50-dimensional vector of reals

Only words in the predefined vocabulary can be mapped to a vector.

slide-12
SLIDE 12

Word Embeddings (word2vec)

18

Goal: map each word in vocabulary to high-dimensional vector

  • Preserve semantic meaning in this new vector space

vec(swimming) – vec(swim) + vec(walk) = vec(walking)

slide-13
SLIDE 13

19

Word Embeddings (word2vec)

Goal: map each word in vocabulary to high-dimensional vector

  • Preserve semantic meaning in this new vector space
slide-14
SLIDE 14

How to learn the embedding?

Training

20

Reward embeddings that predict nearby words in the sentence. tacos s t a f f dinosaur hammer embedding dimensions typical 100-1000

Goal: learn weights

Credit: https://www.tensorflow.org/tutorials/representation/word2vec

3.2

  • 4.1

7.1

fixed vocabulary typical 1000-100k

W = W

slide-15
SLIDE 15

21

Mike Hughes - Tufts COMP 135 - Fall 2020

Food was so gooodd. I could eat their bruschetta all day it is devine. The Songs Were The Best And The Muppets Were So Hilarious. VERY DISAPPOINTING. there was NO SPEAKERPHONE!!!!

dim1 dim2 dim3 dim4 … dim49 dim50 +1.2 +1.2 +3.1

  • 3.2

.. +20.1

  • 6.8

+5.8

  • 22.5

+4.4 +4.3 +3.1

  • 111.1
  • 8.3
  • 3.1
  • 40.8
  • 4.3

+6.9

  • 10.8

+3.2 +4.7

  • 9.6

+5.5

  • 7.7

+1.8

Example: Word embedding features

Entries will be dense and real-valued (negative or positive). Each column of feature array might be difficult to interpret.

slide-16
SLIDE 16

GloVe: Key Design Decisions for Project B

22

Mike Hughes - Tufts COMP 135 - Fall 2020

slide-17
SLIDE 17

23

Mike Hughes - Tufts COMP 135 - Fall 2020

What features are best? What classifier is best? What hyperparameters are best?

PROJECT 2: Text Sentiment Classification

slide-18
SLIDE 18

Lab: Bag of Words

  • Part 1-3 : Pure python to build BoW features
  • Part 4: How to use with classifier
  • Part 5: sklearn CountVectorizer
  • Part 6: Doing grid search with sklearn pipelines

24

Mike Hughes - Tufts COMP 135 - Fall 2020