Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

advanced analytics in business d0s07a big data platforms
SMART_READER_LITE
LIVE PREVIEW

Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Text Mining Overview Featurization Traditional approaches Word embeddings and representational learning A look ahead Tooling 2 Text is everywhere


slide-1
SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Text Mining

slide-2
SLIDE 2

Overview

Featurization Traditional approaches Word embeddings and representational learning A look ahead Tooling

2

slide-3
SLIDE 3

Text is everywhere

Medical records Product reviews Repair notes Facebook posts Book recommendations Tweets Declarations Legislation, court decisions Emails Websites …

3

slide-4
SLIDE 4

But it is unstructured

Or “semi-structured”, rather, though still:

No direct “feature vector” representation Linguistic structure Language Relationship between words Importance of words Negations, etc. Text is dirty Grammatical, spelling, abbreviations, homographs Text is intended for communication between people

4

slide-5
SLIDE 5

The trick for unstructured data

It all boils down to featurization

Approaches to convert the unstructured data to a structured feature vector “Feature engineering”

Just as for computer vision, we can split up approaches in a “before and after deep learning” era 5

slide-6
SLIDE 6

Basic Text Featurization

6

slide-7
SLIDE 7

So, how to featurize?

The goal is to take a collection of documents – each of which is a relatively free-form sequence of words – and turn it into the familiar feature-vector representation

A collection of documents: corpus (plural: corpora) A document is composed of individual tokens or terms (words, but can be broader as well, i.e. punctuation tokens, abbreviations, smileys, …) Mostly: each document is one instance, but sentences can also form instances

7

slide-8
SLIDE 8

Bag of words

The “keep it simple approach”

Simply treat every document (instance) as a collection of individual tokens (words, most often)

Ignore grammar, word order, sentence structure, and (usually) punctuation Treat every word in a document as a potentially important keyword of the document

Each document is represented by sequence of ones (if the token is present in the document) or zeros (the token is not present in the document)

I.e. one feature per word

Inexpensive to generate though leads to an explosion of features Can work in some settings with careful filtering Alternatively, frequencies instead of binary features can be used (“term frequency matrix”)

this is an a example second “This is an example” => 1 1 1 0 1 0 “This is a second example” => 1 1 0 1 1 1

8

slide-9
SLIDE 9

Normalization, stop-word removal, stemming

“Can work in some settings with careful filtering”

The case should be normalized

E.g. every term is in lowercase

Words should be stemmed (stemming)

Suffixes are removed (only root is kept) E.g., noun plurals are transformed to singular forms Porter’s stemming algorithm: basically suffix-stripping More complex: lemmatization (e.g. better → good, flies/flight → fly) If context is required: part of speech (PoS) tagging (see later)

Stop-words should be removed (stop word removal)

A stop-word is a very common word in English (or whatever language is being parsed) Typically words such as “the”, “and”, “of”, “on” … are removed

9

slide-10
SLIDE 10

(Normalized) term frequency

Recall that we can use bag of words with counts (word frequencies)

Nice as this differentiates between how many times a word is used “Document-term matrix”

However:

Documents can be of various lengths Words of different frequencies Words should not be too common or too rare Both upper and lower limit on the number (or fraction) of documents in which a word may occur

So the raw term frequencies are best normalized in some way

Such as by dividing each by the total number of words in the document Or the frequency of the specific term in the corpus And as always: think about the production context; what do we do when we encounter a previously unseen term?

10

slide-11
SLIDE 11

TF-IDF

Term frequency (TF) is calculated per term t per document d

Number of times term appears divided by document length

However, frequency in the corpus also plays a role: terms should not be too rare, but also not be too

  • common. So we need a measure of sparseness

Inverse document frequency (IDF) is calculated per term t over the corpus c

One plus logarithm of total number of documents divided by documents containing term

TF(t, d) = |{w ∈ d : w = t}| |d| IDF(t) = 1 + log( ) |c| |{d ∈ c : t ∈ d}|

11

slide-12
SLIDE 12

TF-IDF

12

slide-13
SLIDE 13

TF-IDF

Gives you a weight for each term in each document Perfect for our feature matrix Rewards terms that occur frequently in the document But penalizes if they occur frequently in the whole collection

A vector of weights per document is obtained TFIDF(t, d) = TF(t, d) × IDF(t) 13

slide-14
SLIDE 14

TF-IDF: example

15 prominent jazz musicians and excerpts of their biographies from Wikipedia Nearly 2,000 features after stemming and stop-word removal! Consider the sample phrase “Famous jazz saxophonist born in Kansas who played bebop and latin”

14

slide-15
SLIDE 15

Dealing with high dimensionality

Feature selection will often need to be applied Fast and scalable classification or clustering techniques will need to be used

E.g. linear Naive Bayes and Support Vector Machines have proven to work well in this setting Using clustering techniques based on non-negative matrix factorization

Also recall from pre-processing: “the hashing trick”: collapse the high amount of features to n hashed features Use dimensionality reduction techniques like t-SNE or UMAP

15

slide-16
SLIDE 16

N-gram sequences

What if word order is important?

A next step from the previous techniques is to use sequences of adjacent words as terms Adjacent pairs are commonly called bi-grams

Example: “The quick brown fox jumps”

Would be transformed into {quick_brown, brown_fox, fox_jumps} Can be combined together with 1-grams: {quick, brown, fox, jumps}

But: N-grams greatly increase the size of the feature set 16

slide-17
SLIDE 17

Natural language processing (NLP)

Key idea: use machine learning and statistical approaches to learn a language from data Better suited to deal with

Contextual information (“This camera sucks” vs. “This vacuum cleaner really sucks”) Negations Sarcasm

Best known tasks:

PoS (Part of Speech) tagging: noun, verb, subject, … Named entity recognition

17

slide-18
SLIDE 18

Part of speech

18

slide-19
SLIDE 19

Named entity recognition

19

slide-20
SLIDE 20

Named entity recognition

20

slide-21
SLIDE 21

Word Embeddings and Representational Learning

21

slide-22
SLIDE 22

↓ Doc2 ↓ Doc4 ↓ Doc7 ↓ Doc9 ↓ Doc11

Vector space models

Represent an item (a document, a word) as a vector of numbers: banana 0 1 0 1 0 0 2 0 1 0 1 0 Such a vector could for instance correspond to documents in which the word

  • ccurs:

banana 0 1 0 1 0 0 2 0 1 0 1 0 Recall: before we had an opposite representation: vector describing word presence per document 22

slide-23
SLIDE 23

↓ (yellow) ↓ (on) ↓ (grows) ↓ (tree) ↓ (africa)

Vector space models

The vector can also correspond to neighboring word context banana 1 1 1 1 1 “yellow banana grows on trees in africa” 23

slide-24
SLIDE 24

Word embeddings

The goal is to construct a dense vector of real values per word The vector dimension is typically much smaller than the number of items (the vocabulary size) You can imagine the vectors as coordinates for items in the “embedding space” In other words: for each item (word), we obtain a representation (a vector of real values) Distance metrics can be used to define a notion of relatedness between items in this space

What defines a “good” embedding is a good thinking point

E.g. in essence dimensionality reduction techniques embed as well in a reduced feature space What is a good embedding in e.g. unsupervised, supervised contexts? For text, structured data, imagery?

24

slide-25
SLIDE 25

Word embeddings

Man is to woman as king is to ____ ? Good is to best as smart is to ____ ? China is to Beijing as Russia is to ____ ? [king] – [man] + [woman] ≈ [queen]

In turns out that using the idea of context leads to embeddings that can be for similarity and analogy tasks 25

slide-26
SLIDE 26

How to construct word embeddings?

Matrix factorization based

Non-negative matrix factorization GloVe (Word-NeighboringWord) See https://nlp.stanford.edu/projects/glove/

Neural network based: word2vec

word2vec released by Google in 2013 See https://code.google.com/archive/p/word2vec/ Neural network-based implementation that learns vector representation per word

Background information

https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/ https://iksinc.wordpress.com/tag/continuous-bag-of-words-cbow/ https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/ https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e

26

slide-27
SLIDE 27

word2vec

Word2vec: convert each term to a vector representation

Works at term level, not at document level (by default) Such a vector comes to represent in some abstract way the “meaning” of a word Possible to learn word vectors that are able to capture the relationships between words in a surprisingly expressive way “Man is to woman as uncle is to (?)”

27

slide-28
SLIDE 28

word2vec

Here, the word vectors have a dimension of 1000 So it’s common to apply a second-step PCA or t-SNE possible to project to two-dimensions

28

slide-29
SLIDE 29

word2vec

The general idea is the assumption that a word is correlated with and defined by its context

Was also the basic idea behind N-grams More so, we’re going to predict a word based on its context

The methods of learning:

Continuous Bag-of-Words (CBOW) Continuous Skip-Gram (CSG)

29

slide-30
SLIDE 30

word2vec: CBOW

The context words form the input layer: train word against context

Each word is encoded in one-hot form If the vocabulary size is V these will be V-dimensional vectors with just one of the elements set to one, and the rest all zeros There is a single hidden layer and an output layer The training objective is to predict the output word (the focus word) given the input context words The activation function of the hidden layer units is a weighted average The final layer is a softmax layer

30

slide-31
SLIDE 31

word2vec: CBOW

Let’s take an example with just

  • ne context word

We want to get word vectors of size N=4

Network layout:

Input: V units W (VxN) matrix of weights: after training, we extract the vectors per word in V here Hidden layer: N units: no activation

31

slide-32
SLIDE 32

word2vec: CBOW

Now we introduce more context words Network layout:

Input: CxV units W (VxN) matrix of weights: the same weight vector is applied for every context word Hidden layer: (CxV) x (VxN) gives an (CxN)

  • utput vector in the hidden layer

To collapse this to (N), we average over each column

32

slide-33
SLIDE 33

word2vec: CBOW

33

slide-34
SLIDE 34

word2vec: CBOW

So after training: boils down to a simple one-hot lookup in a weight matrix 34

slide-35
SLIDE 35

word2vec: CBOW

Using the weights for the hidden layer is sufficient to

  • btain the word vectors after training

Can apply division by context size to obtain averages, though not necessary Whole context and desired focus is presented as one training example Though note that the ordering of the context does not matter! This model is assumes independence in inputs This together with the averaging activation function allows for some “smoothing”: less training data required

35

slide-36
SLIDE 36

word2vec: CSG

The focus word forms the input layer: train context against word

Also encoded in one-hot form Activation function is a simple pass-through The training objective is to predict the ฀context

36

slide-37
SLIDE 37

word2vec: CSG

The hidden layer here is a simple pass-through (no activation function) We can hence also use the weights W after training Easier to implement Better quality vectors, but more training data required Hence, in both cases, after training the network, we can basically throw it away and

  • nly retain the hidden weights to get our

vectors Size of hidden layer determines the word vector dimensionality Approach comparable to auto-encoders So: a kind of dimensionality reduction with context

37

slide-38
SLIDE 38

Auto-encoders

Architecture which squeezes an input through a low-dimensionality bottleneck

Desired output can be the same as input, or a denoised version, or context (as done here), … Forces the network to learn a sparse representation Use full network after training, e.g. for image denoising (in combination with CNN setup) Or only use compressed representation

38

slide-39
SLIDE 39

word2vec

This is just the basics, good implementations utilize

Random context word selection Subsampling to skip retraining frequent words Negative sampling: don’t always modify all the weights

39

slide-40
SLIDE 40

GloVe: Global Vectors for Word Representation

Pennington et al. argue that the approach used by word2vec is suboptimal since it doesn’t fully exploit statistical information regarding word co-occurrences They demonstrate a Global Vectors (GloVe) model which combines the benefits

  • f the word2vec skip-gram model when it comes to word analogy tasks, with

the benefits of matrix factorization methods that can exploit global statistical information Similar output as word2vec 40

slide-41
SLIDE 41

fastText

FastText is a Facebook library for text classification and representation. It transforms text into continuous vectors that can later be used on any language related task fastText uses a hashtable for either word or character ngrams One of the key features of fastText word representation is its ability to produce vectors for any words, even made-up ones. Indeed, fastText word vectors are built from vectors of substrings of characters contained in it. This allows to build vectors even for misspelled words or concatenation of words 41

slide-42
SLIDE 42

par2vec, doc2vec

https://medium.com/@amarbudhiraja/understanding-document-embeddings-of-doc2vec-bfe7237a26da

42

slide-43
SLIDE 43

Using RNNs

One annoying aspect of the standard word2vec approach is the independence in inputs Therefore, (bidirectional) RNN based models are also often applied to extract embeddings 43

slide-44
SLIDE 44

Categorical to vec

The concept of representational learning and embeddings has even been applied towards sparse, high level categoricals

https://arxiv.org/abs/1604.06737 https://www.r-bloggers.com/exploring-embeddings-for-categorical-variables-with-keras/

The idea is to represent a categorical representation with n-continuous variables

Let’s say you want to model the effect of day of the week on an outcome. Usually you would try to one-hot encode the variable, which means that you create 6 variables (each for one day of a week minus 1) and set the variable 1 or 0 depending on the value. You end up having a 6- dimensional space to represent a weekday So, what is the advantage of mapping the variables in an continuous space? With embeddings you can reduce the dimensionality of your feature space which should reduce

  • verfitting in prediction problems

Use a simple neural network with only the categorical as input → embedding → outcome

Picked up by fast.ai: https://www.fast.ai/2018/04/29/categorical-embeddings/ 44

slide-45
SLIDE 45

Categorical to vec

https://www.r-bloggers.com/exploring-embeddings-for-categorical-variables-with-keras/

45

slide-46
SLIDE 46

Categorical to vec

46

slide-47
SLIDE 47

Generalizing embeddings

An embedding is simply a layer where a categorical input is mapped to a vector of weights

Weights are trained using e.g. standard stochastic gradient descent and used upstream in the network Used in various context and different data formats

More generally, many dimensionality reduction techniques (t-SNE, UMAP, PCA, auto encoders) could also be considered as techniques which can be used to create an embedding or representation

47

slide-48
SLIDE 48

More Examples

48

slide-49
SLIDE 49

Embeddings and sentiments

49

slide-50
SLIDE 50

Embeddings and bias

https://arstechnica.com/science/2017/04/princeton-scholars-figure-out-why- your-ai-is-racist/

“Man is to Computer Programmer as Woman is to Homemaker?”

Using the IAT as a model, Caliskan and her colleagues created the Word- Embedding Association Test (WEAT), which analyzes chunks of text to see which concepts are more closely associated than others”

“ “

50

slide-51
SLIDE 51

Listing embeddings

https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in- search-601172f7603e

51

slide-52
SLIDE 52

Twitter

https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html

52

slide-53
SLIDE 53

Analyzing news sources

53

slide-54
SLIDE 54

node2vec

54

slide-55
SLIDE 55

A Look Ahead

55

slide-56
SLIDE 56

A lot has happened in NLP

For a “long” time, the common approach was to use word2vec or GloVe, enabling downstream tasks such as classification

Using a neural network on top of the embeddings, or a traditional approach

In the majority of cases, it’s common to use pre-trained word embeddings Still quite common, e.g. fastText is still being used a lot in industry 56

slide-57
SLIDE 57

A lot has happened in NLP

Since 2018 or so, newer approaches started to focus on deeper approaches to better deal with unseen words, more nuanced context, and larger data sets

Stepping away a bit from context-to-prediction towards learning directly from text itself (“language modeling”), using architectures that enable different tasks (transfer learning), e.g. question answering, translation, text generation, classification, and so on

In what follows, we give a high-level overview of the main names, key concepts and resources 57

slide-58
SLIDE 58

Recap

Let’s recap the “history” so far:

Basics: bag of words, n-grams and TF-IDF Embeddings: word2vec and friends, GloVe, fastText Neural approaches: RNN or (1d) CNN, even, based

Combinations of both Also allow to extract some notion of an embedding

This led to a couple of variants

Bidirectional RNN: process input text sequence both from left to right and right to left Character-level RNNs: predict at the level of characters to better deal with underrepresented our

  • ut-of-vocabulary words

Encoder-decoder models: two stacks of RNNs: the first encodes the input sequence to an intermediate state, whereas the second decodes this to an output sequence (e.g. for translation) Attention mechanisms: a means to weight the contextual impact of each input to each output prediction

58

slide-59
SLIDE 59

ELMo

59

slide-60
SLIDE 60

ELMo

ELMo (Embeddings from Language Models) generates embeddings for a word based

  • n the context it appears thus

generating slightly different embeddings for each of its

  • ccurrence

Learning different representations depending on the context, e.g. “stick” has multiple meanings depending on where it’s used”

ELMo gained its language understanding from being trained to predict the next word in a sequence of words (language modeling) Uses bi-directional RNNs to encode context 60

slide-61
SLIDE 61

ULMFiT

ULMFiT (Universal Language Model Fine-Tuning) was proposed and designed by fast.ai’s Jeremy Howard and DeepMind’s Sebastian Ruder

Introduced the idea of transfer learning in text The method involves fine-tuning a pretrained language model, trained on the Wikitext 103 dataset, to a new dataset in such a manner that it does not forget what it previously learned

61

slide-62
SLIDE 62

Transformer

The Transformer architecture is at the core of almost all the recent major developments in NLP. It was introduced in 2017 by Google

Transformer applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position Uses and encoder-decoder structure

62

slide-63
SLIDE 63

Transformer

Google released an improved version of Transformer in 2018 called Universal Transformer And later: Transformer-XL, a novel NLP architecture that helps machines understand context beyond a fixed-length limitation 63

slide-64
SLIDE 64

Transformer

The release of the Transformer paper and code, and the results it achieved on tasks such as machine translation was impressive The OpenAI Transformer

A fine-tunable pre-trained model based on the Transformer But: ELMo’s language model was bi-directional, whereas the openAI transformer only trains a forward language model Could we build a transformer-based model whose language model looks both forward and backwards?

OpenAI: It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer.

“ “

64

slide-65
SLIDE 65

BERT

Bidirectional Encoder Representations: considers context from both sides

At the time of its release, BERT was producing state-of-the-art results on 11 Natural Language Processing (NLP) tasks

65

slide-66
SLIDE 66

GPT-2

Generative Pre-Training 2

From OpenAI

66

slide-67
SLIDE 67

GPT-2

These findings, combined with earlier results on synthetic imagery, audio, and video, imply that technologies are reducing the cost of generating fake content and waging disinformation

  • campaigns. The public at large will need to become more skeptical of text they find online,

just as the “deep fakes” phenomenon calls for more skepticism about images. Today, malicious actors—some of which are political in nature—have already begun to target the shared online commons, using things like “robotic tools, fake accounts and dedicated teams to troll individuals with hateful commentary or smears that make them afraid to speak,

  • r difficult to be heard or believed”.

We should consider how research into the generation of synthetic images, videos, audio, and text may further combine to unlock new as-yet-unanticipated capabilities for these actors, and should seek to create better technical and non-technical countermeasures. Furthermore, the underlying technical innovations inherent to these systems are core to fundamental artificial intelligence research, so it is not possible to control research in these domains without slowing down the progress of AI as a whole. Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code.

“ “

67

slide-68
SLIDE 68

A lot has happened in NLP

http://jalammar.github.io/illustrated-bert/ http://ruder.io/10-exciting-ideas-of-2018-in-nlp/ https://github.com/mratsim/Arraymancer/issues/268 https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html http://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and- bert/ https://towardsdatascience.com/beyond-word-embeddings-part-2-word-vectors-nlp-modeling- from-bow-to-bert-4ebd4711d0ec

Stanford CS224N: NLP with Deep Learning | Winter 2019 Write with Transformer 68

slide-69
SLIDE 69

A lot has happened in NLP

69

slide-70
SLIDE 70

Tooling

70

slide-71
SLIDE 71

Tooling

Natural Language Toolkit (NLTK – http://www.nltk.org) MITIE: library and tools for information extraction (https://github.com/mit-nlp/MITIE) R: tm , topicmodels and nlp packages, http://tidytextmining.com/ Gensim: https://radimrehurek.com/gensim/

Good implementations of word2vec, GloVe, fastText Pretrained word vectors Topic modeling with LDA (https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)

Some text mining tools in Spark and scikit-learn https://spacy.io/ https://nlu.rasa.com/ https://allennlp.org/: newer, based on PyTorch

vaderSentiment : Valence Aware Dictionary and sEntiment Reasoner

(https://github.com/cjhutto/vaderSentiment): simple sentiment analysis tool that works well on social media Hugging Face Transformers

71

slide-72
SLIDE 72

Tooling

https://spacy.io/universe/project/scattertext

72