Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Text Mining Overview Featurization Traditional approaches Word embeddings and representational learning A look ahead Tooling 2 Text is everywhere
Overview
Featurization Traditional approaches Word embeddings and representational learning A look ahead Tooling
2
Text is everywhere
Medical records Product reviews Repair notes Facebook posts Book recommendations Tweets Declarations Legislation, court decisions Emails Websites …
3
But is is unstructured
Or "semi-structured", rather, though still:
No direct "feature vector" representation Linguistic structure
Language Relationship between words Importance of words Negations, etc.
Text is dirty
Grammatical, spelling, abbreviations, homographs
Text is intended for communication between people
4
The trick for unstructered data
It all boils down to featurization
Approaches to convert the unstructured data to a structured feature vector "Feature engineering"
Just as for computer vision, we can split up approaches in a "before and after deep learning" era 5
The basics
6
So, how to featurize?
Let's start simple The goal is to take a collection of documents – each of which is a relatively free-form sequence of words – and turn it into the familiar feature-vector representation
A collection of documents: corpus (plural: corpora) A document is composed of individual tokens or terms (words, but can be broader as well, i.e. punctuation tokens, abbreviations, smileys, …) Mostly: each document is one instance, but sentences can also form instances Which features will be used is still to be determined
7
Bag of words
The "keep it simple approach"
Simply treat every document (instance) as a collection of individual words
Ignore grammar, word order, sentence structure, and (usually) punctuation Treat every word in a document as a potentially important keyword of the document
Each document is represented by sequence of ones (if the token is present in the document) or zeros (the token is not present in the document)
I.e. one feature per word
Inexpensive to generate though leads to an explosion of features Can work in some settings Alternatively: frequencies instead of binary features can be used
this is an a example second “This is an example” => 1 1 1 0 1 0 “This is a second example” => 1 1 0 1 1 1
8
Normalization, stop-word removal, stemming
The case should be normalized
E.g. every term is in lowercase
Words should be stemmed (stemming)
Suffixes are removed (only root is kept) E.g., noun plurals are transformed to singular forms Porter’s stemming algorithm: basically suffix- stripping More complex: lemmatization (e.g. better → good, flies/flight &arr; fly) Context is required: part of speech (PoS) tagging (see later)
Stop-words should be removed (stop word removal)
A stop-word is a very common word in English (or whatever language is being parsed) Typical words such as the words the, and, of, and on are removed
9
(Normalized) term frequency
Recall that we can use bag of words with counts (word frequencies)
Nice as this differentiates between how many times a word is used “Document-term matrix”
However:
Documents of various lengths Words of different frequencies Words should not be too common or too rare Both upper and lower limit on the number (or fraction) of documents in which a word may
- ccur
So the raw term frequencies are best normalized in some way
Such as by dividing each by the total number of words in the document Or the frequency of the specific term in the corpus Also think about the production context: what do we do when we encounter a previously unseen term?
10
TF-IDF
Term frequency (TF) is calculated per term t per document d
TF(t, d) =
(number of times term appears divided by document length) However, frequency in the corpus also plays a role: terms should not be too rare, but also not be too common. So we need a measure of sparseness Inverse document frequency (IDF) is calculated per term t over the corpus c
IDF(t) = 1 + log
(one plus logarithm of total number of documents divided by documents having term)
∣d∣ ∣{w ∈ d : w = t}∣ (∣{d ∈ c : t ∈ d}∣ ∣c∣ )
11
TF-IDF
12
TF-IDF
TFIDF(t, d) = TF(t, d) × IDF(t)
Gives you a weight for each term in each document
Perfect for our feature matrix Rewards terms that occur frequently in the document But penalizes if they occur frequently in the whole collection
A vector of weights per document is obtained 13
TF-IDF: example
15 prominent jazz musicians and excerpts
- f their biographies from Wikipedia
Nearly 2,000 features after stemming and stop-word removal! Consider the sample phrase “Famous jazz saxophonist born in Kansas who played bebop and latin”
14
Dealing with high dimensionality
Feature selection will often need to be applied Fast and scalable classification or clustering techniques will need to be used
E.g. linear Naive Bayes and Support Vector Machines have proven to work well in this setting Using clustering techniques based on non-negative matrix factorization
Also recall from pre-processing: "the hashing trick": collapse the high amount of features to n hashed features Use dimensionality reduction techniques like t-SNE or UMAP (Uniform Manifold Approximation and Projection, McInnes, 2018, https://github.com/lmcinnes/umap)
15
N-gram sequences
What if word order is important?
A next step from the previous techniques is to use sequences of adjacent words as terms Adjacent pairs are commonly called bi-grams
Example: “The quick brown fox jumps”
Would be transformed into {quick_brown, brown_fox, fox_jumps} Can be combined together with 1-grams: {quick, brown, fox, jumps}
But: N-grams greatly increase the size of the feature set 16
Natural language processing (NLP)
Complete field of research Key idea: use machine learning and statistical approaches to learn a language from data Better suited to deal with
Contextual information (“This camera sucks” vs. “This vacuum cleaner really sucks”) Negations Sarcasm
Best known tasks:
PoS (Part of Speech) tagging: noun, verb, subject, … Named entity recognition
17
Part of speech
18
Named entity recognition
19
Named entity recognition
20
Word embeddings and representational learning
21
↓ Doc2 ↓ Doc4 ↓ Doc7 ↓ Doc9 ↓ Doc11
Vector space models
Represent an item (a document, a word) as a vector of numbers: banana 0 1 0 1 0 0 2 0 1 0 1 0 Such a vector could for instance correspond to documents in which the word
- ccurs:
banana 0 1 0 1 0 0 2 0 1 0 1 0 22
↓ (yellow,-1) ↓ (on,+2) ↓ (grows,+1) ↓ (tree,+3) ↓ (africa,+5)
Vector space models
The vector can also correspond to neighboring word context banana 1 1 2 1 1 "yellow banana grows on trees in africa" 23
Word embeddings
A dense vector of real values. The vector dimension is typically much smaller than the number
- f items or the vocabulary size
E.g. typical dimension for some learning tasks: 128
You can imagine the vectors as coordinates for times in the embedding space. Some distance metric defines a notion of relatedness between items in this space
24
Word embeddings
Man is to woman as king is to ____ ? Good is to best as smart is to ____ ? China is to Beijing as Russia is to ____ ?
Turns out the word-context based vector model we just looked at is good for such analogy tasks
[king] – [man] + [woman] ≈ [queen]
25
How to construct word embeddings?
Matrix factorization based
Non-negative matrix factorization GloVe (Word-NeighboringWord) See https://nlp.stanford.edu/projects/glove/
Neural network based: word2vec
word2vec released by Google in 2013 See https://code.google.com/archive/p/word2vec/ Neural network-based implementation that learns vector representation per word
Background information
https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/ https://iksinc.wordpress.com/tag/continuous-bag-of-words-cbow/ https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/ https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e
26
word2vec
Word2vec: convert each term to a vector representation
Works at term level, not at document level (by default) Such a vector comes to represent in some abstract way the “meaning” of a word Possible to learn word vectors that are able to capture the relationships between words in a surprisingly expressive way “Man is to woman as uncle is to (?)”
27
word2vec
Here, the word vectors have a dimension of 1000 Second-step PCA or t-SNE possible common to project to two-dimensions
28
word2vec
The general idea is the assumption that a word is correlated with and defined by its context
Was also the basic idea behind N-grams More so, we're going to predict a word based on its context
The methods of learning:
Continuous Bag-of-Words (CBOW) Continuous Skip-Gram (CSG)
29
word2vec: CBOW
The context words form the input layer: train word against context
Each word is encoded in one-hot form If the vocabulary size is V these will be V- dimensional vectors with just one of the elements set to one, and the rest all zeros There is a single hidden layer and an
- utput layer
The training objective is to predict the
- utput word (the focus word) given the
input context words The activation function of the hidden layer units is a weighted average The final layer is a softmax layer
30
word2vec: CBOW
Let's take an example with just one context word
We want to get word vectors of size N=4
Network layout:
Input: V units W (VxN) matrix of weights: after training, we extract the vectors per word in V here Hidden layer: N units: no activation
31
word2vec: CBOW
Now we introduce more context words Network layout:
Input: CxV units W (VxN) matrix of weights: the same weight vector is applied for every context word Hidden layer: (CxV) x (VxN) gives an (CxN) output vector in the hidden layer To collapse this to (N), we average over each column
32
word2vec: CBOW
33
word2vec: CBOW
So after training: boils down to a simple one-hot lookup in a weight matrix 34
word2vec: CBOW
Using the weights for the hidden layer is sufficient to obtain the word vectors after training Can apply division by context size to
- btain averages, though not necessary
Whole context and desired focus is presented as one training example Though note that the ordering of the context does not matter! This model is assumes independence in inputs This together with the averaging activation function allows for some “smoothing”: less training data required
35
word2vec: CSG
The focus word forms the input layer: train context against word
Also encoded in one-hot form Activation function is a simple pass- through The training objective is to predict the context
36
word2vec: CSG
The hidden layer here is a simple pass- through (no activation function) We can hence also use the weights W after training Easier to implement Better quality vectors, but more training data required Hence, in both cases, after training the network, we can basically throw it away and only retain the hidden weights to get
- ur vectors
Size of hidden layer determines the word vector dimensionality Approach comparable to auto-encoders So: a kind of dimensionality reduction with context
37
Auto-encoders
Architecture which squeezes an input through a low-dimensionality bottleneck
Desired output can be the same as input,
- r a denoised version, or context (as done
here), … Forces the network to learn a sparse representation Use full network after training, e.g. for image denoising (in combination with CNN setup) Or only use compressed representation, throwing away the decoding part (as done here)
38
word2vec
This is just the basics, good implementations utilize
Random context word selection Subsampling to skip retraining frequent words Negative sampling: don’t always modify all the weights
39
GloVe: Global Vectors for Word Representation
Pennington et al. argue that the approach used by word2vec is suboptimal since it doesn’t fully exploit statistical information regarding word co-occurrences They demonstrate a Global Vectors (GloVe) model which combines the benefits
- f the word2vec skip-gram model when it comes to word analogy tasks, with
the benefits of matrix factorization methods that can exploit global statistical information Similar output as word2vec 40
fastText
FastText is a library for text classification and representation. It transforms text into continuous vectors that can later be used on any language related task (https://github.com/facebookresearch/fastText) fastText uses a hashtable for either word or character ngrams One of the key features of fastText word representation is its ability to produce vectors for any words, even made-up ones. Indeed, fastText word vectors are built from vectors of substrings of characters contained in it. This allows to build vectors even for misspelled words or concatenation of words 41
Using RNNs
One annoying aspect of the standard word2vec approach is the independence in inputs Therefore, (bidirectional) RNN based models are also often applied to extract embeddings Memory over the sequence dimension … and many others (hierarchical approaches, and so on) 42
Shared representations
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
43
par2vec, doc2vec
https://medium.com/@amarbudhiraja/understanding-document-embeddings-of-doc2vec-bfe7237a26da
44
Categorical to vec
The concept of representational learning and embeddings has even been applied towards sparse, high level categoricals
https://arxiv.org/abs/1604.06737 https://www.r-bloggers.com/exploring-embeddings-for-categorical-variables-with-keras/
The idea is to represent a categorical representation with n-continuous variables
Let’s say you want to model the effect of day of the week on an outcome. Usually you would try to one-hot encode the variable, which means that you create 6 variables (each for one day of a week minus 1) and set the variable 1 or 0 depending on the value. You end up having a 6- dimensional space to represent a weekday So, what is the advantage of mapping the variables in an continuous space? With embeddings you can reduce the dimensionality of your feature space which should reduce
- verfitting in prediction problems
Use a simple neural network with only the categorical as input -> embedding -> outcome
Picked up by fast.ai: https://www.fast.ai/2018/04/29/categorical-embeddings/ 45
Categorical to vec
https://www.r-bloggers.com/exploring-embeddings-for-categorical-variables-with-keras/
46
Some examples
47
Embeddings and sentiments
48
Embeddings and bias
Princeton researchers discover why AI become racist and sexist (https://arstechnica.com/science/2017/04/princeton-scholars-figure-out-why- your-ai-is-racist/)
“Man is to Computer Programmer as Woman is to Homemaker?”
Using the IAT as a model, Caliskan and her colleagues created the Word- Embedding Association Test (WEAT), which analyzes chunks of text to see which concepts are more closely associated than others”
“ “
49
Listing embeddings
https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in- search-601172f7603e
50
Listing embeddings
https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in- search-601172f7603e
51
Analyzing news sources
52
A look ahead
53
A lot has happened in NLP
Focus on better embedding models (e.g. working well with unseen words, new words, misspelled words) and transfer learning in the textual domain
Elmo, ULMFit, Transformer, BERT http://jalammar.github.io/illustrated-bert/ http://ruder.io/10-exciting-ideas-of-2018-in-nlp/ https://github.com/mratsim/Arraymancer/issues/268 https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html http://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and- bert/ https://towardsdatascience.com/beyond-word-embeddings-part-2-word-vectors-nlp-modeling- from-bow-to-bert-4ebd4711d0ec BERT builds on top of a number of clever ideas that have been bubbling up in the NLP community recently – including but not limited to Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le), ELMo (by Matthew Peters and researchers from AI2 and UW CSE), ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer (Vaswani et al).
“ “
54
A lot has happened in NLP
https://towardsdatascience.com/beyond-word-embeddings-part-2-word-vectors-nlp-modeling- from-bow-to-bert-4ebd4711d0ec
Bag of words, N-grams, TF-IDF, distributional embeddings (based on mutual information), word2vec, GloVe, fastText Newer techniques incorporate RNN's, attention mechanisms, more context, transfer learning, one-dimensional CNN's 55
A lot has happened in NLP
Learning different representations depending on the context, e.g. "stick" has multiple meanings depending on where it’s used" E.g. approaches include learning on the whole of Wikipedia and then fine-tuning on your domain
ELMo is a model generates embeddings for a word based on the context it appears thus generating slightly different embeddings for each of its
- ccurrence.
“ “
ELMo gained its language understanding from being trained to predict the next word in a sequence of words - a task called Language Modeling
“ “
ULM-FiT introduced methods to effectively utilize a lot of what the model learns during pre-training – more than just embeddings, and more than contextualized embeddings. ULM-FiT introduced a language model and a process to effectively fine-tune that language model for various tasks.
“ “
56
A lot has happened in NLP
Uses and encoder-decoder structure The OpenAI Transformer
The release of the Transformer paper and code, and the results it achieved
- n tasks such as machine translation started to make some in the field
think of them as a replacement to LSTMs. This was compounded by the fact that Transformers deal with long-term dependencies better than LSTMs.
“ “
It turns out we don’t need an entire Transformer to adopt transfer learning and a fine-tunable language model for NLP tasks. We can do with just the decoder of the transformer.
“ “
57
A lot has happened in NLP
The openAI transformer gave us a fine-tunable pre-trained model based on the Transformer. But something went missing in this transition from LSTMs to Transformers. ELMo’s language model was bi-directional, but the openAI transformer only trains a forward language model. Could we build a transformer-based model whose language model looks both forward and backwards (in the technical jargon – “is conditioned on both left and right context”)?
“ “
BERT applies a similar approach but uses attention transformers instead of bi-directonal RNNs (LSTMs) to encode context together with a masking system
“ “
58
A lot has happened in NLP
Stanford CS224N: NLP with Deep Learning | Winter 2019 https://www.youtube.com/watch? v=kEMJRjEdNzM&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z 59
Tooling
60
Tooling
Natural Language Toolkit (NLTK – http://www.nltk.org) MITIE: library and tools for information extraction (https://github.com/mit-nlp/MITIE) R: tm , topicmodels and nlp packages, http://tidytextmining.com/ Gensim: https://radimrehurek.com/gensim/
Good implementations of word2vec, GloVe Pretrained word vectors Topic modeling with LDA (https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)
Some text mining tools in Spark and scikit-learn https://spacy.io/ https://nlu.rasa.com/ https://allennlp.org/: newer, based on PyTorch
vaderSentiment : Valence Aware Dictionary and sEntiment Reasoner
(https://github.com/cjhutto/vaderSentiment): simple sentiment analysis tool that works well on social media