Something Old, Something New A Talk about NLP for the Curious - - PowerPoint PPT Presentation

something old something new
SMART_READER_LITE
LIVE PREVIEW

Something Old, Something New A Talk about NLP for the Curious - - PowerPoint PPT Presentation

Something Old, Something New A Talk about NLP for the Curious @EVANAHARI, YOW! AUSTRALIA 2016 Jabberwocky `Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe.


slide-1
SLIDE 1

@EVANAHARI, YOW! AUSTRALIA 2016

Something Old, Something New

A Talk about NLP for the Curious

slide-2
SLIDE 2

Jabberwocky

slide-3
SLIDE 3

– Lewis Carroll
 from Through the Looking-Glass and What Alice Found There, 1871

“`Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe.”

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

Why are these monkeys following me? Arrfff! LOL

slide-7
SLIDE 7

Challenges

  • Mistakes
  • Slang & sparse words
  • Ambiguity types
  • Lexical
  • Syntax level
  • Referential
slide-8
SLIDE 8

Human Language

  • The cortical speech center unique to humans
  • Evolution over hundred thousands of years
  • Vocabulary
  • Grammar
  • Speed
  • An advanced processing unit
  • Sounds
  • Meaning of words
  • Grammar constructs
  • Match against a knowledge base
  • Understanding context and humor!
slide-9
SLIDE 9

Human Language Processing

Phonology − organization of sounds Morphology − construction of words Syntax − creation of valid sentences/phrases and identifying the structural roles of words in them Semantics − finding meaning of words/phrases/sentences Pragmatics − Situational meaning of sentences Discourse − order of sentences affecting interpretation World knowledge − mapping to general world knowledge Context awareness - the hardest part…?

slide-10
SLIDE 10

Natural Language Processing

  • Computers generating language
  • Computers understanding human language 


Lexical analysis Syntactic analysis Semantic analysis Discourse Integration Pragmatic Analysis

slide-11
SLIDE 11

– J. R. Firth, 1957

“You should know a word by the company it keeps.”

slide-12
SLIDE 12

Language Models

  • Represent language in a mathematical way

A language model is a function that captures the statistical characteristics of the word-sequence distribution in a language

  • Dimensionality challenge

10-word sequence from a 100 000 word vocabulary 
 —> 10^50 possible sequences

  • Large sample set vs processing time & cost vs accuracy
slide-13
SLIDE 13

Bag-of-words

  • Not suited for huge vocabulary
  • Semantics are not considered
  • Order of words are lost

= [111100]
 = [111100]
 = [110011] = [111100]

= [443311]

Vocabulary: Happy birthday to you dear “name” = [100000]
 = [010000] = [001000]
 = [000100] = [000010] = [000001] Sample text:
 Happy birthday to you
 Happy birthday to you
 Happy birthday dear “name”
 Happy birthday to you Term frequency

slide-14
SLIDE 14

n-grams

“Hello everyone who is eager to learn NLP!”

  • “gram”: a unit, e.g. letter, phoneme, word, …
  • uni-gram: Hello, everyone, who, is, …
  • bi-gram: Hello-everyone, everyone-who, who-is, …
  • n-gram: n-length sequences of units
  • k-skip-gram: skip k units
  • bi-skip-tri-gram: Hello-is-learn, everyone-eager-NLP
slide-15
SLIDE 15

n-gram Probabilistic Model

  • Given a sequence of words what is the likelihood of the next?
  • Using counts of n-grams extracted from a training data set we can predict

the next word x based on probabilities

  • Simple; only n-1 words determines the probability
  • Difficult to handle infrequent words and expressions
  • Smoothening (e.g. Good-Turing, Katz-Back-off model, etc)
  • Use additional sampling (bi-grams, tri-grams, skip-grams)

P(xi | xi-(n-1),… ,xi-1) = count(xi-(n-1),… ,xi-1) count (xi-(n-1),… ,xi-1, xi)

slide-16
SLIDE 16

Example use: 
 Named Entity Extraction (NER)

Examples:

  • Grammar based: “…live in <city>”
  • Co-occurrence based: “new+york”, “san

+francisco”, … Common pattern: Inference of applying various models

slide-17
SLIDE 17

Apple

Round Red HasLeaf

+ +

Naive Bayes Probabilistic Model

slide-18
SLIDE 18

Example Use: Text Classification

Sample Data Apple Red No Green Yes Yellow Yes Red Yes Red Yes Green Yes Yellow No Yellow No Red Yes Yellow Yes Red No Green Yes Green Yes Yellow No

Feature No Yes Green 4 4/14 0.29 Yellow 3 2 5/14 0.36 Red 2 3 5/14 0.36 Grand Total 5 9 5/14 9/14 0.36 0.64

Incoming fruit text says “red” - is it about an apple?

P(Yes | Red) = P( Red | Yes) * P(Yes) / P (Red) P (Red |Yes) = 3/9 = 0.33 P(Yes)= 9/14 = 0.64 P(Red) 0.36 P (Yes | Red) = 0.33 * 0.64 / 0.36 = 0.60

60% chance it’s about an apple!

slide-19
SLIDE 19

Naive Bayes

Things to Consider:

  • Easy and fast, good for multi-class, better than

most

  • Does not handle unknown categories well, needs

smoothing

  • Needs less training data, but well representative
  • Assuming attributes to be truly independent
slide-20
SLIDE 20

Combining Models

Things to Consider:

  • How many models can you afford?
  • How good are your models (i.e. training data)?
  • Latency vs accuracy?
slide-21
SLIDE 21

Bag of Words

0 0 0 1 = = 0 1 0 0

slide-22
SLIDE 22

= =

Continuous Bag of Words (Embeddings)

2 3 8 1 7 5 6 2

slide-23
SLIDE 23
slide-24
SLIDE 24

Distributed Representation

  • A word is a dot in a multi-dimensional vector

space, where each dimension is features of a word

  • Decide features?
  • HUMAN: decides features; gender, plurality,

semantic characteristics

  • COMPUTER: learn the 


features; continuous values 


slide-25
SLIDE 25

Neural Net Language Model

  • A model based on the capabilities of NN is an

NNLM

  • Rely on the NN to discover the features of a

distributed representation

  • Extrapolations makes it possible to keep a dense

model - even for very large data sets

slide-26
SLIDE 26

Mikolow et al’s CBOW vs Continuous Skip-gram

  • CBOW - predict a term based on context (near-terms)
  • w-2, w-1, w+1, w+2 —> w
  • fast to train
  • higher accuracy for frequent words
  • conditioning on context needs larger data sets
  • Continuous Skip-gram - predict context (near-terms) based on a word
  • w —> w-2, w-1, w+1, w+2
  • k-skip-n-gram: k and n determines complexity (training time vs accuracy)
  • helps create more samples from a smaller data set (data sparsity, rare terms)
slide-27
SLIDE 27

Diagram borrowed from Mikolow et al’s paper

slide-28
SLIDE 28
  • 1. Probability of next term, i.e. Bayes Theorem



 
 Approximate t with n - to gain simplicity of n-grams


  • 2. d-dimensional feature vector Cwt-i (column wt-i of parameter matrix C):



 
 Ck contains learned features for word k


  • 3. Use standard NN for probabilistic classification (Softmax):



 
 
 where 


NN-based Probabilistic Prediction Model

P(w1, w2,… ,wt-1, wt) = P(w1)P(w2|w1)P(w3|w1,w2)…P(w1, w2,… ,wt-1) x = (Cwt-n+1, 1, …, Cwt-n+1, d, Cwt-n+2, 1, …, Cwt-2, d, Cwt-1, 1, …, Cwt-1, d) SUM(i=1 to N) eai eak P(wt = k|wt-n+1, … ,wt-1) = ak = bk + SUM(i=1 to h) Wki tanh(ci + SUM(j=1 to (n-1)d) Vijxj)

slide-29
SLIDE 29

Diagram borrowed from Bengio et al’s paper

slide-30
SLIDE 30

NLP is not New…


 ABBYY, Angoss, Attensity, AUTINDEX, Autonomy, Averbis, Basis Technology, Clarabridge, Complete Discovery Source, Endeca Technologies, Expert System S.p.A., FICO Score, General Sentiment, IBM LanguageWare, IBM SPSS, Insight, LanguageWare, Language Computer Corporation, Lexalytics, LexisNexis, Luminoso, Mathematica, MeaningCloud, Medallia, Megaputer Intelligence, NetOwl, RapidMiner, SAS Text Miner and Teragram;, Semantria , Smartlogic, StatSoft, Sysomos, WordStat, Xpresso, ….

slide-31
SLIDE 31

…but Getting Hot (Again)

  • Big text data sets available
  • Distributed processing tech & capacity cheaper
  • ML-based training economically possible (and more

accurate)

  • Open source movement
  • Large upswing potential…

No animals were harmed during this photo shoot

slide-32
SLIDE 32

Cheat Sheet

  • openNLP - Java, Apache, familiar, easier, older
  • coreNLP - Java, Stanford, popular, good tool span
  • NLTK - python, rich in resources, easiest
  • spaCy - up and coming, python, promising..
  • FasCext - nothing new..?
  • Spark - “ML framework”, custom implementaKon, large

scale

  • Deeplearning4j - word2vec (java, scala)
  • Tensorflow (SyntaxNet) - separated opKmizaKon & more

tuning nobs, beCer syntax parsing model, very recently large scale too

slide-33
SLIDE 33
  • Language key to our species’ success
  • Our multi-step process is complex and our brains


forgiving

  • A language models represents word sequence


distributions within a language

  • Bag-of-words, n-grams are common representations
  • Naive bayes common for probabilistic models
  • Distributed representations are dense and powerful
  • NNLM based on learned word-features
  • Positive NLP trends: 


More open source tools and frameworks and generated distributed representations available to all

Summary and Questions

slide-34
SLIDE 34

@EVANAHARI, YOW! AUSTRALIA 2016

Jabberwocky

Vote!