Fast & Effective: Natural Language Understanding Mike Conover, - - PowerPoint PPT Presentation

fast effective natural language understanding
SMART_READER_LITE
LIVE PREVIEW

Fast & Effective: Natural Language Understanding Mike Conover, - - PowerPoint PPT Presentation

Fast & Effective: Natural Language Understanding Mike Conover, Ph.D. Principal Data Scientist SkipFlag Smart Knowledge Base Instant Answers Expert Identification Intelligent Bot SkipFlag Smart Knowledge Base Entity


slide-1
SLIDE 1

Fast & Effective: Natural Language Understanding

Mike Conover, Ph.D. Principal Data Scientist

slide-2
SLIDE 2

SkipFlag

SkipFlag

  • Smart Knowledge Base
  • Instant Answers
  • Expert Identification
  • Intelligent Bot
slide-3
SLIDE 3

Smart Knowledge Base

  • Entity Graph
  • Projects & Jargon
  • Relevant Articles
  • Documentation
  • Source Code
slide-4
SLIDE 4

Prototype Rapidly:

Or how to solve open research problems in a production environment on deadline.

slide-5
SLIDE 5

Reflections

Exercise is good for you.

slide-6
SLIDE 6

Reflections

Start with the model the state of the art claims to beat and implement that.

slide-7
SLIDE 7

Containers & Model Deployment

slide-8
SLIDE 8

Tiered Metadata Architecture

  • Compute local data access
  • Memory constrained environments
  • Fast bulk write
slide-9
SLIDE 9

Language in the Wild

Common Crawl

  • Petabyte Scale Web Crawl
  • Available for Free on S3

Twitter

Cornucopia of Malformed Text

Wikipedia

  • Linked Structured
  • Taxonomic
slide-10
SLIDE 10

Word Embeddings

  • ccupy
slide-11
SLIDE 11

“All models are wrong, but some are useful.”

George Box

slide-12
SLIDE 12

Who Needs Grammar, Anyway?

Azimuth Declination Percolate Azimuth Declination Percolate .5 .9 .01

.. M’s of Dimensions

Orienteering Physics .9 0.1

.. 100’s of Dimensions

LSA / LDA, etc.

slide-13
SLIDE 13

Targets of Interest

Document Feature

Classification

Ranking Feature Engineering

Clusters

EDA

slide-14
SLIDE 14

Semantic Structure

King Queen Man Woman Italy Rome Good Better Best Japan Tokyo

Gender Geography Superlatives

slide-15
SLIDE 15

Embedding Vectors

The sky above the port was the color of television, tuned to a dead channel.

Embedding Dimension the sky above channel Document Vector

slide-16
SLIDE 16

Word Embeddings

Glove Vectors

Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB): Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB) Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB) Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB)

Word2Vec

Google News (100B tokens, 3M vocab, 300d) Freebase (100B words, 1.4M vocab, 300d)

Corpus Casing Dimensionality Size

https://nlp.stanford.edu/projects/glove/

slide-17
SLIDE 17

Build Your Own Embeddings

Word2Vec Doc2Vec Poincare Embeddings LDA / LSA

Out of the Box

slide-18
SLIDE 18

Tensorflow Embedding Projector

Text Images Music .. Get Crazy

slide-19
SLIDE 19

Compositional Embeddings

Domain Specific Corpora Initialize with Pre-trained Embeddings

slide-20
SLIDE 20

Cut to the Chase

FastText ‒ Multiclass Classification ‒ Subword Embeddings

https://github.com/facebookresearch/fastText Bojanowski, Piotr, et al. "Enriching word vectors with subword information." arXiv (2016)

trichlorodifluorene

fluor .. trichl

slide-21
SLIDE 21

Embed All the Things!

StarSpace

‒ Text Classification ‒ Graph Embeddings ‒ Similarity / Ranking ‒ Image Classification

https://github.com/facebookresearch/StarSpace

  • L. Wu "StarSpace: Embed All The Things!." arXiv (2017)
slide-22
SLIDE 22

Fine-Grained Structure

DisplayCy

slide-23
SLIDE 23

Breakdown

DisplayCy

www.sadtromebone.com

slide-24
SLIDE 24

Piece by Piece

Keyphrase Extraction

‒ RAKE Algorithm ‒ Segphrase / Autophrase

graham_askew | a | biomechanics_professor | at the | university_of_leeds | in | england | leads research | to | understand | better | how | the | chambered_nautilus | moves

  • F. Diaz. "Query expansion with locally-trained word embeddings." arXiv (2016)
slide-25
SLIDE 25

Taking Sentences Apart

DisplayCy

Zeroth Law: This only works in practice, never in theory.

slide-26
SLIDE 26

Learning to Rank with Neural Nets

Sometimes Good Enough Isn’t Good Enough

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ Severyn, Aliaksei, and Alessandro Moschitti. "Learning to rank short text pairs with convolutional deep neural networks." SIGIR 2015.

slide-27
SLIDE 27

Section Break Slide

Area for a Subhead, or the Name and Title of a Copresenter

slide-28
SLIDE 28

Pete Skomoroch Sam Shah Scott Blackburn Matt Hayes

slide-29
SLIDE 29

Fast & Effective: Natural Language Understanding

Mike Conover, Ph.D. Principal Data Scientist

slide-30
SLIDE 30

Cut to the Chase

Emoji Space

https://github.com/facebookresearch/fastText P.Bojanowski, "Enriching word vectors with subword information." arXiv (2016)

slide-31
SLIDE 31

Build Your Own Embeddings

Paragraph Vectors (Doc2Vec)

Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International Conference on Machine Learning. 2014.

slide-32
SLIDE 32

Ship It!

1 P r

  • t
  • t

y p e ( 2 W e e k s ) L i t e r a t u r e R e v i e w O p e r a t i

  • n

a l i z a t i

  • n

S t r a w m a n M

  • d

e l s 2 P r

  • d

u c t i

  • n

i z e ( 2 W e e k s ) S c h e d u l e S e r v i c e P r

  • f

i l i n g 3 H a r d e n ( O n g

  • i

n g ) K a g g l e C h a l l e n g e C

  • m

p u t e F

  • t

p r i n t

slide-33
SLIDE 33

Instant Answers