Language as an Interface Spencer Kelly introduction The pope is - - PowerPoint PPT Presentation

language as an interface
SMART_READER_LITE
LIVE PREVIEW

Language as an Interface Spencer Kelly introduction The pope is - - PowerPoint PPT Presentation

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data language as an interface introduction (@spencermountain) introduction problem problem problem problem 4-gram: 3-gram: 2-gram: 1-gram: london


slide-1
SLIDE 1

Language as an Interface

Spencer Kelly

slide-2
SLIDE 2

The pope is catholic.

language as an interface language as data

introduction

slide-3
SLIDE 3

(@spencermountain)

introduction

slide-4
SLIDE 4

introduction

slide-5
SLIDE 5

problem

slide-6
SLIDE 6

problem

slide-7
SLIDE 7

problem

slide-8
SLIDE 8

london in the rain london in the in the rain london in in the the rain london in the rain

4-gram: 3-gram: 2-gram: 1-gram:

10 requests per keystroke

4-words -

5-words : 15 6-words : 21 7-words : 28 8-words : 36

problem

slide-9
SLIDE 9

london in the rain london in the in the rain london in in the the rain London in the rain in the london in the in the rain london in the rain

Stopwords blacklist: Edge gram filter: Redundancy check:

#1 #2 #3

in the

“rain” “london”

“london in the rain”

problem

slide-10
SLIDE 10
  • NLTK - excellent, huge, python
  • Stanford parser - excellent, huge, java

Alchemy, TextRazor, OpenCalais, Embedly, Zemanta Or an offsite API?

  • Freeling - excellent, huge, C++
  • Illinois tagger - excellent, huge, java

When all you’ve got is a jackhammer..

niche

slide-11
SLIDE 11

Can it be hacked? tldr: yes.

niche

slide-12
SLIDE 12

Zipfs law

The top 10 words account for 25% of language. The top 100 words account for 50% of language. The top 50,000 words account for 95% of language.

niche

slide-13
SLIDE 13

How big is a language? Shakespeare - 35,000 Wordnet - 200,000 ! OED - 600,000 !

niche

slide-14
SLIDE 14

602 kb

uncompressed

50,000 different words

An average person will ever hear ~4 lookups in binary search

niche

slide-15
SLIDE 15

first, let’s kill the nouns 70%

process

180 kb

uncompressed

slide-16
SLIDE 16

Noun Verb Adjective Adverb Tomato Tomatoes Toronto Torontonian Speak Spoke Speaking will speak have spoken had spoken ... nice nicer nicest quickly quicklier quickliest “awesome” “awesomeify” improveify your vocabularies “quickly” “quick”

n/2.3

Each word

*not handsome *not truly *not is *not economics

“tomatoey” “tomato” “speak” “speaker” “aggressive” “agressiveness” “civil” “civilize”

niche

slide-17
SLIDE 17

then, let’s conjugate

  • ur verbs

process

110 kb

uncompressed

slide-18
SLIDE 18

process

jQuery

256kb

d3js

330kb

react

653kb

110 kb

uncompressed

lodash

503kb

the whole english language

110kb

slide-19
SLIDE 19

Ok, let’s roll our own POS tagger..

(what could go rong?)

process

slide-20
SLIDE 20

1) Lexicon 2) Suffix regexes 3) Sentence-level markov chain

slide-21
SLIDE 21

Suffix rules

process

slide-22
SLIDE 22

Grammar rules - markov

She could walk the walk .

before: Verb - Det - Verb after: Verb - Det - Noun

process

slide-23
SLIDE 23

“Unreasonable effectiveness” of rule-based taggers-

  • a 1,000 word lexicon - 45% precision
  • fallback to [Noun] - 70% precision
  • a little regex - 74% precision
  • a little grammar in it - 81% precision

process

slide-24
SLIDE 24

t.text(“keep on rocking in the free world”) t.negate() //“don’t keep on rocking in the free world.”

  • utcome
slide-25
SLIDE 25

t.text(“it is a cool library”) t.toValleyGirl() //“so, it is like, a cool library.”

  • utcome
slide-26
SLIDE 26

We gave the monkeys the bananas,

..because they were ripe. ..because they were hungry.

  • utcome
slide-27
SLIDE 27

We gave the monkeys the bananas [Pr] [Verb] [Dt] [Noun] [Dt] [Noun]

list of letters POS-tagging Dependency parser

We give [Noun] [Noun]

[act / transfer / voluntary] [genus / monkey]

Knowledge engine

[plant / banana]

  • utcome
slide-28
SLIDE 28

#TODOFML

  • Mutable/Immutable API
  • Speed, performance testing
  • Romantic-language verb conjugations
  • ‘bl.ocks.org’ of demos and docs
  • utcome
slide-29
SLIDE 29

npm install --wooyeah @spencermountain

Slack group, mailing list, github, Toronto/coffee

slide-30
SLIDE 30