language as an interface
play

Language as an Interface Spencer Kelly introduction The pope is - PowerPoint PPT Presentation

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data language as an interface introduction (@spencermountain) introduction problem problem problem problem 4-gram: 3-gram: 2-gram: 1-gram: london


  1. Language as an Interface Spencer Kelly

  2. introduction The pope is catholic. language as data language as an interface

  3. introduction (@spencermountain)

  4. introduction

  5. problem

  6. problem

  7. problem

  8. problem 4-gram: 3-gram: 2-gram: 1-gram: london in the rain london in the london in london in the rain in the in the rain the rain 4-words - 10 requests per keystroke 5-words : 15 6-words : 21 7-words : 28 8-words : 36

  9. problem Stopwords Edge gram Redundancy blacklist: filter: check: # 1 london in the rain “london in the rain” london in the london in the in the rain in the rain london in london in in the in the # 2 the rain the rain London “london” in in # 3 the the “rain” rain

  10. niche When all you’ve got is a jackhammer.. NLTK - excellent, huge, python ● Stanford parser - excellent, huge, java ● Freeling - excellent, huge, C++ Or an offsite API? ● Alchemy, Illinois tagger - excellent, huge, java ● TextRazor, OpenCalais, Embedly, Zemanta

  11. niche Can it be hacked? tldr: yes. �

  12. niche Zipfs law The top 10 words account for 25% of language. The top 100 words account for 50% of language. The top 50,000 words account for 95% of language.

  13. niche How big is a language? Shakespeare - 35,000 Wordnet - 200,000 ! OED - 600,000 !

  14. niche An average person will ever hear 50,000 different words 602 kb uncompressed ~4 lookups in binary search

  15. process first, let’s kill the nouns 70% 180 kb uncompressed

  16. niche improveify your vocabularies Noun Verb Adjective Adverb Tomato Speak nice quickly Tomatoes Spoke nicer quicklier Toronto Speaking nicest quickliest Torontonian will speak have spoken had spoken ... *not is *not handsome *not truly *not economics “tomatoey” “tomato” “agressiveness” “aggressive” “civil” “civilize” “quickly” “quick” “speaker” “speak” n/2.3 “awesomeify” “awesome” Each word

  17. process then, let’s conjugate our verbs 110 kb uncompressed

  18. process react 110 kb 653kb lodash uncompressed 503kb d3js 330kb jQuery the whole 256kb english language 110kb

  19. process Ok, let’s roll our own POS tagger.. (what could go rong?)

  20. 1) Lexicon 2) Suffix regexes 3) Sentence-level markov chain

  21. process Suffix rules

  22. process Grammar rules - markov She could walk the walk . before: Verb - Det - Verb after: Verb - Det - Noun

  23. process “Unreasonable effectiveness” of rule-based taggers- a 1,000 word lexicon - 45 % precision ● fallback to [Noun] - 70 % precision ● a little regex - 74 % precision ● a little grammar in it - 81 % precision ●

  24. outcome t.text (“keep on rocking in the free world”) t.negate() //“don’t keep on rocking in the free world.”

  25. outcome t.text (“it is a cool library”) t.toValleyGirl() //“so, it is like, a cool library.”

  26. outcome We gave the monkeys the bananas, ..because they were ripe. ..because they were hungry.

  27. outcome ฀ � � �� Knowledge engine [act / transfer / voluntary] [genus / monkey] [plant / banana] We give [Noun] [Noun] Dependency parser [Pr] [Verb] [Dt] [Noun] [Dt] [Noun] POS-tagging We gave the monkeys the bananas list of letters

  28. outcome #TODOFML ● Mutable/Immutable API ● Speed, performance testing ● Romantic-language verb conjugations ● ‘bl.ocks.org’ of demos and docs

  29. npm install --wooyeah Slack group, mailing list, github, Toronto/coffee @spencermountain

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend