subwords seriously
play

Subwords, Seriously? Ken Church KennethChurch@baidu.com CLSW-2020 - PowerPoint PPT Presentation

Subwords, Seriously? Ken Church KennethChurch@baidu.com CLSW-2020 Tokenization Modern Deep Nets BERT and ERNIE Two modes: Known Words (W): directional directional Unknown Words (OOVs): unidirectional un ##idi


  1. Subwords, Seriously? Ken Church KennethChurch@baidu.com CLSW-2020

  2. Tokenization • Modern Deep Nets • BERT and ERNIE • Two modes: • Known Words (W): • directional à directional • Unknown Words (OOVs): • unidirectional à un ##idi ##re ##ction ##al • Subwords, byte pair encoding (BPE) • No word formation rules that derive • new words from other words • Proposal: • Add a 3 rd case between known and unknown: • almost known (𝑩𝑳) • Many OOVs are near known words (𝐿) • 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 → 𝑣𝑜𝑗 − 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 • Near: 𝑃𝑃𝑊 → 𝑞𝑠𝑓 𝐿 | 𝑃𝑃𝑊 → 𝐿 𝑡𝑣𝑔 • Where 𝐿 is a known word • And pre and suf are on a short list of prefixes and suffixes • CLSW-2020 2

  3. Example from PubMed (Medical Abstracts) ERNIE/BERT (Baseline): 48 Tokens Proposed (Fewer Tokens): 35 Tokens un ##idi ##re ##ction ##al mixed l UNI- DIRECTIONAL MIXED ##ym ##ph ##oc ##yte cultures ( lymphocyte CULTURES ( M- LC ) ml ##c ) were set up using bo WERE SET UP USING BO- VINE ##vine peripheral blood l ##ym PERIPHERAL BLOOD lymphocytes ##ph ##ocytes ( p ##bl ) as ( pbl ) AS RESPOND -ER CELLS AND respond ##er cells and auto ##log autologous CELL LINES ##ous cell lines transformed in TRANSFORMED IN VITRO BY T . vitro by t . CLSW-2020 3

  4. Observation: Many OOVs are near known words • Example • 𝑡 = 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 (almost known) • 𝑥 = 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 (known) • 𝑥 ∈ 𝑒𝑗𝑑𝑢 (unlike 𝑡 ) • When 𝑡 is near 𝑥 • There are opportunities to infer sound and meaning of 𝑡 from 𝑥 • Claim: • These inferences are safer than backing off to subwords (spelling) • Many applications: • Sound: g2p (grapheme to phoneme) for tts (text to speech) • Meaning: Translation CLSW-2020 4

  5. Spoiler Alert: Morphology and Semantics as Vector Rotations WordNet Semantics Morphology • Some semantic relations: • Some morphological relations synonymy, antonymy, is-a 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 → 𝑣𝑜𝑗 + 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 • • 𝑣𝑜𝑨𝑗𝑞𝑞𝑓𝑒 → 𝑣𝑜 + 𝑨𝑗𝑞𝑞𝑓𝑒 • • Collect seeds for training • 𝑒𝑝𝑕𝑡 → 𝑒𝑝𝑕 + 𝑡 • is-a: < 𝑑𝑏𝑠, 𝑤𝑓ℎ𝑗𝑑𝑚𝑓 >, … 𝑐𝑏𝑠𝑙𝑗𝑜𝑕 → 𝑐𝑏𝑠𝑙 + 𝑗𝑜𝑕 • synonym: < 𝑕𝑝𝑝𝑒, ℎ𝑝𝑜𝑓𝑡𝑢 >, < 𝑕𝑝𝑝𝑒, 𝑞𝑠𝑝𝑔𝑗𝑑𝑗𝑓𝑜𝑢 >, … • • Collect seeds for training: antonym: < 𝑕𝑝𝑝𝑒, 𝑐𝑏𝑒 >, < 𝑕𝑝𝑝𝑒, 𝑓𝑤𝑗𝑚 >, … • < 𝑣𝑜𝑗 + 𝑦, 𝑦 > • • Learn Rotations: • < 𝑣𝑜 + 𝑦, 𝑦 > 𝑤𝑓𝑑(𝑑𝑏𝑠)𝑆 Y[_ ≈ 𝑤𝑓𝑑(𝑤𝑓ℎ𝑗𝑑𝑚𝑓) • < 𝑦 + 𝑡, 𝑦 > • • 𝑤𝑓𝑑 𝑕𝑝𝑝𝑒 𝑆 [`X ≈ 𝑤𝑓𝑑 ℎ𝑝𝑜𝑓𝑡𝑢 < 𝑦 + 𝑗𝑜𝑕, 𝑦 > • 𝑤𝑓𝑑(𝑕𝑝𝑝𝑒)𝑆 _Xa ≈ 𝑤𝑓𝑑(𝑐𝑏𝑒) • • Learn rotations 𝑆 • Thus, 𝑦𝑆𝑧 ⟹ 𝑤𝑓𝑑 𝑦 𝑆 ≈ 𝑤𝑓𝑑(𝑧) • 𝑤𝑓𝑑(𝑣𝑜𝑗 + 𝑦)𝑆 WXY ≈ 𝑤𝑓𝑑(𝑦) words ⟹ vectors • 𝑤𝑓𝑑(𝑣𝑜 + 𝑦)𝑆 WX ≈ 𝑤𝑓𝑑(𝑦) • functions on words (predictes, relations) ⟹ rotations • 𝑤𝑓𝑑(𝑦 + 𝑡)𝑆 [ ≈ 𝑤𝑓𝑑(𝑦) • • What is the meaning of not ? • 𝑤𝑓𝑑(𝑦 + 𝑗𝑜𝑕)𝑆 YX\ ≈ 𝑤𝑓𝑑(𝑦) ¬𝑦 ⟹ 𝑤𝑓𝑑 𝑦 𝑆 Xfa • • by analogy with: 𝑤𝑓𝑑(𝑣𝑜 + 𝑦) = 𝑤𝑓𝑑(𝑦)𝑆 WX CLSW-2020 5

  6. Motivations: Black Boxes vs. Gray Boxes Modern Deep Nets (Black Boxes) Traditional Linguistics (Gray Boxes) un ##idi ##re ##ction ##al UNI- DIRECTIONAL Desiderata Intermediate representations: Unit test ≫ System test • • • End-to-End Performance: System test ≫ Unit test • Capture relevant linguistic generalizations • Intermediate representations considered harmful Example of an intermediate representation • • Small vocabularies ( 𝑊 ): Space & Time grow with 𝑊 Morphology • • Generalization to other tasks, domains, languages, etc. Relevant Linguistic Generalizations • • Morphology tends to be language specific • Optimization ≫ Annotation ≫ Creating lists by hand • Sound (S) • Linguistic resources considered harmful • Meaning (M) Non-Desiderata • 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 ~ 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 • • Linguistic Generalizations • Capture generalizations associated with stem: 𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 • https://en.wikipedia.org/wiki/Frederick_Jelinek • 𝑇(𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) ~ 𝑇(𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) • Every time I fire a linguist, • 𝑁(𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) ~ 𝑁(𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) • the performance of the speech recognizer goes up • Capture generalizations associated with affix: 𝑣𝑜𝑗 BPE (Byte Pair Encoding) • • 𝑇(𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚) ~ 𝑇(𝑣𝑜𝑗) (vowel) • Generous Definition: • 𝑁 𝑣𝑜𝑗𝑒𝑗𝑠𝑓𝑑𝑢𝑗𝑝𝑜𝑏𝑚 ~ 𝑁 𝑣𝑜𝑗 (one) • An optimization to find a small vocabulary of tokens with broad coverage Sound & Meaning ≫ Spelling • Not so generous definition: • • Deep representations are more insightful than superficial observations • BPE ≈ Spelling Spelling ≫ Sound & Meaning: • Spelling is observable • CLSW-2020 6

  7. A Pendulum Swung Too Far (Church, 2011) CLSW-2020 7

  8. A Pendulum Swung Too Far (Church, 2011) • 1950s: Empiricism • Shannon, Skinner, Firth, Harris • 1970s: Rationalism Grandparents Grandparents Grandparents • Chomsky and and and • Minsky Grandchildren Grandchildren Grandchildren • 1990s: Empiricism • IBM Speech Group • AT&T Bell Labs • 2010s: A Return to Rationalism? Fads come, and fads go • 2010s: Deep Nets • 2030s: DARPA AI Next • ``We don’t need more cat detectors” CLSW-2020 8

  9. Jurafsky: Interspeech-2016, NAACL-2009 https://www.superlectures.com/interspeech2016/ • Jurafsky uses history of ketchup (& ice cream) • to shed light on currently popular methods in speech and language • He traces etymology of “ketchup” from an Asian fish sauce • Advances in (sailing) technology made it possible to replace anchovies with less expensive tomatoes and sugar from the West • The ice cream story combines fruit syrups (Sharbat) from Persia • with gun powder from China and advances in refrigeration technology Big Tent Better Together: Humanities + Engineering + Stats CLSW-2020 9

  10. The Speech Invasion • At speech meetings (Interspeech-2016, as opposed to NAACL-2009), • Jurafsky credits speech researchers for transferring currently popular techniques from speech to language. CLSW-2020 10

  11. What happened in1988? https://www.superlectures.com/interspeech2016/ CLSW-2020 11

  12. • Jurafsky’s story is nice & simple, • But history is “complicated” • IMHO, speech did onto language, • what was done onto them • https://www.superlectures.com/interspeech2016/ What happened in 1975? The same thing that happened to language in 1988 (and to hedge funds in 1990s, and politics in 2016)? CLSW-2020 12

  13. Robert Mercer ACL Lifetime Achievement http://techtalks.tv/talks/closing-session/60532/ 2014 End-to-end vs. Representation CLSW-2020 13

  14. A Unified (Dystopian) Perspective: The World Would Be Better Off Without People • More on firing linguists… • Self-diving cars: • The most dangerous thing about a car is the driver. • Let's get rid of drivers. • Hedge funds: • The weak spot in an investment fund, is the fund manager. • Let's get rid of fund managers. • Speech, Machine Translation, CL, Deep Nets: • The most dangerous thing are the researchers. • Let's get rid of researchers (and especially the linguists) • Politics: • Government would would better without politicians. • See discussion of Brexit and 2016 US Election in https://en.wikipedia.org/wiki/Robert_Mercer • In these difficult times, • it would be good if the world was more tolerant of one another, • and willing to love one another through thick and thin. CLSW-2020 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend