Subwords, Seriously? Ken Church KennethChurch@baidu.com CLSW-2020 - - PowerPoint PPT Presentation

β–Ά
subwords seriously
SMART_READER_LITE
LIVE PREVIEW

Subwords, Seriously? Ken Church KennethChurch@baidu.com CLSW-2020 - - PowerPoint PPT Presentation

Subwords, Seriously? Ken Church KennethChurch@baidu.com CLSW-2020 Tokenization Modern Deep Nets BERT and ERNIE Two modes: Known Words (W): directional directional Unknown Words (OOVs): unidirectional un ##idi


slide-1
SLIDE 1

Subwords, Seriously?

Ken Church KennethChurch@baidu.com CLSW-2020

slide-2
SLIDE 2

Tokenization

  • Modern Deep Nets
  • BERT and ERNIE
  • Two modes:
  • Known Words (W):
  • directional Γ  directional
  • Unknown Words (OOVs):
  • unidirectional Γ  un ##idi ##re ##ction ##al
  • Subwords, byte pair encoding (BPE)
  • No word formation rules that derive
  • new words from other words
  • Proposal:
  • Add a 3rd case between known and unknown:
  • almost known (𝑩𝑳)
  • Many OOVs are near known words (𝐿)
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ π‘£π‘œπ‘— βˆ’ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • Near: π‘ƒπ‘ƒπ‘Š β†’ π‘žπ‘ π‘“ 𝐿 | π‘ƒπ‘ƒπ‘Š β†’ 𝐿 𝑑𝑣𝑔
  • Where 𝐿 is a known word
  • And pre and suf are on a short list
  • f prefixes and suffixes

2 CLSW-2020

slide-3
SLIDE 3

Example from PubMed (Medical Abstracts)

ERNIE/BERT (Baseline): 48 Tokens

un ##idi ##re ##ction ##al mixed l ##ym ##ph ##oc ##yte cultures ( ml ##c ) were set up using bo ##vine peripheral blood l ##ym ##ph ##ocytes ( p ##bl ) as respond ##er cells and auto ##log ##ous cell lines transformed in vitro by t .

Proposed (Fewer Tokens): 35 Tokens

UNI- DIRECTIONAL MIXED lymphocyte CULTURES ( M- LC ) WERE SET UP USING BO- VINE PERIPHERAL BLOOD lymphocytes ( pbl ) AS RESPOND -ER CELLS AND autologous CELL LINES TRANSFORMED IN VITRO BY T .

3 CLSW-2020

slide-4
SLIDE 4

Observation: Many OOVs are near known words

  • Example
  • 𝑑 = π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š (almost known)
  • π‘₯ = π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š

(known)

  • π‘₯ ∈ 𝑒𝑗𝑑𝑒 (unlike 𝑑)
  • When 𝑑 is near π‘₯
  • There are opportunities to infer sound and meaning of 𝑑 from π‘₯
  • Claim:
  • These inferences are safer than backing off to subwords (spelling)
  • Many applications:
  • Sound: g2p (grapheme to phoneme) for tts (text to speech)
  • Meaning: Translation

4 CLSW-2020

slide-5
SLIDE 5

Spoiler Alert: Morphology and Semantics as Vector Rotations

Morphology

  • Some morphological relations
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • π‘£π‘œπ‘¨π‘—π‘žπ‘žπ‘“π‘’ β†’ π‘£π‘œ + π‘¨π‘—π‘žπ‘žπ‘“π‘’
  • 𝑒𝑝𝑕𝑑 β†’ 𝑒𝑝𝑕 + 𝑑
  • π‘π‘π‘ π‘™π‘—π‘œπ‘• β†’ 𝑐𝑏𝑠𝑙 + π‘—π‘œπ‘•
  • Collect seeds for training:
  • < π‘£π‘œπ‘— + 𝑦, 𝑦 >
  • < π‘£π‘œ + 𝑦, 𝑦 >
  • < 𝑦 + 𝑑, 𝑦 >
  • < 𝑦 + π‘—π‘œπ‘•, 𝑦 >
  • Learn rotations 𝑆
  • 𝑀𝑓𝑑(π‘£π‘œπ‘— + 𝑦)𝑆WXY β‰ˆ 𝑀𝑓𝑑(𝑦)
  • 𝑀𝑓𝑑(π‘£π‘œ + 𝑦)𝑆WX β‰ˆ 𝑀𝑓𝑑(𝑦)
  • 𝑀𝑓𝑑(𝑦 + 𝑑)𝑆[ β‰ˆ 𝑀𝑓𝑑(𝑦)
  • 𝑀𝑓𝑑(𝑦 + π‘—π‘œπ‘•)𝑆YX\ β‰ˆ 𝑀𝑓𝑑(𝑦)

WordNet Semantics

  • Some semantic relations:
  • synonymy, antonymy, is-a
  • Collect seeds for training
  • is-a: < 𝑑𝑏𝑠, π‘€π‘“β„Žπ‘—π‘‘π‘šπ‘“ >, …
  • synonym: < 𝑕𝑝𝑝𝑒, β„Žπ‘π‘œπ‘“π‘‘π‘’ >, < 𝑕𝑝𝑝𝑒, π‘žπ‘ π‘π‘”π‘—π‘‘π‘—π‘“π‘œπ‘’ >, …
  • antonym: < 𝑕𝑝𝑝𝑒, 𝑐𝑏𝑒 >, < 𝑕𝑝𝑝𝑒, π‘“π‘€π‘—π‘š >, …
  • Learn Rotations:
  • 𝑀𝑓𝑑(𝑑𝑏𝑠)𝑆Y[_ β‰ˆ 𝑀𝑓𝑑(π‘€π‘“β„Žπ‘—π‘‘π‘šπ‘“)
  • 𝑀𝑓𝑑 𝑕𝑝𝑝𝑒 𝑆[`X β‰ˆ 𝑀𝑓𝑑 β„Žπ‘π‘œπ‘“π‘‘π‘’
  • 𝑀𝑓𝑑(𝑕𝑝𝑝𝑒)𝑆_Xa β‰ˆ 𝑀𝑓𝑑(𝑐𝑏𝑒)
  • Thus, 𝑦𝑆𝑧 ⟹ 𝑀𝑓𝑑 𝑦 𝑆 β‰ˆ 𝑀𝑓𝑑(𝑧)
  • words ⟹ vectors
  • functions on words (predictes, relations) ⟹ rotations
  • What is the meaning of not?
  • ¬𝑦 ⟹ 𝑀𝑓𝑑 𝑦 𝑆Xfa
  • by analogy with: 𝑀𝑓𝑑(π‘£π‘œ + 𝑦) = 𝑀𝑓𝑑(𝑦)𝑆WX

5 CLSW-2020

slide-6
SLIDE 6

Motivations: Black Boxes vs. Gray Boxes

Modern Deep Nets (Black Boxes) un ##idi ##re ##ction ##al

  • Desiderata
  • End-to-End Performance: System test ≫ Unit test
  • Intermediate representations considered harmful
  • Small vocabularies (π‘Š): Space & Time grow with π‘Š
  • Generalization to other tasks, domains, languages, etc.
  • Morphology tends to be language specific
  • Optimization ≫ Annotation ≫ Creating lists by hand
  • Linguistic resources considered harmful
  • Non-Desiderata
  • Linguistic Generalizations
  • https://en.wikipedia.org/wiki/Frederick_Jelinek
  • Every time I fire a linguist,
  • the performance of the speech recognizer goes up
  • BPE (Byte Pair Encoding)
  • Generous Definition:
  • An optimization to find a small vocabulary of tokens with broad coverage
  • Not so generous definition:
  • BPE β‰ˆ Spelling
  • Spelling ≫ Sound & Meaning:
  • Spelling is observable

Traditional Linguistics (Gray Boxes) UNI- DIRECTIONAL

  • Intermediate representations: Unit test ≫ System test
  • Capture relevant linguistic generalizations
  • Example of an intermediate representation
  • Morphology
  • Relevant Linguistic Generalizations
  • Sound (S)
  • Meaning (M)
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • Capture generalizations associated with stem: π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • 𝑇(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ 𝑇(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • 𝑁(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ 𝑁(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Capture generalizations associated with affix: π‘£π‘œπ‘—
  • 𝑇(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ 𝑇(π‘£π‘œπ‘—)

(vowel)

  • 𝑁 π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ 𝑁 π‘£π‘œπ‘—

(one)

  • Sound & Meaning ≫ Spelling
  • Deep representations are more insightful than superficial observations

6 CLSW-2020

slide-7
SLIDE 7

A Pendulum Swung Too Far (Church, 2011)

7 CLSW-2020

slide-8
SLIDE 8

A Pendulum Swung Too Far (Church, 2011)

  • 1950s: Empiricism
  • Shannon, Skinner, Firth, Harris
  • 1970s: Rationalism
  • Chomsky
  • Minsky
  • 1990s: Empiricism
  • IBM Speech Group
  • AT&T Bell Labs
  • 2010s: A Return to Rationalism?
  • 2010s: Deep Nets
  • 2030s: DARPA AI Next
  • ``We don’t need more cat detectors”

Fads come, and fads go

8 CLSW-2020

Grandparents and Grandchildren Grandparents and Grandchildren Grandparents and Grandchildren

slide-9
SLIDE 9

Jurafsky: Interspeech-2016, NAACL-2009

https://www.superlectures.com/interspeech2016/

  • Jurafsky uses history of ketchup (& ice cream)
  • to shed light on currently popular methods in speech and language
  • He traces etymology of β€œketchup” from an Asian fish sauce
  • Advances in (sailing) technology made it possible to replace

anchovies with less expensive tomatoes and sugar from the West

  • The ice cream story combines fruit syrups (Sharbat) from Persia
  • with gun powder from China and advances in refrigeration technology

Big Tent

Better Together: Humanities + Engineering + Stats

9 CLSW-2020

slide-10
SLIDE 10

The Speech Invasion

  • At speech meetings (Interspeech-2016, as opposed to NAACL-2009),
  • Jurafsky credits speech researchers for transferring currently popular

techniques from speech to language.

10 CLSW-2020

slide-11
SLIDE 11

What happened in1988?

https://www.superlectures.com/interspeech2016/

11 CLSW-2020

slide-12
SLIDE 12
  • Jurafsky’s story is nice & simple,
  • But history is β€œcomplicated”
  • IMHO,
  • speech did onto language,
  • what was done onto them

What happened in 1975? The same thing that happened to language in 1988

(and to hedge funds in 1990s, and politics in 2016)?

https://www.superlectures.com/interspeech2016/

12 CLSW-2020

slide-13
SLIDE 13

Robert Mercer ACL Lifetime Achievement

http://techtalks.tv/talks/closing-session/60532/

End-to-end vs. Representation

13 CLSW-2020

2014

slide-14
SLIDE 14

A Unified (Dystopian) Perspective:

The World Would Be Better Off Without People

  • More on firing linguists…
  • Self-diving cars:
  • The most dangerous thing about a car is the driver.
  • Let's get rid of drivers.
  • Hedge funds:
  • The weak spot in an investment fund, is the fund manager.
  • Let's get rid of fund managers.
  • Speech, Machine Translation, CL, Deep Nets:
  • The most dangerous thing are the researchers.
  • Let's get rid of researchers (and especially the linguists)
  • Politics:
  • Government would would better without politicians.
  • See discussion of Brexit and 2016 US Election in https://en.wikipedia.org/wiki/Robert_Mercer
  • In these difficult times,
  • it would be good if the world was more tolerant of one another,
  • and willing to love one another through thick and thin.

CLSW-2020 14

slide-15
SLIDE 15

Intolerance: Reviewing the Reviewers (Again)

https://www.cambridge.org/core/journals/natural-language-engineering/article/emerging-trends- reviewing-the-reviewers-again/10CDC1D71E1AEB21456CFBDA187CBCB6#fndtn-information

  • My most recent EMNLP submission was

rejected with this remark:

  • β€œI recommend reading several recent ACL
  • r EMNLP papers prior to submitting, to get

a sense of the conventions of the field.”

  • Maybe this reviewer was trying to be

helpful but probably not

  • During the rebuttal period,
  • I wanted to mention some of my experience
  • (former president of ACL and
  • co-creator of EMNLP),
  • but could not see how to do that
  • within restrictions of blind reviewing process.
  • Not-ok reviews: intolerance
  • no one from your < π‘‘π‘’π‘“π‘ π‘“π‘π‘’π‘§π‘žπ‘“ >
  • does good work
  • Officer-on-deck: responsible for his watch
  • Bad if he knows about it;
  • Worse if he does not.
  • Constructive suggestion
  • for not-ok reviews:
  • social media
  • (negative feedback loops)
  • If you have received a not-ok review,
  • please help PCs improve by sharing
  • The process will improve over time
  • if reviewers teach authors
  • how to write better submissions,
  • and authors teach reviewers
  • how to write more constructive reviews.

CLSW-2020 15

slide-16
SLIDE 16

16 CLSW-2020

slide-17
SLIDE 17

On firing linguists…

  • Finally, they removed the dictionary lookup HMM,
  • taking for the pronunciation of each word its spelling.
  • Thus, a word like t-h-r-o-u-g-h was assumed to have a pronunciation

like tuh huh ruh oh uu guh huh.

  • After training, the system learned that
  • with words like l-a-t-e the front end often missed the e.
  • Similarly, it learned that g's and h's were often silent.
  • This crippled system was still able to recognize
  • 43% of 100 test sentences correctly as compared with
  • 35% for the original Raleigh system.

1995

17 CLSW-2020

slide-18
SLIDE 18

Sound & Meaning >> Spelling

1937 2012

18

https://www.icsi.berkeley.edu/icsi/news/2012/07/fillmore-lifetime-achievement-award

slide-19
SLIDE 19

Motivations: Black Boxes vs. Gray Boxes

Modern Deep Nets (Black Boxes) un ##idi ##re ##ction ##al

  • Desiderata
  • End-to-End Performance: System test ≫ Unit test
  • Intermediate representations considered harmful
  • Small vocabularies (π‘Š): Space & Time grow with π‘Š
  • Generalization to other tasks, domains, languages, etc.
  • Morphology tends to be language specific
  • Optimization ≫ Annotation ≫ Creating lists by hand
  • Linguistic resources considered harmful
  • Non-Desiderata
  • Linguistic Generalizations
  • https://en.wikipedia.org/wiki/Frederick_Jelinek
  • Every time I fire a linguist,
  • the performance of the speech recognizer goes up
  • BPE (Byte Pair Encoding)
  • Generous Definition:
  • An optimization to find a small vocabulary of tokens with broad coverage
  • Not so generous definition:
  • BPE β‰ˆ Spelling
  • Spelling ≫ Sound & Meaning:
  • Spelling is observable

Traditional Linguistics (Gray Boxes) UNI- DIRECTIONAL

  • Intermediate representations: Unit test ≫ System test
  • Capture relevant linguistic generalizations
  • Example of an intermediate representation
  • Morphology
  • Relevant Linguistic Generalizations
  • Sound (S)
  • Meaning (M)
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • Generalizations associated with stem: π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • 𝑇(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ 𝑇(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • 𝑁(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ 𝑁(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Generalizations associated with affix: π‘£π‘œπ‘— βˆ’
  • 𝑇(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ 𝑇(π‘£π‘œπ‘—βˆ’)

(vowel)

  • 𝑁 π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ 𝑁 π‘£π‘œπ‘— βˆ’

(one)

  • Sound & Meaning ≫ Spelling
  • Deep representations are more insightful than superficial observations

Seriously?

19 CLSW-2020

slide-20
SLIDE 20

Unknown Words (OOVs) Are Often Similar to Known Words

  • Prefix
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • π‘£π‘œπ‘—π‘’π‘—π‘›π‘“π‘œπ‘‘π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘›π‘“π‘œπ‘‘π‘—π‘π‘œπ‘π‘š
  • π‘π‘—π‘žπ‘π‘’π‘π‘‘π‘‘π‘—π‘£π‘› ~ π‘žπ‘π‘’π‘π‘‘π‘‘π‘—π‘£π‘›
  • Suffix
  • 𝑒𝑝π‘₯π‘œπ‘šπ‘π‘π‘’π‘π‘π‘šπ‘“ ~ 𝑒𝑝π‘₯π‘œπ‘šπ‘π‘π‘’
  • Prefix Swap
  • π‘“π‘šπ‘“π‘‘π‘’π‘ π‘π‘›π‘“π‘’π‘ π‘—π‘‘ ~ 𝑕𝑓𝑝𝑛𝑓𝑒𝑠𝑗𝑑
  • Suffix Swap
  • π‘’π‘“π‘šπ‘“π‘žβ„Žπ‘π‘œπ‘§ ~ π‘’π‘“π‘šπ‘“π‘žβ„Žπ‘π‘œπ‘“
  • π‘‘π‘‘β„Žπ‘—π‘¨π‘π‘žβ„Žπ‘ π‘“π‘œπ‘—π‘‘ ~ π‘‘π‘‘β„Žπ‘—π‘¨π‘π‘žβ„Žπ‘ π‘“π‘œπ‘—π‘
  • Rhyme
  • π‘ π‘“π‘’π‘‘β„Ž ~ π‘‘π‘’π‘ π‘“π‘’π‘‘β„Ž / π‘‘π‘™π‘“π‘’π‘‘β„Ž / π‘”π‘“π‘’π‘‘β„Ž
  • Compound
  • β„Žπ‘π‘£π‘‘π‘“π‘π‘π‘π‘’ ~ β„Žπ‘π‘£π‘‘π‘“ + 𝑐𝑝𝑏𝑒
  • β„Žπ‘π‘£π‘‘π‘“π‘π‘π‘π‘’ ~ 𝑐𝑝𝑏𝑒 + β„Žπ‘π‘£π‘‘π‘“
  • Case-based Reasoning
  • if π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š,
  • then π‘‘π‘π‘£π‘œπ‘’(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ π‘‘π‘π‘£π‘œπ‘’(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • & π‘›π‘“π‘π‘œπ‘—π‘œπ‘•(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ π‘›π‘“π‘π‘œπ‘—π‘œπ‘•(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Assume short lists of affixes, plus large dictionary
  • Dictionary entries
  • π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ 𝐽𝑄𝐡(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ 𝑀𝑓𝑑(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Task:
  • Connect the dots between OOV (π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • and nearby known word (π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Inference
  • What is π‘‘π‘π‘£π‘œπ‘’(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?
  • What is π‘›π‘“π‘π‘œπ‘—π‘œπ‘• (π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?

20 CLSW-2020

slide-21
SLIDE 21

Inferences with Almost Known Words (𝐡𝐿): Connect Dots à Sound and Meaning

Connect Dots: Almost Known ~ Known π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š

  • Prefix
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • π‘£π‘œπ‘—π‘’π‘—π‘›π‘“π‘œπ‘‘π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘›π‘“π‘œπ‘‘π‘—π‘π‘œπ‘π‘š
  • π‘π‘—π‘žπ‘π‘’π‘π‘‘π‘‘π‘—π‘£π‘› ~ π‘žπ‘π‘’π‘π‘‘π‘‘π‘—π‘£π‘›
  • Suffix
  • 𝑒𝑝π‘₯π‘œπ‘šπ‘π‘π‘’π‘π‘π‘šπ‘“ ~ 𝑒𝑝π‘₯π‘œπ‘šπ‘π‘π‘’
  • Prefix Swap
  • π‘’π‘—π‘žπ‘π‘’π‘π‘‘π‘‘π‘—π‘£π‘› ~ π‘π‘—π‘žπ‘π‘’π‘π‘‘π‘‘π‘—π‘£π‘›
  • Suffix Swap
  • π‘’π‘“π‘šπ‘“π‘žβ„Žπ‘π‘œπ‘§ ~ π‘’π‘“π‘šπ‘“π‘žβ„Žπ‘π‘œπ‘“
  • π‘‘π‘‘β„Žπ‘—π‘¨π‘π‘žβ„Žπ‘ π‘“π‘œπ‘—π‘‘ ~ π‘‘π‘‘β„Žπ‘—π‘¨π‘π‘žβ„Žπ‘ π‘“π‘œπ‘—π‘
  • Rhyme
  • π‘ π‘“π‘’π‘‘β„Ž ~ π‘‘π‘’π‘ π‘“π‘’π‘‘β„Ž / π‘‘π‘™π‘“π‘’π‘‘β„Ž / π‘”π‘“π‘’π‘‘β„Ž
  • Compound
  • β„Žπ‘π‘£π‘‘π‘“π‘π‘π‘π‘’ ~ β„Žπ‘π‘£π‘‘π‘“ + 𝑐𝑝𝑏𝑒
  • β„Žπ‘π‘£π‘‘π‘“π‘π‘π‘π‘’ ~ 𝑐𝑝𝑏𝑒 + β„Žπ‘π‘£π‘‘π‘“

Sound (S) and Meaning (M) 𝑇(𝐡𝐿) ~ 𝑇(𝐿); 𝑁(𝐡𝐿) ~ 𝑁(𝐿)

  • Case-based Reasoning: What is 𝑇(π‘₯)? What is 𝑁(π‘₯)?
  • Plan A: reason by table lookup (π‘₯ is known)
  • Plan B: reason by interpolation (π‘₯ is almost known)
  • Plan C: reason from first principles (π‘₯ is OOV)
  • Assume large lexicon with sound and meaning (for known words)
  • π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ 𝐽𝑄𝐡(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ 𝑀𝑓𝑑(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Task: Given π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • What is 𝑇(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?
  • What is 𝑁(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?
  • Assumptions:
  • 𝑇(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ 𝑇(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • 𝑁(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ~ 𝑁(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Representations:
  • words β‡’ vectors, and relations β‡’ rotations (functions on vectors)

21 CLSW-2020

slide-22
SLIDE 22

Heuristic: Minimize Splits

  • Fewer splits are better
  • More splits are more risky
  • Splits make use of (imperfect)

compositionality assumptions

  • Ambiguity:
  • π‘£π‘œπ‘— βˆ’ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • π‘£π‘œ βˆ’ 𝑗𝑒𝑗 βˆ’ 𝑠𝑓 βˆ’ π‘‘π‘’π‘—π‘π‘œ βˆ’ π‘π‘š
  • BERT/ERNIE tokenizers do not

combine subwords and words

  • Two modes: known words & OOVs
  • No almost-known words:
  • Combinations of words & affixes

22 CLSW-2020

slide-23
SLIDE 23

Recall Example from PubMed (Medical Abstracts)

ERNIE/BERT (Baseline): 48 Tokens

un ##idi ##re ##ction ##al mixed l ##ym ##ph ##oc ##yte cultures ( ml ##c ) were set up using bo ##vine peripheral blood l ##ym ##ph ##ocytes ( p ##bl ) as respond ##er cells and auto ##log ##ous cell lines transformed in vitro by t .

Proposed (Fewer Tokens): 35 Tokens

UNI- DIRECTIONAL MIXED lymphocyte CULTURES ( M- LC ) WERE SET UP USING BO- VINE PERIPHERAL BLOOD lymphocytes ( pbl ) AS RESPOND -ER CELLS AND autologous CELL LINES TRANSFORMED IN VITRO BY T .

23 CLSW-2020

slide-24
SLIDE 24

Proposed Method Γ  Fewer Subwords Known, Unknown & Almost Known

Tokens per Abstract (PubMed)

BERT/ERNIE Baseline Proposed Known 188.1 188.1 Prefix 0.0 5.5 Suffix 0.0 3.5 Subwords 54.1 7.3 Totals 242.2 204.4

  • Almost-Known (words & affixes)
  • OOV Γ  pre W
  • OOV Γ  W suf
  • Where W is a known word
  • And pre and suf are members of short list of

prefixes and suffixes

  • Short lists were learned automatically from

training data

  • Short list Γ  High coverage
  • 59 affixes Γ  50% coverage
  • -ly, ed, cyto-, p-, lipo-, ac-, re-, ion, m-, un-,

a-, h-, ase, c-, -ation, t-, mono-, nucleo-, in-, as-, l-, ing, id, intra-, sub-, r-, -al, g-, i-, hepato-, ory, up-, fr-, ad-, radio-, bio-, na-, immuno-, -in, non-, ity, e-, ization, de-, anti-, histo-, f-, able, ic, glyco-, -pre-, para-, bo-, s- , lympho-, ine, d-, dp, sero-

24 CLSW-2020

slide-25
SLIDE 25

Inferences Input: almost known word (π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)

Output: Sound

  • If π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š is like

π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š

  • and we have 𝐽𝑄𝐡(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • What is 𝐽𝑄𝐡(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?

Output: Meaning (vec/translation)

  • If π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š is like

π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š

  • and we have 𝑀𝑓𝑑(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • What is 𝑀𝑓𝑑(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?
  • If π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š is like

π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š

  • and we have π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • What is π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?

Coker, Church and Liberman (1990) Bilingual Lexicon Induction Word2Vec Analogies & Knowledge Graph Completion

25 CLSW-2020

slide-26
SLIDE 26

Coker, Church and Liberman (1990): Mo Morp rphology and Rhymi ming: Tw Two Powerful Alternatives to Letter-to to-So Sound Rule les for Sp Speech Synthesis is

Morphology Γ  Sound & Meaning Morphology Γ  Sound & Meaning Morphology Γ  Sound & Meaning Morphology Γ  Sound & Meaning Morphology Γ  Sound & Meaning Rhyming Γ  Sound, but Meaning

26 CLSW-2020

slide-27
SLIDE 27

Coker, Church and Liberman (1990): Mo Morp rphology and Rhymi ming: Tw Two Powerful Alternatives to Letter-to to-So Sound Rule les for Sp Speech Synthesis is

Avoid spelling (when possible)

27 CLSW-2020

slide-28
SLIDE 28

Coker, Church and Liberman (1990): Mo Morp rphology and Rhymi ming: Tw Two Powerful Alternatives to Letter-to to-So Sound Rule les for Sp Speech Synthesis is

28 CLSW-2020

slide-29
SLIDE 29

Short Lists: Construct by Hand (or by Machine Learning) Spelling, IPA, Vectors, etc.

  • prefixes:
  • be, bi, black, co, com, con, contra, de, di, dis, electro,

en, ex, extra, fort, grand, hyper, hypo, im, in, inter, intra, mis, off, on, out, over, par, per, pre, pro, re, sub, super, sur, trans, tri, un, under, uni, wood, abdul, ab, lipo, i, cyber, meta, hexa, mono, quad, penta, sex, semi, hemi, after, tetra, anti, ante, agri, alta, non, mal, cor, bio, fore, for, micro, macro, multi, pri, tele, mari, smart, west, east, south, north, back, geo, auto, nano, para, power, mid, photo, phono, poly, techno, media, hand, neuro, petro, high, info, down, up, ever, poli, ultra, counter, aero, gene, metro, nova, hydro, radi, chemo,

  • mni, arch, math, short, long, video, mano, terra,

cardio, pseudo, amino, carbo, nutri, ab, thio, methyl, pheno, immuno, bacterio, baro, methylo, strepto, self

  • suffixes:
  • a, able, ably, al, ally, als, an, ance, and, ary, ate, ated,

ating, ation, ations, ative, bury, d, e, ed, ee, ence, ent, ents, er, ers, er’s, es, ess, est, et, ey, ful, fully, ia, ial, ian, ians, ic, ical, ically, ier, ies, ied, iment, ility, ine, iness, ing, ingly, ings, ion, ions, ious, is, ise, ism, ist, ists, ite, ities, ity, ive, ized, less, lessly, like, ly, man, man’s, men, ment, ments, ness, or, ous, r, s, ’s, sion, son’s, stone, tion, tions, tive, way, y, y’s, n’t, ’d, son, sen, ability, able, nostics, nostic, sky, ski, tuple, owski, owsky, owicz, ewicz, ewski, ewsky, opoulos, berg, ville, land, ick, wood, field, town, ton, ford, burg, erman, stein, ington, itz, berger, ization, izing, meier, isation, ising, istic, ible, ified, ification, ologist, ology, ifying, ography, ance, ient, ience, iance

  • onsets (for Rhyming):
  • c, m, d, p, b, r, s, h, l, f, t, w, g, n, v, pr, ch, j, st, tr, br, k,

sh, gr, cr, cl, th, sp, fr, bl, fl, pl, q, wh, dr, sc, sl, str, gl, sw, ph, z, sn, sm, sk, wr, thr, kn, scr, tw, sq, sch, spr, chr, rh, shr, kr, spl, ps, gh, kl, kh, gn, dw, gw, ts, phr, pt, x, tch, ll, vr, chl, sph, schw, schl, phl, dh, thw, sv, cz, bh, hr, vl, kw, schn, dm, psh, dl, bj, zw, tl, sf, schr, mn, dv

29 CLSW-2020

slide-30
SLIDE 30

We used to create lists of affixes by hand These days, more emphasis on training

(IMHO, training short lists seems somewhat pointless)

Training (Learn 𝑼 β‰ˆ 𝟐𝟏𝟏𝟏 Affixes)

  • Input: training list of words, 𝑋
  • Split each word, π‘₯ in 𝑋, into 2

pieces: π‘₯[0: π‘˜] + π‘₯[π‘˜: ], for all π‘˜

  • If first piece is in 𝑋, then w is

evidence other piece is a suffix

  • If second piece is in 𝑋, then π‘₯ is

evidence other piece is a prefix

  • Sort affixes by evidence
  • Output top π‘ˆ

Inference (Tokenization)

  • Input string 𝑑
  • If 𝑑 ∈ 𝑒𝑗𝑑𝑒, output 𝑑
  • Else if 𝑑 β†’ π‘žπ‘ π‘“ π‘₯ or 𝑑 β†’ π‘₯ 𝑑𝑣𝑔,
  • where π‘₯ ∈ 𝑒𝑗𝑑𝑒
  • and pre/suf are lists of π‘ˆ affixes
  • output best analysis
  • (sorted by evidence)
  • Otherwise, use subwords
  • (from ERNIE/BERT)

30 CLSW-2020

slide-31
SLIDE 31

Inferences Input: almost known word (π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)

Output: Sound

ΓΌIf π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š is like π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š

ΓΌand we have 𝐽𝑄𝐡(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) ΓΌWhat is 𝐽𝑄𝐡(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?

Output: Meaning (vec/translation)

ØIf π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š is like π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š

  • and we have 𝑀𝑓𝑑(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • What is 𝑀𝑓𝑑(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?
  • If π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š is like

π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š

  • and we have π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • What is π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)?

Coker, Church and Liberman (1990) Bilingual Lexicon Induction Word2Vec Analogies & Knowledge Graph Completion

31 CLSW-2020

slide-32
SLIDE 32

Word2vec

https://usc-isi-i2.github.io/slides/part-3.pdf

Words: Points in a Vector Space Analogies: Vector Translations

  • man is to woman
  • as king is to queen
  • dog is to dogs
  • as cat is to cats
  • Paris is to France
  • as London is to England
  • and Rome is to Italy
  • slow is to slower
  • as fast is to faster
  • and long is to longer
  • slower is to slowest
  • as fast is to fastest
  • and long is to longest

𝑀𝑓𝑑(π‘›π‘π‘œ) + 𝑀𝑓𝑑(π‘₯π‘π‘›π‘π‘œ) β‰ˆ 𝑀𝑓𝑑(π‘™π‘—π‘œπ‘•) + 𝑀𝑓𝑑(π‘Ÿπ‘£π‘“π‘“π‘œ)

Analogies from Word2Vec Paper (2013)

32

slide-33
SLIDE 33

Morphology as Vector Translation

man is to woman as king is to queen

𝑀𝑓𝑑 𝑦 + 𝑀𝑓𝑑 𝑦 + 𝑧 β‰ˆ 𝑀𝑓𝑑 𝑨 + 𝑀𝑓𝑑(𝑨 + 𝑧)

Analogies from Word2Vec Paper

33 CLSW-2020

slide-34
SLIDE 34

Vector Translations and Rotations

  • Translation
  • 𝑍 = π‘Œ + 𝑐
  • Rotation
  • 𝑍 = 𝐡 π‘Œ
  • Least Squares Regression
  • 𝑍 ~ π‘Œ
  • 𝑍 β‰ˆ 𝐡 π‘Œ + 𝑐
  • Task: learn 𝑔(π‘Œ) β‰ˆ 𝑍
  • 𝑀𝑓𝑑 π‘£π‘œπ‘— + 𝑦 ~ 𝑀𝑓𝑑 𝑦
  • Training, input seeds: < 𝑦, 𝑧 >
  • < π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š, π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š >
  • < π‘’π‘—π‘›π‘“π‘œπ‘‘π‘—π‘π‘œπ‘π‘š, π‘£π‘œπ‘—π‘’π‘—π‘›π‘“π‘œπ‘‘π‘—π‘π‘œπ‘π‘š >
  • …
  • Use regression (machine learning) to estimate 𝐡, 𝑐
  • Notation
  • π‘Œ: Input Vector
  • 𝑍: Output Vector
  • 𝐡, 𝑐: Constants
  • 𝐡: Rotation Matrix
  • 𝑐: bias
  • Shapes
  • 𝐿: Hidden/Latent Dimensions (~300)
  • 𝑇: Seeds (Training Examples)
  • π‘Œ: 𝑇 Γ— 𝐿
  • 𝑍: 𝑇 Γ— 𝐿
  • 𝐡: 𝐿 Γ— 𝐿
  • 𝑐: 𝐿
  • Repeat for more pairs of words
  • to train different 𝑔’s for different relations of interest
  • such as: prefixes, suffixes, word net relations, etc.

34 CLSW-2020

slide-35
SLIDE 35

Translation: (Pun Intended) 𝑀𝑓𝑑(𝑓) ~ 𝑀𝑓𝑑 𝑑

where 𝑓 is an English word and 𝑑 is a Spanish word

Translating Vectors Translating Words

35 CLSW-2020

slide-36
SLIDE 36

BLI (Bilingual Lexicon Induction)

Standard BLI

  • Training: Learn β€ž

𝑆 from 𝑇 seeds: < 𝑓, 𝑑 >

  • β€ž

𝑆 = π‘π‘ π‘•π‘›π‘—π‘œβ€¦ 𝐹‑𝑆 βˆ’ 𝐷[

  • Inference:
  • π‘’π‘ π‘π‘œπ‘‘(𝑓) = 𝑀𝑓𝑑ˆ‰(𝐷, 𝑀𝑓𝑑(𝐹, 𝑓) β€ž

𝑆)

  • Notation
  • 𝐹: embedding of English Words, 𝑓
  • 𝐷: embedding of Chinese Words, 𝑑
  • 𝐹[, 𝐷[: embeddings for S seed words
  • 𝑆: rotation that maps English into Chinese
  • 𝑆Š: rotation that maps Chinese into English
  • 𝑀𝑓𝑑(𝐹, π‘₯) find vec for word π‘₯ in embedding 𝐹
  • 𝑀𝑓𝑑ˆ‰ (𝐷, 𝑀) find word for vec 𝑀 in embedding 𝐷

BLI for almost-known words

  • π‘’π‘ π‘π‘œπ‘‘ π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ π‘’π‘ π‘π‘œπ‘‘ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • 𝑓 = 𝑦 + 𝑧
  • 𝑀𝑓𝑑(𝐹, 𝑓) = 𝑀𝑓𝑑 𝑦 β€ž

𝑆 + 𝑐; often assume 𝑐 = 0

  • π‘’π‘ π‘π‘œπ‘‘(𝑓) = 𝑀𝑓𝑑ˆ‰(𝐷, 𝑀𝑓𝑑(𝐹, 𝑦) β€ž

𝑆)

36 CLSW-2020

slide-37
SLIDE 37

Applications of Vector Rotations and Translations:

Embedding Multiple Point Clouds into Comparable Coordinates (under appropriate ridged body assumptions)

BLI (Bilingual Lexicon Induction) Language Translation ICP (Iterative Closest Point): Robotics Point Set Registration

By Dllu - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=27950915

http://ais.informatik.uni-freiburg.de/teaching/ss11/robotics/slides/17-icp.pdf

37 CLSW-2020

slide-38
SLIDE 38

Matrix Completion Applications

Knowledge Graph Completion: WordNet, Cyc, Freebase Matrix Completion (Recommender Systems)

  • Related to
  • Collaborative Filtering
  • Imputation
  • Example: Netflix Competition
  • Matrix 𝑆:
  • 𝑣𝑑𝑓𝑠𝑑 Γ— 𝑛𝑝𝑀𝑗𝑓𝑑 β†’ π‘ π‘π‘’π‘—π‘œπ‘•π‘‘
  • Mostly missing values
  • Missing at random
  • β€Ή

𝑆 β‰ˆ 𝐼𝑋, where

  • 𝐼 ∈ ℝW[Ε½β€’[Γ—β€’_aΕ½Xa β€˜_’afβ€’[
  • 𝑋 ∈ ℝ‒_aΕ½Xa β€˜_’afβ€’[Γ—β€œf”YΕ½[

``An overview of embedding models of entities and relationships for knowledge base completion’’ (Nguyen, 2017)

38

slide-39
SLIDE 39

Knowledge Graph Completion

  • triples: < β„Ž, 𝑠, 𝑒 >
  • β„Ž: head (node)
  • 𝑒: tail (node)
  • 𝑠: relation (edge)
  • Morphology
  • β„Ž: π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • 𝑒: π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • 𝑠: π‘£π‘œπ‘—
  • WordNet
  • β„Ž: 𝑑𝑏𝑠
  • 𝑒: π‘€π‘“β„Žπ‘—π‘‘π‘šπ‘“
  • 𝑠: 𝑗𝑑_𝑏

39 CLSW-2020

slide-40
SLIDE 40

Massive Literature

An overview of embedding models of entities and relationships for knowledge base completion (Nguyen, 2017)

pip install pykg2vec

https://github.com/Sujit-O/pykg2vec

40 CLSW-2020

slide-41
SLIDE 41

pip install pykg2vec

https://github.com/Sujit-O/pykg2vec

Datasets

  • https://pykg2vec.readthedocs.io/e

n/latest/dataset.html#

  • FreebaseFB15k
  • WordNet18
  • WordNet18RR
  • YAGO3_10
  • DeepLearning50a
  • (or define your own)

Algorithms

  • https://pykg2vec.readthedocs.io/e

n/latest/algos.html

  • Variations on vector translations

and rotations: 𝑍~π‘Œ

  • Latent Distance Models
  • TransE, TransH, TransR, TransD,

TransM, KG2E, RotatE

  • Semantic Matching Models
  • RESCAL, DistMult, Complex, TuckER

41 CLSW-2020

slide-42
SLIDE 42

A ``Few” Results

An overview of embedding models of entities and relationships for knowledge base completion (Nguyen, 2017)

42 CLSW-2020

slide-43
SLIDE 43

WordNet: Fantastic Resource

WN18RR: Standard Train/Valid/Test Split

WN18RR (train.txt)

Triples Relation

34,796 _hypernym 29,715 _derivationally_related_form 7402 _member_meronym 4816 _has_part 3116 _synset_domain_topic_of 2921 _instance_hypernym 1299 _also_see 1138 _verb_group 923 _member_of_domain_region 629 _member_of_domain_usage 80 _similar_to

WordNet (my own extraction from NLTK)

Triples Relation 329,402 hyponym 329,396 hypernym 157,992 synonym 74,717 derivationally_related_form 49,073 morph 8023 pertainym 7979 antonym

43 CLSW-2020

slide-44
SLIDE 44

Six Types of Word Pairs

Source: WordNet extracted from NLTK (except for rhymes)

  • Morph
  • ejecting/eject
  • shortages/shortage
  • owns/own
  • juries/jury
  • Antonym
  • general/specific
  • relative/absolute
  • meaningless/meaningful
  • literate/illiterate
  • Synonym
  • ring/band
  • surround/ring
  • look/see
  • thankfully/gratefully
  • Hypernym. (is-a)
  • singer/instrumentalist
  • dupe/person
  • dress/clothing
  • uniform/clothing
  • Hyponym (is-a inverse)
  • instrumentalist/singer
  • person/dupe
  • clothing/dress
  • clothing/uniform
  • Rhyme (spelling)
  • page/stage
  • founded/wounded
  • chick/lick
  • granted/planted

= β‰  < >

Sound v. Meaning

44 CLSW-2020

slide-45
SLIDE 45

Rotations (𝑺) vs. Additions (𝑩)

Method

  • Baseline:
  • cos(𝑀𝑓𝑑(𝑦), 𝑀𝑓𝑑(𝑧))
  • Add (𝑩)
  • cos(𝑀𝑓𝑑(𝑦) + 𝐡, 𝑀𝑓𝑑(𝑧))
  • Rotation (𝑺)
  • cos(𝑀𝑓𝑑 𝑦 𝑆, 𝑀𝑓𝑑(𝑧))
  • For each WordNet relation (and emb)
  • Train 𝑺 & 𝑩 on training set
  • Report cosines on test set

Conclusion: 𝑺 often better than 𝐡

CLSW-2020 45

Improvement

slide-46
SLIDE 46

Dueling Methods vs. Dueling Tasks

  • Much of the literature compares

methods

  • Rotations (𝑺) vs. Additions (𝑩)
  • But we don’t have to choose
  • And does it matter whose toppling

leaderboard?

  • More interesting question
  • What matters?
  • Sound? Meaning? Spelling?
  • Cosines also vary by embedding
  • FastText embeddings (spelling)
  • Other embeddings (collocations)

CLSW-2020 46

slide-47
SLIDE 47

Similarity Metrics Matter:

Different metrics for different tasks

  • Synonyms
  • WordNet Synsets
  • Collocations
  • bread and butter
  • Word2Vec / PMI
  • Pointwise Mutual Information
  • (Church and Hanks, 1990)
  • Failure mode
  • Not exactly synonyms
  • Antonyms are also similar under PMI
  • 𝑕𝑝𝑝𝑒 β‰ˆ 𝑐𝑏𝑒 ? ? ?
  • Similar: anything that can be compared

and contrasted

  • Spelling
  • FastText
  • baby and babies
  • Failure mode: rhymes
  • babies and rabies
  • Spelling captures meaning
  • But also sound, etymology and more
  • Lots of alternatives
  • Currently working on back-translation

similarity

  • Random walks over bilingual

dictionaries

  • Words are similar if they have similar

translations in other languages

CLSW-2020 47

slide-48
SLIDE 48

Cosines are taking advantage of collocation and spelling (not sound and meaning)

48 CLSW-2020

Sound Meaning

slide-49
SLIDE 49

Morphology and Semantics as Vector Rotations

Morphology

  • Some morphological relations
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • π‘£π‘œπ‘¨π‘—π‘žπ‘žπ‘“π‘’ β†’ π‘£π‘œ + π‘¨π‘—π‘žπ‘žπ‘“π‘’
  • 𝑒𝑝𝑕𝑑 β†’ 𝑒𝑝𝑕 + 𝑑
  • π‘π‘π‘ π‘™π‘—π‘œπ‘• β†’ 𝑐𝑏𝑠𝑙 + π‘—π‘œπ‘•
  • Collect seeds for training:
  • < π‘£π‘œπ‘— + 𝑦, 𝑦 >
  • < π‘£π‘œ + 𝑦, 𝑦 >
  • < 𝑦 + 𝑑, 𝑦 >
  • < 𝑦 + π‘—π‘œπ‘•, 𝑦 >
  • Learn rotations 𝑆
  • 𝑀𝑓𝑑(π‘£π‘œπ‘— + 𝑦)𝑆WXY β‰ˆ 𝑀𝑓𝑑(𝑦)
  • 𝑀𝑓𝑑(π‘£π‘œ + 𝑦)𝑆WX β‰ˆ 𝑀𝑓𝑑(𝑦)
  • 𝑀𝑓𝑑(𝑦 + 𝑑)𝑆[ β‰ˆ 𝑀𝑓𝑑(𝑦)
  • 𝑀𝑓𝑑(𝑦 + π‘—π‘œπ‘•)𝑆YX\ β‰ˆ 𝑀𝑓𝑑(𝑦)

WordNet Semantics

  • Some semantic relations:
  • synonymy, antonymy, is-a
  • Collect seeds for training
  • is-a: < 𝑑𝑏𝑠, π‘€π‘“β„Žπ‘—π‘‘π‘šπ‘“ >, …
  • synonym: < 𝑕𝑝𝑝𝑒, β„Žπ‘π‘œπ‘“π‘‘π‘’ >, < 𝑕𝑝𝑝𝑒, π‘žπ‘ π‘π‘”π‘—π‘‘π‘—π‘“π‘œπ‘’ >, …
  • antonym: < 𝑕𝑝𝑝𝑒, 𝑐𝑏𝑒 >, < 𝑕𝑝𝑝𝑒, π‘“π‘€π‘—π‘š >, …
  • Learn Rotations:
  • 𝑀𝑓𝑑(𝑑𝑏𝑠)𝑆Y[_ β‰ˆ 𝑀𝑓𝑑(π‘€π‘“β„Žπ‘—π‘‘π‘šπ‘“)
  • 𝑀𝑓𝑑 𝑕𝑝𝑝𝑒 𝑆[`X β‰ˆ 𝑀𝑓𝑑 β„Žπ‘π‘œπ‘“π‘‘π‘’
  • 𝑀𝑓𝑑(𝑕𝑝𝑝𝑒)𝑆_Xa β‰ˆ 𝑀𝑓𝑑(𝑐𝑏𝑒)
  • Thus, 𝑦𝑆𝑧 ⟹ 𝑀𝑓𝑑 𝑦 𝑆 β‰ˆ 𝑀𝑓𝑑(𝑧)
  • words ⟹ vectors
  • functions on words (predictes, relations) ⟹ rotations
  • What is the meaning of not?
  • ¬𝑦 ⟹ 𝑀𝑓𝑑 𝑦 𝑆Xfa
  • by analogy with: 𝑀𝑓𝑑(π‘£π‘œ + 𝑦) = 𝑀𝑓𝑑(𝑦)𝑆WX

49 CLSW-2020

slide-50
SLIDE 50

Conclusions

  • Proposal:
  • Add a 3rd case between known (K) and OOV
  • almost known (𝑩𝑳)
  • Many words are near known words
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β†’ π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • BERT/ERNIE: un ##idi ##re ##ction ##al (seriously?)
  • Currently, BERT/ERNIE do not support word formation rules
  • that generate words from other words
  • Training:
  • Input seeds: π‘Œ, 𝑍 (for each relation)
  • Learn rotation/translation: 𝑍~π‘Œ
  • Inference:
  • Input almost known word
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • Connect dots
  • π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • under the π‘£π‘œπ‘— relation/rotation
  • Apply π‘£π‘œπ‘— rotation to predict
  • 𝑀𝑓𝑑(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) from 𝑀𝑓𝑑(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) and
  • π‘‘π‘π‘£π‘œπ‘’(π‘£π‘œπ‘—π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) from π‘‘π‘π‘£π‘œπ‘’(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Precedents for vector rotations and translations
  • Word2vec analogies
  • Bilingual Lexicon Induction (BLI)
  • Knowledge Graph Completion
  • Robotics (Iterative Closest Point)
  • Compositionality Assumptions
  • Suppose 𝑑 is an OOV,
  • but 𝑑 = 𝑦 + 𝑧 and 𝑦 is known
  • 𝑑 βˆ‰ 𝑒𝑗𝑑𝑒, but 𝑑 = 𝑦 + 𝑧 and 𝑦 ∈ 𝑒𝑗𝑑𝑒 𝑝𝑠 𝑧 ∈ 𝑒𝑗𝑑𝑒
  • Then we can infer much of the unknown word from nearby

known word(𝑦)

  • π‘‘π‘žπ‘“π‘šπ‘š(𝑑) ~ π‘‘π‘žπ‘“π‘šπ‘š(𝑦)
  • 𝐽𝑄𝐡(𝑑) ~ 𝐽𝑄𝐡(𝑦)
  • 𝑀𝑓𝑑(𝑑) ~ 𝑀𝑓𝑑(𝑦)
  • π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(𝑑) ~ π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(𝑦)
  • πΊπ‘ π‘“π‘œπ‘‘β„Ž(𝑑) ~ πΊπ‘ π‘“π‘œπ‘‘β„Ž(𝑦)
  • While these assumptions are far from perfect,
  • they are probably better than alternatives
  • (backing off to spelling)
  • System Test vs. Unit Test
  • End to end: Black Box (System Test)
  • Intermediate Representations: Gray Box (Unit Test)
  • Capturing relevant linguistic generalizations
  • Intermediate representations support unit test

CLSW-2020 50

slide-51
SLIDE 51

Backup

51 CLSW-2020

slide-52
SLIDE 52

Compositionality Assumptions

  • Suppose is an OOV, but 𝑑 = 𝑦 + 𝑧 and 𝑦 is known
  • 𝑑 βˆ‰ 𝑒𝑗𝑑𝑒, but 𝑑 = 𝑦 + 𝑧 and 𝑦 ∈ 𝑒𝑗𝑑𝑒 𝑝𝑠 𝑧 ∈ 𝑒𝑗𝑑𝑒
  • Then we can infer much of the unknown word from nearby known word(s)
  • π‘‘π‘žπ‘“π‘šπ‘š(𝑑) ~ π‘‘π‘žπ‘“π‘šπ‘š(𝑦)
  • 𝐽𝑄𝐡(𝑑) ~ 𝐽𝑄𝐡(𝑦)
  • 𝑀𝑓𝑑(𝑑) ~ 𝑀𝑓𝑑(𝑦)
  • π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(𝑑) ~ π·β„Žπ‘—π‘œπ‘“π‘‘π‘“(𝑦)
  • πΊπ‘ π‘“π‘œπ‘‘β„Ž(𝑑) ~ πΊπ‘ π‘“π‘œπ‘‘β„Ž(𝑦)
  • While these assumptions are far from perfect,
  • they are probably better than alternatives (backing off to spelling)
  • Robustness opportunities: more apps, more decompositions
  • Common failure mode: 𝑑 β‰  𝑦 + 𝑧 (though it appears to do so)
  • Decompositions should work across tapes: π‘‘π‘žπ‘“π‘šπ‘š, 𝐽𝑄𝐡, 𝑀𝑓𝑑, π·β„Žπ‘—π‘œπ‘“π‘‘π‘“, πΊπ‘ π‘“π‘œπ‘‘β„Ž
  • Heuristic: use multiple tapes to verify one another

52 CLSW-2020

slide-53
SLIDE 53

Breaking Symmetries: VAD VAD Diversity: Antonyms ≫ Synonyms

Antonyms Synonyms

53 CLSW-2020

slide-54
SLIDE 54

Six Types of Word Pairs

Source: WordNet (except for rhymes)

  • Morph
  • ejecting/eject
  • shortages/shortage
  • owns/own
  • juries/jury
  • Antonym
  • general/specific
  • relative/absolute
  • meaningless/meaningful
  • literate/illiterate
  • Synonym
  • ring/band
  • surround/ring
  • look/see
  • thankfully/gratefully
  • Hypernym (is-a)
  • singer/instrumentalist
  • dupe/person
  • dress/clothing
  • uniform/clothing
  • Hyponym (is-a inverse)
  • instrumentalist/singer
  • person/dupe
  • clothing/dress
  • clothing/uniform
  • Rhyme
  • page/stage
  • founded/wounded
  • chick/lick
  • granted/planted
  • Wordnet
  • Traditional Semantics
  • Synsets: equiv relations
  • reflexive, symmetric, transitive
  • Equiv rel: effective for synonyms
  • but not antonyms, is-a
  • Cosine Similarity
  • Symmetric: challenge for is-a
  • Evidence: distributional stats (PMI)
  • (and sometimes spelling)
  • Distributional Stats
  • Challenge: Synonyms β‰ˆ Antonyms
  • Spelling: helpful in many cases
  • But not for rhymes
  • Rhymes have similar sound
  • But not meaning

54 CLSW-2020

slide-55
SLIDE 55

55 CLSW-2020

slide-56
SLIDE 56

Applications

(Work in progress; Collaboration would be most appreciated)

  • Better Tokenization for ERNIE/BERT
  • Intuition:
  • Net should have an easier time figuring out the meaning of an

almost known word (unidirectional)

  • if it is tokenized to emphasize similarity with a nearby known word

(directional)

  • Hypo: Better Tok Γ  Better Perf on GLUE
  • Fear BERT tokenizer may been tuned for GLUE…
  • But we are using ERNIE for many commercially important killer apps
  • Sound:
  • g2p for tts:
  • Terminology
  • g2p: grapheme to phoneme
  • tts: text to speech (speech synthesis)
  • Input spelling Γ  output IPA (phonemes)
  • π‘•π‘ž2(π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) is like 𝑕2π‘ž(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Assume compositionality (same phones for stem)
  • g2p++: duration modeling for Text-to-Speech
  • Input spelling Γ  output phones with durations
  • 𝑒𝑣𝑠(π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) is like 𝑒𝑣𝑠(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Assume compositionality (same durations for stem)
  • Meaning:
  • Better word2vec vectors
  • Intuition: 𝑀𝑓𝑑(π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) is like 𝑀𝑓𝑑(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Chinese Word Segmentation
  • BLI (bilingual lexicon induction)
  • Standard BLI:
  • Training: Learn 𝑆 from 𝑇 seeds: < 𝑓, 𝑑 >
  • π‘π‘ π‘•π‘›π‘—π‘œβ€¦ 𝐹‑𝑆 βˆ’ 𝐷[
  • Inference: π‘’π‘ π‘π‘œπ‘‘(𝑓) = 𝑀𝑓𝑑ˆ‰(𝐷, 𝑀𝑓𝑑(𝐹, 𝑓) 𝑆)
  • Notation
  • 𝐹: embedding of English Words, 𝑓
  • 𝐷: embedding of Chinese Words, 𝑑
  • 𝐹[, 𝐷[: embeddings for S seed words
  • 𝑆: rotation that maps English into Chinese
  • 𝑆Š: rotation that maps Chinese into English
  • 𝑀𝑓𝑑(𝐹, π‘₯) find vec for word π‘₯ in embedding 𝐹
  • 𝑀𝑓𝑑ˆ‰ (𝐷, 𝑀) find word for vec 𝑀 in embedding 𝐷
  • BLI for almost known words
  • π‘’π‘ π‘π‘œπ‘‘ π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š is like π‘’π‘ π‘π‘œπ‘‘ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • Opportunity: stem invariance / compositionality
  • When OOV Γ  stem + affix, stem is often unchanged

63 CLSW-2020

slide-57
SLIDE 57

Applications for Almost-Known Words

  • Better Tokenization for ERNIE/BERT
  • Intuition:
  • Net should have an easier time figuring out the

meaning of an almost known word (unidirectional)

  • if it is tokenized to emphasize similarity with a

nearby known word (directional)

  • Hypo: Better Tok Γ  Better Perf on GLUE
  • Fear BERT tokenizer may been tuned for GLUE…
  • But we are using ERNIE for many commercially

important killer apps

  • Sound:
  • g2p for tts:
  • Terminology
  • g2p: grapheme to phoneme
  • tts: text to speech (speech synthesis)
  • Input spelling Γ  output IPA (phonemes)
  • π‘•π‘ž2 π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ 𝑕2π‘ž(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Assume compositionality (same phones for stem)
  • Meaning:
  • Better word2vec vectors
  • 𝑀𝑓𝑑 π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š ~ 𝑀𝑓𝑑(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • 𝑀𝑓𝑑 π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β‰ˆ 𝐡WXY 𝑀𝑓𝑑 π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š + 𝑐WXY
  • Chinese Word Segmentation
  • BLI (bilingual lexicon induction)
  • Standard BLI: π‘’π‘ π‘π‘œπ‘‘(𝑓) = 𝑀𝑓𝑑ˆ‰(𝐷, 𝑀𝑓𝑑(𝐹, 𝑓) β€ž

𝑆)

  • Almost-Known:
  • π‘’π‘ π‘π‘œπ‘‘ π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β‰ˆ

𝑀𝑓𝑑ˆ‰(𝐷, 𝑀𝑓𝑑(𝐹, 𝐡WXY 𝑀𝑓𝑑 π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š + 𝑐WXY) β€ž 𝑆)

  • Opportunity:
  • stem invariance / compositionality
  • When OOV Γ  stem + affix,
  • stem is often unchanged

64 CLSW-2020

slide-58
SLIDE 58

What do we mean by ``like’’?

g2p (Grapheme to Phoneme)

  • π‘•π‘ž2(π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) is like 𝑕2π‘ž(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Strong Compositionality
  • 𝐽𝑄𝐡 π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β‰ˆ 𝐽𝑄𝐡 π‘£π‘œπ‘— +

𝐽𝑄𝐡(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)

  • Start with dictionary such as CMU Dict: 𝐽𝑄𝐡 π‘₯
  • Training:
  • Learn 𝐽𝑄𝐡 π‘žπ‘ π‘“ and 𝐽𝑄𝐡 𝑑𝑣𝑔 from CMU dictionary
  • Find 𝐽𝑄𝐡 π‘žπ‘ π‘“ and 𝐽𝑄𝐡 𝑑𝑣𝑔 that maximizes

performance of following inference procedure:

  • Inference: 𝑑 β†’ 𝐽𝑄𝐡(𝑑)
  • If 𝑑 is known (𝑑 in dictionary), return 𝐽𝑄𝐡 𝑑
  • If 𝑑 is almost known: 𝑑 β†’ π‘žπ‘ π‘“ π‘₯|π‘₯ 𝑑𝑣𝑔
  • 𝐽𝑄𝐡 𝑑 β‰ˆ 𝐽𝑄𝐡 π‘žπ‘ π‘“ + 𝐽𝑄𝐡(π‘₯)
  • 𝐽𝑄𝐡 𝑑 β‰ˆ 𝐽𝑄𝐡 π‘₯ + 𝐽𝑄𝐡(𝑑𝑣𝑔)
  • Otherwise, fall back to g2p programs for OOVs
  • https://github.com/petronny/g2p
  • https://github.com/Kyubyong/g2p

Word2Vec

  • 𝑀𝑓𝑑(π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š) is like 𝑀𝑓𝑑(π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š)
  • Strong Compositionality
  • 𝑀𝑓𝑑 π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β‰ˆ 𝑀𝑓𝑑 π‘£π‘œπ‘— +

𝑀𝑓𝑑 π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š

  • Related Word/Intuition: Analogies (see next slide)
  • Alternatives
  • 𝑀𝑓𝑑 π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š β‰ˆ 𝑀𝑓𝑑 π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š 𝑆WXY
  • Related Work/Intuition: BLI
  • Prefixes are like translation
  • Learn 𝑆WXY from S seeds < 𝑑, π‘£π‘œπ‘— + 𝑑 >
  • Training: learn vectors/rotations for affixes
  • Inference: 𝑑 β†’ 𝑀𝑓𝑑(𝑑)
  • If 𝑑 is known (𝑑 in dictionary), return 𝑀𝑓𝑑 𝑑
  • If 𝑑 is almost known: 𝑑 β†’ π‘žπ‘ π‘“ π‘₯ | 𝑑 β†’ π‘₯ 𝑑𝑣𝑔
  • 𝑀𝑓𝑑 𝑑 β‰ˆ 𝑀𝑓𝑑(π‘₯)𝑆§‒Ž
  • 𝑀𝑓𝑑 𝑑 β‰ˆ 𝑀𝑓𝑑 π‘žπ‘ π‘“ + 𝑀𝑓𝑑 π‘₯
  • Otherwise, fall back to subwords (fasttext)

65 CLSW-2020

slide-59
SLIDE 59

Translating Almost Known Words

Standard BLI (Bilingual Lexicon Induction)

  • Training: Learn 𝑆 from 𝑇 seeds: < 𝑓, 𝑑 >
  • π‘π‘ π‘•π‘›π‘—π‘œβ€¦ 𝐹‑𝑆 βˆ’ 𝐷[
  • Inference:
  • π‘’π‘ π‘π‘œπ‘‘(𝑓) = 𝑀𝑓𝑑ˆ‰(𝐷, 𝑀𝑓𝑑(𝐹, 𝑓) 𝑆)
  • Notation
  • 𝐹: embedding of English Words, 𝑓
  • 𝐷: embedding of Chinese Words, 𝑑
  • 𝐹[, 𝐷[: embeddings for S seed words
  • 𝑆: rotation that maps English into Chinese
  • 𝑆Š: rotation that maps Chinese into English
  • 𝑀𝑓𝑑(𝐹, π‘₯) find vec for word π‘₯ in embedding 𝐹
  • 𝑀𝑓𝑑ˆ‰(𝐷, 𝑀) find word for vec 𝑀 in embedding 𝐷

BLI for almost known words

  • π‘’π‘ π‘π‘œπ‘‘ π‘£π‘œπ‘— + π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š is like π‘’π‘ π‘π‘œπ‘‘ π‘’π‘—π‘ π‘“π‘‘π‘’π‘—π‘π‘œπ‘π‘š
  • Use procedure on previous slide for 𝑀𝑓𝑑(𝑑)
  • Two approaches: addition and rotation
  • 𝑀𝑓𝑑 𝑑 β‰ˆ 𝑀𝑓𝑑 π‘žπ‘ π‘“ + 𝑀𝑓𝑑 π‘₯
  • 𝑀𝑓𝑑 𝑑 β‰ˆ 𝑀𝑓𝑑(π‘₯)𝑆§‒Ž
  • Two approaches Γ  two solutions:
  • π‘’π‘ π‘π‘œπ‘‘(𝑑) = 𝑀𝑓𝑑ˆ‰ 𝐷, 𝑀𝑓𝑑 𝐹, π‘žπ‘ π‘“ 𝑆 + 𝑀𝑓𝑑 𝐹, π‘₯ 𝑆
  • π‘’π‘ π‘π‘œπ‘‘(𝑑) = 𝑀𝑓𝑑ˆ‰(𝐷, 𝑀𝑓𝑑(𝐹, 𝑓)𝑆§‒Ž𝑆)
  • What is the Chinese translation of English plural?
  • 𝑀𝑓𝑑ˆ‰(𝐷, 𝑦 + π‘žπ‘šπ‘£π‘ π‘π‘š) β‰ˆ 𝑀𝑓𝑑ˆ‰(𝐷, 𝑦)
  • Opportunity: Sim and diffs across languages Γ  Insight into vectors
  • English has (some) gender and (more) number
  • French has (more) gender
  • Chinese has tones

66 CLSW-2020

slide-60
SLIDE 60

Overview / Conclusions

ΓΌSVAIL:

ΓΌMachine Learning ΓΌSpeech, Language & Systems

ΓΌUnifying Themes:

ΓΌUnderstanding successes

ΓΌ Black Box & Gray Box

ΓΌPrepare for future: months/years/decades

ΓΌSystems

ΓΌDeep Dive: Parallelism Planning

ΓΌBlack Box & Gray Box

ΓΌBlack Box:

ΓΌ End-to-End:

ΓΌ Unit Testing (with no units)

ΓΌGray Box:

ΓΌ Visualizing ERNIE and BERT

ΓΌSpeech & Language

ΓΌDeep Dive:

ΓΌ Dementia Challenge

ΓΌDeep Dive:

ΓΌ Many OOVs are near known words

67 CLSW-2020

slide-61
SLIDE 61
  • 161 pairs < π‘£π‘œ + 𝑦, 𝑦 >; Margins 30 Vs, 77 As, 27 Ds
  • V Γ  π‘£π‘œ + 𝑦 more valence than 𝑦 ; v otherwise

Antonyms: π‘£π‘œ + 𝑦 vs. 𝑦

# of words Case word1 word2 word3 word4 word5 63 vAd unacceptable unattached unaware unbalanced unbelief 59 vad unable unacknowledged unattractive unauthorized unavailable 11 Vad unarmed unconcerned uncritical unfreeze unintended 10 VaD unafraid unaltered unambiguous unbreakable unbroken 8 VAD unconditional uncover undefeated unfold unlimited 5 vAD uncommon unforgiving unscrew untraceable unveil 4 vaD unchangeable unrepentant untie untouched 1 VAd unpack

68 CLSW-2020

slide-62
SLIDE 62

Breaking Symmetries: Antonyms (Spoiler Alert: Not yet successful…)

  • good bad
  • bad β‰  good
  • Hypothesis:
  • Symmetries might be problematic for rotations
  • Training will be more successful if training data moves in a consistent direction,
  • e.g., from positive to negative
  • Lookup words in VAD lexicon
  • antonym.v: selects antonyms < 𝑦, 𝑧 > where π‘Š(𝑦) > π‘Š(𝑧)
  • antonym.a: selects antonyms < 𝑦, 𝑧 > where 𝐡(𝑦) > 𝐡(𝑧)
  • antonym.d: selects antonyms < 𝑦, 𝑧 > where 𝐸(𝑦) > 𝐸(𝑧)
  • VAD Lexicon: Borrows concepts from emotion literature
  • https://saifmohammad.com/WebPages/nrc-vad.html

CLSW-2020 69

slide-63
SLIDE 63

NRC Valence, Arousal, and Dominance (NRC-VAD) Lexicon https://saifmohammad.com/WebPages/nrc-vad.html

70 CLSW-2020

Polarity Shock

slide-64
SLIDE 64

Breaking Symmetries VAD: Valence, Arousal, and Dominance

Antonyms with Very Positive Delta V

Delta V More V Less V

0.97 love hate 0.95 happily sadly 0.94 joyful sorrowful 0.94 truth falsehood 0.93 joy sorrow 0.93 pleasure pain 0.92 respectful disrespectful 0.92 peace war 0.91 happiness sadness 0.91 generous stingy

Antonyms with Very Negative Delta V

Delta V Less V More V

  • 0.97

hate love

  • 0.95

sadly happily

  • 0.94

sorrowful joyful

  • 0.94

falsehood truth

  • 0.93

sorrow joy

  • 0.93

pain pleasure

  • 0.92

disrespectful respectful

  • 0.92

war peace

  • 0.91

sadness happiness

  • 0.91

stingy generous

71 CLSW-2020

Polarity Shock

slide-65
SLIDE 65

Breaking Symmetries VAD: Valence, Arousal, and Dominance

More Delta V (Valence) More Delta A (Arousal) More Delta D (Dominance)

love hate agitation calmness strong weak happily sadly stormy calm success failure joyful sorrowful war peace superior inferior truth falsehood noisy quiet successful unsuccessful joy sorrow supernatural natural rich poor pleasure pain shout whisper effective ineffective respectful disrespectful irritate soothe brave cowardly peace war restless restful brave timid happiness sadness lively dull secure insecure generous stingy paranormal normal strongly weakly

72 CLSW-2020

Polarity Shock

slide-66
SLIDE 66

Breaking Symmetries: Antonyms & Synonyms

Antonyms More Delta V (Valence) More Delta A (Arousal) More Delta D (Dominance)

love hate agitation calmness strong weak happily sadly stormy calm success failure joyful sorrowful war peace superior inferior truth falsehood noisy quiet successful unsuccessful joy sorrow supernatural natural rich poor pleasure pain shout whisper effective ineffective respectful disrespectful irritate soothe brave cowardly peace war restless restful brave timid happiness sadness lively dull secure insecure generous stingy paranormal normal strongly weakly

Synonyms More Delta V (Valence) More Delta A (Arousal) More Delta D (Dominance)

awesome awful fuck bed blunt dull blessed goddamned quarrel words loud trashy amazing awful betray grass president chair smart hurt violate break founder flop blessed damn firearm piece first low tenderness soreness rescue saving chairwoman chair fantastic grotesque arrest stay flashy trashy terrific terrifying corrupt cloud peaked poorly blessed infernal infuriate incense combat scrap extraordinary sinful mess pot chairman chair

Compelling Contrasts Synonyms should be compared (not contrasted)

73

Polarity

Shock

slide-67
SLIDE 67

Antonyms: π‘£π‘œ + 𝑦 vs. 𝑦

  • 161 pairs < π‘£π‘œ + 𝑦, 𝑦 > in both Wordnet and NRC

VAD Lexicon

  • Usually, π‘£π‘œ + 𝑦 is more negative
  • (less V, A & D),
  • but lots of exceptions (especially A)
  • More V (Valence): 30 pairs
  • unafraid, unaltered, unambiguous, unarmed,

unbreakable, unbroken, unconcerned, unconditional, uncover, uncritical, undefeated, unequivocal, unfold, unfreeze, unintended, unintentionally, unlimited, unlock, unopposed, unpack, unpredictable, unpretentious, unpunished, unquestionable, unrestrained, unrestricted, unselfish, untroubled, unwind, unwrap

  • More D (Dominance): 27 pairs
  • unafraid, unaltered, unambiguous, unbreakable,

unbroken, unchangeable, uncommon, unconditional, uncover, undefeated, unequivocal, unfold, unforgiving, unlimited, unpredictable, unpunished, unquestionable, unrepentant, unrestrained, unrestricted, unscrew, untie, untouched, untraceable, untroubled, unveil, unwrap

  • More A (Arousal): 77 pairs
  • unacceptable, unattached, unaware, unbalanced,

unbelief, unbutton, uncertain, uncertainty, unclean, unclear, uncomfortable, uncommon, unconditional, unconscious, unconstitutional, unconventional, unconvincing, uncover, undefeated, undefined, undignified, undress, undue, uneasy, uneducated, unequal, unethical, unexpected, unfairness, unfaithful, unfamiliar, unfavorable, unfold, unforgiving, unfriendly, ungrateful, unhealthy, unholy, unhook, uninformed, unkind, unlawful, unlike, unlimited, unmanageable, unmask, unnatural, unorganized, unorthodox, unpack, unpleasant, unpredictable, unprotected, unpublished, unreasonable, unregulated, unreliable, unrestrained, unsanitary, unscrew, unscrupulous, unsettled, unstable, unsteady, unsupported, unsure, unsympathetic, untidy, untraceable, untrustworthy, unusual, unveil, unwary, unwillingly, unworthy, unwrap, unwritten

74 CLSW-2020

Polarity Shock

slide-68
SLIDE 68

Rotation (Usually) Helps, but but no not f for an ant.[V [VAD]

75 CLSW-2020

Improvement

slide-69
SLIDE 69

Rotation (Usually) Helps

76 CLSW-2020

Improvement