Paving the Way to a Large-scale Pseudosense-annotated Dataset The - - PowerPoint PPT Presentation

paving the way to a large scale
SMART_READER_LITE
LIVE PREVIEW

Paving the Way to a Large-scale Pseudosense-annotated Dataset The - - PowerPoint PPT Presentation

Paving the Way to a Large-scale Pseudosense-annotated Dataset The problem: Paucity of manually-annotated data POS tagged sentences Treebanks Sense-annotated data SemCor (Miller et al., 1993) MASC (Ide et al., 2010) A


slide-1
SLIDE 1

Paving the Way to a Large-scale Pseudosense-annotated Dataset

slide-2
SLIDE 2

The problem: Paucity of manually-annotated data

  • POS tagged sentences
  • Treebanks
  • Sense-annotated data

― SemCor (Miller et al., 1993) ― MASC (Ide et al., 2010)

slide-3
SLIDE 3

A Solution:

Automatic generation of sense-annotated data

  • Bootstrapping (Yarowsky, 1995)
  • Exploiting parallel data (Chan and Ng, 2005)
  • Topic signatures (Martínez et al., 2008)
  • Crowdsourcing (Snow et al., 2008)
  • Pseudowords (Gale et al., 1992, Schütze, 1992)
slide-4
SLIDE 4

What is a pseudoword?

slide-5
SLIDE 5

What is a pseudoword?

airplane river

slide-6
SLIDE 6

What is a pseudoword?

airplane river airplaneriver

slide-7
SLIDE 7

What is a pseudoword?

airplane river airplaneriver airplane*river

slide-8
SLIDE 8

What is a pseudoword?

airplane river airplaneriver airplane*river

slide-9
SLIDE 9

How can pseudowords be used to generate annotated data?

slide-10
SLIDE 10

airplane*river

How can pseudowords be used to generate annotated data?

slide-11
SLIDE 11

The Wright brothers invented the airplane. The Nile is the longest river in the world.

airplane*river

How can pseudowords be used to generate annotated data?

slide-12
SLIDE 12

The Wright brothers invented the airplane. The Nile is the longest river in the world.

airplane*river

How can pseudowords be used to generate annotated data?

slide-13
SLIDE 13

airplane*river

The Wright brothers invented the airplane. The Nile is the longest river in the world.

airplane*river

How can pseudowords be used to generate annotated data?

slide-14
SLIDE 14

airplane*river airplane*river

The Wright brothers invented the airplane. The Nile is the longest river in the world.

airplane*river

How can pseudowords be used to generate annotated data?

slide-15
SLIDE 15

airplane*river airplane*river

The Wright brothers invented the airplane*river. The Nile is the longest airplane*river in the world. The Wright brothers invented the airplane. The Nile is the longest river in the world.

airplane*river

How can pseudowords be used to generate annotated data?

slide-16
SLIDE 16

Applications of pseudowords

  • Evaluation of

– Word Sense Disambiguation

Gale et al. (1992), and Schütze (1992)

– Word Sense Induction

Bordag (2006), and Di Marco and Navigli (2013)

– Selectional Preferences

Erk (2007), Bergsma et al. (2008), and Chambers and Jurafsky (2010)

– Information Retrieval

Schütze and Pederson (1995), Sanderson and Rijsbergen (1999)

slide-17
SLIDE 17
  • Monosemy

Some constraints on pseudosenses

bank*airplane They pulled the canoe up on the .

slide-18
SLIDE 18
  • Monosemy

Some constraints on pseudosenses

bank*airplane They pulled the canoe up on the .

By 1905, the Wright Flyer III was capable of fully controllable, stable airplane for substantial periods. The Wright brothers credited Otto Lilienthal as a major inspiration for their decision to pursue manned flight. In 1906, Alberto Santos Dumont made what was claimed to be the first airplane flight unassisted by catapult and set the first world record recognized by the Aéro-Club de France by flying 220 metres (720 ft) in less than 22 seconds. It had movable tail surfaces controlling both yaw and pitch, a form of roll control supplied either by wing warping or by ailerons and controlled by its pilot with a joystick and rudder bar. It was an important predecessor of his later Bleriot XI Channel-crossing aircraft of the summer of 1909. World War II served as a testbed for the use of the airplane as a weapon. Airplane demonstrated its potential as mobile observation platforms, then proved themselves to be machines of war capable of causing casualties to the enemy. The earliest known aerial victory with a synchronized machine gun-armed fighter aircraft occurred in 1915, by German Luftstreitkräfte Leutnant Kurt Wintgens. Alcock and Brown crossed the Atlantic non-stop for the first time in 1919. The first international commercial flights took place between the United States and Canada in 1919. Airplane had a presence in all the major battles of World War II. They were an essential component of the military strategies of the period, such as the German Blitzkrieg or the American and Japanese aircraft carrier campaigns of the Pacific War.

  • Sufficient frequency
slide-19
SLIDE 19
  • Monosemy

Some constraints on pseudosenses

bank*airplane They pulled the canoe up on the .

By 1905, the Wright Flyer III was capable of fully controllable, stable airplane for substantial periods. The Wright brothers credited Otto Lilienthal as a major inspiration for their decision to pursue manned flight. In 1906, Alberto Santos Dumont made what was claimed to be the first airplane flight unassisted by catapult and set the first world record recognized by the Aéro-Club de France by flying 220 metres (720 ft) in less than 22 seconds. It had movable tail surfaces controlling both yaw and pitch, a form of roll control supplied either by wing warping or by ailerons and controlled by its pilot with a joystick and rudder bar. It was an important predecessor of his later Bleriot XI Channel-crossing aircraft of the summer of 1909. World War II served as a testbed for the use of the airplane as a weapon. Airplane demonstrated its potential as mobile observation platforms, then proved themselves to be machines of war capable of causing casualties to the enemy. The earliest known aerial victory with a synchronized machine gun-armed fighter aircraft occurred in 1915, by German Luftstreitkräfte Leutnant Kurt Wintgens. Alcock and Brown crossed the Atlantic non-stop for the first time in 1919. The first international commercial flights took place between the United States and Canada in 1919. Airplane had a presence in all the major battles of World War II. They were an essential component of the military strategies of the period, such as the German Blitzkrieg or the American and Japanese aircraft carrier campaigns of the Pacific War.

  • Sufficient frequency
slide-20
SLIDE 20

Why are random pseudowords not good?

  • Homonymous distinctions;

Curium Centimeter

cm

airplane*river

slide-21
SLIDE 21

Why are random pseudowords not good?

deficiency

lack, deficiency -- (the state of needing something that is absent or unavailable; "water is the critical deficiency in desert regions") insufficiency, inadequacy, deficiency -- (lack of an adequate quantity or number; "the inadequacy of unemployment benefits")

airplane*river

slide-22
SLIDE 22

We need semantically-aware pseudowords

deficiency

lack, deficiency -- (the state of needing something that is absent or unavailable; "water is the critical deficiency in desert regions") insufficiency, inadequacy, deficiency -- (lack of an adequate quantity or number; "the inadequacy of unemployment benefits")

lack*shortfall

slide-23
SLIDE 23

We need semantically-aware pseudowords

  • Category-based pseudowords

Nakov and Hearst (2003)

  • WordNet-based

Otrusina and Smrz (2010)

lack*shortfall

slide-24
SLIDE 24

Challenges ahead of pseudoword generation

  • Semantic awareness

– E.g.: lack*shortfall

slide-25
SLIDE 25

Challenges ahead of pseudoword generation

  • Semantic awareness

– E.g.: lack*shortfall

  • Coverage

– Many distinct semantically-aware pseudowords – Ideally a pseudowords for each ambiguous word in the lexicon

slide-26
SLIDE 26

Our idea: Similarity-based pseudowords

slide-27
SLIDE 27

Our idea: Similarity-based pseudowords

Ambiguous word sense

1

sense

2

sense

n

. . .

slide-28
SLIDE 28

Our idea: Similarity-based pseudowords

pseudosense

1 pseudosense 2

pseudosense n . . . Ambiguous word sense

1

sense

2

sense

n

. . .

slide-29
SLIDE 29

Our idea: Similarity-based pseudowords

pseudosense

1 pseudosense 2

pseudosense n . . .

Corresponding Similarity-based pseudoword

Ambiguous word sense

1

sense

2

sense

n

. . . * * *

slide-30
SLIDE 30

Personalized PageRank

Haveliwala (2002)

slide-31
SLIDE 31

Personalized PageRank

  • Used for semantic similarity by Agirre et al. (2009)

Haveliwala (2002)

slide-32
SLIDE 32

horoscope -- (a prediction of someone's future based on the relative positions of the planets) horoscope -- (a diagram of the positions of the planets and signs of the zodiac at a particular time and place)

slide-33
SLIDE 33

Similarity-based pseudowords

#1

slide-34
SLIDE 34

Similarity-based pseudowords

#1

slide-35
SLIDE 35

Similarity-based pseudowords

{prediction, foretelling, forecasting, prognostication} {horoscope} {prognosis, forecast} {extropy} {statement} {prophecy, divination} {meteorology, weather_forecasting} {fortunetelling} {meteorology} {oracle} . . . 0.194 0.174 0.031 0.029 0.025 0.023 0.020 0.018 0.011 0.008 . . .

slide-36
SLIDE 36

Similarity-based approach

 Preserves the semantic relationship among senses.  

slide-37
SLIDE 37

Similarity-based approach

 Preserves the semantic relationship among senses.  Larger search space, hence higher coverage. 

Hyponym Hypernym Meronym Siblings

All WordNet

slide-38
SLIDE 38

Similarity-based approach

 Preserves the semantic relationship among senses.  Larger search space, hence higher coverage.  Does not need sense-annotated data.

Hyponym Hypernym Meronym Siblings

All WordNet

slide-39
SLIDE 39

15,935 pseudowords for all polysemous nouns in WordNet 3.0

slide-40
SLIDE 40

15,935 pseudowords for all polysemous nouns in WordNet 3.0 Graff and Cieri (2003)

slide-41
SLIDE 41

15,935 pseudowords for all polysemous nouns in WordNet 3.0 Graff and Cieri (2003) (minFreq=1000)

slide-42
SLIDE 42

http://lcl.uniroma1.it/pseudowords/ 15,935 pseudowords for all polysemous nouns in WordNet 3.0 Graff and Cieri (2003) (minFreq=1000)

slide-43
SLIDE 43

http://lcl.uniroma1.it/pseudowords/ 15,935 pseudowords for all polysemous nouns in WordNet 3.0 Graff and Cieri (2003) (minFreq=1000)

slide-44
SLIDE 44

http://lcl.uniroma1.it/pseudowords/ 15,935 pseudowords for all polysemous nouns in WordNet 3.0 Graff and Cieri (2003) (minFreq=1000)

slide-45
SLIDE 45

Evaluating pseudowords

slide-46
SLIDE 46

Evaluation 1 Disambiguation difficulty of pseudowords

slide-47
SLIDE 47

Disambiguation difficulty of pseudowords

  • 20 nouns of the Senseval-3 English Lexical Sample

task (Mihalcea et al., 2004)

  • Pseudosense-annotated dataset

– English Gigaword corpus (Graff and Cieri, 2003) – Preserved sense distribution

  • Baseline: 20 random pseudowords
  • WSD System: IMS (ZhiZhong and Ng, 2010)
slide-48
SLIDE 48

40 60 80 100 40 60 80 100

Recall with Pseudowords Recall with Real words

Similarity Based Random

slide-49
SLIDE 49

40 60 80 100 40 60 80 100

Recall with Pseudowords Recall with Real words

Similarity Based Random

𝜍 = 0.74 𝜍 = 0.54

slide-50
SLIDE 50

Evaluation 2 Representativeness of pseudosenses

slide-51
SLIDE 51

Percentage of similarity-based pseudosenses obtained from different types of WordNet relations

20 40 60 80 100 500 1000

Percentage

Minimum pseudosense frequency

Distance > 1 Distance = 1 Synonyms sentences sentences sentences

slide-52
SLIDE 52

Sampling pseudowords for evaluation

  • 110 pseudowords

10 for each polysemy degree 2 to 12

  • Only 50 nouns (0.3%) in WordNet 3.0 have polysemy

degree > 12

slide-53
SLIDE 53

Representativeness of pseudosenses

slide-54
SLIDE 54

Representativeness of pseudosenses

slide-55
SLIDE 55

Representativeness of pseudosenses

negotiator*spokeperson*congressman*case_in_point

slide-56
SLIDE 56

Representativeness of pseudosenses

negotiator spokeperson congressman case_in_point negotiator*spokeperson*congressman*case_in_point

slide-57
SLIDE 57

Representativeness of pseudosenses

A person who represents others representative

negotiator

3 3 An advocate who represents someone else’s policy spokeperson, interpreter, representative, voice

spokeperson

4 4 A member of the U.S. House of Representatives congressman, congresswoman, representative

congressman

4 4 An item of information that is typical of a group example, illustration, instance, representative

case_in_point

4 3 3.75 3.5

slide-58
SLIDE 58

Representativeness of pseudosenses

A person who represents others representative

negotiator

3 3 An advocate who represents someone else’s policy spokeperson, interpreter, representative, voice

spokeperson

4 4 A member of the U.S. House of Representatives congressman, congresswoman, representative

congressman

4 4 An item of information that is typical of a group example, illustration, instance, representative

case_in_point

4 3 3.75 3.5 1: completely unrelated 2: somewhat related 3: good substitute 4: perfect substitute

slide-59
SLIDE 59

Representativeness of pseudosenses

A person who represents others representative

negotiator

3 3 An advocate who represents someone else’s policy spokeperson, interpreter, representative, voice

spokeperson

4 4 A member of the U.S. House of Representatives congressman, congresswoman, representative

congressman

4 4 An item of information that is typical of a group example, illustration, instance, representative

case_in_point

4 3 3.75 3.5 1: completely unrelated 2: somewhat related 3: good substitute 4: perfect substitute

slide-60
SLIDE 60

Representativeness of pseudosenses

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00

2 3 4 5 6 7 8 9 10 11 12

Representativeness score Polysemy degree

slide-61
SLIDE 61

Evaluation 3 Distinguishability of pseudosenses

slide-62
SLIDE 62

Distinguishability of pseudosenses

donor

slide-63
SLIDE 63

Distinguishability of pseudosenses

donor

  • 1. donor, giver, presenter,

bestower, conferrer (person who makes a gift of property)

  • 2. donor

((medicine) someone who gives blood or tissue or an

  • rgan to be used in another

person (the host))

slide-64
SLIDE 64

Distinguishability of pseudosenses

philanthropist*benefactor

donor

  • 1. donor, giver, presenter,

bestower, conferrer (person who makes a gift of property)

  • 2. donor

((medicine) someone who gives blood or tissue or an

  • rgan to be used in another

person (the host))

slide-65
SLIDE 65

Distinguishability of pseudosenses

philanthropist*benefactor

donor

  • 1. donor, giver, presenter,

bestower, conferrer (person who makes a gift of property)

  • 2. donor

((medicine) someone who gives blood or tissue or an

  • rgan to be used in another

person (the host))

slide-66
SLIDE 66

Distinguishability of pseudosenses

philanthropist*benefactor

donor

  • 1. donor, giver, presenter,

bestower, conferrer (person who makes a gift of property)

  • 2. donor

((medicine) someone who gives blood or tissue or an

  • rgan to be used in another

person (the host))

slide-67
SLIDE 67

[spokeperson, case_in_point, negotiator, congressman]

  • A person who represents others

representative

  • An advocate who represents someone else’s policy

spokeperson, interpreter, representative, voice

  • A member of the U.S. House of Representatives

congressman, congresswoman, representative

  • An item of information that is typical of a group

example, illustration, instance, representative

slide-68
SLIDE 68

[spokeperson, case_in_point, negotiator, congressman]

  • A person who represents others

representative

  • An advocate who represents someone else’s policy

spokeperson, interpreter, representative, voice

  • A member of the U.S. House of Representatives

congressman, congresswoman, representative

  • An item of information that is typical of a group

example, illustration, instance, representative

slide-69
SLIDE 69

[spokeperson, case_in_point, negotiator, congressman]

  • A person who represents others

representative

  • An advocate who represents someone else’s policy

spokeperson, interpreter, representative, voice

  • A member of the U.S. House of Representatives

congressman, congresswoman, representative

  • An item of information that is typical of a group

example, illustration, instance, representative

4/4 = 1

slide-70
SLIDE 70

Distinguishability scores

0.00 0.20 0.40 0.60 0.80 1.00

2 3 4 5 6 7 8 9 10 11 12 Distinguishability score Polysemy degree

slide-71
SLIDE 71

Conclusions

  • Similarity-based pseudowords

– Semantic-awareness – Coverage

  • Three evaluation experiments
slide-72
SLIDE 72

Thanks!

http://lcl.uniroma1.it/pseudowords/

slide-73
SLIDE 73
slide-74
SLIDE 74

Category-based Pseudowords

(Nakov and Hearst, 2003)

  • MeSH
  • Eye

– A01: Body Region – A09: Sense Organ thumb pupils