Abbreviation Expansion in Lexical Annotation of Schema Maciej - - PowerPoint PPT Presentation

abbreviation expansion in lexical annotation of schema
SMART_READER_LITE
LIVE PREVIEW

Abbreviation Expansion in Lexical Annotation of Schema Maciej - - PowerPoint PPT Presentation

Abbreviation Expansion in Lexical Annotation of Schema Maciej Gawinecki International Doctorate School in Information and Communication Technologies Universit degli Studi di Modena e Reggio Emilia Schemata Integration: Finding the Same


slide-1
SLIDE 1

Maciej Gawinecki

International Doctorate School in Information and Communication Technologies Università degli Studi di Modena e Reggio Emilia

Abbreviation Expansion in Lexical Annotation of Schema

slide-2
SLIDE 2

2

Schemata Integration: Finding the Same Meaning

==?

slide-3
SLIDE 3

3

Schemata Integration: Finding the Same Meaning

==?

annotation from WordNet synonyms: measure, quantity

no entry in WordNet!

slide-4
SLIDE 4

4

Schemata Integration: Finding the Same Meaning

“QuantiTY”

==

annotation from WordNet synonyms: measure, quantity annotation from WordNet synonyms: measure, amount

slide-5
SLIDE 5

5

Why do we care?

  • We need to make abbreviations meaningful
  • to improve effectiveness of lexical annotation and

thus schema mapping discovery

  • Most data integration tools ignore the problem
  • User-defined dictionary (COMA++, Cupid) is not

scalable

  • one abbreviation -- several expansions
  • vocabulary evolves -- dictionary must be updated
  • schema/domain expert still needed
  • We propose effective and scalable solution
slide-6
SLIDE 6

6

Automatic Abbreviation Expansion

  • Given two schemata to integrate, identify

character sequences that are abbreviations and determine expansions

  • 1. Identify abbreviations
  • 2. Determine expansions

“WareHouSE” “Unit Of Measure” “QuantiTY” “IDentifier” “Information” “Purchase Order”

slide-7
SLIDE 7

7

Abbreviation Identification

  • Determining whether a given word has been

used for abbreviation in the given schema label

  • Heuristic #1: non-dictionary words are

abbreviations

  • False negatives: legitimate English words may be

used for abbreviations

  • Some words (standard schema abbreviations) are

always used for abbreviations in schema labels

  • Heuristic #2: standard schema abbreviations

and non-dictionary words are abbreviations!

slide-8
SLIDE 8

8

Tokenizing Labels

  • Reason: A label can be an abbreviation or it

may be multi-word and contain abbreviation(s)

  • Word boundaries
  • punctuation, camel case, e.g. fragileInd
  • no boundaries, e.g. WHSECODE
  • Tokenization methods
  • simple
  • greedy [Feild 2006]

– isolating the longest prefixing/suffixing dictionary word/

standard schema abbreviation

slide-9
SLIDE 9

9

Abbreviation Expansion

  • The task of finding a relevant expansion for a

given identified abbreviation

  • There can be more then one expansion

candidate for an abbreviation

  • e.g. PO can be expanded to:

– Purchase Order – Parents Of – Post Office – etc.

slide-10
SLIDE 10

10

Types of Abbreviations in Schema

  • Standard schema abbreviations
  • describe how a value of an element is represented
  • e.g. Ref (Reference), Nbr (Number)
  • Standard for domain
  • denote important and repeating domain concepts
  • e.g. UOM (Unit of Measure)
  • Ad hoc abbr. [Ratinov 2004]
  • created to save space, from phrases that would not

be abbreviated in a normal context

  • e.g. WHSE (Warehouse), bk (book)
slide-11
SLIDE 11

11

Where can I find Expansions?

  • We did manual expansion of abbrs. in several
  • pen-source schemata
  • Observations
  • for standard schema abbreviations:

– user-defined dictionary

  • for standard domain abbreviations:

– online abbreviation dictionary

  • for ad hoc abbreviations:

– context of abbreviation – complementary schema

}

internal sources

}

external sources

slide-12
SLIDE 12

12

abbreviation

expansion

Internal Sources: Context Source

  • Label of containing class (for attribute) or

schema (for class)

slide-13
SLIDE 13

13

Internal Sources: Complementary Schema

abbreviation

expansion

slide-14
SLIDE 14

14

category, where abbr. and expansion co-occur expansion popularity of expansion in given category

decreasing

External Sources: Online Abbreviation Dictionary

slide-15
SLIDE 15

15

Online Abbreviation Dictionary: Selecting Expansion

  • Expansion is more relevant when
  • it is more popular?
  • it shares more domains of usage with both

schemata?

  • it is more more popular in domains of usage shared

with both schemata!

slide-16
SLIDE 16

16

Online Abbreviation Dictionary: Scoring Relevance of Expansion

slide-17
SLIDE 17

17

Online Abbreviation Dictionary: Scoring Relevance of Expansion

sociology metrology commerce prevalent prevalent WNDs WNDs

  • 1. Compute schema prevalent WordNet Domains [Bergamaschi 2008]
slide-18
SLIDE 18

18

Online Abbreviation Dictionary: Scoring Relevance of Expansion

sociology metrology commerce commerce book_keeping economy corresponding corresponding WNDs WNDs prevalent prevalent WNDs WNDs

  • 1. Compute schema prevalent WordNet Domains [Bergamaschi 2008]
  • 2. Get WordNet Domains of expansion
slide-19
SLIDE 19

19

Online Abbreviation Dictionary: Scoring Relevance of Expansion

sociology metrology commerce commerce book_keeping economy corresponding corresponding WNDs WNDs prevalent prevalent WNDs WNDs

  • 1. Compute schema prevalent WordNet Domains [Bergamaschi 2008]
  • 2. Get WordNet Domains of expansion
  • 3. Discover shared domains between schemata & expansion
slide-20
SLIDE 20

20

Online Abbreviation Dictionary: Scoring Relevance of Expansion

sociology metrology commerce commerce book_keeping economy corresponding corresponding WNDs WNDs prevalent prevalent WNDs WNDs 0.7

  • 1. Compute schema prevalent WordNet Domains [Bergamaschi 2008]
  • 2. Get WordNet Domains of expansion
  • 3. Discover shared domains between schemata & expansion
  • 4. Sum up popularity of expansion in shared domains

popularity popularity in shared in shared domains domains

slide-21
SLIDE 21

21

Combining Sources Together

  • Sources are complementary in providing

expansions

  • No objective criteria for distinguishing ad hoc
  • abbrs. from (domain) standard abbrs!
  • However
  • some types of abbreviations may be more relevant

in general

  • and thus corresponding sources may be considered

as more relevant!

slide-22
SLIDE 22

22

Relevance of Sources

  • Assumption #1
  • Standard schema abbreviations should be always

expanded to the same expansion

  • User-defined dictionary is the most relevant
  • Assumption #2
  • Ad hoc abbreviations are more frequent then

domain-specific abbreviations

  • Context and complementary schema reflects better

user-intention then online dictionary

slide-23
SLIDE 23

23

Example of Expansion

PO Parent Of Purchase Order Purchase Order Purchase Order

complementary complementary schema schema context context

  • nline
  • nline

dict. dict. user-def. user-def. dict. dict.

Purchase Order Parent Of

decreasing relevance

  • f expansion

1

Purchase Order 1.0

1.0 0.7 0.2 0.8 1.0 0.05

combining scores combining scores

  • f expansions
  • f expansions
slide-24
SLIDE 24

24

Evaluation Methodology

  • Implemented on the top of MOMIS data

integration system [Bergamaschi 1999]

  • Dataset
  • 2 relational schemata of Amalgam integration

benchmark

– www.cs.toronto.edu/~miller/amalgam – 168 labels with 52 abbreviations

  • Evaluation of identification and expansion

methods done separately

  • output of identification gives different input for

expansion

slide-25
SLIDE 25

25

Evaluation Criteria of Identification

  • Variable
  • CORRECTNESS: % of correctly identified labels
  • Correctly identified label
  • correctly tokenized
  • all abbreviations identified
  • Reference for output
  • manually tokenized labels and identified abbrs.
slide-26
SLIDE 26

26

Experiments for Identification

  • 3 experiments with different tokenization

method used

  • simple (ST)
  • greedy + WordNet (GT/WN) dictionary to identify
  • dict. words during tokenization
  • greedy + Ispell (GT/Ispell) English words list to

identify dict. words during tokenization

  • All experiments used WordNet for classifying

abbreviations!

slide-27
SLIDE 27

27

Results: Identification Correctness

  • ST (92%) ~ GT/Ispell (93%)
  • reason: relatively few labels in dataset without word

boundaries, e.g. bktittle

  • GT/WN much worse (70%)
  • reason: WordNet contains many short abbreviations

forcing incorrect tokenization, e.g. au (gold) in authID

  • General problem: legitimate English words!
  • e.g. Pub is used for Publication but is a dictionary

word and it is not a standard schema abbreviation

slide-28
SLIDE 28

28

Evaluation Criteria of Expansion

  • Variable
  • CORRECTNESS: % of correctly expanded abbrs.
  • Input
  • manually tokenized labels and identified

abbreviations

  • Reference for output
  • manually expanded
slide-29
SLIDE 29

29

Experiments for Expansion

  • Single sources
  • External sources
  • Internal sources
  • All sources together
slide-30
SLIDE 30

30

Results: Expansion Correctness

  • Single source: user-defined dictionary: 42%

correct

  • errors in domain and ad hoc abbreviations
  • Single source: online abbreviation dictionary:

19% correct

  • errors in ad hoc and standard schema abbrs.
  • Internal sources: 25% correct
  • very good in ad hoc abbreviations
  • All sources: 83% (~ 42%+19%+25%) correct
  • constituent sources are complementary
slide-31
SLIDE 31

31

Conclusions

  • Abbreviations:
  • obstacle for data integration
  • Solution
  • complementary sources of expansions for different

types of abbreviations

  • Results
  • 83% of correct expansion (42% -- when only user-

defined dictionary!) and better scalability

  • Detailed experimental results and data used
  • http://www.ibspan.waw.pl/~gawinec/abbr
slide-32
SLIDE 32

32

Thank you!