Towards a learning approach for abbreviation detection and - - PowerPoint PPT Presentation

towards a learning approach for abbreviation detection
SMART_READER_LITE
LIVE PREVIEW

Towards a learning approach for abbreviation detection and - - PowerPoint PPT Presentation

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Towards a learning approach for abbreviation detection and resolution Klaar Vanopstal, Bart Desmet, V eronique Hoste LT 3 , Language and


slide-1
SLIDE 1

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Towards a learning approach for abbreviation detection and resolution

Klaar Vanopstal, Bart Desmet, V´ eronique Hoste

LT3, Language and Translation Technology Team University College Ghent {klaar.vanopstal,bart.desmet,v´ eronique.hoste}@hogent.be Department of Applied Mathematics & Computer Science Ghent University Krijgslaan 281 (S9), 9000 Gent, Belgium

May 19, 2010

LT3, Language and Translation Technology Team University College Ghent

slide-2
SLIDE 2

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

1 Background

LT3, Language and Translation Technology Team University College Ghent

slide-3
SLIDE 3

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

1 Background 2 Annotation

LT3, Language and Translation Technology Team University College Ghent

slide-4
SLIDE 4

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

1 Background 2 Annotation 3 Pattern-based approach

LT3, Language and Translation Technology Team University College Ghent

slide-5
SLIDE 5

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

1 Background 2 Annotation 3 Pattern-based approach 4 Learning-based approach

LT3, Language and Translation Technology Team University College Ghent

slide-6
SLIDE 6

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

1 Background 2 Annotation 3 Pattern-based approach 4 Learning-based approach 5 Conclusions and future work

LT3, Language and Translation Technology Team University College Ghent

slide-7
SLIDE 7

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Problem Use

Problem

Information explosion ⇒ growing number of (bio)medical abbreviations. New abbreviations are created; not always known to the reader. ⇒ automatic detection and resolution

LT3, Language and Translation Technology Team University College Ghent

slide-8
SLIDE 8

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Problem Use

Use

information retrieval information extraction NER anaphora resolution

LT3, Language and Translation Technology Team University College Ghent

slide-9
SLIDE 9

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Corpus

English

  • AbbRE: reliable standard but limited size
  • Medstract: publicly available and commonly used

LT3, Language and Translation Technology Team University College Ghent

slide-10
SLIDE 10

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Corpus

English

  • AbbRE: reliable standard but limited size
  • Medstract: publicly available and commonly used

Dutch: no resources available

LT3, Language and Translation Technology Team University College Ghent

slide-11
SLIDE 11

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Corpus

English

  • AbbRE: reliable standard but limited size
  • Medstract: publicly available and commonly used

Dutch: no resources available Abstracts from 2 medical journals:

  • Nederlands Tijdschrift voor Geneeskunde (NTvG); 29,978

words

  • Belgisch Tijdschrift voor Geneeskunde (TvG); 36,757 words

⇒ total of 66,739 words

LT3, Language and Translation Technology Team University College Ghent

slide-12
SLIDE 12

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Different types of abbreviations included in annotations: Truncation Example adm for administration

LT3, Language and Translation Technology Team University College Ghent

slide-13
SLIDE 13

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Different types of abbreviations included in annotations: Truncation Example adm for administration First letter initialization Example AAA for abdominal aortic aneurysm

LT3, Language and Translation Technology Team University College Ghent

slide-14
SLIDE 14

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Different types of abbreviations included in annotations: Truncation Example adm for administration First letter initialization Example AAA for abdominal aortic aneurysm Opening letter initialization Example HeLa for Henrietta Lacks

LT3, Language and Translation Technology Team University College Ghent

slide-15
SLIDE 15

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Syllabic initialization Example BZD for benzodiazepine

LT3, Language and Translation Technology Team University College Ghent

slide-16
SLIDE 16

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Syllabic initialization Example BZD for benzodiazepine Substitution initialization Example Fe for iron

LT3, Language and Translation Technology Team University College Ghent

slide-17
SLIDE 17

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Syllabic initialization Example BZD for benzodiazepine Substitution initialization Example Fe for iron Combination of letters and numbers Example CXCR4 for chemokine receptor fusin

LT3, Language and Translation Technology Team University College Ghent

slide-18
SLIDE 18

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Labels

  • 1. ABBR: Dutch abbreviations which have a full form in their

local context Example Hoge-resolutie-computertomografie (HRCT) EN: High resolution computed tomography (HRCT)

LT3, Language and Translation Technology Team University College Ghent

slide-19
SLIDE 19

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Labels

  • 1. ABBR: Dutch abbreviations which have a full form in their

local context Example Hoge-resolutie-computertomografie (HRCT) EN: High resolution computed tomography (HRCT)

  • 2. ABBR DE: Dutch abbreviations with full form in abstract

(not in local context) Example de pathofysiologie van het CFS EN: the pathophysiology of CFS

LT3, Language and Translation Technology Team University College Ghent

slide-20
SLIDE 20

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

  • 3. DEF: Dutch full forms which define an abbreviation in their

local context Example Hoge-resolutie-computertomografie (HRCT) EN: High resolution computed tomography (HRCT)

LT3, Language and Translation Technology Team University College Ghent

slide-21
SLIDE 21

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

  • 3. DEF: Dutch full forms which define an abbreviation in their

local context Example Hoge-resolutie-computertomografie (HRCT) EN: High resolution computed tomography (HRCT)

  • 4. ABBR IN COMP: part of a compound word; no definition in

the abstract Example HIV-pati¨ enten (EN: HIV patients)

LT3, Language and Translation Technology Team University College Ghent

slide-22
SLIDE 22

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

  • 5. ABBR IN COMP DE: part of a compound word; full form

in abstract Example ernstige reumato¨ ıde artritis (RA)-vasculitis. Bij de ziekte van Wegener en RA-vasculitis... EN: severe rheumatoid arthritis (RA) vasculitis. Wegener’s disease and RA vasculitis...)

LT3, Language and Translation Technology Team University College Ghent

slide-23
SLIDE 23

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

  • 5. ABBR IN COMP DE: part of a compound word; full form

in abstract Example ernstige reumato¨ ıde artritis (RA)-vasculitis. Bij de ziekte van Wegener en RA-vasculitis... EN: severe rheumatoid arthritis (RA) vasculitis. Wegener’s disease and RA vasculitis...)

  • 6. ABBR NO DEF: abbreviations without full form

Example AIDS, HIV

LT3, Language and Translation Technology Team University College Ghent

slide-24
SLIDE 24

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

  • 7. ABBR EN: English abbreviation with Dutch/English

definition in local context Example endosonografie (EUS) EN: endoscopic ultrasound (EUS)

LT3, Language and Translation Technology Team University College Ghent

slide-25
SLIDE 25

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

  • 7. ABBR EN: English abbreviation with Dutch/English

definition in local context Example endosonografie (EUS) EN: endoscopic ultrasound (EUS)

  • 8. DEF EN: English full form which accompanies an English

abbreviation Example Mini Mental State Examination (MMSE) ⇒ Kappa score: 0.89

LT3, Language and Translation Technology Team University College Ghent

slide-26
SLIDE 26

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

NTvG TvG ABBR 11.60 14.25 ABBR DE 30.62 22.55 ABBR IN COMP 7.14 22.43 ABBR IN COMP DE 16.85 4.96 ABBR NO DEF 27.65 29.12 ABBR EN 6.14 6.69 TOTAL % 3.36 2.19 Table: Labels and their frequencies in the corpus (%)

LT3, Language and Translation Technology Team University College Ghent

slide-27
SLIDE 27

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

NTvG TvG def: loc 17.74% 20.94 % def: broad 47.47% 27.50% def: loc/broad 65.21% 48.45% Table: Abbreviations and defined abbreviations in the corpus

⇒ Between 45% and 52% of the abbreviations are undefined

LT3, Language and Translation Technology Team University College Ghent

slide-28
SLIDE 28

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Challenges

English abbreviations with Dutch full form: no match Example HAART = krachtige antiretrovirale therapie

LT3, Language and Translation Technology Team University College Ghent

slide-29
SLIDE 29

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Challenges

English abbreviations with Dutch full form: no match Example HAART = krachtige antiretrovirale therapie Parenthetical patterns Example gunstige uitkomst (score 5)

LT3, Language and Translation Technology Team University College Ghent

slide-30
SLIDE 30

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Challenges

English abbreviations with Dutch full form: no match Example HAART = krachtige antiretrovirale therapie Parenthetical patterns Example gunstige uitkomst (score 5) Syllabic initialization Example CVS = chronische-vermoeidheidssyndroom EN: CFS = chronic fatigue syndrome)

LT3, Language and Translation Technology Team University College Ghent

slide-31
SLIDE 31

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Corpus Labels

Challenges

English abbreviations with Dutch full form: no match Example HAART = krachtige antiretrovirale therapie Parenthetical patterns Example gunstige uitkomst (score 5) Syllabic initialization Example CVS = chronische-vermoeidheidssyndroom EN: CFS = chronic fatigue syndrome)

LT3, Language and Translation Technology Team University College Ghent

slide-32
SLIDE 32

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Pattern-based approach - Related research

⇒ Use of patterns to detect abbreviations:

LT3, Language and Translation Technology Team University College Ghent

slide-33
SLIDE 33

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Pattern-based approach - Related research

⇒ Use of patterns to detect abbreviations: short uppercase words typical patterns: “long form (short form)” or “short form (long form)”

LT3, Language and Translation Technology Team University College Ghent

slide-34
SLIDE 34

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Pattern-based approach - Related research

⇒ Use of patterns to detect abbreviations: short uppercase words typical patterns: “long form (short form)” or “short form (long form)” identification of definitions:

  • window of 2*N (Taghva & Gilbreth, 1999)
  • r 3*N words (Stanford Medical Abbreviation Method (Chang

& Sch¨ utze, 2006))

  • text markers: () “ =
  • linguistic cues: “short”, “or” (Park & Byrd, 2001)

LT3, Language and Translation Technology Team University College Ghent

slide-35
SLIDE 35

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

+ use of NLP tools to refine the search space of the definitions (Pustojevski et al., 2001) and/or to tackle the problem of function word matching Example ADL = activiteiten van het dagelijkse leven EN: daily life activities

LT3, Language and Translation Technology Team University College Ghent

slide-36
SLIDE 36

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

2 steps:

LT3, Language and Translation Technology Team University College Ghent

slide-37
SLIDE 37

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

2 steps: Abbreviation detection

LT3, Language and Translation Technology Team University College Ghent

slide-38
SLIDE 38

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

2 steps: Abbreviation detection Definition matching

LT3, Language and Translation Technology Team University College Ghent

slide-39
SLIDE 39

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Step 1: abbreviation detection:

LT3, Language and Translation Technology Team University College Ghent

slide-40
SLIDE 40

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Step 1: abbreviation detection: capital letters / combinations of capital letters with 1-3 lowercased letters or numbers Example QSRL pANCA CDG1A

LT3, Language and Translation Technology Team University College Ghent

slide-41
SLIDE 41

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Step 1: abbreviation detection: capital letters / combinations of capital letters with 1-3 lowercased letters or numbers Example QSRL pANCA CDG1A window of 3*N words

LT3, Language and Translation Technology Team University College Ghent

slide-42
SLIDE 42

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Step 1: abbreviation detection: capital letters / combinations of capital letters with 1-3 lowercased letters or numbers Example QSRL pANCA CDG1A window of 3*N words text markers () = “ ’ ⇒ list of candidate definitions

LT3, Language and Translation Technology Team University College Ghent

slide-43
SLIDE 43

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Step 2: definition matching:

LT3, Language and Translation Technology Team University College Ghent

slide-44
SLIDE 44

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Step 2: definition matching: list of candidate definitions

LT3, Language and Translation Technology Team University College Ghent

slide-45
SLIDE 45

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Step 2: definition matching: list of candidate definitions matching: first letter of abbreviation - words in candidate definition ⇒ matching word + rest of the 3*N sequence = definition

LT3, Language and Translation Technology Team University College Ghent

slide-46
SLIDE 46

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Abbreviations precision recall FB1 TvG 83.89 78.64 81.18 NTvG 82.05 83.07 82.56 Definitions precision recall FB1 TvG 74.49 83.36 78.68 NTvG 68.03 85.50 75.77 Table: Results of the pattern-based approach

LT3, Language and Translation Technology Team University College Ghent

slide-47
SLIDE 47

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Error Analysis

Errors in abbreviation detection step

LT3, Language and Translation Technology Team University College Ghent

slide-48
SLIDE 48

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Error Analysis

Errors in abbreviation detection step

  • Titles printed in capital letters

LT3, Language and Translation Technology Team University College Ghent

slide-49
SLIDE 49

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Error Analysis

Errors in abbreviation detection step

  • Titles printed in capital letters
  • Roman numerals confused with capitalized i, v or x

LT3, Language and Translation Technology Team University College Ghent

slide-50
SLIDE 50

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Error Analysis

Errors in abbreviation detection step

  • Titles printed in capital letters
  • Roman numerals confused with capitalized i, v or x
  • single letters which are not abbreviations (e.g. hepatitis A)

LT3, Language and Translation Technology Team University College Ghent

slide-51
SLIDE 51

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Error Analysis

Errors in abbreviation detection step

  • Titles printed in capital letters
  • Roman numerals confused with capitalized i, v or x
  • single letters which are not abbreviations (e.g. hepatitis A)
  • abbreviations with word-internal capital letters (e.g. mmHg

(EN: Torr))

LT3, Language and Translation Technology Team University College Ghent

slide-52
SLIDE 52

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Error Analysis

Errors in abbreviation detection step

  • Titles printed in capital letters
  • Roman numerals confused with capitalized i, v or x
  • single letters which are not abbreviations (e.g. hepatitis A)
  • abbreviations with word-internal capital letters (e.g. mmHg

(EN: Torr))

  • abbreviations with no typical orthographical characteristics

(e.g. min)

LT3, Language and Translation Technology Team University College Ghent

slide-53
SLIDE 53

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Errors in definition matching step

LT3, Language and Translation Technology Team University College Ghent

slide-54
SLIDE 54

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Errors in definition matching step

  • error percolation

LT3, Language and Translation Technology Team University College Ghent

slide-55
SLIDE 55

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Errors in definition matching step

  • error percolation
  • mislinked words (e.g. het hepatitis-A-virus (HAV))

LT3, Language and Translation Technology Team University College Ghent

slide-56
SLIDE 56

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Errors in definition matching step

  • error percolation
  • mislinked words (e.g. het hepatitis-A-virus (HAV))
  • function words (e.g. op evidentie gebaseerde zorg (EBZ)

(EN: evidence-based medicine (EBM))

LT3, Language and Translation Technology Team University College Ghent

slide-57
SLIDE 57

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Errors in definition matching step

  • error percolation
  • mislinked words (e.g. het hepatitis-A-virus (HAV))
  • function words (e.g. op evidentie gebaseerde zorg (EBZ)

(EN: evidence-based medicine (EBM))

  • English abbreviations with a Dutch definition

LT3, Language and Translation Technology Team University College Ghent

slide-58
SLIDE 58

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Errors in definition matching step

  • error percolation
  • mislinked words (e.g. het hepatitis-A-virus (HAV))
  • function words (e.g. op evidentie gebaseerde zorg (EBZ)

(EN: evidence-based medicine (EBM))

  • English abbreviations with a Dutch definition
  • contractions (e.g. therapiegebonden secundaire

myelodysplasie (t - MDS) en acute leukemie (t - AL). (EN: the incidence of therapy-related secondary myelodysplasia (t-MDS) and acute leukemia (t-AL).)

LT3, Language and Translation Technology Team University College Ghent

slide-59
SLIDE 59

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Learning-based approach - Related research

Often in combination with pattern-based techniques, e.g. Stanford Medical Abbreviation Method (2006), Chang et al. (2002)

LT3, Language and Translation Technology Team University College Ghent

slide-60
SLIDE 60

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Learning-based approach - Related research

Often in combination with pattern-based techniques, e.g. Stanford Medical Abbreviation Method (2006), Chang et al. (2002) Pattern-based detection of abbreviations + learning-based matching with definitions

LT3, Language and Translation Technology Team University College Ghent

slide-61
SLIDE 61

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Learning-based approach - Related research

Often in combination with pattern-based techniques, e.g. Stanford Medical Abbreviation Method (2006), Chang et al. (2002) Pattern-based detection of abbreviations + learning-based matching with definitions examples of features:

  • % of characters aligned at beginning of word
  • % of characters aligned on syllable boundary
  • number of words that were skipped (negative weight)

LT3, Language and Translation Technology Team University College Ghent

slide-62
SLIDE 62

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Learning-based approach - Related research

Often in combination with pattern-based techniques, e.g. Stanford Medical Abbreviation Method (2006), Chang et al. (2002) Pattern-based detection of abbreviations + learning-based matching with definitions examples of features:

  • % of characters aligned at beginning of word
  • % of characters aligned on syllable boundary
  • number of words that were skipped (negative weight)

LT3, Language and Translation Technology Team University College Ghent

slide-63
SLIDE 63

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Own approach

Preprocessing steps:

LT3, Language and Translation Technology Team University College Ghent

slide-64
SLIDE 64

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Own approach

Preprocessing steps:

tokenization

LT3, Language and Translation Technology Team University College Ghent

slide-65
SLIDE 65

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Own approach

Preprocessing steps:

tokenization POS tagging + NP chunking (Daelemans & van den Bosch, 2005)

LT3, Language and Translation Technology Team University College Ghent

slide-66
SLIDE 66

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Learning experiments

LT3, Language and Translation Technology Team University College Ghent

slide-67
SLIDE 67

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Learning experiments

YamCha (Kudo & Matsumoto, 2003): open source sequence tagger using SVM

LT3, Language and Translation Technology Team University College Ghent

slide-68
SLIDE 68

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Learning experiments

YamCha (Kudo & Matsumoto, 2003): open source sequence tagger using SVM 10-fold cross-validation

LT3, Language and Translation Technology Team University College Ghent

slide-69
SLIDE 69

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Feature vector:

LT3, Language and Translation Technology Team University College Ghent

slide-70
SLIDE 70

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Feature vector:

token POS name initials sentence-initial position morphological features (initial capital letter, completely capitalized, internal capital letters, lowercased, roman number, punctuation, hyphens, exclusively consonants) prefix and suffix information symbolic word shape feature: all morphological (binary) features feature to match 1st letter of abbreviation against words in 3*N sequence

LT3, Language and Translation Technology Team University College Ghent

slide-71
SLIDE 71

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work Related research Own approach Results

Results

Abbreviations precision recall FB1 TvG 95.31 92.26 93.76 NTvG 96.76 90.97 93.78 Definitions precision recall FB1 TvG 86.92 78.18 82.32 NTvG 87.19 78.00 82.34 Table: Ten-fold cross-validation results of the learning experiments.

LT3, Language and Translation Technology Team University College Ghent

slide-72
SLIDE 72

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Conclusions

LT3, Language and Translation Technology Team University College Ghent

slide-73
SLIDE 73

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Conclusions

annotated dataset of +/- 67,000 words (Dutch, medical)

LT3, Language and Translation Technology Team University College Ghent

slide-74
SLIDE 74

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Conclusions

annotated dataset of +/- 67,000 words (Dutch, medical) 2 approaches: pattern-based and classification-based

LT3, Language and Translation Technology Team University College Ghent

slide-75
SLIDE 75

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Conclusions

annotated dataset of +/- 67,000 words (Dutch, medical) 2 approaches: pattern-based and classification-based classification-based approach outperforms the pattern-based approach on both tasks:

LT3, Language and Translation Technology Team University College Ghent

slide-76
SLIDE 76

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Conclusions

annotated dataset of +/- 67,000 words (Dutch, medical) 2 approaches: pattern-based and classification-based classification-based approach outperforms the pattern-based approach on both tasks:

  • abbreviation detection: 93% F-score

LT3, Language and Translation Technology Team University College Ghent

slide-77
SLIDE 77

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Conclusions

annotated dataset of +/- 67,000 words (Dutch, medical) 2 approaches: pattern-based and classification-based classification-based approach outperforms the pattern-based approach on both tasks:

  • abbreviation detection: 93% F-score
  • definition matching: 82% F-score

LT3, Language and Translation Technology Team University College Ghent

slide-78
SLIDE 78

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Future work

LT3, Language and Translation Technology Team University College Ghent

slide-79
SLIDE 79

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Future work

incorporate information from error analysis into learning approach

LT3, Language and Translation Technology Team University College Ghent

slide-80
SLIDE 80

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Future work

incorporate information from error analysis into learning approach apply decompounding techniques (syllabic initializations)

LT3, Language and Translation Technology Team University College Ghent

slide-81
SLIDE 81

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Future work

incorporate information from error analysis into learning approach apply decompounding techniques (syllabic initializations) cross-lingual matching: external sources + MT techniques

LT3, Language and Translation Technology Team University College Ghent

slide-82
SLIDE 82

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Future work

incorporate information from error analysis into learning approach apply decompounding techniques (syllabic initializations) cross-lingual matching: external sources + MT techniques undefined abbreviations: external sources

LT3, Language and Translation Technology Team University College Ghent

slide-83
SLIDE 83

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Future work

incorporate information from error analysis into learning approach apply decompounding techniques (syllabic initializations) cross-lingual matching: external sources + MT techniques undefined abbreviations: external sources F-scores per label (now focus on abbreviations and definitions)

LT3, Language and Translation Technology Team University College Ghent

slide-84
SLIDE 84

Background Annotation Pattern-based approach Learning-based approach Conclusions and future work

Future work

incorporate information from error analysis into learning approach apply decompounding techniques (syllabic initializations) cross-lingual matching: external sources + MT techniques undefined abbreviations: external sources F-scores per label (now focus on abbreviations and definitions) English corpus

LT3, Language and Translation Technology Team University College Ghent