A simple and robust A simple and robust algorithm for extracting - - PowerPoint PPT Presentation

a simple and robust a simple and robust algorithm for
SMART_READER_LITE
LIVE PREVIEW

A simple and robust A simple and robust algorithm for extracting - - PowerPoint PPT Presentation

A simple and robust A simple and robust algorithm for extracting algorithm for extracting terminology terminology Lu s Sarmento s Sarmento Lu Linguateca Linguateca www.linguateca.pt / / las@letras.up.pt las@letras.up.pt


slide-1
SLIDE 1

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

A simple and robust A simple and robust algorithm for extracting algorithm for extracting terminology terminology

Lu Luí ís Sarmento s Sarmento Linguateca Linguateca www.linguateca.pt www.linguateca.pt / / las@letras.up.pt las@letras.up.pt

slide-2
SLIDE 2

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Exponential growth of multi

Exponential growth of multi-

  • lingual written

lingual written information, especially in information, especially in

  • Need for

Need for

  • Information Retrieval

Information Retrieval

  • Technical Writing

Technical Writing

  • Translation

Translation

  • But

But is constantly evolving and is constantly evolving and so is its so is its . .

slide-3
SLIDE 3

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Terminology resources

Terminology resources

  • Short life

Short life-

  • cycles, constant need for update

cycles, constant need for update

  • Expensive to produce and maintain

Expensive to produce and maintain

  • Need to keep up with emergent domains

Need to keep up with emergent domains

  • What we need:

What we need:

  • Easy

Easy-

  • to

to-

  • use terminology extraction software

use terminology extraction software

  • Computing

Computing-

  • aware terminology specialists

aware terminology specialists

“Build & Go Build & Go” ” terminology resources terminology resources

slide-4
SLIDE 4

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • 1.

1.

Obtain a specific domain corpus Obtain a specific domain corpus

“Do Do-

  • it

it-

  • yourself

yourself” ” / web search / specialist / web search / specialist

2. 2.

Extract terminology (semi Extract terminology (semi-

  • automatically)

automatically)

3. 3.

Validate results using corpora Validate results using corpora

  • Consult specialist, if possible...

Consult specialist, if possible...

4. 4.

Use terminology for IR, Translation, etc... Use terminology for IR, Translation, etc...

5. 5.

IF/ WHEN more terminology resources are IF/ WHEN more terminology resources are necessary, go back to Step 1 necessary, go back to Step 1

slide-5
SLIDE 5

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Statistical

Statistical

  • Rationale: find word sequences that differ from

Rationale: find word sequences that differ from “ “common common-

  • language

language” ”

  • Simple and portable but requires

Simple and portable but requires “ “common common-

  • language

language” ” corpus corpus for comparison: for comparison: ! !

  • Syntactic

Syntactic

  • Rationale: Find word sequences that have a specific POS

Rationale: Find word sequences that have a specific POS pattern pattern

  • Good precision and coverage, but complex and requires

Good precision and coverage, but complex and requires

  • . Difficult to port to other languages.

. Difficult to port to other languages.

slide-6
SLIDE 6

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Morphological:

Morphological:

  • Rationale: find words that look like terms based on

Rationale: find words that look like terms based on roots or suffixes. roots or suffixes.

  • Good precision for

Good precision for domains but requires domains but requires

  • .

.

  • Hybrid:

Hybrid:

  • Rationale: try to combine any of the previous

Rationale: try to combine any of the previous approaches and use other heuristics approaches and use other heuristics

  • May lead to good results but usually lacks

May lead to good results but usually lacks

slide-7
SLIDE 7

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • The situation:

The situation:

  • Large amounts of text available on

Large amounts of text available on-

  • line

line

  • High

High – – should be explored! should be explored!

  • Multi

Multi-

  • lingual corpora (comparable, not parallel)

lingual corpora (comparable, not parallel)

  • What is required:

What is required:

  • algorithms

algorithms

  • Large amounts of text to be processed

Large amounts of text to be processed

  • High

High algorithms algorithms

  • High coverage comes from redundancy

High coverage comes from redundancy

” algorithms algorithms

  • Easy to port to other languages: spare the programmers!

Easy to port to other languages: spare the programmers!

slide-8
SLIDE 8

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • We still need human intervention

We still need human intervention

  • at least domain specialists for validation

at least domain specialists for validation

“Fully automated Fully automated” ” methods are never fully methods are never fully automated automated

  • Human intervention in resource building is

Human intervention in resource building is advisable and feasible advisable and feasible

  • But it cannot be too difficult/ boring

But it cannot be too difficult/ boring

  • is more important than coverage!

is more important than coverage!

slide-9
SLIDE 9

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • The Corp

The Corpó ógrafo is a complete web grafo is a complete web-

  • based terminology

based terminology extraction environment. extraction environment.

  • We assume user intervention:

We assume user intervention:

  • the

the “ “need for speed need for speed” ”

  • good precision

good precision

  • easy to understand!

easy to understand!

  • Need to perform reasonably well in many languages.

Need to perform reasonably well in many languages.

  • We cannot afford POS tagging:

We cannot afford POS tagging:

  • too complex, too slow, too expensive, too dependent

too complex, too slow, too expensive, too dependent

slide-10
SLIDE 10

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Collect N

Collect N-

  • grams from the corpus

grams from the corpus

  • Ask user to check if they are terms.

Ask user to check if they are terms.

  • Advantages:

Advantages:

  • No linguistic resources needed

No linguistic resources needed

  • Fast and portable

Fast and portable

  • Disadvantages

Disadvantages

  • Too noisy

Too noisy

  • Users obviously find it inappropriate

Users obviously find it inappropriate

slide-11
SLIDE 11

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Specific domain corpus

Specific domain corpus (neurology) (neurology)

  • Texts taken from the web (

Texts taken from the web (pdf pdf, , word, html) word, html)

  • 6 languages

6 languages (PT,EN,FR,ES,IT,DE) (PT,EN,FR,ES,IT,DE)

  • English section: 29192

English section: 29192 tks tks. .

0.137 0.137 40 40 in a in a 0.137 0.137 40 40 by the by the 0.143 0.143 42 42 is the is the 0.157 0.157 46 46

  • 0.157

0.157 46 46 the axon the axon 0.164 0.164 48 48 the neuron the neuron 0.174 0.174 51 51

  • f a
  • f a

0.178 0.178 52 52 and the and the 0.202 0.202 59 59

  • n the
  • n the

0.222 0.222 65 65

  • 0.222

0.222 65 65 the brain the brain 0.243 0.243 71 71 from the from the 0.404 0.404 118 118 the cell the cell 0.414 0.414 121 121 to the to the 0.832 0.832 243 243 in the in the 1.137 1.137 332 332

  • f the
  • f the
slide-12
SLIDE 12

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Results can be easily improved.

Results can be easily improved.

  • We could start by describing what a term is and trying

We could start by describing what a term is and trying to find n to find n-

  • grams that respect that description

grams that respect that description

  • Ex: a term must end with

Ex: a term must end with “ “* *ology

  • logy”

”, etc.. , etc..

  • However, it is very difficult to say what a term might be

However, it is very difficult to say what a term might be for every domain. for every domain.

  • But it is much easier to say what a term is NOT!

But it is much easier to say what a term is NOT!

  • And it is also much more

And it is also much more “ “stable stable” ”... ...

slide-13
SLIDE 13

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Define 3 n

Define 3 n-

  • gram exclusion lists:

gram exclusion lists:

  • List of

List of

  • : tokens that cannot start terms

: tokens that cannot start terms

  • List of

List of

  • : tokens that cannot end terms

: tokens that cannot end terms

  • List of

List of

  • : tokens that cannot be part of the

: tokens that cannot be part of the term in any position term in any position

  • Find N

Find N-

  • grams conforming to these restrictions

grams conforming to these restrictions

  • Let redundancy do the rest

Let redundancy do the rest

  • Similar approach in Merkel &

Similar approach in Merkel & Andersson Andersson, 2000 , 2000

slide-14
SLIDE 14

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Most of the elements in these lists are:

Most of the elements in these lists are:

  • prepositions

prepositions

  • pronouns

pronouns

  • punctuation

punctuation

  • certain very frequent words

certain very frequent words

  • Can be easily compiled through trial

Can be easily compiled through trial-

  • and

and-

  • error strategy

error strategy

  • Very stable among different knowledge domains

Very stable among different knowledge domains

  • but may be easily changed, if necessary

but may be easily changed, if necessary

  • Easy to adapt to other non

Easy to adapt to other non-

  • agglutinative languages

agglutinative languages

slide-15
SLIDE 15

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • 0.041

0.041 12 12 nerve nerve cell cell 0.041 0.041 12 12 myelin sheath myelin sheath 0.041 0.041 12 12 amino acids amino acids 0.041 0.041 12 12 developing circuits developing circuits 0.041 0.041 12 12 nervous systems nervous systems 0.044 0.044 13 13 endoplasmic reticulum endoplasmic reticulum 0.044 0.044 13 13 nerve fibers nerve fibers 0.044 0.044 13 13 membrane proteins membrane proteins 0.047 0.047 14 14 schwann cells schwann cells 0.047 0.047 14 14 action potentials action potentials 0.054 0.054 16 16 central nervous central nervous 0.054 0.054 16 16 plasma membrane plasma membrane 0.068 0.068 20 20 synaptic cleft synaptic cleft 0.099 0.099 29 29 glial cells glial cells 0.109 0.109 32 32 action potential action potential 0.116 0.116 34 34 spinal cord spinal cord 0.126 0.126 37 37 nerve cells nerve cells 0.133 0.133 39 39 electrical activity electrical activity 0.157 0.157 46 46 cell body cell body 0.222 0.222 65 65 nervous system nervous system

  • 0.010

0.010 3 3 refinement of neural refinement of neural 0.010 0.010 3 3 complexes of integral complexes of integral 0.010 0.010 3 3 primary cell walls primary cell walls 0.010 0.010 3 3 rate of transmission rate of transmission 0.010 0.010 3 3 action potential will action potential will 0.010 0.010 3 3 – – the messengers the messengers 0.010 0.010 3 3 nmda receptor activation nmda receptor activation 0.010 0.010 3 3 primary visual cortex primary visual cortex 0.010 0.010 3 3 can be divided can be divided 0.010 0.010 3 3 evoked nt secretion evoked nt secretion 0.013 0.013 4 4 induction of ltp/ltd induction of ltp/ltd 0.013 0.013 4 4 synthesis of proteins synthesis of proteins 0.013 0.013 4 4 nodes of ranvier nodes of ranvier 0.017 0.017 5 5 pattern of activity pattern of activity 0.017 0.017 5 5 signaling between nerve signaling between nerve 0.017 0.017 5 5 name of glial name of glial 0.017 0.017 5 5 nuclear pore complexes nuclear pore complexes 0.027 0.027 8 8 integral membrane proteins integral membrane proteins 0.027 0.027 8 8 peripheral nervous system peripheral nervous system 0.054 0.054 16 16 central nervous system central nervous system

slide-16
SLIDE 16

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Precision increases

Precision increases

  • user intervention is easier: +100 terms/ hr

user intervention is easier: +100 terms/ hr

  • Still some problems:

Still some problems:

1. 1.

Some very frequent words that bring false candidates Some very frequent words that bring false candidates (ex: (ex: “ “can can” ”) )

2. 2.

Plural/ Singular divide term occurrences Plural/ Singular divide term occurrences

3. 3.

Encapsulated terms still difficult to separate: Encapsulated terms still difficult to separate:

“nervous system nervous system” ” and and “ “central central ” ”

  • But still possible to improve results

But still possible to improve results

slide-17
SLIDE 17

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Goal: improve precision and solve some of the

Goal: improve precision and solve some of the previous problems (1 & 2) previous problems (1 & 2)

  • If we

If we “ “force force” ” N N-

  • Grams to be NP, then, singularization

Grams to be NP, then, singularization is trivial in is trivial in languages languages

  • Impose 1 additional restriction: N

Impose 1 additional restriction: N-

  • Grams must be

Grams must be preceded by certain words preceded by certain words

  • Ex:

Ex: “ “a a” ”, , “ “the the” ”, , “ “one

  • ne”

”, , “ “as as” ”, etc , etc

  • Similar restrictions in

Similar restrictions in “ “PT PT” ”, , “ “ES ES” ”, ,” ”FR FR” ”, , “ “IT IT” ”

  • Simple enough to be easily implemented

Simple enough to be easily implemented

  • Again: let redundancy do the rest!

Again: let redundancy do the rest!

slide-18
SLIDE 18

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • 0.023

0.023 7 7 nmda nmda receptor receptor 0.023 0.023 7 7 schwann schwann cell cell 0.023 0.023 7 7 respiratory chain respiratory chain 0.027 0.027 8 8 human brain human brain 0.027 0.027 8 8 endoplasmic reticulum endoplasmic reticulum 0.030 0.030 9 9 protein synthesis protein synthesis 0.030 0.030 9 9 peripheral nervous peripheral nervous 0.034 0.034 10 10 neural circuit neural circuit 0.037 0.037 11 11 developing circuit developing circuit 0.041 0.041 12 12 myelin sheath myelin sheath 0.051 0.051 15 15 central nervous central nervous 0.054 0.054 16 16 glial glial cell cell 0.054 0.054 16 16 plasma membrane plasma membrane 0.068 0.068 20 20 synaptic cleft synaptic cleft 0.075 0.075 22 22 electrical activity electrical activity 0.089 0.089 26 26 nerve cell nerve cell 0.102 0.102 30 30 action potential action potential 0.106 0.106 31 31 spinal cord spinal cord 0.119 0.119 35 35 nervous system nervous system 0.157 0.157 46 46 cell body cell body

  • 0.006

0.006 2 2 developing neural circuit developing neural circuit 0.006 0.006 2 2 activity activity-

  • dependent synaptic modification

dependent synaptic modification 0.006 0.006 2 2 cytochrome cytochrome b gene b gene 0.006 0.006 2 2 activity activity-

  • induced synaptic modification

induced synaptic modification 0.006 0.006 2 2 induction of induction of ltp ltp 0.006 0.006 2 2 xenopus xenopus retinotectal retinotectal system system 0.006 0.006 2 2 refinement of neural refinement of neural 0.006 0.006 2 2 evoked evoked nt nt secretion secretion 0.006 0.006 2 2 activation of activation of nmda nmda 0.010 0.010 3 3 rate of transmission rate of transmission 0.010 0.010 3 3 development of neural development of neural 0.010 0.010 3 3 energy of energy of atp atp 0.010 0.010 3 3 primary visual cortex primary visual cortex 0.013 0.013 4 4 induction of induction of ltp ltp/ltd /ltd 0.013 0.013 4 4 nuclear pore complex nuclear pore complex 0.013 0.013 4 4 synthesis of proteins synthesis of proteins 0.020 0.020 6 6 integral membrane protein integral membrane protein 0.020 0.020 6 6 node of node of ranvier ranvier 0.030 0.030 9 9 peripheral nervous system peripheral nervous system 0.051 0.051 15 15 central nervous system central nervous system

slide-19
SLIDE 19

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • Further increases in precision

Further increases in precision

  • Easier for user validation!

Easier for user validation!

  • Plural/ Singular division solved

Plural/ Singular division solved

  • Easy to implement

Easy to implement

“Multi Multi-

  • lingual

lingual” ”

  • Even faster: no long N

Even faster: no long N-

  • grams list to sort!

grams list to sort!

  • 29K tokens in less 2s on a standard Intel P4 machine

29K tokens in less 2s on a standard Intel P4 machine

  • But we haven

But we haven’ ’t yet solved the term encapsulation t yet solved the term encapsulation

  • problem. Possible solution:
  • problem. Possible solution:
  • use

use “ “bigger first bigger first” ” N N-

  • gram search strategy

gram search strategy

slide-20
SLIDE 20

Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  • We have presented a simple algorithm:

We have presented a simple algorithm:

  • Easy to implement by developers

Easy to implement by developers

  • Easy to understand by users

Easy to understand by users

  • Fast enough to process large corpora (> 1M)

Fast enough to process large corpora (> 1M)

  • Filters can be adapted to specific domains

Filters can be adapted to specific domains

  • Can be easily ported to many languages

Can be easily ported to many languages

  • Improvable by simple trial

Improvable by simple trial-

  • and

and-

  • error methods

error methods

  • But it still needs some work to deal with

But it still needs some work to deal with encapsulated terms (in development) encapsulated terms (in development)