Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
A simple and robust A simple and robust algorithm for extracting - - PowerPoint PPT Presentation
A simple and robust A simple and robust algorithm for extracting - - PowerPoint PPT Presentation
A simple and robust A simple and robust algorithm for extracting algorithm for extracting terminology terminology Lu s Sarmento s Sarmento Lu Linguateca Linguateca www.linguateca.pt / / las@letras.up.pt las@letras.up.pt
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Exponential growth of multi
Exponential growth of multi-
- lingual written
lingual written information, especially in information, especially in
- Need for
Need for
- Information Retrieval
Information Retrieval
- Technical Writing
Technical Writing
- Translation
Translation
- But
But is constantly evolving and is constantly evolving and so is its so is its . .
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Terminology resources
Terminology resources
- Short life
Short life-
- cycles, constant need for update
cycles, constant need for update
- Expensive to produce and maintain
Expensive to produce and maintain
- Need to keep up with emergent domains
Need to keep up with emergent domains
- What we need:
What we need:
- Easy
Easy-
- to
to-
- use terminology extraction software
use terminology extraction software
- Computing
Computing-
- aware terminology specialists
aware terminology specialists
- “
“Build & Go Build & Go” ” terminology resources terminology resources
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- 1.
1.
Obtain a specific domain corpus Obtain a specific domain corpus
- “
“Do Do-
- it
it-
- yourself
yourself” ” / web search / specialist / web search / specialist
2. 2.
Extract terminology (semi Extract terminology (semi-
- automatically)
automatically)
3. 3.
Validate results using corpora Validate results using corpora
- Consult specialist, if possible...
Consult specialist, if possible...
4. 4.
Use terminology for IR, Translation, etc... Use terminology for IR, Translation, etc...
5. 5.
IF/ WHEN more terminology resources are IF/ WHEN more terminology resources are necessary, go back to Step 1 necessary, go back to Step 1
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Statistical
Statistical
- Rationale: find word sequences that differ from
Rationale: find word sequences that differ from “ “common common-
- language
language” ”
- Simple and portable but requires
Simple and portable but requires “ “common common-
- language
language” ” corpus corpus for comparison: for comparison: ! !
- Syntactic
Syntactic
- Rationale: Find word sequences that have a specific POS
Rationale: Find word sequences that have a specific POS pattern pattern
- Good precision and coverage, but complex and requires
Good precision and coverage, but complex and requires
- . Difficult to port to other languages.
. Difficult to port to other languages.
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Morphological:
Morphological:
- Rationale: find words that look like terms based on
Rationale: find words that look like terms based on roots or suffixes. roots or suffixes.
- Good precision for
Good precision for domains but requires domains but requires
- .
.
- Hybrid:
Hybrid:
- Rationale: try to combine any of the previous
Rationale: try to combine any of the previous approaches and use other heuristics approaches and use other heuristics
- May lead to good results but usually lacks
May lead to good results but usually lacks
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- The situation:
The situation:
- Large amounts of text available on
Large amounts of text available on-
- line
line
- High
High – – should be explored! should be explored!
- Multi
Multi-
- lingual corpora (comparable, not parallel)
lingual corpora (comparable, not parallel)
- What is required:
What is required:
- algorithms
algorithms
- Large amounts of text to be processed
Large amounts of text to be processed
- High
High algorithms algorithms
- High coverage comes from redundancy
High coverage comes from redundancy
- “
“
- ”
” algorithms algorithms
- Easy to port to other languages: spare the programmers!
Easy to port to other languages: spare the programmers!
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- We still need human intervention
We still need human intervention
- at least domain specialists for validation
at least domain specialists for validation
- “
“Fully automated Fully automated” ” methods are never fully methods are never fully automated automated
- Human intervention in resource building is
Human intervention in resource building is advisable and feasible advisable and feasible
- But it cannot be too difficult/ boring
But it cannot be too difficult/ boring
- is more important than coverage!
is more important than coverage!
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- The Corp
The Corpó ógrafo is a complete web grafo is a complete web-
- based terminology
based terminology extraction environment. extraction environment.
- We assume user intervention:
We assume user intervention:
- the
the “ “need for speed need for speed” ”
- good precision
good precision
- easy to understand!
easy to understand!
- Need to perform reasonably well in many languages.
Need to perform reasonably well in many languages.
- We cannot afford POS tagging:
We cannot afford POS tagging:
- too complex, too slow, too expensive, too dependent
too complex, too slow, too expensive, too dependent
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Collect N
Collect N-
- grams from the corpus
grams from the corpus
- Ask user to check if they are terms.
Ask user to check if they are terms.
- Advantages:
Advantages:
- No linguistic resources needed
No linguistic resources needed
- Fast and portable
Fast and portable
- Disadvantages
Disadvantages
- Too noisy
Too noisy
- Users obviously find it inappropriate
Users obviously find it inappropriate
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Specific domain corpus
Specific domain corpus (neurology) (neurology)
- Texts taken from the web (
Texts taken from the web (pdf pdf, , word, html) word, html)
- 6 languages
6 languages (PT,EN,FR,ES,IT,DE) (PT,EN,FR,ES,IT,DE)
- English section: 29192
English section: 29192 tks tks. .
0.137 0.137 40 40 in a in a 0.137 0.137 40 40 by the by the 0.143 0.143 42 42 is the is the 0.157 0.157 46 46
- 0.157
0.157 46 46 the axon the axon 0.164 0.164 48 48 the neuron the neuron 0.174 0.174 51 51
- f a
- f a
0.178 0.178 52 52 and the and the 0.202 0.202 59 59
- n the
- n the
0.222 0.222 65 65
- 0.222
0.222 65 65 the brain the brain 0.243 0.243 71 71 from the from the 0.404 0.404 118 118 the cell the cell 0.414 0.414 121 121 to the to the 0.832 0.832 243 243 in the in the 1.137 1.137 332 332
- f the
- f the
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Results can be easily improved.
Results can be easily improved.
- We could start by describing what a term is and trying
We could start by describing what a term is and trying to find n to find n-
- grams that respect that description
grams that respect that description
- Ex: a term must end with
Ex: a term must end with “ “* *ology
- logy”
”, etc.. , etc..
- However, it is very difficult to say what a term might be
However, it is very difficult to say what a term might be for every domain. for every domain.
- But it is much easier to say what a term is NOT!
But it is much easier to say what a term is NOT!
- And it is also much more
And it is also much more “ “stable stable” ”... ...
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Define 3 n
Define 3 n-
- gram exclusion lists:
gram exclusion lists:
- List of
List of
- : tokens that cannot start terms
: tokens that cannot start terms
- List of
List of
- : tokens that cannot end terms
: tokens that cannot end terms
- List of
List of
- : tokens that cannot be part of the
: tokens that cannot be part of the term in any position term in any position
- Find N
Find N-
- grams conforming to these restrictions
grams conforming to these restrictions
- Let redundancy do the rest
Let redundancy do the rest
- Similar approach in Merkel &
Similar approach in Merkel & Andersson Andersson, 2000 , 2000
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Most of the elements in these lists are:
Most of the elements in these lists are:
- prepositions
prepositions
- pronouns
pronouns
- punctuation
punctuation
- certain very frequent words
certain very frequent words
- Can be easily compiled through trial
Can be easily compiled through trial-
- and
and-
- error strategy
error strategy
- Very stable among different knowledge domains
Very stable among different knowledge domains
- but may be easily changed, if necessary
but may be easily changed, if necessary
- Easy to adapt to other non
Easy to adapt to other non-
- agglutinative languages
agglutinative languages
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- 0.041
0.041 12 12 nerve nerve cell cell 0.041 0.041 12 12 myelin sheath myelin sheath 0.041 0.041 12 12 amino acids amino acids 0.041 0.041 12 12 developing circuits developing circuits 0.041 0.041 12 12 nervous systems nervous systems 0.044 0.044 13 13 endoplasmic reticulum endoplasmic reticulum 0.044 0.044 13 13 nerve fibers nerve fibers 0.044 0.044 13 13 membrane proteins membrane proteins 0.047 0.047 14 14 schwann cells schwann cells 0.047 0.047 14 14 action potentials action potentials 0.054 0.054 16 16 central nervous central nervous 0.054 0.054 16 16 plasma membrane plasma membrane 0.068 0.068 20 20 synaptic cleft synaptic cleft 0.099 0.099 29 29 glial cells glial cells 0.109 0.109 32 32 action potential action potential 0.116 0.116 34 34 spinal cord spinal cord 0.126 0.126 37 37 nerve cells nerve cells 0.133 0.133 39 39 electrical activity electrical activity 0.157 0.157 46 46 cell body cell body 0.222 0.222 65 65 nervous system nervous system
- 0.010
0.010 3 3 refinement of neural refinement of neural 0.010 0.010 3 3 complexes of integral complexes of integral 0.010 0.010 3 3 primary cell walls primary cell walls 0.010 0.010 3 3 rate of transmission rate of transmission 0.010 0.010 3 3 action potential will action potential will 0.010 0.010 3 3 – – the messengers the messengers 0.010 0.010 3 3 nmda receptor activation nmda receptor activation 0.010 0.010 3 3 primary visual cortex primary visual cortex 0.010 0.010 3 3 can be divided can be divided 0.010 0.010 3 3 evoked nt secretion evoked nt secretion 0.013 0.013 4 4 induction of ltp/ltd induction of ltp/ltd 0.013 0.013 4 4 synthesis of proteins synthesis of proteins 0.013 0.013 4 4 nodes of ranvier nodes of ranvier 0.017 0.017 5 5 pattern of activity pattern of activity 0.017 0.017 5 5 signaling between nerve signaling between nerve 0.017 0.017 5 5 name of glial name of glial 0.017 0.017 5 5 nuclear pore complexes nuclear pore complexes 0.027 0.027 8 8 integral membrane proteins integral membrane proteins 0.027 0.027 8 8 peripheral nervous system peripheral nervous system 0.054 0.054 16 16 central nervous system central nervous system
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Precision increases
Precision increases
- user intervention is easier: +100 terms/ hr
user intervention is easier: +100 terms/ hr
- Still some problems:
Still some problems:
1. 1.
Some very frequent words that bring false candidates Some very frequent words that bring false candidates (ex: (ex: “ “can can” ”) )
2. 2.
Plural/ Singular divide term occurrences Plural/ Singular divide term occurrences
3. 3.
Encapsulated terms still difficult to separate: Encapsulated terms still difficult to separate:
- “
“nervous system nervous system” ” and and “ “central central ” ”
- But still possible to improve results
But still possible to improve results
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Goal: improve precision and solve some of the
Goal: improve precision and solve some of the previous problems (1 & 2) previous problems (1 & 2)
- If we
If we “ “force force” ” N N-
- Grams to be NP, then, singularization
Grams to be NP, then, singularization is trivial in is trivial in languages languages
- Impose 1 additional restriction: N
Impose 1 additional restriction: N-
- Grams must be
Grams must be preceded by certain words preceded by certain words
- Ex:
Ex: “ “a a” ”, , “ “the the” ”, , “ “one
- ne”
”, , “ “as as” ”, etc , etc
- Similar restrictions in
Similar restrictions in “ “PT PT” ”, , “ “ES ES” ”, ,” ”FR FR” ”, , “ “IT IT” ”
- Simple enough to be easily implemented
Simple enough to be easily implemented
- Again: let redundancy do the rest!
Again: let redundancy do the rest!
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- 0.023
0.023 7 7 nmda nmda receptor receptor 0.023 0.023 7 7 schwann schwann cell cell 0.023 0.023 7 7 respiratory chain respiratory chain 0.027 0.027 8 8 human brain human brain 0.027 0.027 8 8 endoplasmic reticulum endoplasmic reticulum 0.030 0.030 9 9 protein synthesis protein synthesis 0.030 0.030 9 9 peripheral nervous peripheral nervous 0.034 0.034 10 10 neural circuit neural circuit 0.037 0.037 11 11 developing circuit developing circuit 0.041 0.041 12 12 myelin sheath myelin sheath 0.051 0.051 15 15 central nervous central nervous 0.054 0.054 16 16 glial glial cell cell 0.054 0.054 16 16 plasma membrane plasma membrane 0.068 0.068 20 20 synaptic cleft synaptic cleft 0.075 0.075 22 22 electrical activity electrical activity 0.089 0.089 26 26 nerve cell nerve cell 0.102 0.102 30 30 action potential action potential 0.106 0.106 31 31 spinal cord spinal cord 0.119 0.119 35 35 nervous system nervous system 0.157 0.157 46 46 cell body cell body
- 0.006
0.006 2 2 developing neural circuit developing neural circuit 0.006 0.006 2 2 activity activity-
- dependent synaptic modification
dependent synaptic modification 0.006 0.006 2 2 cytochrome cytochrome b gene b gene 0.006 0.006 2 2 activity activity-
- induced synaptic modification
induced synaptic modification 0.006 0.006 2 2 induction of induction of ltp ltp 0.006 0.006 2 2 xenopus xenopus retinotectal retinotectal system system 0.006 0.006 2 2 refinement of neural refinement of neural 0.006 0.006 2 2 evoked evoked nt nt secretion secretion 0.006 0.006 2 2 activation of activation of nmda nmda 0.010 0.010 3 3 rate of transmission rate of transmission 0.010 0.010 3 3 development of neural development of neural 0.010 0.010 3 3 energy of energy of atp atp 0.010 0.010 3 3 primary visual cortex primary visual cortex 0.013 0.013 4 4 induction of induction of ltp ltp/ltd /ltd 0.013 0.013 4 4 nuclear pore complex nuclear pore complex 0.013 0.013 4 4 synthesis of proteins synthesis of proteins 0.020 0.020 6 6 integral membrane protein integral membrane protein 0.020 0.020 6 6 node of node of ranvier ranvier 0.030 0.030 9 9 peripheral nervous system peripheral nervous system 0.051 0.051 15 15 central nervous system central nervous system
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- Further increases in precision
Further increases in precision
- Easier for user validation!
Easier for user validation!
- Plural/ Singular division solved
Plural/ Singular division solved
- Easy to implement
Easy to implement
- “
“Multi Multi-
- lingual
lingual” ”
- Even faster: no long N
Even faster: no long N-
- grams list to sort!
grams list to sort!
- 29K tokens in less 2s on a standard Intel P4 machine
29K tokens in less 2s on a standard Intel P4 machine
- But we haven
But we haven’ ’t yet solved the term encapsulation t yet solved the term encapsulation
- problem. Possible solution:
- problem. Possible solution:
- use
use “ “bigger first bigger first” ” N N-
- gram search strategy
gram search strategy
Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)
- We have presented a simple algorithm:
We have presented a simple algorithm:
- Easy to implement by developers
Easy to implement by developers
- Easy to understand by users
Easy to understand by users
- Fast enough to process large corpora (> 1M)
Fast enough to process large corpora (> 1M)
- Filters can be adapted to specific domains
Filters can be adapted to specific domains
- Can be easily ported to many languages
Can be easily ported to many languages
- Improvable by simple trial
Improvable by simple trial-
- and
and-
- error methods
error methods
- But it still needs some work to deal with