a simple and robust a simple and robust algorithm for
play

A simple and robust A simple and robust algorithm for extracting - PowerPoint PPT Presentation

A simple and robust A simple and robust algorithm for extracting algorithm for extracting terminology terminology Lu s Sarmento s Sarmento Lu Linguateca Linguateca www.linguateca.pt / / las@letras.up.pt las@letras.up.pt


  1. A simple and robust A simple and robust algorithm for extracting algorithm for extracting terminology terminology Luí ís Sarmento s Sarmento Lu Linguateca Linguateca www.linguateca.pt / / las@letras.up.pt las@letras.up.pt www.linguateca.pt Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  2. ���������� ���������� � Exponential growth of multi Exponential growth of multi- -lingual written lingual written � information, especially in ����������������� information, especially in ����������������� � Need for Need for ��������������������� ��������������������� � � Information Retrieval Information Retrieval � � Technical Writing Technical Writing � � Translation Translation � � But But ������������������� ������������������� is constantly evolving and is constantly evolving and � so is its ����������� . so is its ����������� . Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  3. ���������� ���������� � Terminology resources Terminology resources � � Short life Short life- -cycles, constant need for update cycles, constant need for update � � Expensive to produce and maintain Expensive to produce and maintain � � Need to keep up with emergent domains Need to keep up with emergent domains � � What we need: What we need: � � ��������������������������������������� ��������������������������������������� � � Easy Easy- -to to- -use terminology extraction software use terminology extraction software � � Computing Computing- -aware terminology specialists aware terminology specialists � � “ “Build & Go Build & Go” ” terminology resources terminology resources � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  4. ����������� ����������� � � Obtain a specific domain corpus Obtain a specific domain corpus 1. 1. “Do Do- -it it- -yourself yourself” ” / web search / specialist / web search / specialist “ � � Extract terminology (semi- -automatically) automatically) Extract terminology (semi 2. 2. Validate results using corpora Validate results using corpora 3. 3. Consult specialist, if possible... Consult specialist, if possible... � � Use terminology for IR, Translation, etc... Use terminology for IR, Translation, etc... 4. 4. IF/ WHEN more terminology resources are IF/ WHEN more terminology resources are 5. 5. necessary, go back to Step 1 necessary, go back to Step 1 Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  5. ������������������������ ������������������������ � Statistical Statistical � � Rationale: find word sequences that differ from Rationale: find word sequences that differ from “ “common common- - � language” ” language � Simple and portable but requires Simple and portable but requires “ “common common- -language language” ” corpus corpus � for comparison: ��������� for comparison: ��������� ! ! � Syntactic Syntactic � � Rationale: Find word sequences that have a specific POS Rationale: Find word sequences that have a specific POS � pattern pattern � Good precision and coverage, but complex and requires Good precision and coverage, but complex and requires � . Difficult to port to other languages. ���������� . Difficult to port to other languages. ������ ������ � ��� ���� ����������� � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  6. ������������������������ ������������������������ � Morphological: Morphological: � � Rationale: find words that look like terms based on Rationale: find words that look like terms based on � roots or suffixes. roots or suffixes. � Good precision for Good precision for ���� ���� domains but requires domains but requires ������ ������ � . ��������������������� . ��������������������� � Hybrid: Hybrid: � � Rationale: try to combine any of the previous Rationale: try to combine any of the previous � approaches and use other heuristics approaches and use other heuristics � May lead to good results but usually lacks May lead to good results but usually lacks ������������ ������������ � ���������� ���������� Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  7. �������������������� �������������������� � The situation: The situation: � � Large amounts of text available on Large amounts of text available on- -line line � � High High ���������� ���������� – – should be explored! should be explored! � � Multi Multi- -lingual corpora (comparable, not parallel) lingual corpora (comparable, not parallel) � � What is required: What is required: � algorithms ����� algorithms � ����� � � Large amounts of text to be processed Large amounts of text to be processed � � High High ��������� ��������� algorithms algorithms � � High coverage comes from redundancy High coverage comes from redundancy � � “ “ ����� ” algorithms algorithms ������� ” ������ �������� � � Easy to port to other languages: spare the programmers! Easy to port to other languages: spare the programmers! � � � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  8. ���������������� ���������������� � We still need human intervention We still need human intervention � � at least domain specialists for validation at least domain specialists for validation � � “ “Fully automated Fully automated” ” methods are never fully methods are never fully � automated automated � Human intervention in resource building is Human intervention in resource building is � advisable and feasible advisable and feasible � But it cannot be too difficult/ boring But it cannot be too difficult/ boring � ��������� is more important than coverage! is more important than coverage! � ��������� � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  9. ������������������������������ ������������������������������ � The Corp The Corpó ógrafo is a complete web grafo is a complete web- -based terminology based terminology � extraction environment. extraction environment. � We assume user intervention: We assume user intervention: � � the the “ “need for speed need for speed” ” � � good precision good precision � � easy to understand! easy to understand! � � Need to perform reasonably well in many languages. Need to perform reasonably well in many languages. � � We cannot afford POS tagging: We cannot afford POS tagging: � � too complex, too slow, too expensive, too dependent too complex, too slow, too expensive, too dependent � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

  10. ���������������������� ���������������������� � Collect N Collect N- -grams from the corpus grams from the corpus � � Ask user to check if they are terms. Ask user to check if they are terms. � � Advantages: Advantages: � � No linguistic resources needed No linguistic resources needed � � Fast and portable Fast and portable � � Disadvantages Disadvantages � � Too noisy Too noisy � � Users obviously find it inappropriate Users obviously find it inappropriate � Luís Sarmento @ META Simposium - For a Proactive Translatology (Université de Montréal, Québec, Canadá, 7-9 April 2005)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend