the impact of distributional metrics in the quality of
play

The Impact of Distributional Metrics in the Quality of Relational - PowerPoint PPT Presentation

The Impact of Distributional Metrics in the Quality of Relational Triples calo Oliveira 1 , Paulo Gomes Hernani Costa, Hugo Gon hpcosta@student.dei.uc.pt, { hroliv,pgomes } @dei.uc.pt Cognitive & Media Systems Group CISUC, University of


  1. The Impact of Distributional Metrics in the Quality of Relational Triples calo Oliveira 1 , Paulo Gomes Hernani Costa, Hugo Gon¸ hpcosta@student.dei.uc.pt, { hroliv,pgomes } @dei.uc.pt Cognitive & Media Systems Group CISUC, University of Coimbra Lisbon, August 16, 2010 1supported by FCT scholarship grant SFRH/BD/44955/2008 Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 1 / 21

  2. Outline Introduction 1 Information Extraction Information Retrieval Research Goals Approach 2 Experimentation 3 Set-up Metrics adaptation Results Additional experimentation Concluding remarks 4 Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 2 / 21

  3. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  4. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  5. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Automatic creation/enrichment from textual resources is an alternative Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  6. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Automatic creation/enrichment from textual resources is an alternative ▶ Higher coverage, easier update, but... Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  7. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Automatic creation/enrichment from textual resources is an alternative ▶ Higher coverage, easier update, but... ▶ Precision is lower Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  8. Introduction Introduction Knowledge bases (eg. WordNet) are useful resources for NLP Their creation and maintenance involves intensive human effort Automatic creation/enrichment from textual resources is an alternative ▶ Higher coverage, easier update, but... ▶ Precision is lower ▶ Evaluation requires once again intensive human labour! Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 3 / 21

  9. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  10. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. “Car is a vehicle with 4 wheels and an engine, used for carrying a small number of passengers.” Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  11. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. “Car is a vehicle with 4 wheels and an engine, used for carrying a small number of passengers.” ▶ vehicle HYPERNYM OF car Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  12. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. “Car is a vehicle with 4 wheels and an engine, used for carrying a small number of passengers.” ▶ vehicle HYPERNYM OF car ▶ wheel PART OF car ▶ engine PART OF car Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  13. Introduction Information Extraction Information extraction (IE) Automatic extraction of structured information from natural language. “Car is a vehicle with 4 wheels and an engine, used for carrying a small number of passengers.” ▶ vehicle HYPERNYM OF car ▶ wheel PART OF car ▶ engine PART OF car ▶ carrying people PURPOSE OF car Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 4 / 21

  14. Introduction Information Retrieval Information retrieval (IR) Locating specific information in natural language resouces. Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 5 / 21

  15. Introduction Information Retrieval Information retrieval (IR) Locating specific information in natural language resouces. Approaches based on the occurrence of words in documents. Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 5 / 21

  16. Introduction Information Retrieval Information retrieval (IR) Locating specific information in natural language resouces. Approaches based on the occurrence of words in documents. Distributional similarity metrics ▶ Cocitation (Small (1973)) ▶ LSA (Deerwester et al. (1990)) ▶ Lin’s (Lin (1998)) ▶ PMI-IR (Turney (2001)) ▶ 휎 (Kozima and Furugori (1993)) ▶ ... Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 5 / 21

  17. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  18. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  19. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples ▶ What about other semantic relations? Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  20. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples ▶ What about other semantic relations? ▶ What metrics should be used? Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  21. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples ▶ What about other semantic relations? ▶ What metrics should be used? ▶ New combined metrics? Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  22. Introduction Research Goals Goals 1 Use IR metrics to improve IE precision ▶ Adapt distributional metrics to determine words similarity ▶ Wandmacher et al. (2007) and Cederberg and Widdows (2003) used LSA to weight hypernymy triples ▶ What about other semantic relations? ▶ What metrics should be used? ▶ New combined metrics? 2 Help manual evaluation Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 6 / 21

  23. Approach IE system Grammars Extraction of Corpus relational triples Removal of triples with stopwords Lemmatisation Additional Metrics extraction of triples application Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 7 / 21

  24. Experimentation Set-up Experimentation set-up ublico 2 corpus (annotated version) CETEMP´ ▶ 28,000 documents ▶ 30,100 unique context words (nouns, verbs and adjectives) ▶ term-document matrix 2 http://www.linguateca.pt/cetempublico/ Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 8 / 21

  25. Experimentation Set-up Experimentation set-up ublico 2 corpus (annotated version) CETEMP´ ▶ 28,000 documents ▶ 30,100 unique context words (nouns, verbs and adjectives) ▶ term-document matrix Triples obtained ▶ Extracted: 20,308 ▶ Discarded: 5,844 ▶ Inferred: 2,492 ▶ Final triple set: 16,956 2 http://www.linguateca.pt/cetempublico/ Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 8 / 21

  26. Experimentation Metrics adaptation Similarity between two documents For instance, Cocitation: First presented as a similarity metric between scientific papers (Small (1973)) Cocitation ( d i , d j ) = P ( d i ∩ d j ) (1) P ( d i ∪ d j ) Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 9 / 21

  27. Experimentation Metrics adaptation Similarity between two documents For instance, Cocitation: First presented as a similarity metric between scientific papers (Small (1973)) Cocitation ( d i , d j ) = P ( d i ∩ d j ) (1) P ( d i ∪ d j ) ▶ d i , d j represent two documents Costa, Gon¸ calo Oliveira & Gomes (CISUC) LaTeCH 2010 Lisbon, August 16, 2010 9 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend