geodisco encyclopedic geographical discourse in france
play

GeoDISCO: Encyclopedic Geographical Discourse in France from the - PowerPoint PPT Presentation

GeoDISCO: Encyclopedic Geographical Discourse in France from the Enlightenment to Wikipedia D. Vigier, T. Joliveau, L. Moncla, K. McDonough, A. Brenon ludovic.moncla@liris.cnrs.fr GIR19 What is this project about? The GoDisco Project asks


  1. GeoDISCO: Encyclopedic Geographical Discourse in France from the Enlightenment to Wikipedia D. Vigier, T. Joliveau, L. Moncla, K. McDonough, A. Brenon ludovic.moncla@liris.cnrs.fr GIR’19

  2. What is this project about? The GéoDisco Project asks How does geographical discourse develop in French encyclopedias between the 18th century and today? 1 Encyclopédie ou Dictionnaire raisonné des sciences, des arts et des métiers, par une Société de Gens de lettres (1751-1772) edited by Diderot and d’Alembert 1 2 Encyclopedia Universalis (2018 digital edition) 3 French Wikipedia (July 2018) 1. https://artfl-project.uchicago.edu/ GeoDISCO GIR’19 – 2/22

  3. Overview Three main axes 1. Named Entity Recognition and Classification in the EDDA - Improving NER with a linguistic approach 2. Toponym Disambiguation - Building a network of relations between toponyms 3. Extracting explicit locations in EDDA - Extracting and interpreting geographical coordinates from EDDA articles GeoDISCO GIR’19 – 3/22

  4. Improving NER with a linguistic approach Corpus analysis using the TXM platform ( http://textometrie.ens-lyon.fr/ ) • The goal is to find specific patterns in order to improve the PERDIDO NER rules Methodology • Identify the most frequent proper nouns in the geography subcorpus based on the POS tagging (Treetagger) • Manual selection of the 30 most frequent occurrences for several types of entities - country, city, region, person. • For each list, compute the co-occurrences ordered by the specificity score GeoDISCO GIR’19 – 4/22

  5. Improving NER with a linguistic approach Most important co-occurrences Place Position Person country city region par , de, selon , -1 de, en , le de, à, dans de, en , le sous , suivant ville, cour, coutume, France, saint, roi, de, dans , bourg, -2 parlement, prévot, ... duc, comte, ... empereur, pape, ... royaume, rivière, roi, ... punctation mark punctuation mark punctuation mark +1 punctuation mark numeric value et I , II , IV , ... dans, au, capitale, Sicile, géographie, +2 numeric value royaume, ... Valais, Baptiste, ... • prepositions, list of nouns, . . . GeoDISCO GIR’19 – 5/22

  6. Improving NER with a linguistic approach GeoDISCO GIR’19 – 6/22

  7. Toponym Disambiguation and Historical Texts : Challenges There is no gazetteer for the 18th-c. world • Modern or historical gazetteers contain lots of noise • Typical ranking solutions do not apply well in these cases (e.g. population) • EDDA’s complex structure means that one geography article may refers to more than one place • Sometimes it is impossible to match a record in any resource to a toponym • articles explicitly refuse to pin down a location for a toponym • there is no existing gazetteer record for a place that was nonetheless documented in the past We propose to make use of toponym relations and attributes internal to the corpus of documents itself for toponym resolution GeoDISCO GIR’19 – 7/22

  8. Building the Network EDDA contains 20.7 million words in 44 632 entries among them 14 457 articles classified as ’ geographie ’ Nodes • get the list of headwords from articles metadata • normalize headwords • prepositions, punctuation marks, and/or alternate names or spelling • e.g. ’ Brassaw, ou Gronstat ’, ’ Adiazzo, Adiazze ou Ajaccio ’ Edges • relationship between nodes • extract place names with a custom version of the Perdido geoparser • a new edge is created between the current node and all the corresponding node of each toponym in the content GeoDISCO GIR’19 – 8/22

  9. In-degree centrality Rank Node Score 1 france 0.1130 2 italie 0.0853 3 allemagne 0.0814 4 afrique 0.0481 5 espagne 0.0462 6 naples 0.0211 7 pologne 0.0199 8 paris 0.0183 9 océan 0.0161 10 perse 0.0158 GeoDISCO GIR’19 – 9/22

  10. Betweenness centrality Rank Node Score 1 mer méditerranée 0.0373 2 france 0.0223 3 allemagne 0.0223 4 natolie 0.0220 5 monde 0.0136 6 italie 0.0131 7 lycie 0.0129 8 lycus 0.0126 9 issus 0.0122 10 europe 0.0102 GeoDISCO GIR’19 – 10/22

  11. GeoDISCO GIR’19 – 11/22

  12. Using the Network for Disambiguation Our hypothesis is that the quantitative citation network reveals qualitative relations. • We compute an ego-centered network • We compute the betweenness centrality measure of this ego-network • The node with the highest value is selected as the most related. GeoDISCO GIR’19 – 12/22

  13. Using the Network for Disambiguation 83 over 100 responses are correct For a city the method returns the name of the country to which it belongs aziruth egypte → cezimbra portugal → . . . For a country it returns the name of a neighboring location pérou egypte → vénézuéla grenade (la nouvelle) → . . . In 18% of correct answers (15 over 83) the returned name is not present in the content of the article isaurie natolie → salé mer méditerranée → walcheren flessingue → . . . GeoDISCO GIR’19 – 13/22

  14. Classification of nodes city 6 378 unclassified 5 041 hydronym 1 193 country 1 033 mountain 174 GeoDISCO GIR’19 – 14/22

  15. Extracting explicit locations in EDDA Some articles of EDDA are explicitly located GeoDISCO GIR’19 – 15/22

  16. Extracting Geographic coordinates in EDDA Many kinds of location expressions • By absolute geographic coordinates - Long. 22. 30. latit. 45. 33 • By distances to other locations - à 5 lieues au midi & au dessou de Lyon, à 15 au nord-ouest de Grenoble, & à 108 au sud-est de Paris. • By spatial relations - On a line: sur le bord oriental du Rhône - Within an area: province de France - Adjacent to an other entity: bornée à l’occident par le Rhône • By logical relations - Grenoble en est la capitale • . . . GeoDISCO GIR’19 – 16/22

  17. Extracting Geographic coordinates in EDDA Examples - long. 62. 50. lat. 3. 28. Problems and constraints - Lat. 42. 8. long.. 67. 35. - Long. 36. 4. lat. 40. 48. (D. J.) • Iregular expressions - Long. 40. 5. latit. 62. 6. - Lat. 14. 20 - 16. 15. long. 58. 30 - 59. • References to different kind of entities - Lat. 42 degrés, 20 minutes long. 306 degrés, 50 & quelques minutes. - Point: pair of coordinates - Lat. 37 degrés long. 27 & demi - Area: 4 latitude and longitude references - long. 135. 20. lat. mérid. - Long. 18 d. 26 ’. 6". lat. 48 d. 57 ’. 43". CONCHITE • Several pairs of coordinates - entre le 32 & le 41 de long. & le 10 & le 20 de lat - à 12 d. de long . & à 33. de latit - More of one place in the article - la long. à 103. 50. & la lat. à 26 - Several supposition of one location - Long. 110 d. & lat. 46. 45. selon Uluhbeg; & long . 116. & according to different sources of authors lat. 45. selon Nassiredden. - Abulféda lui donne 78 d. 4 ’. de long. . elle a, selon quelques - uns, 43 d. 30 ’. de latit. septentrionale. (GIUND) GeoDISCO GIR’19 – 17/22

  18. Looking for textual patterns Comparison of two methods Interactive exploration + manual extraction • CQL Queries (TXM) - [word="Longitude"]|[word ="Longit"]|[word="Long"]| [word="longitude"]|[word ="longit"]|[word="long"] - [word="Latitude"]|[word ="Latit"]|[word="Lat"]|[word="latitude"]| [word="lat"]|[word="lat"] • Laborious and fastidious rearrangement of the results � • A useful knowledge of the different cases found in EDDA � Automatic annotation of most frequent redundant patterns • Automatic retrieval � • Missing some specific cases � - situé entre le 45 & 47 degré de long. & entre le 15 & 23 degré de lat. - sous le troisième degré de long et sous le 20e de lat GeoDISCO GIR’19 – 18/22

  19. Georeferencing the place names • Still at the very beginning - 4702 articles have coordinates (merging the results of the 2 methods) • Shortcomings - Longitude missing (sometimes) - North or South precision missing for latitude (quite often) • Prime meridian? - Officially in France since Richelieu: Ferro Meridian In fact located by Delisle at 20˚ of Paris Meridian - d’Alembert in the articles Latitude and Méridien of EDDA: sometime the authors use a local meridian GeoDISCO GIR’19 – 19/22

  20. https://arcg.is/1STjfW GeoDISCO GIR’19 – 20/22

  21. https://arcg.is/1STjfW GeoDISCO GIR’19 – 21/22

  22. Thank you for your attention CONTACT Ludovic Moncla ludovic.moncla@liris.cnrs.fr Thierry Joliveau thierry.joliveau@univ-st-etienne.fr

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend