GeoDISCO: Encyclopedic Geographical Discourse in France from the - - PowerPoint PPT Presentation

geodisco encyclopedic geographical discourse in france
SMART_READER_LITE
LIVE PREVIEW

GeoDISCO: Encyclopedic Geographical Discourse in France from the - - PowerPoint PPT Presentation

GeoDISCO: Encyclopedic Geographical Discourse in France from the Enlightenment to Wikipedia D. Vigier, T. Joliveau, L. Moncla, K. McDonough, A. Brenon ludovic.moncla@liris.cnrs.fr GIR19 What is this project about? The GoDisco Project asks


slide-1
SLIDE 1

ludovic.moncla@liris.cnrs.fr GIR’19

GeoDISCO: Encyclopedic Geographical Discourse in France from the Enlightenment to Wikipedia

  • D. Vigier, T. Joliveau, L. Moncla, K. McDonough, A. Brenon
slide-2
SLIDE 2

GeoDISCO GIR’19 – 2/22

What is this project about?

The GéoDisco Project asks How does geographical discourse develop in French encyclopedias between the 18th century and today?

1 Encyclopédie ou Dictionnaire raisonné des sciences, des arts et des métiers, par

une Société de Gens de lettres (1751-1772) edited by Diderot and d’Alembert 1

2 Encyclopedia Universalis (2018 digital edition) 3 French Wikipedia (July 2018)

  • 1. https://artfl-project.uchicago.edu/
slide-3
SLIDE 3

GeoDISCO GIR’19 – 3/22

Overview

Three main axes

  • 1. Named Entity Recognition and Classification in the EDDA
  • Improving NER with a linguistic approach
  • 2. Toponym Disambiguation
  • Building a network of relations between toponyms
  • 3. Extracting explicit locations in EDDA
  • Extracting and interpreting geographical coordinates from EDDA articles
slide-4
SLIDE 4

GeoDISCO GIR’19 – 4/22

Improving NER with a linguistic approach

Corpus analysis using the TXM platform (http://textometrie.ens-lyon.fr/)

  • The goal is to find specific patterns in order to improve the PERDIDO NER rules

Methodology

  • Identify the most frequent proper nouns in the geography subcorpus based on

the POS tagging (Treetagger)

  • Manual selection of the 30 most frequent occurrences for several types of

entities

  • country, city, region, person.
  • For each list, compute the co-occurrences ordered by the specificity score
slide-5
SLIDE 5

GeoDISCO GIR’19 – 5/22

Improving NER with a linguistic approach

Most important co-occurrences

Position Place Person country city region

  • 1

de, en , le de, à, dans de, en , le par , de, selon , sous , suivant

  • 2

de, dans , bourg, royaume, rivière, roi, ... ville, cour, parlement, prévot, ... coutume, France, duc, comte, ... saint, roi, empereur, pape, ... +1 punctuation mark punctuation mark numeric value punctuation mark et punctation mark I , II , IV , ... +2 dans, au, capitale, royaume, ... numeric value Sicile, géographie, Valais, Baptiste, ...

  • prepositions, list of nouns, . . .
slide-6
SLIDE 6

GeoDISCO GIR’19 – 6/22

Improving NER with a linguistic approach

slide-7
SLIDE 7

GeoDISCO GIR’19 – 7/22

Toponym Disambiguation and Historical Texts : Challenges

There is no gazetteer for the 18th-c. world

  • Modern or historical gazetteers contain lots of noise
  • Typical ranking solutions do not apply well in these cases (e.g. population)
  • EDDA’s complex structure means that one geography article may refers to more

than one place

  • Sometimes it is impossible to match a record in any resource to a toponym
  • articles explicitly refuse to pin down a location for a toponym
  • there is no existing gazetteer record for a place that was nonetheless documented in the past

We propose to make use of toponym relations and attributes internal to the corpus

  • f documents itself for toponym resolution
slide-8
SLIDE 8

GeoDISCO GIR’19 – 8/22

Building the Network

EDDA contains 20.7 million words in 44 632 entries among them 14 457 articles classified as ’geographie’ Nodes

  • get the list of headwords from articles metadata
  • normalize headwords
  • prepositions, punctuation marks, and/or alternate names or spelling
  • e.g. ’Brassaw, ou Gronstat’, ’Adiazzo, Adiazze ou Ajaccio’

Edges

  • relationship between nodes
  • extract place names with a custom version of the Perdido geoparser
  • a new edge is created between the current node and all the corresponding node of each

toponym in the content

slide-9
SLIDE 9

GeoDISCO GIR’19 – 9/22

In-degree centrality

Rank Node Score 1 france 0.1130 2 italie 0.0853 3 allemagne 0.0814 4 afrique 0.0481 5 espagne 0.0462 6 naples 0.0211 7 pologne 0.0199 8 paris 0.0183 9

  • céan

0.0161 10 perse 0.0158

slide-10
SLIDE 10

GeoDISCO GIR’19 – 10/22

Betweenness centrality

Rank Node Score 1 mer méditerranée 0.0373 2 france 0.0223 3 allemagne 0.0223 4 natolie 0.0220 5 monde 0.0136 6 italie 0.0131 7 lycie 0.0129 8 lycus 0.0126 9 issus 0.0122 10 europe 0.0102

slide-11
SLIDE 11

GeoDISCO GIR’19 – 11/22

slide-12
SLIDE 12

GeoDISCO GIR’19 – 12/22

Using the Network for Disambiguation

Our hypothesis is that the quantitative citation network reveals qualitative relations.

  • We compute an ego-centered network
  • We compute the betweenness centrality measure of this ego-network
  • The node with the highest value is selected as the most related.
slide-13
SLIDE 13

GeoDISCO GIR’19 – 13/22

Using the Network for Disambiguation

83 over 100 responses are correct For a city the method returns the name of the country to which it belongs aziruth → egypte cezimbra → portugal . . . For a country it returns the name of a neighboring location pérou → egypte vénézuéla → grenade (la nouvelle) . . . In 18% of correct answers (15 over 83) the returned name is not present in the content of the article isaurie → natolie salé → mer méditerranée walcheren → flessingue . . .

slide-14
SLIDE 14

GeoDISCO GIR’19 – 14/22

Classification of nodes

city 6 378 unclassified 5 041 hydronym 1 193 country 1 033 mountain 174

slide-15
SLIDE 15

GeoDISCO GIR’19 – 15/22

Extracting explicit locations in EDDA

Some articles of EDDA are explicitly located

slide-16
SLIDE 16

GeoDISCO GIR’19 – 16/22

Extracting Geographic coordinates in EDDA

Many kinds of location expressions

  • By absolute geographic coordinates
  • Long. 22. 30. latit. 45. 33
  • By distances to other locations
  • à 5 lieues au midi & au dessou de Lyon, à 15 au nord-ouest de Grenoble, & à 108 au sud-est

de Paris.

  • By spatial relations
  • On a line: sur le bord oriental du Rhône
  • Within an area: province de France
  • Adjacent to an other entity: bornée à l’occident par le Rhône
  • By logical relations
  • Grenoble en est la capitale
  • . . .
slide-17
SLIDE 17

GeoDISCO GIR’19 – 17/22

Extracting Geographic coordinates in EDDA

Problems and constraints

  • Iregular expressions
  • References to different kind of entities
  • Point: pair of coordinates
  • Area: 4 latitude and longitude references
  • Several pairs of coordinates
  • More of one place in the article
  • Several supposition of one location

according to different sources of authors

Examples

  • long. 62. 50. lat. 3. 28.
  • Lat. 42. 8. long.. 67. 35.
  • Long. 36. 4. lat. 40. 48. (D. J.)
  • Long. 40. 5. latit. 62. 6.
  • Lat. 14. 20 - 16. 15. long. 58. 30 - 59.
  • Lat. 42 degrés, 20 minutes long. 306 degrés, 50 & quelques

minutes.

  • Lat. 37 degrés long. 27 & demi
  • long. 135. 20. lat. mérid.
  • Long. 18 d. 26 ’. 6". lat. 48 d. 57 ’. 43". CONCHITE
  • entre le 32 & le 41 de long. & le 10 & le 20 de lat
  • à 12 d. de long . & à 33. de latit
  • la long. à 103. 50. & la lat. à 26
  • Long. 110 d. & lat. 46. 45. selon Uluhbeg; & long . 116. &
  • lat. 45. selon Nassiredden.
  • Abulféda lui donne 78 d. 4 ’. de long. . elle a, selon

quelques - uns, 43 d. 30 ’. de latit. septentrionale. (GIUND)

slide-18
SLIDE 18

GeoDISCO GIR’19 – 18/22

Looking for textual patterns

Comparison of two methods

Interactive exploration + manual extraction

  • CQL Queries (TXM)
  • [word="Longitude"]|[word ="Longit"]|[word="Long"]| [word="longitude"]|[word

="longit"]|[word="long"]

  • [word="Latitude"]|[word ="Latit"]|[word="Lat"]|[word="latitude"]| [word="lat"]|[word="lat"]
  • Laborious and fastidious rearrangement of the results
  • A useful knowledge of the different cases found in EDDA

Automatic annotation of most frequent redundant patterns

  • Automatic retrieval
  • Missing some specific cases
  • situé entre le 45 & 47 degré de long. & entre le 15 & 23 degré de lat.
  • sous le troisième degré de long et sous le 20e de lat
slide-19
SLIDE 19

GeoDISCO GIR’19 – 19/22

Georeferencing the place names

  • Still at the very beginning
  • 4702 articles have coordinates (merging the results of the 2 methods)
  • Shortcomings
  • Longitude missing (sometimes)
  • North or South precision missing for latitude (quite often)
  • Prime meridian?
  • Officially in France since Richelieu: Ferro Meridian

In fact located by Delisle at 20˚ of Paris Meridian

  • d’Alembert in the articles Latitude and Méridien of EDDA:

sometime the authors use a local meridian

slide-20
SLIDE 20

GeoDISCO GIR’19 – 20/22

https://arcg.is/1STjfW

slide-21
SLIDE 21

GeoDISCO GIR’19 – 21/22

https://arcg.is/1STjfW

slide-22
SLIDE 22

Thank you for your attention

CONTACT Ludovic Moncla ludovic.moncla@liris.cnrs.fr Thierry Joliveau thierry.joliveau@univ-st-etienne.fr