UPEM geocoding and clustering methods applied to EUPRO FP3 - - PowerPoint PPT Presentation

upem geocoding
SMART_READER_LITE
LIVE PREVIEW

UPEM geocoding and clustering methods applied to EUPRO FP3 - - PowerPoint PPT Presentation

RISIS / Working with geographical data UPEM geocoding and clustering methods applied to EUPRO FP3 subdataset Lionel Villard, Michel Revollo 10/09/2015 1/17 Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further


slide-1
SLIDE 1

1/17

RISIS / Working with geographical data

UPEM geocoding and clustering methods

applied to EUPRO FP3 subdataset

Lionel Villard, Michel Revollo 10/09/2015

slide-2
SLIDE 2

2/17

Main goals

Analyzing the geographical distribution of FP3 adresses and measuring the aggregation effects by identifying the existing geographical spaces where a high density of activity takes place.

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-3
SLIDE 3

3/17

1/ Selection of attributs, cleaning step and external data

The addresses were clean, no need to further treatment. Additions of external data:  Two digits country code and english country name (ISO 3166-1 alpha-2 norm)

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-4
SLIDE 4

4/17

We chose to use two different attributes:  sAddress_orig : complete addresses, with eventually a building names, postal codes, cities, countries  19 710 objects  5 without address (excluded)  % 4 with only a country in the address (excluded)  sCity and ISO country names :  95,8 % with a city name We tried to use postal code: not accurate with the batchgeocode geocoding engine.

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-5
SLIDE 5

5/17

2/ Automatic geocoding step

Automatic grabbing of the results of batch geocode web application, in two steps :  Complete addresses  Cities with ISO country names Returned information :  Returned cleaned address  Longitude and latitude coordinates  Accuracy of the coordinates

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-6
SLIDE 6

6/17

Examples of results for addresses Examples of results for cities

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-7
SLIDE 7

7/17

3/ Filtering accuracies and sources

Prioritization of the results of the geocoded addresses against cities (better precision, ex: building level...)

accuracy % Geocoded Addr LabelAccuracy 1 0% Country level 2 0% Region (state, province, prefecture, etc.) level 3 51% Sub-region (county, municipality, etc.) level 4 0% Town (city, village) level 5 13% Post code (zip code) level 6 0% Street level 7 0% Intersection level 8 6% Address level 9 37% Premise (building name, property name, shopping center, etc.) level Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-8
SLIDE 8

8/17

Info origine Accuracy Nb Geocoded Addr % Geoloc By Accuracy addresses 3 8018 47% addresses 5 2469 15% addresses 8 1106 7% addresses 9 5390 32% 16983 100% cities 3 1445 96% cities 5 19 1% cities 8 2 0% cities 9 37 2% 1503 100% Total geocoded addresses : (16983+1503)/19715 = 93,8% Total Total

Sources of addresses, accuracies and geocoded addresses

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-9
SLIDE 9

9/17

Top iso_ctry_code_alpha2 TotalNbAddr Geocoded Addr % Geocoded 1 FR 3174 2989 94,2% 2 DE 2966 2800 94,4% 3 GB 2610 2518 96,5% 4 IT 2036 1957 96,1% 5 ES 1587 1460 92,0% 6 NL 1370 1291 94,2% 7 BE 1023 920 89,9% 8 GR 923 857 92,8% 9 DK 764 719 94,1% 10 PT 652 613 94,0%

Top 10 : geocoded addresses per country

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-10
SLIDE 10

10/17

4 / Clustering step

A method based on a combination of two sequential approaches 1 / Identification of the initial clusters with a density-based algorithm (DBScan, 1996) that is able to identifying the area where the activities are

  • concentrated. The clusters are defined by two parameters fixed before the

calculation: all points of a cluster are surrounded by at least X points in a circle with a diameter of Y km. Where are located the area in which activity is the most intense?

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-11
SLIDE 11

11/17

2 / In a second step, we compare two different dimensions of the relation between the initial clusters: 2.1 How intense are the relations between the initial clusters (less than 100 km between the centroids) ? RI/Relative Interconnectivity 2.2 Does the final cluster will have a similar profil of collaborations as the two initial clusters taken separately (to avoid large variations of density of links in the final cluster) ? RC/Relative Closeness (Not relevant in our prototype : no relation between addresses)

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-12
SLIDE 12

12/17

5 / Drawing clusters boundaries

Main goal : convert points group by a unique cluster key into areas delimited by boundaries Using Minimum Convex Polygons (MPC or convex hull) of the software Geospatial Modelling Environment (Hawthorne Beyer, 2014) Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-13
SLIDE 13

13/17

6 / Naming step

Main goals :  finding a relevant name for each cluster (readable and easily understandable name, not which does not depend on the data)  identifying the core cities of the clusters Method : geographical intersection of two layers  populated Places : layer of points for cities produced by Natural Earth project (Fourth Edition, Oct. 2009-2012, mainly members of North American Cartographic Information Society)

many capitals, major cities and towns, plus a sampling of smaller towns in sparsely inhabited regions

 Cluster s shapes : layer of shapes for clusters

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-14
SLIDE 14

14/17

All the 7323 cities with population (2012) All clusters shapes Selected cities with population inside clusters shapes

Geoprocessing : intersection

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-15
SLIDE 15

15/17

Building the cluster s name by a popularity criteria : names of the core cities are ordered by population

IdClusterD ClustAddr ClustName 1 1034 Athens / Piraievs 2 362 Lisbon 3 77 Valencia 4 466 Madrid 5 120 Thessaloniki 6 197 Barcelona 7 562 Rome / Vatican City 8 121 Toulouse 9 112 Montpellier 10 75 Pisa 11 97 Florence 12 88 Genoa 13 82 Bologna 14 150 Turin 15 159 Grenoble 16 308 Milan 17 130 Lyon 18 272 Munich 19 79 Vienna 20 2552 Paris / Versailles

IdClusterD ClustAddr ClustName stOrg NbOrgAdd Pc 42 1662 Kobenhavn / Malmo / Roskilde Technical University of Denmark - Danmarks Tekniske 97 5,84% 42 1662 Kobenhavn / Malmo / Roskilde University of Copenhagen - Koebenhavns Universitet (KU) 91 5,48% 25 1110 Brussels / Namur Katholieke Universiteit Leuven 108 9,73% 25 1110 Brussels / Namur Universite catholique de Louvain 73 6,58% 1 1034 Athens / Piraievs National Technical University of Athens (NTUA) 87 8,41% 7 562 Rome / Vatican City Universitá di Roma La Sapienza, University of Rome La Sapienza 40 7,12% 30 531 Essen / Wuppertal Ruhr-Universität Bochum 29 5,46% 4 466 Madrid UPM Universidad Politecnica de Madrid/Madrid Polytechnical 55 11,80% 4 466 Madrid CSIC - Consejo Superior de Investigaciones Cientificas/Higher 52 11,16% 4 466 Madrid UCM Universidad Complutense de Madrid 49 10,52%

Examples of cluster names Main organisations in proportion of addresses in the clusters

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-16
SLIDE 16

16/17

See the maps ! Two thresholds has been tested (Minimal number of addresses in 25 km to begin a cluster) : 75 addresses are needed to begin a cluster, and 100 addresses Proportion of addresses inside and ouside clusters

100 % 75 % Clust 9298 50,3% 10446 56,5% Hclut 9187 49,7% 8039 43,5% Total 18485 18485

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges

slide-17
SLIDE 17

17/17

7 / Further work

 Quality check : is there differences (distance) between geocoded cities and geocoded addresses ?  Analytical dimensions : combining these geographical information with other attributes (temporal, sectorial...) to analyse the geographical dynamics of FPs projects.  Merging close clusters : with relations between addresses, we would be able two compare close clusters and merge them if they have similar characteristics (in terms of relations)  Generalisation: applying this process to all FPs

Goals / Sources / Geocoding / Filtering / Clustering / Boundaries / Naming / Further challenges