Variability of country names and identifiers in datasets Reconciling - - PowerPoint PPT Presentation

variability of country names and identifiers in datasets
SMART_READER_LITE
LIVE PREVIEW

Variability of country names and identifiers in datasets Reconciling - - PowerPoint PPT Presentation

Variability of country names and identifiers in datasets Reconciling practical and cultural perspectives International Cartographic Conference, Dresden Laura Kostanski | Sara Jane Farmer | Rob Atkinson August 2013 GOVERNMENT AND COMMERCIAL


slide-1
SLIDE 1

International Cartographic Conference, Dresden

Variability of country names and identifiers in datasets – Reconciling practical and cultural perspectives

GOVERNMENT AND COMMERCIAL SERVICES THEME

Laura Kostanski| Sara‐Jane Farmer | Rob Atkinson August 2013

slide-2
SLIDE 2

Today’s Presentation

  • Overview
  • Cultural Reasons for Multiple Country Names
  • Impact of Cultural Reasons
  • Multiple Country Name Datasets
  • Reconciling Information
  • Spatial Identifier Reference Framework (SIRF) Approach
slide-3
SLIDE 3

Overview

  • There are multiple country name datasets in use
  • e.g. ISO 3166, UNSTATS , Alexandria Digital Library, CIA Fact Book, UN‐FAO
  • Multiple stakeholders in creation and use of data using these names
  • e.g. World Bank, Statistics Agencies, Crisis Response and Social

Protection Groups.

  • Time spent accessing and reconciling data is costly and delays production of

results from analysis

  • The same issues apply to most, perhaps all, identifiers of spatial objects
  • Preview of how we might tackle this problem
slide-4
SLIDE 4

Context

  • CSIRO. UNSDI Gazetteer for Social Protection in Indonesia
slide-5
SLIDE 5

Data Analysis

Utopia Way Inc. investigated files in the data.un.org

  • dataset. …

Country names were discovered in multiple fields, such as:

  • country of birth,
  • country of citizenship,
  • country or area,
  • country or territory,
  • country or territory of asylum or residence,
  • country or territory of origin,
  • reference area.

and identified significant issues with country name alignments and mismatches. An automated matching process was set up to explore the extent of the issue. In all, 21,195,188 rows of data were analysed.

slide-6
SLIDE 6

Common “Errors”

Index error Examples Withdrawn countries with no ISO3166 code “East Timor", "Czechoslovakia, Czechoslovak Socialist Republic”, "USSR, Union of Soviet Socialist Republics", “Yemen, Yemen Arab Republic", “Yemen, Democratic, People's Democratic Republic of", “Yugoslavia, Socialist Federal Republic of”, “Germany, Federal Republic of”, “German Democratic Republic”, “US Miscellaneous Pacific Islands", “Wake Island", “Serbia and Montenegro". Abbreviation “Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”, “&” for “and”. Added markers “+” added to the end of region names, to differentiate them from

  • countrynames. “MDG_”

added to region names, e.g. “MDG_Southern Asia”. Capitalisation “YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the”, “the” for “The”. Brackets “()”

  • r “[]”

instead of commas “Virgin Islands (British)” for “British Virgin Islands”. Standards confusion The ISO3166 labels “name” and “official_name” were both used in the same datasets (“name” is available for all countries; “official_name” is not). Use of familiar names Brunei, Ivory Coast, China, Libya issues with character translation Cote d'Ivoire, Åland Islands, Curaçao, Réunion Misspellings Double spaces, trailing spaces, “South Asia” vs “Southern Asia”.

slide-7
SLIDE 7

Long names, short names

slide-8
SLIDE 8

Organisation Name of Data Set

United Nations Statistics Division Country and Region Codes for Statistical Use Working Group on Country Names, United Nations Group of Experts on Geographic Names List of Country Names Terminology Section, Department for General Assembly and Conference Management Multilingual Terminology Database (UNTERM) International Standards Organisation (ISO) ISO 3166: Codes for the representation of names of countries and their subdivisions (parts 1, 2 and 3) Food and Agriculture Organisation of the United Nations Global Administrative Unit Layers (GAUL) United Nations Geospatial Information Working Group (UNGIWG) Second Administrative Level Boundaries (SALB) National Geospatial Intelligence Agency Federal Information Processing Standard (FIPS) 10‐4 : Countries, Dependencies, Areas of Special Sovereignty, and their Principal Administrative Divisions NATO Standards Agreement (STANAG) 1059

Data sets providing country names

slide-9
SLIDE 9

Two Aspects of Country Name Datasets

1: Development of datasets

Why is there a proliferation of country name sources?

  • Cultural issues
  • Development practices

2: Usage

How, in a digital age of ‘big data’ analytics and SDIs, can newly emerging technologies such as the Spatial Identifier Reference Framework (SIRF) assist in reducing the ambiguity associated with multiple, heterogeneous country name sources?

  • Can we do better? What do we need to do it?
slide-10
SLIDE 10

Cultural Issues

  • Toponyms provide communities with identity (Toponymic Identity is both

reflected and reinforced)

  • Country names are the highest‐order toponyms
  • Problems are similar at lower levels, compounded by scale (size of problem)

and higher rates of change (e.g. electoral boundaries, urban growth)

slide-11
SLIDE 11

Endonym/Exonym

Above and beyond associations with an individual’s attachment to the Endonym

  • f their country, there are often multiple Exonyms

used by other languages.

 e.g. Deutschland= Germany

  • r Allemagne
slide-12
SLIDE 12

Other Cultural Country Naming Considerations

Formal/Informal naming applications

(particularly prevalent in the social media world‐ e.g. ‘Oz’ for Australia)

Political/Non‐Political Usage

e.g. ‘Commonwealth of Australia’

Change over time

e.g. Czechoslovakia

Non‐standardised international conventions

e.g. Saint or St? The or none?

slide-13
SLIDE 13

The Impact

All of these cultural mores impact on the ability of people and organisations to record country name information in a standardised, transparent manner. Thus, there exists a proliferation of country name lists which are officially promoted by international agencies. This impact is then intensified in usage,

slide-14
SLIDE 14

Options

Suggested improvements to the indices and standards include: 1. Improve access to source data

a. Make the UN’s regions list available as a csv file online, to include withdrawn country codes, assignment dates and withdrawal dates (these are needed to match names for earlier years). b. Make the UN’s economic status list available as a csv file online.

2. Lobby to improve content

a. ISO to create a region (Africa, West Africa, North America etc.) code standard. b. ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in Bolivia’s name).

3. Policy

a. Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN

  • nline development data should attempt to adhere to.

4. Better citation mechanisms

a. Standardised metadata and identifiers that “resolve” – i.e. links back to data b. Shared infrastructure to link all the information together

slide-15
SLIDE 15

Spatial Identifier Reference Framework

CSIRO has been working with stakeholders including UN, National agencies and others on a set of standards and infrastructure services to support discovering and linking multiple sources of spatial references. This is being presented in more detail in: 6D.3 Spatial Identifier Reference Framework (SIRF): Realising the potential of SDI Using Spatial Identifiers to Link Multiple Information Systems (#633) Paul Box

1, Robert Atkinson 1, Laura Kostanski 2

S6‐D ‐ SDI Tuesday, August 27, 2013 04:30 p.m. ‐ 05:45 p.m. ‐ Room: Conference Level ‐ C1

slide-16
SLIDE 16

One real world feature: a bus station Represented in multiple systems using different names, and classified and represented in different ways Merak

Terminus Dataset

Merak, Stasiun Bis

Gazetir Indeonesia

Spatial Identifier REFERENCE FRAMEWORK

Links gazetteers (based on same feature in different gazetteers) used in web applications and other

  • nline resources.

Passenger Travel Stats Application Linked Resource

Merak (Gazetteer Entry)

Terminus Dataset (Gazetteer)

Merak, Stasiun Bis (Gazetteer Entry)

Gazetir Indonesia (Gazetteer) Online Public Transport Map Linked Resource

Same as Used in

Navigation application Linked Resource

Used in

Identifier Feature Type Footprint Merak, Stasiun Bis Transport Point Identifier Feature Type Footprint Merak Terminal Polygon

BIG National Gazetteer of Indonesia Department of Transport Bus Terminals

Currently systems are disconnected and difficult to integrate

slide-17
SLIDE 17

Identifiers

This is the “tricky part” Lets start with the practical implication…

Catchment Boundary Area Geometry 1123343 33535.4 151.3344,‐ 35.330……. Catchment ExtractionRate Storage 1123343 730 300

slide-18
SLIDE 18

“Distributed” references

Catchment ExtractionRate Storage 1123343 730 300 Internet How to ask for this entity How to deliver this entity Catchment Boundary Area Geometry 1123343 33535.4 151.3344,‐35.330…….

slide-19
SLIDE 19

One real world feature: a bus station Represented in multiple systems using different names, and classified and represented in different ways Merak

Terminus Dataset

Merak, Stasiun Bis

Gazetir Indeonesia

Spatial Identifier REFERENCE FRAMEWORK

Links gazetteers (based on same feature in different gazetteers) used in web applications and other

  • nline resources.

Passenger Travel Stats Application Linked Resource

Merak (Gazetteer Entry)

Terminus Dataset (Gazetteer)

Merak, Stasiun Bis (Gazetteer Entry)

Gazetir Indonesia (Gazetteer) Online Public Transport Map Linked Resource

Same as Used in

Navigation application Linked Resource

Used in

Identifier Feature Type Footprint Merak, Stasiun Bis Transport Point Identifier Feature Type Footprint Merak Terminal Polygon

BIG National Gazetteer of Indonesia Department of Transport Bus Terminals

Currently systems are disconnected and difficult to integrate

URI Describe Link Discover Provenance SDI resource access

slide-20
SLIDE 20

Thank you

For more information Rob.atkinson@csiro.au

GOVERNMENT AND COMMERCIAL SERVICES THEME