How much duplicate botanical data is available for digitization - - PowerPoint PPT Presentation

how much duplicate botanical data is available for
SMART_READER_LITE
LIVE PREVIEW

How much duplicate botanical data is available for digitization - - PowerPoint PPT Presentation

How much duplicate botanical data is available for digitization reuse? igo Granzow-de la Cerda 1 & Ben Anhalt 2 TDWG 2011, New Orleans 17 Oct. 1 Autonomous University of Barcelona 2 Biodiversity Institute, University of Kansas


slide-1
SLIDE 1

How much duplicate botanical data is available for digitization reuse?

Íñigo Granzow-de la Cerda1 & Ben Anhalt2

1Autonomous University of Barcelona 2Biodiversity Institute, University of Kansas

TDWG 2011, New Orleans 17 Oct.

slide-2
SLIDE 2

Addressing a major challange to herbarium digitization: Cost of specimen data acquisition

Mainly caused by:

 Populating locality information  Georeferencing

Addressed by  Minimizing number of fields populated

but …

How far are we willing to go minimizing data acquisition, and still remain usefull?

slide-3
SLIDE 3

Addressing a major challange to herbarium digitization: Cost of data acquisition Alternatively: HARVESTING data that have already been acquired elsewhere

slide-4
SLIDE 4

 Historically, plant collectors (still do so) have

distributed duplicates of specimens among peer institutions (for id/verification by specialists, specimen exchanges, etc.)

 Institutions undergo digitization projects,

independently from each other

 So specimens become databased regardless of

whether their duplicates have been already digitized by one or more peer institutions

Addressing a major challange to herbarium digitization: Cost of data acquisition

slide-5
SLIDE 5

 The overall amount of specimen duplication is

largely unknown

 Search for duplicates has been attempted

through filter-push network architecture

 We have developd a more rudimentary but more

direct way of doing it:

SGR

(Scatter, Gather, and Reconciling of Specimen data) Addressing the main challange to herbarium digitization

slide-6
SLIDE 6

 After decades of databasing efforts, it is likely that a

significant volume of specimen duplication has accumulated into metadata, like GBIF .

 Redundancy is good, despite perceived inefficiency of

repeatedly capuring data from the same specimen

 Latecomers into the databasing process have an

advantage: many of their specimens (if duplicates) have already been digitized by someone else

What can metadata do for you?

slide-7
SLIDE 7

GBIF can serve your data acquisition needs in two ways:

  • 1. Data from duplicate specimens already acquired by

prior databasing effort(s) can be used as a source for populating part of your own records 

Specimen per specimen, harvesting data, field for field

This includes fields that are most expensive to acquire (e.g. locality data and assigning geocoordinates)

  • 2. Help identify which of your specimens are UNIQUE

(absent from existing metadata). 

These are the most valuable specimens in your collection, the ones to be prioritized for full digitization.

What can metadata do for you?

slide-8
SLIDE 8

# recs. In GBIF Institution/project database 3,916,545 BfN/NetPhyD Bundesamt fuer Naturschutz / Netzwerk Phytodiversitaet Deutschland 3,741,903 MO Tropicos 1,217,931 O Oslo Bot. Mus. VXL 1,118,715 ANTHOS GBIF Spain/Fundación Biodiversidad 924,217 NY Herbarium 658,511 S S-Vascular 595,642 NSW Royal Bot. Gdn., Sydney 588,872 NHN Nationaal Herbarium Nederland 538,851 KNA Plant 509,255 MNHN Paris 490,550 O V 424,133 US Botany 359,134 K RBG Kew Herbarium

What is the level of duplication out there?

31,343,738 records in all ca. 650 botanical collections with > 1,000 records in GBIF

SOURCE: 5 botanical databases among the 13 with most records in GBIF (July 2011 release), global in scope, and sufficiently DarwinCore-compliant for fields of interest

slide-9
SLIDE 9

Institution Australia Canada Costa Rica Guyana Mexico South Africa Total records S (Naturh.Riksmus. Stockholm) 8,400 4,700 6,100 15,600 34,800 MO (MOBot/ Tropicos) 5,500 15,300 20,600 17,400 58,800 K (Royal Bot. Garden Kew) 12,200 2,200 4,500 13,200 25,000 57,100 NY (NY Bot. Garden) 3,700 3,600 3,400 23,500 34,200 US (Smithsonian inst.) 3,100 10,600 3,600 32,700 50,000 total 32,900 20,000 16,400 32,100 75,500 58,000

SOURCE: sample data sets, each consisting of 30-60K botanical records, belonging to 4-5 of 6 selected countries such that

  • # recs. >2k and <35k for any given country had to be
  • 3 with highly diverse floras (MX, CR, ZA),
  • 2 large in size (CDN, AU) and
  • 1 relatively small, rich flora but less well-collected (GY)
  • countries to which any of the institutions belonged were excluded

What is the level of duplication out there?

slide-10
SLIDE 10

What is the level of duplication out there?

TARGET: Lucene index of GBIF records

The SGR search algorithm was run for each source dataset, targeting the fields:

  • Collector name
  • Corrector’s field number
  • Collection date
  • Taxon name

Generated a Matching Index

slide-11
SLIDE 11
slide-12
SLIDE 12

Institution Total S (Naturh.Riksmus. Stockholm) 34,800 MO (MOBot/Tropicos) 58,800 K (Royal Bot. Garden Kew) 57,100 NY (NY Bot. Garden) 34,200 % duplic- ation 10.4% 12.1% 15.4% 34.8%

What is the level of duplication out there?

Multiplicity of duplicates, average 2.62 1.4 1.94 2.43

slide-13
SLIDE 13

 Why is matching not any higher?

 Matches are often missed because datasets from some

collections don’t show data for key fields, such as

 Collector name (US)  Collector’s # (MNHN/P

, SANBI, NHN)

 Collection date (DUKE, HBG)

Some attach coll. # to coll. name (LD)

However, multiple copies exist (sometimes > 10) for many specimens, including intra-collection duplication

average of 1.4 to 2.6 duplicates across target records

What is the level of duplication out there?

slide-14
SLIDE 14

 But even in the absence of full matches (e.g.,

because of collector field # are different), essential data can still be harvested for identical localities (including georeferences)

Maximizing SGR functionality

slide-15
SLIDE 15

 The matching algorithm will improve and become

smarter (i.e. by running consecutive analyses with diferent algorithms to minimize missed matches and false matches, respectively)

 Data quality of what goes into GBIF will improve (to

maximize fully DwC-compliant data on key fields)

Maximizing SGR functionality

slide-16
SLIDE 16

A caveat:

 In order to benefit from SGR, collections need to be

pre-catalogued first:  Generate a skeletal database that includes 4-6

minimal fields to act a source for SGR.

(you still want to know, roughly, what you have in your collection)

Maximizing SGR functionality

slide-17
SLIDE 17

 The main contribution of SGR is not just that helps

populate records in your collection

But… its real contribution to science is

 Allowing to identify specimens that do not have

duplicates anywhere else, so as to prioritize resources toward populating these records to be incorporated into existing metadata

Maximizing SGR functionality

slide-18
SLIDE 18

And thanks to:

National Science Foundation – BRC Biodiversity Institute, University of Kansas MNH University of Michigan Herbarium

Thank you