How much duplicate botanical data is available for digitization reuse?
Íñigo Granzow-de la Cerda1 & Ben Anhalt2
1Autonomous University of Barcelona 2Biodiversity Institute, University of Kansas
TDWG 2011, New Orleans 17 Oct.
How much duplicate botanical data is available for digitization - - PowerPoint PPT Presentation
How much duplicate botanical data is available for digitization reuse? igo Granzow-de la Cerda 1 & Ben Anhalt 2 TDWG 2011, New Orleans 17 Oct. 1 Autonomous University of Barcelona 2 Biodiversity Institute, University of Kansas
1Autonomous University of Barcelona 2Biodiversity Institute, University of Kansas
TDWG 2011, New Orleans 17 Oct.
# recs. In GBIF Institution/project database 3,916,545 BfN/NetPhyD Bundesamt fuer Naturschutz / Netzwerk Phytodiversitaet Deutschland 3,741,903 MO Tropicos 1,217,931 O Oslo Bot. Mus. VXL 1,118,715 ANTHOS GBIF Spain/Fundación Biodiversidad 924,217 NY Herbarium 658,511 S S-Vascular 595,642 NSW Royal Bot. Gdn., Sydney 588,872 NHN Nationaal Herbarium Nederland 538,851 KNA Plant 509,255 MNHN Paris 490,550 O V 424,133 US Botany 359,134 K RBG Kew Herbarium
31,343,738 records in all ca. 650 botanical collections with > 1,000 records in GBIF
SOURCE: 5 botanical databases among the 13 with most records in GBIF (July 2011 release), global in scope, and sufficiently DarwinCore-compliant for fields of interest
Institution Australia Canada Costa Rica Guyana Mexico South Africa Total records S (Naturh.Riksmus. Stockholm) 8,400 4,700 6,100 15,600 34,800 MO (MOBot/ Tropicos) 5,500 15,300 20,600 17,400 58,800 K (Royal Bot. Garden Kew) 12,200 2,200 4,500 13,200 25,000 57,100 NY (NY Bot. Garden) 3,700 3,600 3,400 23,500 34,200 US (Smithsonian inst.) 3,100 10,600 3,600 32,700 50,000 total 32,900 20,000 16,400 32,100 75,500 58,000
SOURCE: sample data sets, each consisting of 30-60K botanical records, belonging to 4-5 of 6 selected countries such that
TARGET: Lucene index of GBIF records
The SGR search algorithm was run for each source dataset, targeting the fields:
Generated a Matching Index
Institution Total S (Naturh.Riksmus. Stockholm) 34,800 MO (MOBot/Tropicos) 58,800 K (Royal Bot. Garden Kew) 57,100 NY (NY Bot. Garden) 34,200 % duplic- ation 10.4% 12.1% 15.4% 34.8%
Multiplicity of duplicates, average 2.62 1.4 1.94 2.43
collections don’t show data for key fields, such as
, SANBI, NHN)
Some attach coll. # to coll. name (LD)
average of 1.4 to 2.6 duplicates across target records
(you still want to know, roughly, what you have in your collection)
And thanks to:
National Science Foundation – BRC Biodiversity Institute, University of Kansas MNH University of Michigan Herbarium