How much duplicate botanical data is available for digitization - PowerPoint PPT Presentation

How much duplicate botanical data is available for digitization reuse? Íñigo Granzow-de la Cerda 1 & Ben Anhalt 2 TDWG 2011, New Orleans 17 Oct. 1 Autonomous University of Barcelona 2 Biodiversity Institute, University of Kansas

Addressing a major challange to herbarium digitization: Cost of specimen data acquisition Mainly caused by:  Populating locality information  Georeferencing Addressed by  Minimizing number of fields populated but … How far are we willing to go minimizing data acquisition, and still remain usefull?

Addressing a major challange to herbarium digitization: Cost of data acquisition Alternatively: HARVESTING data that have already been acquired elsewhere

Addressing a major challange to herbarium digitization: Cost of data acquisition  Historically, plant collectors (still do so) have distributed duplicates of specimens among peer institutions (for id/verification by specialists, specimen exchanges, etc.)  Institutions undergo digitization projects, independently from each other  So specimens become databased regardless of whether their duplicates have been already digitized by one or more peer institutions

Addressing the main challange to herbarium digitization  The overall amount of specimen duplication is largely unknown  Search for duplicates has been attempted through filter-push network architecture  We have developd a more rudimentary but more direct way of doing it: SGR ( Scatter, Gather, and Reconciling of Specimen data)

What can metadata do for you?  After decades of databasing efforts, it is likely that a significant volume of specimen duplication has accumulated into metadata , like GBIF .  Redundancy is good , despite perceived inefficiency of repeatedly capuring data from the same specimen  Latecomers into the databasing process have an advantage: many of their specimens (if duplicates) have already been digitized by someone else

What can metadata do for you? GBIF can serve your data acquisition needs in two ways: 1. Data from duplicate specimens already acquired by prior databasing effort(s) can be used as a source for populating part of your own records Specimen per specimen, harvesting data, field for field  This includes fields that are most expensive to acquire  (e.g. locality data and assigning geocoordinates) 2. Help identify which of your specimens are UNIQUE (absent from existing metadata). These are the most valuable specimens in your  collection, the ones to be prioritized for full digitization.

What is the level of duplication out there? SOURCE: 5 botanical databases among the 13 with most records in GBIF (July 2011 release), global in scope, and sufficiently DarwinCore-compliant for fields of interest # recs. Institution/project database In GBIF Bundesamt fuer Naturschutz / Netzwerk Phytodiversitaet Deutschland 3,916,545 BfN/NetPhyD 3,741,903 MO Tropicos 1,217,931 O Oslo Bot. Mus. VXL 1,118,715 ANTHOS GBIF Spain/Fundación Biodiversidad 924,217 NY Herbarium 658,511 S S-Vascular 595,642 NSW Royal Bot. Gdn., Sydney 588,872 NHN Nationaal Herbarium Nederland 538,851 KNA Plant 509,255 MNHN Paris 490,550 O V 424,133 US Botany 359,134 K RBG Kew Herbarium 31,343,738 records in all ca. 650 botanical collections with > 1,000 records in GBIF

What is the level of duplication out there? SOURCE: sample data sets, each consisting of 30-60K botanical records, belonging to 4-5 of 6 selected countries such that • # recs. >2k and <35k for any given country had to be • 3 with highly diverse floras (MX, CR, ZA), • 2 large in size (CDN, AU) and • 1 relatively small, rich flora but less well-collected (GY) • countries to which any of the institutions belonged were excluded Australia Canada Costa South Total Institution Guyana Mexico Rica Africa records S (Naturh.Riksmus. 8,400 4,700 6,100 15,600 34,800 Stockholm) MO (MOBot/ 5,500 15,300 20,600 17,400 58,800 Tropicos) K (Royal Bot. Garden 12,200 2,200 4,500 13,200 25,000 57,100 Kew) NY (NY Bot. Garden) 3,700 3,600 3,400 23,500 34,200 US (Smithsonian 3,100 10,600 3,600 32,700 50,000 inst.) total 32,900 20,000 16,400 32,100 75,500 58,000

What is the level of duplication out there? TARGET: Lucene index of GBIF records The SGR search algorithm was run for each source dataset, targeting the fields: • Collector name • Corrector’s field number • Collection date • Taxon name Generated a Matching Index

What is the level of duplication out there? % duplic- Multiplicity of Institution Total ation duplicates, average S (Naturh.Riksmus. Stockholm) 34,800 10.4% 2.62 12.1% 1.4 MO (MOBot/Tropicos) 58,800 K (Royal Bot. Garden Kew) 57,100 15.4% 1.94 NY (NY Bot. Garden) 34.8% 2.43 34,200

What is the level of duplication out there?  Why is matching not any higher?  Matches are often missed because datasets from some collections don’t show data for key fields, such as  Collector name (US)  Collector’s # (MNHN/P , SANBI, NHN)  Collection date (DUKE, HBG) Some attach coll. # to coll. name (LD) However, multiple copies exist (sometimes > 10) for many specimens, including intra-collection duplication average of 1.4 to 2.6 duplicates across target records

Maximizing SGR functionality  But even in the absence of full matches (e.g., because of collector field # are different), essential data can still be harvested for identical localities (including georeferences)

Maximizing SGR functionality  The matching algorithm will improve and become smarter (i.e. by running consecutive analyses with diferent algorithms to minimize missed matches and false matches, respectively)  Data quality of what goes into GBIF will improve (to maximize fully DwC-compliant data on key fields)

Maximizing SGR functionality A caveat:  In order to benefit from SGR, collections need to be pre-catalogued first:  Generate a skeletal database that includes 4-6 minimal fields to act a source for SGR. (you still want to know, roughly, what you have in your collection)

Maximizing SGR functionality  The main contribution of SGR is not just that helps populate records in your collection But… its real contribution to science is  Allowing to identify specimens that do not have duplicates anywhere else, so as to prioritize resources toward populating these records to be incorporated into existing metadata

Thank you And thanks to : National Science Foundation – BRC Biodiversity Institute, University of Kansas MNH University of Michigan Herbarium

How much duplicate botanical data is available for digitization - PowerPoint PPT Presentation

How much duplicate botanical data is available for digitization reuse? igo Granzow-de la Cerda 1 & Ben Anhalt 2 TDWG 2011, New Orleans 17 Oct. 1 Autonomous University of Barcelona 2 Biodiversity Institute, University of Kansas

05/05/2017 BOTANICAL STUDIES IN THE CHIMANIMANI TFCA Two botanical projects in the TFCA:

W.W. Seymour W.W. Seymour Botanical Conservatory Botanical Conservatory Ethnobotanical Tour

Duplicate Encounter Avoidance Guidelines MCO Encounter Improvement Initiative Meridian Health

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

1 Near-Duplicate News Articles Near-Duplicate Detection More challenging task Are web

Extracts Using Multiple Technologies Dr. Ranjan Mitra Head Analytical Development, Dabur

Course 14 Seminar Presentation Dona Leonardi DipSBA(Dist) Diploma Part 1 Botanical

MANAGEMENT ON NATIVE TREE RECRUITMENT Hannah Carpenter Danelle Haake Missouri Botanical Garden

Waterton, Banff, Alberta Alberta Botanical Beach, B.C. Waterton, Banff, Alberta Alberta

Museum Fund Forest Park Nature Center Luthy Botanical Garden Peoria Zoo Tawny Oaks Museum Fund

iOS Mobile App THE STATE BOTANICAL GARDEN OF GEORGIA Marie Anderson, Megan Gaffney, Rachael

BOTANICAL FOOD SUPPLEMENTS: TOWARDS A WORKABLE REGULATORY FRAMEWORK EUROPEAN PARLIAMENT,

NIH Botanical Research Centers Program Applicant Information Meeting March 12, 2009 Christine A.

Solar ROI for Eastside Businesses Confidential. Do not duplicate or retransmit. What we do

Mahjong International League (MIL) and Duplicate Mahjong History of Mahjong Modern Mahjong and

Cleaning Up the Neighborhood: Duplicate Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Detection

From Pointer Systems to Counter Systems using Shape Analysis Arnaud Sangnier EDF R&D, LSV,

Particle Learning and Smoothing Hedibert Freitas Lopes The University of Chicago Booth School of

Breakdown in All Seasons Cavity K. Yonehara APC, Fermilab

Analyze Breakdown in All Seasons Cavity K. Yonehara APC,

EQUATIONS OF MOTION OF COMPACT BINARIES at THE FOURTH POST-NEWTONIAN ORDER Luc Blanchet

Automorphisms of extremal codes Gabriele Nebe Lehrstuhl D f ur Mathematik ALCOMA 15 Plan

Reduced Ordered Binary Decision Diagrams Lecture #13 of Advanced Model Checking Joost-Pieter

Large Quantile Estimation for Distributions in the Domain of Attraction of a Max-Semistable Law

Sambuz

Useful Links

Newsletter

Mail Us

How much duplicate botanical data is available for digitization - PowerPoint PPT Presentation

How much duplicate botanical data is available for digitization reuse? igo Granzow-de la Cerda 1 & Ben Anhalt 2 TDWG 2011, New Orleans 17 Oct. 1 Autonomous University of Barcelona 2 Biodiversity Institute, University of Kansas

05/05/2017 BOTANICAL STUDIES IN THE CHIMANIMANI TFCA Two botanical projects in the TFCA:

W.W. Seymour W.W. Seymour Botanical Conservatory Botanical Conservatory Ethnobotanical Tour

Duplicate Encounter Avoidance Guidelines MCO Encounter Improvement Initiative Meridian Health

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

1 Near-Duplicate News Articles Near-Duplicate Detection More challenging task Are web

Extracts Using Multiple Technologies Dr. Ranjan Mitra Head Analytical Development, Dabur

Course 14 Seminar Presentation Dona Leonardi DipSBA(Dist) Diploma Part 1 Botanical

MANAGEMENT ON NATIVE TREE RECRUITMENT Hannah Carpenter Danelle Haake Missouri Botanical Garden

Waterton, Banff, Alberta Alberta Botanical Beach, B.C. Waterton, Banff, Alberta Alberta

Museum Fund Forest Park Nature Center Luthy Botanical Garden Peoria Zoo Tawny Oaks Museum Fund

iOS Mobile App THE STATE BOTANICAL GARDEN OF GEORGIA Marie Anderson, Megan Gaffney, Rachael

BOTANICAL FOOD SUPPLEMENTS: TOWARDS A WORKABLE REGULATORY FRAMEWORK EUROPEAN PARLIAMENT,

NIH Botanical Research Centers Program Applicant Information Meeting March 12, 2009 Christine A.

Solar ROI for Eastside Businesses Confidential. Do not duplicate or retransmit. What we do

Mahjong International League (MIL) and Duplicate Mahjong History of Mahjong Modern Mahjong and

Cleaning Up the Neighborhood: Duplicate Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Detection

From Pointer Systems to Counter Systems using Shape Analysis Arnaud Sangnier EDF R&amp;D, LSV,

Particle Learning and Smoothing Hedibert Freitas Lopes The University of Chicago Booth School of

Breakdown in All Seasons Cavity K. Yonehara APC, Fermilab

Analyze Breakdown in All Seasons Cavity K. Yonehara APC,

EQUATIONS OF MOTION OF COMPACT BINARIES at THE FOURTH POST-NEWTONIAN ORDER Luc Blanchet

Automorphisms of extremal codes Gabriele Nebe Lehrstuhl D f ur Mathematik ALCOMA 15 Plan

Reduced Ordered Binary Decision Diagrams Lecture #13 of Advanced Model Checking Joost-Pieter

Large Quantile Estimation for Distributions in the Domain of Attraction of a Max-Semistable Law

Sambuz

Useful Links

Newsletter

Mail Us

From Pointer Systems to Counter Systems using Shape Analysis Arnaud Sangnier EDF R&D, LSV,