Automating Authority Work
Mike Monaco Coordinator, Cataloging Services May 14, 2018
Automating Authority Work Automating authority work, or, Be your - - PowerPoint PPT Presentation
Mike Monaco Coordinator, Cataloging Services May 14, 2018 Automating Authority Work Automating authority work, or, Be your own authority control vendor Ohio Valley Group of Technical Services Librarians Mike Monaco 2018 Conference, May
Mike Monaco Coordinator, Cataloging Services May 14, 2018
Mike Monaco Coordinator, Cataloging Services The University of Akron mmonaco@uakron.edu Ohio Valley Group of Technical Services Librarians 2018 Conference, May 13-15, 2018 Hesburgh Libraries The University of Notre Dame South Bend, Indiana
John Carroll University (2001-2004) Part-time AV cataloger Akron-Summit County Public Library (2001-2004) Substitute librarian Cleveland Public Library (2004-2016) Catalog librarian The University of Akron (2016- ) Coordinator, Cataloging Services
University Libraries
Bierce Library Science & Technology Library Archival Services
(Separate Units)
Wayne College Library Akron Law Library Center for the History
○ Full authority control for tangible items only ○ Shift to batches of e-resources over time made authority work for batches overwhelming ○ 2013: Budget 80:20 electronic:tangible ○ 2018: ratio is about 95:5
Automated authority control within the ILS Working with an authority control vendor
Grabbing the “low-hanging fruit” for batches of records When traditional authority work is not practical (the item is not in-hand
The “Headings used for the first time” report could export a list of the headings, and we could batch search OCLC for records?
1. Before loading, correct variant headings (with MarcEdit) 2. After loading, extract headings from report (with SQL query or ILS’s output) 3. Separate names and subjects (in a spreadsheet or text editor) 4. Remove extraneous data (with RegEx-capable editor) 5. Batch search for authority records (in Connexion Client) 6. Load authority records
MarcEdit can check name and subject fields against LC authorities in the Linked Data Service, and automatically correct headings that match a variant (“Use for”) heading*. *NB: The process is imperfect!
Because this is an extra step, we’ve been comparing record sets from various vendors to determine which
Selected Vendor Loads (March 2017-March 2018)
Record Source Records per load Invalid headings per load Variants changed per load Invalid heading: Record ratio Variant: Record ratio Alexander Street Press 293 117 18 0.399279 0.060566 EBSCO 76992 75668 181 0.982807 0.002348 Films on Demand 2509 1019 175 0.406314 0.069911 Kanopy 9960 4946 397 0.496628 0.039815 Proquest EBC 13086 2309 101 0.17647 0.0077 World Share 31 7 0.7 0.232114 0.023772
https://mmonaco-uakron.tinytake.com/sf/MjUwMDQxMF83NTIyNTY0
Hundreds or even thousands of entries after batch loads...
Be sure to import as Unicode (UTF-8) if your ILS is encoding characters as Unicode rather than MARC8!
Sorting A-Z arranges the headings by field group tag and MARC tag (a=names, b=other names, d=subject) So a100-b730 : names used as names d600-d630 : names used as subjects d650- : subjects
You can’t feed this raw data into a batch search in Connexion Client
(or other RegEx-enabled editor) Strip out MARC tags, delimiters, punctuation, etc.
(.*\|a) (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) (\|e.*|\|4.*|\|0.*) (\|x.*|\|v.*|\|z.*) (\|.|\|$) (;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’ | ‘| be | that |\.{3}| near )
(.*\|a) Everything before |a
(\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) AACR2 abbreviations
(\|e.*|\|4.*|\|0.*) Relator terms, URIs
(\|.|\|$) Any remaining delimiters and subfield codes
(;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’ | ‘| be | that |\.{3}| near ) Punctuation, operators, and stopwords that foil OCLC searches
(\|x.*|\|v.*|\|z.*) Subdivisions
https://mmonaco-uakron.tinytake.com/sf/MjU4ODk4OF83Nzg3NTMy
(\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.)
(;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| near )
(.*\|a) (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) (\|e.*|\|4.*|\|0.*) (\|x.*|\|v.*|\|z.*) (\|.|\|$) (;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| near )
(.*\|a) (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) (\|e.*|\|4.*|\|0.*) (\|x.*|\|v.*|\|z.*) (\|.|\|$) (;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| near )
(.*\|a) (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) (\|e.*|\|4.*|\|0.*) (\|x.*|\|v.*|\|z.*) (\|.|\|$) (;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| near )
List unauthorized tags report
1. Open .txt file in editor 2. Delete header of report 3. Find/Replace to delete page headers (“Tags With UNAUTHORIZED Headings / Produced on Sat Jul 1 17:00:11 2017”) 4. Separate name and topical headings 5. RegEx to remove other data
(.*\|a)
(\|e.*|\|4.*|\|0.*) Misses “|?UNAUTHORIZED” by itself. Only captures it if preceded by |e |4 |0
(\|e.*|\|4.*|\|0.*|\|\?.*) \|\?.* captures “|?” followed by anything
Portions of “|?UNAUTHORIZED” that wrapped to new line
Output changes any diacritics to them
Name/Title headings are especially likely to get broken up. Here, the delimiter was even separated from the subfield code “t”
1. FIND: \* REPLACE:[nothing] to delete asterisks 2. Use EditPad’s “Extras” to delete blank lines, duplicate lines, etc. 3. Depending on the number of items, you might close up split lines by hand.
https://mmonaco-uakron.tinytake.com/sf/MjU4OTAyMF83Nzg3NjE2
“Use default index” settings nw: for names/titles su: for topics/geographic terms Maximum number of matches to download: 1 (Tools>Options>Batch)
NOTE: Your local save file has a maximum capacity of 10,000 records, so don’t search more than that many strings!
III requires name headings that are to be used as subjects to be loaded separately from name headings to be used as names! SirsiDynix does not have this issue.
I’m very pleased with hit rate on names!
Total headings extracted Hits in batch search Success rate (ARs found for heading) Names 36,244 21,760 60 % Names as Subjects 3,795 896 23.6 % Subjects 29,147 1,516 5.2 %
Main issue: Name/Title headings often not established
Total headings extracted Hits in batch search Success rate (ARs found for heading) Names 36,244 21,760 60 % Names as Subjects 3,795 896 23.6 % Subjects 29,147 1,516 5.2 %
Main issues: Sierra treats subdivided subject headings as single headings, inflating report (3.4 should correct this issue!) Music headings (instruments, Arranged) often valid but not established in an AR
Total headings extracted Unique hits in batch search Success rate (ARs found for heading) Names 36,244 21,760 60 % Names as Subjects 3,795 896 23.6 % Subjects 29,147 1,516 5.2 %
Typically this process (excluding Validate Headings in MarcEdit) takes less than an hour and resolved 36% of unauthorized headings -- averaging over a thousand ARs a week. Several “known issues” in Sierra made me place this project on hold until they are fixed however. So consider 1. the number of new headings you normally see in a report, 2. the quality of your incoming bib records (can the headings be authorized?) and 3. the capability of your ILS to use authority records effectively
If you decide to test this process at your institution or discover any refinements, please let me know! mmonaco@uakron.edu
Help with regular expressions: Regular Expressions 101 http://regex101.com RegEx enabled text editors: Editpad Lite https://www.editpadlite.com EmEditor https://www.emeditor.com/ pgAdmin free software: https://www.pgadmin.org/download/ This presentation: https://events.library.nd.edu/ovgtsl2018/talk/monaco.shtml