Automating Authority Work Automating authority work, or, Be your - - PowerPoint PPT Presentation

automating authority work automating authority work
SMART_READER_LITE
LIVE PREVIEW

Automating Authority Work Automating authority work, or, Be your - - PowerPoint PPT Presentation

Mike Monaco Coordinator, Cataloging Services May 14, 2018 Automating Authority Work Automating authority work, or, Be your own authority control vendor Ohio Valley Group of Technical Services Librarians Mike Monaco 2018 Conference, May


slide-1
SLIDE 1

Automating Authority Work

Mike Monaco Coordinator, Cataloging Services May 14, 2018

slide-2
SLIDE 2

Automating authority work,

  • r,

Be your own authority control vendor

Mike Monaco Coordinator, Cataloging Services The University of Akron mmonaco@uakron.edu Ohio Valley Group of Technical Services Librarians 2018 Conference, May 13-15, 2018 Hesburgh Libraries The University of Notre Dame South Bend, Indiana

slide-3
SLIDE 3

Who are you?

John Carroll University (2001-2004) Part-time AV cataloger Akron-Summit County Public Library (2001-2004) Substitute librarian Cleveland Public Library (2004-2016) Catalog librarian The University of Akron (2016- ) Coordinator, Cataloging Services

slide-4
SLIDE 4

The University of Akron Libraries

University Libraries

Bierce Library Science & Technology Library Archival Services

(Separate Units)

Wayne College Library Akron Law Library Center for the History

  • f Psychology
slide-5
SLIDE 5

Authority control at UA

  • 1995 migration and vendor (BNA) supplied one-time authority processing
  • Local authority work put on hold in expectation contracting with a vendor…which never happened
  • Authority work resumed early 2000s

○ Full authority control for tangible items only ○ Shift to batches of e-resources over time made authority work for batches overwhelming ○ 2013: Budget 80:20 electronic:tangible ○ 2018: ratio is about 95:5

slide-6
SLIDE 6

What this is NOT about

Automated authority control within the ILS Working with an authority control vendor

slide-7
SLIDE 7

What this IS

Grabbing the “low-hanging fruit” for batches of records When traditional authority work is not practical (the item is not in-hand

  • r headings reports are too vast to address individually)
slide-8
SLIDE 8

Wouldn’t it be nice if...

The “Headings used for the first time” report could export a list of the headings, and we could batch search OCLC for records?

slide-9
SLIDE 9

The tool box

  • MarcEdit
  • OCLC Connexion Client
  • Excel (or other program for sorting textual lists)
  • pgAdmin (or similar for a SQL query, III/Sierra only)
  • A rudimentary grasp of Regular Expressions
  • EditPad (or similar RegEx-compatible text editor: Google Sheets, EmEditor)
slide-10
SLIDE 10

The process

1. Before loading, correct variant headings (with MarcEdit) 2. After loading, extract headings from report (with SQL query or ILS’s output) 3. Separate names and subjects (in a spreadsheet or text editor) 4. Remove extraneous data (with RegEx-capable editor) 5. Batch search for authority records (in Connexion Client) 6. Load authority records

slide-11
SLIDE 11

Validate Headings

MarcEdit can check name and subject fields against LC authorities in the Linked Data Service, and automatically correct headings that match a variant (“Use for”) heading*. *NB: The process is imperfect!

slide-12
SLIDE 12

Because this is an extra step, we’ve been comparing record sets from various vendors to determine which

  • nes really benefit.
slide-13
SLIDE 13

Selected Vendor Loads (March 2017-March 2018)

Record Source Records per load Invalid headings per load Variants changed per load Invalid heading: Record ratio Variant: Record ratio Alexander Street Press 293 117 18 0.399279 0.060566 EBSCO 76992 75668 181 0.982807 0.002348 Films on Demand 2509 1019 175 0.406314 0.069911 Kanopy 9960 4946 397 0.496628 0.039815 Proquest EBC 13086 2309 101 0.17647 0.0077 World Share 31 7 0.7 0.232114 0.023772

slide-14
SLIDE 14

III Sierra

slide-15
SLIDE 15

SQL query of Headings used for the first time report

https://mmonaco-uakron.tinytake.com/sf/MjUwMDQxMF83NTIyNTY0

slide-16
SLIDE 16

Headings used for the first time

Hundreds or even thousands of entries after batch loads...

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19

SQL query*

slide-20
SLIDE 20

Results...

slide-21
SLIDE 21

In Excel...

Be sure to import as Unicode (UTF-8) if your ILS is encoding characters as Unicode rather than MARC8!

slide-22
SLIDE 22

Sort the terms

Sorting A-Z arranges the headings by field group tag and MARC tag (a=names, b=other names, d=subject) So a100-b730 : names used as names d600-d630 : names used as subjects d650- : subjects

slide-23
SLIDE 23

Notice

You can’t feed this raw data into a batch search in Connexion Client

slide-24
SLIDE 24

In EditPad

(or other RegEx-enabled editor) Strip out MARC tags, delimiters, punctuation, etc.

slide-25
SLIDE 25

Find/replace using RegEx

(.*\|a) (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) (\|e.*|\|4.*|\|0.*) (\|x.*|\|v.*|\|z.*) (\|.|\|$) (;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’ | ‘| be | that |\.{3}| near )

slide-26
SLIDE 26

Names

(.*\|a) Everything before |a

slide-27
SLIDE 27

(\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) AACR2 abbreviations

  • b. d. fl. ca.
slide-28
SLIDE 28

(\|e.*|\|4.*|\|0.*) Relator terms, URIs

slide-29
SLIDE 29

(\|.|\|$) Any remaining delimiters and subfield codes

slide-30
SLIDE 30

(;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’ | ‘| be | that |\.{3}| near ) Punctuation, operators, and stopwords that foil OCLC searches

slide-31
SLIDE 31

Names as subjects

(\|x.*|\|v.*|\|z.*) Subdivisions

slide-32
SLIDE 32

Converting SQL output to batch searchable text file with RegEx

https://mmonaco-uakron.tinytake.com/sf/MjU4ODk4OF83Nzg3NTMy

slide-33
SLIDE 33

Name headings

slide-34
SLIDE 34

(.*\|a)

slide-35
SLIDE 35

(\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.)

slide-36
SLIDE 36

(\|e.*|\|4.*|\|0.*)

slide-37
SLIDE 37

(\|.|\|$)

slide-38
SLIDE 38

(;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| near )

slide-39
SLIDE 39

Names (can be skipped)

(.*\|a) (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) (\|e.*|\|4.*|\|0.*) (\|x.*|\|v.*|\|z.*) (\|.|\|$) (;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| near )

slide-40
SLIDE 40

Names as subjects

(.*\|a) (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) (\|e.*|\|4.*|\|0.*) (\|x.*|\|v.*|\|z.*) (\|.|\|$) (;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| near )

slide-41
SLIDE 41

Topical subjects (can be skipped)

(.*\|a) (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) (\|e.*|\|4.*|\|0.*) (\|x.*|\|v.*|\|z.*) (\|.|\|$) (;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| near )

slide-42
SLIDE 42

Sirsi/Dynix Symphony

slide-43
SLIDE 43

List unauthorized tags report

slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46

Slightly different procedure to clean these up

1. Open .txt file in editor 2. Delete header of report 3. Find/Replace to delete page headers (“Tags With UNAUTHORIZED Headings / Produced on Sat Jul 1 17:00:11 2017”) 4. Separate name and topical headings 5. RegEx to remove other data

slide-47
SLIDE 47

So far so good...

(.*\|a)

slide-48
SLIDE 48

Uh oh...

(\|e.*|\|4.*|\|0.*) Misses “|?UNAUTHORIZED” by itself. Only captures it if preceded by |e |4 |0

slide-49
SLIDE 49

(\|e.*|\|4.*|\|0.*|\|\?.*) \|\?.* captures “|?” followed by anything

slide-50
SLIDE 50

Problems

slide-51
SLIDE 51

Spaces

slide-52
SLIDE 52

Scraps

Portions of “|?UNAUTHORIZED” that wrapped to new line

slide-53
SLIDE 53

Asterisks

Output changes any diacritics to them

slide-54
SLIDE 54

Line breaks

Name/Title headings are especially likely to get broken up. Here, the delimiter was even separated from the subfield code “t”

slide-55
SLIDE 55

Workaround

1. FIND: \* REPLACE:[nothing] to delete asterisks 2. Use EditPad’s “Extras” to delete blank lines, duplicate lines, etc. 3. Depending on the number of items, you might close up split lines by hand.

slide-56
SLIDE 56

Searching in batches

slide-57
SLIDE 57

Searching a batch of terms in Connexion Client

https://mmonaco-uakron.tinytake.com/sf/MjU4OTAyMF83Nzg3NjE2

slide-58
SLIDE 58

Batch searching

“Use default index” settings nw: for names/titles su: for topics/geographic terms Maximum number of matches to download: 1 (Tools>Options>Batch)

slide-59
SLIDE 59

Batch searching

NOTE: Your local save file has a maximum capacity of 10,000 records, so don’t search more than that many strings!

slide-60
SLIDE 60

Successful name searches (of 1941 entries)

slide-61
SLIDE 61

Names, names as subjects, and subjects

III requires name headings that are to be used as subjects to be loaded separately from name headings to be used as names! SirsiDynix does not have this issue.

slide-62
SLIDE 62

A four month test

I’m very pleased with hit rate on names!

Total headings extracted Hits in batch search Success rate (ARs found for heading) Names 36,244 21,760 60 % Names as Subjects 3,795 896 23.6 % Subjects 29,147 1,516 5.2 %

slide-63
SLIDE 63

A four month test

Main issue: Name/Title headings often not established

Total headings extracted Hits in batch search Success rate (ARs found for heading) Names 36,244 21,760 60 % Names as Subjects 3,795 896 23.6 % Subjects 29,147 1,516 5.2 %

slide-64
SLIDE 64

A four month test

Main issues: Sierra treats subdivided subject headings as single headings, inflating report (3.4 should correct this issue!) Music headings (instruments, Arranged) often valid but not established in an AR

Total headings extracted Unique hits in batch search Success rate (ARs found for heading) Names 36,244 21,760 60 % Names as Subjects 3,795 896 23.6 % Subjects 29,147 1,516 5.2 %

slide-65
SLIDE 65

Is it worth it?

Typically this process (excluding Validate Headings in MarcEdit) takes less than an hour and resolved 36% of unauthorized headings -- averaging over a thousand ARs a week. Several “known issues” in Sierra made me place this project on hold until they are fixed however. So consider 1. the number of new headings you normally see in a report, 2. the quality of your incoming bib records (can the headings be authorized?) and 3. the capability of your ILS to use authority records effectively

slide-66
SLIDE 66

Questions?

If you decide to test this process at your institution or discover any refinements, please let me know! mmonaco@uakron.edu

slide-67
SLIDE 67

Useful tools

Help with regular expressions: Regular Expressions 101 http://regex101.com RegEx enabled text editors: Editpad Lite https://www.editpadlite.com EmEditor https://www.emeditor.com/ pgAdmin free software: https://www.pgadmin.org/download/ This presentation: https://events.library.nd.edu/ovgtsl2018/talk/monaco.shtml