‘-
1
John Beatty
INTAKE IN THE INSTITUTIONAL - REPOSITORY WITH PYTHON John Beatty - - PowerPoint PPT Presentation
AUTOMATING PROCESSING AND INTAKE IN THE INSTITUTIONAL - REPOSITORY WITH PYTHON John Beatty 1 INTRODUCTION - What are we doing here? 2 Populating the Institutional Repository Year 1: Faculty scholarship Year 2: Law
‘-
1
John Beatty
‘-
2
What are we doing here?
‘-
3
‘-
4
It’s true I had zero Python programming knowledge at the start of this project. But I was starting with some knowledge:
‘-
5
The Law Journal Project
‘-
6
issue/volume
‘-
7
‘-
8
November
‘-
9
What’s so special about the Law Review?
‘-
10
in the system as a section rather than individual pieces
individual metadata
author data
page
‘-
11
issue are all in one file
in HeinOnline
2 book reviews at most; splitting and metadata creation was done by hand
reviews an issue, so automation helpful
‘-
12
an entire issue
have them
‘-
13
the prior year’s Court of Appeals term
‘-
14
the top of a page
note
‘-
15
New York or United States law
‘-
16
Previous Solution
worker Our Situation
massive LSP migration
because of retirement
‘-
17
and split PDFs
‘-
18
Proposed:
1-2 weeks
Actual:
(Thanksgiving week)
(November 26-30)
(December 3-21, January 3-11)
Note: Processing time included a LOT
‘-
19
First Steps
‘-
20
‘-
21
‘-
22
PDF files
pdfminer.six, for Python 3 compatibility)
‘-
23
specifically says not to rely on the text extraction functions
PDF files, so can be relied on for text extraction. But it doesn’t have the manipulation functions.
‘-
24
Workflow
‘-
25
Just One Problem:
allow the code to consistently identify the metadata elements.
start page (printed)
Commons batch spreadsheets
‘-
26
Use appropriate dsplit-XX.py to extract
good enough to trust that it’s right.
‘-
27
Open the CSV file and check it against the original PDF. Fix titles, authors, and most importantly, start and ending pages for the PDF split.
‘-
28
Feed hand-corrected CSV and original PDF back to dsplit-XX.py to split. For extra fun, hand-correct a couple of volumes, then use a bash script to run through them all while you get coffee.
‘-
29
Feed that CSV file to dc-convert.py. Copy everything back to the main
exported Excel file into DC spreadsheet.
‘-
30
Open split PDFs in Box preview. Check page split. Double-check metadata. Add
‘-
31
Main Python code—Contains all reusable code
Most of these code segments called by external files that act as command-line interfaces
‘-
32
‘-
33
single file. Takes a path and combines all files in filename order.
proper columns to be cut and paste into DC upload sheets.
the last page of the previous file in the directory
and adds it to the end of the first file (quickly replaced by dir-shift.py)
‘-
34
What else can I do with this thing I built?
‘-
35
‘-
36
and replace, then copied into Excel file
‘-
37
‘-
38
https://github.com/johnrbeatty/journaltools
https://digitalcommons.law.buffalo.edu/law_librarian_other_scholarship/