LEVERAGING CORRESPONDENCE MANAGEMENT SYSTEMS
(FOR DIGITAL OBJECT METADATA)
BRIAN THOMAS ELECTRONIC RECORDS SPECIALIST TEXAS STATE LIBRARY AND ARCHIVES COMMISSION
CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) - - PowerPoint PPT Presentation
LEVERAGING CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) BRIAN THOMAS ELECTRONIC RECORDS SPECIALIST TEXAS STATE LIBRARY AND ARCHIVES COMMISSION DISCLAIMER This presentation and any subsequent discussion represents work and
(FOR DIGITAL OBJECT METADATA)
BRIAN THOMAS ELECTRONIC RECORDS SPECIALIST TEXAS STATE LIBRARY AND ARCHIVES COMMISSION
DISCLAIMER
This presentation and any subsequent discussion represents work and perspectives on work completed at the Texas State Library and Archives Commission by the presenter. Opinions and perspectives provided by this presenter are their own and may not indicate the official stance of the agency.
CTS: THE CORRESPONDENCE TRACKING SYSTEM
Some details
database
have a corresponding image file or PDF
The content in the database could be extracted in a way that captured the elements of the Governor’s staff interface? And then paired with the individual images themselves in the preservation/access system for staff research? And possibly indexed for some linked data fun?
FROM: HTTPS://WWW.YOUTUBE.COM/WATCH?V=AOF5LCT5JD0
IF YOU HAVE A HAMMER, EVERYTHING LOOKS LIKE A NAIL
About me and the tools at my disposal
1. I had been working on database preservation 2. I love virtualization 3. I had also been using Python extensively for API and data manipulation in other project 4. Therefore almost all work was done with Python in a virtual machine for this project 5. I like the new Doctor
Courtesy https://imgur.com/gallery/NIgUNZZ
OVERVIEW OF THE WORK
Preserve database Study database structure Export and manipulate data Export data to valid sidecar files Final data manipulation Fix miscellaneous problems
THE ACTUAL STEPS
Studio and Database Visualization Toolkit to understand data structure
would have worked with
individual files
(x2, see later explanation)
transforms
STEP ONE
STEP 1: PRESERVE THE DATABASE
Running SQL Server
unmediated format
software to review structure and contents
Run XML export?
SQL Server management studio available here: https://docs.microsoft.com/en-us/sql/ssms/download-sql-server- management-studio-ssms?view=sql-server-2017
Run Database Preservation Toolkit
○ captures all database content and most functions
○ SIARD Suite app converted databases to SIARD
and seeks to automate conversion, more detailed SIARD2 standard
SIARD2.1 standard ○
https://www.bar.admin.ch/bar/en/home/archiving /tools/siard-suite.html
IN SQL SERVER MANAGEMENT STUDIO
IN DATABASE VISUALIZATION TOOLKIT
WHAT IT SHOULD HAVE LOOKED LIKE
STEP TWO
STEP 2: STUDY THE DATABASE STRUCTURE
elements
reconstruct the information elements from all tables
represented in linked tables
STEP THREE
STEP 3: EXPORT AND MANIPULATE DATA
DBVTK export function
PANDAS
data
a.
Use an outer, inner, left/right join?
STEP FOUR
STEP 4: EXPORT DATA TO VALID SIDECAR FILES
first
○
I didn’t the first time and spent over a day correcting the results
data into a metadata file per ???
○
Make sure it appends data, not
entries for the same thing
valid XML
This Photo by Unknown Author is licensed under CC BY-SA
STEP FIVE
STEP 5: FINAL DATA MANIPULATION
○ 95 data points ○ TEI too simple ○ Qualified Dublin Core not a good fit
○ Yes!
chosen/written schema
STEP SIX
PROBLEM ONE: MISSING IMAGES AND DB ENTRIES
Outgoing/incoming correspondence not logged? Log name is correct?
files? Never scanned?
filename?
PROBLEM ONE: SOLUTION(S)
missing
missing scans so metadata is preserved
searchable
PROBLEM TWO: CAPITALIZATION ERRORS
the same way they capitalized the images
is sensitive to exact filename
Solution
image/metadata files without matches (with filepath)
and compare.
filename to rename the metadata file
PROBLEM THREE: SAME IMAGE IN MULTIPLE PLACES
because…
Solution
image/metadata files without matches (with filepath)
filenames, drop the filepath and compare.
location with the correct filepath
PROBLEM FOUR: MISFILED/MISNAMED FILES
2001/01/0111
200101110167.tif
Solution
generic metadata file suggesting look for correct metadata based on content of file
correct at point that it find errors.
PROBLEM FIVE: LOGGED PHONE CALLS
search?
Solution
files but with the file extension changed to match the icon file extension
image that can be uploaded
LESSONS LEARNED/COULD HAVE DONE BETTER
more internal stakeholder/staff requests
100% of what they said they did)
http://4.bp.blogspot.com/- pOMrxILoPV8/TgOfWqGU8SI/AAAAAAAAAlU/XXDsDr4BaS8/s1600/mist ake3.jpg
NOW LET’S DISCUSS...
1.
How could this have been done better?
2.
What situations are other people facing?
3.
What limitations do you have to work around?
4.
Any other thoughts?
Courtesy NBC.com (https://www.nbc.com/saturday-night-live/video/coffee-talk/n10457)
BRIAN THOMAS NON-GOVERNMENTAL EMAIL: BRIAN.THE.ARCHIVIST@GMAIL. COM GOVERNMENTAL EMAIL: BTHOMAS@TSL.TEXAS.GOV WORK PHONE: 512-475-3374
MERGING SPREADSHEETS USING PYTHON/PANDAS
EXPORTING TO XML FROM CSV USING PYTHON
XML ENCAPSULATION AND VALIDATION USING PYTHON
BATCH FILING IN WINDOWS COMMAND LINE
printer works wonders)
○ “mid” function to pull source directory ○ “mid” function to pullfull filename ○ “mid” function to pull subdirectory 1, 2, etc. ○ “concat” function to assemble parts for a Windows “mkdir” Powershell command ■ Don’t forget to dedupe ○ “concat’ function to assemble parts for Windows “move” cmd to file into new directories
paste as values to remove formulas
MASS MANIPULATION WITH STYLESHEETS AND PYTHON
XSL transform engine Example De-dupe transform
COMPARING DIRECTORIES USING PYTHON