CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) - - PowerPoint PPT Presentation

correspondence
SMART_READER_LITE
LIVE PREVIEW

CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) - - PowerPoint PPT Presentation

LEVERAGING CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) BRIAN THOMAS ELECTRONIC RECORDS SPECIALIST TEXAS STATE LIBRARY AND ARCHIVES COMMISSION DISCLAIMER This presentation and any subsequent discussion represents work and


slide-1
SLIDE 1

LEVERAGING CORRESPONDENCE MANAGEMENT SYSTEMS

(FOR DIGITAL OBJECT METADATA)

BRIAN THOMAS ELECTRONIC RECORDS SPECIALIST TEXAS STATE LIBRARY AND ARCHIVES COMMISSION

slide-2
SLIDE 2

DISCLAIMER

This presentation and any subsequent discussion represents work and perspectives on work completed at the Texas State Library and Archives Commission by the presenter. Opinions and perspectives provided by this presenter are their own and may not indicate the official stance of the agency.

slide-3
SLIDE 3

CTS: THE CORRESPONDENCE TRACKING SYSTEM

Some details

  • 1. Completely homegrown system
  • 2. Interface written in Visual Basic 6
  • 3. Running against a MS SQL Server

database

  • 4. The database itself is a record
  • 5. Covers physical mail, webmail, phone calls
  • 6. Each mail/webmail item was supposed to

have a corresponding image file or PDF

slide-4
SLIDE 4

WHAT IF…

The content in the database could be extracted in a way that captured the elements of the Governor’s staff interface? And then paired with the individual images themselves in the preservation/access system for staff research? And possibly indexed for some linked data fun?

slide-5
SLIDE 5

FROM: HTTPS://WWW.YOUTUBE.COM/WATCH?V=AOF5LCT5JD0

slide-6
SLIDE 6

IF YOU HAVE A HAMMER, EVERYTHING LOOKS LIKE A NAIL

About me and the tools at my disposal

1. I had been working on database preservation 2. I love virtualization 3. I had also been using Python extensively for API and data manipulation in other project 4. Therefore almost all work was done with Python in a virtual machine for this project 5. I like the new Doctor

Courtesy https://imgur.com/gallery/NIgUNZZ

slide-7
SLIDE 7

OVERVIEW OF THE WORK

Preserve database Study database structure Export and manipulate data Export data to valid sidecar files Final data manipulation Fix miscellaneous problems

slide-8
SLIDE 8

THE ACTUAL STEPS

  • Get SQL Server 2018 running
  • Preserve the database into SIARD format
  • Review tables in SQL Server Management

Studio and Database Visualization Toolkit to understand data structure

  • Review fields in CTS GUI to see what staff

would have worked with

  • Determine how tables should be connected
  • Export tables to CSV format
  • Use Python PANDAS to merge tables
  • Replace illegal characters in spreadsheets
  • Use Python script to export metadata into

individual files

  • Use Python script to create valid XML
  • Use Python script to validate the XML
  • Fix broken XML, re-validate until all good
  • Transform metadata export to desired schema

(x2, see later explanation)

  • Use Python script to remove artifacts from

transforms

  • Use Python to correct filenaming/pairing errors
  • Re-upload files with sidecar metadata
slide-9
SLIDE 9

STEP ONE

Preserve the database

slide-10
SLIDE 10

STEP 1: PRESERVE THE DATABASE

Running SQL Server

  • First step, see the database in its actual

unmediated format

  • Take SQL dump and import it into SQL Server
  • Use SQL Server Management Studio or similar

software to review structure and contents

  • Maybe can export directly to a spreadsheet?

Run XML export?

SQL Server management studio available here: https://docs.microsoft.com/en-us/sql/ssms/download-sql-server- management-studio-ssms?view=sql-server-2017

Run Database Preservation Toolkit

  • SIARD format, XML-based

○ captures all database content and most functions

  • Invented by Swiss Federal Archives

○ SIARD Suite app converted databases to SIARD

  • Database Preservation Toolkit is a product of EARK

and seeks to automate conversion, more detailed SIARD2 standard

  • http://www.database-preservation.com/
  • Later Swiss Federal Archives released a tool for

SIARD2.1 standard ○

https://www.bar.admin.ch/bar/en/home/archiving /tools/siard-suite.html

slide-11
SLIDE 11

IN SQL SERVER MANAGEMENT STUDIO

slide-12
SLIDE 12

IN DATABASE VISUALIZATION TOOLKIT

slide-13
SLIDE 13

WHAT IT SHOULD HAVE LOOKED LIKE

slide-14
SLIDE 14

STEP TWO

Study database structure

slide-15
SLIDE 15

STEP 2: STUDY THE DATABASE STRUCTURE

  • 1. Review staff GUI for essential

elements

  • 2. Find elements in database tables
  • 3. Develop a plan on how to

reconstruct the information elements from all tables

  • 4. Beware programmatic joins not

represented in linked tables

slide-16
SLIDE 16

STEP THREE

Export and manipulate data

slide-17
SLIDE 17

STEP 3: EXPORT AND MANIPULATE DATA

  • 1. Export each table to CSV using an

DBVTK export function

  • 2. Load individual CSVs using python

PANDAS

  • 3. Merge CSV files on shared column

data

a.

Use an outer, inner, left/right join?

  • 4. Iteratively save, slice and dice the
  • utput
slide-18
SLIDE 18

STEP FOUR

Export data to valid sidecar files

slide-19
SLIDE 19

STEP 4: EXPORT DATA TO VALID SIDECAR FILES

  • Eliminate the illegal characters from the CSV(s)

first

I didn’t the first time and spent over a day correcting the results

  • Load each CSV and run a script to export that

data into a metadata file per ???

Make sure it appends data, not

  • verwrites. You may have multiple

entries for the same thing

  • Run a script to encapsulate the data to create

valid XML

  • Run another script to validate your XML

This Photo by Unknown Author is licensed under CC BY-SA

slide-20
SLIDE 20
slide-21
SLIDE 21

STEP FIVE

Final data manipulation

slide-22
SLIDE 22

STEP 5: FINAL DATA MANIPULATION

  • Check existing XML schemas for fit

○ 95 data points ○ TEI too simple ○ Qualified Dublin Core not a good fit

  • Write your own?

○ Yes!

  • Run XSLTs against XML files to match

chosen/written schema

  • Run more XSLTs to de-dupe content
  • Re-arrange XML into correct directory structure
  • Pair with files in-system or re-upload files
slide-23
SLIDE 23

STEP SIX

Fix miscellaneous problems

slide-24
SLIDE 24

PROBLEM ONE: MISSING IMAGES AND DB ENTRIES

  • Everything should have been there
  • Paper correspondence only sampled
  • Some images had no metadata.

Outgoing/incoming correspondence not logged? Log name is correct?

  • Some metadata had no images. Missing

files? Never scanned?

  • 353,674 Mail entries without any logged
  • scan. Never scanned? Forgot to add

filename?

  • Yes to all
slide-25
SLIDE 25

PROBLEM ONE: SOLUTION(S)

  • Develop a script to identify what might be

missing

  • Including specific filepaths for processing
  • Create a cute no-scan placeholder file for

missing scans so metadata is preserved

  • Leave items without metadata as is. Still text

searchable

slide-26
SLIDE 26

PROBLEM TWO: CAPITALIZATION ERRORS

  • False negatives for matching XML because…
  • Staff did not capitalize database entries

the same way they capitalized the images

  • Problem because metadata pairing process

is sensitive to exact filename

Solution

  • Use comparative script to generate a list of

image/metadata files without matches (with filepath)

  • Use a script to de-capitalize listed filenames

and compare.

  • If there is a match, use the image version of the

filename to rename the metadata file

slide-27
SLIDE 27

PROBLEM THREE: SAME IMAGE IN MULTIPLE PLACES

  • False negatives for matching XML

because…

  • The file is in another folder altogether
  • And it is in multiple places

Solution

  • Use comparative script to generate a list of

image/metadata files without matches (with filepath)

  • Use a script to de-capitalize listed

filenames, drop the filepath and compare.

  • If there is a match, copy the file to a new

location with the correct filepath

slide-28
SLIDE 28

PROBLEM FOUR: MISFILED/MISNAMED FILES

  • Files put in the wrong directory
  • E.G. 200106110167.tif filed in directory

2001/01/0111

  • Files misnamed
  • E.G. 200106110167.tif misnamed as

200101110167.tif

Solution

  • If no matches in metadata, generate a

generic metadata file suggesting look for correct metadata based on content of file

  • SIP creator tool catches duplicate names,

correct at point that it find errors.

slide-29
SLIDE 29

PROBLEM FIVE: LOGGED PHONE CALLS

  • 771,825 logged phone calls
  • No document for these
  • Need an object to pair metadata to OR
  • Upload metadata only and rely on text

search?

  • Create an html version of metadata?

Solution

  • Find a cool icon
  • Use a script to generate a list of metadata

files but with the file extension changed to match the icon file extension

  • Use a script to mass copy the icon into an

image that can be uploaded

slide-30
SLIDE 30

LESSONS LEARNED/COULD HAVE DONE BETTER

  • Expanded conversation to account for

more internal stakeholder/staff requests

  • Don’t trust that anybody (that they did

100% of what they said they did)

  • Direct database SQL queries?
  • Before the fact contingency planning

http://4.bp.blogspot.com/- pOMrxILoPV8/TgOfWqGU8SI/AAAAAAAAAlU/XXDsDr4BaS8/s1600/mist ake3.jpg

slide-31
SLIDE 31

NOW LET’S DISCUSS...

1.

How could this have been done better?

2.

What situations are other people facing?

3.

What limitations do you have to work around?

4.

Any other thoughts?

Courtesy NBC.com (https://www.nbc.com/saturday-night-live/video/coffee-talk/n10457)

slide-32
SLIDE 32

CONTACT INFORMATION

BRIAN THOMAS NON-GOVERNMENTAL EMAIL: BRIAN.THE.ARCHIVIST@GMAIL. COM GOVERNMENTAL EMAIL: BTHOMAS@TSL.TEXAS.GOV WORK PHONE: 512-475-3374

slide-33
SLIDE 33

SOME USEFUL SCRIPTS/TRICKS

slide-34
SLIDE 34

MERGING SPREADSHEETS USING PYTHON/PANDAS

slide-35
SLIDE 35

EXPORTING TO XML FROM CSV USING PYTHON

slide-36
SLIDE 36

XML ENCAPSULATION AND VALIDATION USING PYTHON

slide-37
SLIDE 37

BATCH FILING IN WINDOWS COMMAND LINE

  • Print file list to a text file (Karen’s Directory

printer works wonders)

  • In Excel or another spreadsheet program

○ “mid” function to pull source directory ○ “mid” function to pullfull filename ○ “mid” function to pull subdirectory 1, 2, etc. ○ “concat” function to assemble parts for a Windows “mkdir” Powershell command ■ Don’t forget to dedupe ○ “concat’ function to assemble parts for Windows “move” cmd to file into new directories

  • Copy finished mkdir and move commands and

paste as values to remove formulas

  • Copy mkdir and move to Powershell and cmd,
  • respectively. Wait… … …
slide-38
SLIDE 38

MASS MANIPULATION WITH STYLESHEETS AND PYTHON

XSL transform engine Example De-dupe transform

slide-39
SLIDE 39

COMPARING DIRECTORIES USING PYTHON