Automatic Processing of Residual Functional Capacity Form Images - - PowerPoint PPT Presentation

automatic processing of residual
SMART_READER_LITE
LIVE PREVIEW

Automatic Processing of Residual Functional Capacity Form Images - - PowerPoint PPT Presentation

Keeping Informed: Automatic Processing of Residual Functional Capacity Form Images JULIA PORCINO AND CHUNXIAO ZHOU HIP19 SEPTEMBER 20-21, 2019 Acknowledgements This research was supported by the Intramural Research Program of the National


slide-1
SLIDE 1

Keeping Informed: Automatic Processing of Residual Functional Capacity Form Images

JULIA PORCINO AND CHUNXIAO ZHOU HIP’19 SEPTEMBER 20-21, 2019

slide-2
SLIDE 2

Acknowledgements

This research was supported by the Intramural Research Program

  • f the National Institutes of Health and the US Social Security

Administration All opinions expressed here are the authors and not those of the US government. We have no conflicts of interest to disclose.

slide-3
SLIDE 3

Background

slide-4
SLIDE 4

US Social Security Administration (SSA)

Disability Programs:

  • Work disability
  • Cash & Health Insurance
  • >10 million beneficiaries
  • 2-3 million new applications

Adjudication Process:

  • Manual review
  • External medical records and

evidence

  • Internal administrative & case

processing data

0.00 2.00 4.00 6.00 8.00 10.00 12.00 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

Number of Beneficiaries (Millions)

Total Disabled Workers Spouses Children

SSA Office of the Chief Actuary: https://www.ssa.gov/oact/STATS/DIbenies.html

slide-5
SLIDE 5

Residual Functional Capacity (RFC) Forms

Function as relates to work

  • Mental and Physical RFCs
  • Checkboxes and free text
  • Currently: electronic database
  • Historically: “paper” form
slide-6
SLIDE 6

Motivation

Why are we interested in historical RFC Forms?

  • Update current databases with historical form data
  • Assess change in function over time
  • Comparison to other sources of function

Millions of paper forms

  • Forms used since 1980s
  • Want automatic way to extract information
slide-7
SLIDE 7

Challenges

slide-8
SLIDE 8

SSA Data

SSA stores all documents as TIF images

  • Limitations with existing software

RFC forms come from templates that can be edited

  • Base content (generally) remains consistent
  • Layout varies greatly
slide-9
SLIDE 9

RFC Form Variation

Number of checkboxes per section:

slide-10
SLIDE 10

Sections per page:

slide-11
SLIDE 11

Section Spans Two Pages:

slide-12
SLIDE 12

Distance between rows and columns:

slide-13
SLIDE 13

Handwriting

slide-14
SLIDE 14

Methods

slide-15
SLIDE 15

Automatic Data Extraction

Steps:

➢ Checkbox Detection ➢ Checkbox Matching

➢ Templates ➢ Template Matching Algorithm

➢ Record Output

slide-16
SLIDE 16

Checkbox Detection

Use python’s OpenCV to detect checkboxes based on size and shape Ratio of black and white pixels at center of checkbox indicates marked checkboxes

slide-17
SLIDE 17

Checkbox Matching

Checkbox Position:

  • Euclidean Coordinates
  • 𝑦𝑗, 𝑧𝑗, 𝑞𝑗
  • Row-Column Coordinates (RCC)
  • 𝑠

𝑗, 𝑑𝑗

Checkbox Alignment:

  • 𝑦𝑗 − 𝑦𝑘 < 𝑓𝑑 ֜ 𝑑𝑗 = 𝑑

𝑘

  • 𝑧𝑗 − 𝑧𝑘 < 𝑓𝑠 ֜ 𝑠

𝑗 = 𝑠 𝑘

slide-18
SLIDE 18

Section Break Row-Column Coordinates

RCC when no break occurs:

Before: [(1,1), (2,2), (2,3), (3,2), (3,3)] After: {}

RCC when break occurs after 1st row:

Before: [(1,1)] After: [(1,1), (1,2), (2,1), (2,2)]

RCC when break occurs after 2nd row:

Before: [(1,1), (2,2), (2,3)] After: [(1,1), (1,2)]

slide-19
SLIDE 19

Templates

3 Types of Templates:

  • Section Template TS
  • Simplest type of template
  • Combined with other sections

to match form

  • Form Template TF
  • Consider entire form F to be
  • ne section S
  • Reduces ambiguity across

sections

  • Break Template TSK
  • Encodes all possible section

breaks

slide-20
SLIDE 20

Template Matching Algorithm

slide-21
SLIDE 21

Record Output

File Name Environmental Limitations Extreme Cold Extreme Heat Wetness Humidity SAMPLE Avoid Concentrated Unlimited Avoid Concentrated Unlimited

SAMPLE.tif:

slide-22
SLIDE 22

Tasks

TASK PURPOSE PHYSICAL RFCs* MENTAL RFCs*

Validation Evaluate templates and matching algorithm performance against

  • riginal form images

10000 5000 Comparison Evaluate template matching (RCC) against location matching (Euclidean) 4914 2364 Sample Generation Perform data entry for entire sample 497646 98408 *Refers to number of images in sample

slide-23
SLIDE 23

Results

slide-24
SLIDE 24

Performance across 3 tasks for Physical RFC (PRFC) and Mental RFC (MRFC) Comparison of Template vs. Location Matching

Performance Metrics

slide-25
SLIDE 25

Error Analysis

Recall Errors:

  • Missed checkboxes
  • Image interference
  • Scan noise
  • Handwriting
  • False positives

Precision Errors:

  • Checkboxes appear marked when not
  • Image interference/Scan noise
  • Checkboxes not marked in center
  • Handwriting
slide-26
SLIDE 26

Next Steps

Checkbox Identification:

  • Train models to identify checkboxes
  • Deep learning models

Checkbox Matching:

  • Add automation to template generation
  • Learn to identify column/row headings

Generalization:

  • Apply methods to other data
  • Checkboxes in medical records
slide-27
SLIDE 27

Conclusion

Successfully used novel templates to extract checkbox data Good performance comes from specificity of task and strong assumptions

  • Grid-like structure of checkboxes
  • No ambiguity in forms

Able to achieve good performance with basic computer vision

  • Necessitated based on limited computing resources
  • Errors came from missing checkboxes (handwriting, scan noise, etc.)
  • More advanced methods (e.g., deep learning) could help improve checkbox

identification or may be necessary for other applications (e.g., medical records)

slide-28
SLIDE 28

Thank you! Questions?

Contact Information: julia.porcino@nih.gov