automatic processing of residual
play

Automatic Processing of Residual Functional Capacity Form Images - PowerPoint PPT Presentation

Keeping Informed: Automatic Processing of Residual Functional Capacity Form Images JULIA PORCINO AND CHUNXIAO ZHOU HIP19 SEPTEMBER 20-21, 2019 Acknowledgements This research was supported by the Intramural Research Program of the National


  1. Keeping Informed: Automatic Processing of Residual Functional Capacity Form Images JULIA PORCINO AND CHUNXIAO ZHOU HIP’19 SEPTEMBER 20-21, 2019

  2. Acknowledgements This research was supported by the Intramural Research Program of the National Institutes of Health and the US Social Security Administration All opinions expressed here are the authors and not those of the US government. We have no conflicts of interest to disclose.

  3. Background

  4. US Social Security Administration (SSA) Disability Programs: o Work disability 12.00 o Cash & Health Insurance Total Disabled Workers 10.00 o >10 million beneficiaries Number of Beneficiaries (Millions) Spouses Children o 2-3 million new applications 8.00 6.00 Adjudication Process: 4.00 o Manual review o External medical records and 2.00 evidence 0.00 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 o Internal administrative & case SSA Office of the Chief Actuary: https://www.ssa.gov/oact/STATS/DIbenies.html processing data

  5. Residual Functional Capacity (RFC) Forms Function as relates to work o Mental and Physical RFCs o Checkboxes and free text o Currently: electronic database o Historically: “paper” form

  6. Motivation Why are we interested in historical RFC Forms? o Update current databases with historical form data o Assess change in function over time o Comparison to other sources of function Millions of paper forms o Forms used since 1980s o Want automatic way to extract information

  7. Challenges

  8. SSA Data SSA stores all documents as TIF images o Limitations with existing software RFC forms come from templates that can be edited o Base content (generally) remains consistent o Layout varies greatly

  9. RFC Form Variation Number of checkboxes per section:

  10. Sections per page:

  11. Section Spans Two Pages:

  12. Distance between rows and columns:

  13. Handwriting

  14. Methods

  15. Automatic Data Extraction Steps: ➢ Checkbox Detection ➢ Checkbox Matching ➢ Templates ➢ Template Matching Algorithm ➢ Record Output

  16. Checkbox Detection Use python’s OpenCV to detect checkboxes based on size and shape Ratio of black and white pixels at center of checkbox indicates marked checkboxes

  17. Checkbox Matching Checkbox Position: ◦ Euclidean Coordinates ◦ 𝑦 𝑗 , 𝑧 𝑗 , 𝑞 𝑗 ◦ Row-Column Coordinates (RCC) ◦ 𝑠 𝑗 , 𝑑 𝑗 Checkbox Alignment: ◦ 𝑦 𝑗 − 𝑦 𝑘 < 𝑓 𝑑 ֜ 𝑑 𝑗 = 𝑑 𝑘 ◦ 𝑧 𝑗 − 𝑧 𝑘 < 𝑓 𝑠 ֜ 𝑠 𝑗 = 𝑠 𝑘

  18. Section Break Row-Column Coordinates RCC when no break occurs: Before: [(1,1), (2,2), (2,3), (3,2), (3,3)] After: {} RCC when break occurs after 1 st row: Before: [(1,1)] After: [(1,1), (1,2), (2,1), (2,2)] RCC when break occurs after 2 nd row: Before: [(1,1), (2,2), (2,3)] After: [(1,1), (1,2)]

  19. Templates 3 Types of Templates: o Section Template T S o Simplest type of template o Combined with other sections to match form o Form Template T F o Consider entire form F to be one section S o Reduces ambiguity across sections o Break Template T SK o Encodes all possible section breaks

  20. Template Matching Algorithm

  21. Record Output SAMPLE.tif: File Name Environmental Extreme Cold Extreme Wetness Humidity Limitations Heat SAMPLE Avoid Concentrated Unlimited Avoid Concentrated Unlimited

  22. Tasks TASK PURPOSE PHYSICAL RFCs* MENTAL RFCs* Validation Evaluate templates and matching 10000 5000 algorithm performance against original form images Comparison Evaluate template matching (RCC) 4914 2364 against location matching (Euclidean) Sample Generation Perform data entry for entire 497646 98408 sample *Refers to number of images in sample

  23. Results

  24. Performance Metrics Performance across 3 tasks for Physical RFC (PRFC) and Comparison of Template vs. Location Matching Mental RFC (MRFC)

  25. Error Analysis Recall Errors: o Missed checkboxes o Image interference o Scan noise o Handwriting o False positives Precision Errors: o Checkboxes appear marked when not o Image interference/Scan noise o Checkboxes not marked in center o Handwriting

  26. Next Steps Checkbox Identification: o Train models to identify checkboxes o Deep learning models Checkbox Matching: o Add automation to template generation o Learn to identify column/row headings Generalization: o Apply methods to other data o Checkboxes in medical records

  27. Conclusion Successfully used novel templates to extract checkbox data Good performance comes from specificity of task and strong assumptions o Grid-like structure of checkboxes o No ambiguity in forms Able to achieve good performance with basic computer vision o Necessitated based on limited computing resources o Errors came from missing checkboxes (handwriting, scan noise, etc.) o More advanced methods (e.g., deep learning) could help improve checkbox identification or may be necessary for other applications (e.g., medical records)

  28. Thank you! Questions? Contact Information: julia.porcino@nih.gov

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend