Tabular Data Extraction Epidemiology Table Classification and - - PowerPoint PPT Presentation

▶

Dec 03, 2022 197 likes •361 views

Tabular Data Extraction Epidemiology Table Classification and Factor Alignment Garrick Sherman Last semester... Worked with Dr. Andrew Leakey Plant biologist The effects of carbon dioxide on photosynthesis Data is locked

SLIDE 1

Tabular Data Extraction

Epidemiology Table Classification and Factor Alignment

Garrick Sherman

SLIDE 2

Last semester...

Worked with Dr. Andrew Leakey

○ Plant biologist ○ The effects of carbon dioxide on photosynthesis

Data is “locked away”
Goals

○ Extract data from articles ○ Keep data associated with articles ○ Add structure to data

SLIDE 3

Last semester...

Look for a set of search terms
Parse HTML tables into CSV files

○ Also extract table captions

Identify columns, “subtables,” and captions

about the search terms

SLIDE 4

Last semester...

Column-based

○ 53 columns extracted ○ Recall: 0.1130 or 0.6279 ○ Precision: 0.3774 or 0.5094

Subtables

○ 23 extracted ○ Recall: 0.1356 ○ Precision: 0.3158

SLIDE 5

This semester...

Epidemiology journals

○ 11 high impact breast cancer journals ■ e.g. British Medical Journal, Cancer, International Journal of Breast Cancer, etc.

Classify table as containing summary

sample characteristics

Align factors

○ e.g. “Marital Status” and “Married”

SLIDE 6

{"Age at diagnosis (years):"=> {"<40"=>["9 (146)", "5 (12)"], "40-49"=>["26 (437)", "29 (71)"], "50-59"=>["37 (631)", "39 (94)"], "≥60"=>["29 (500)", "27 (66)"]}, "Marital status:"=> {"Living with partner"=>["79 (180)"], "Living alone"=>["21 (48)"]}, "Metropolitan classification:"=> {"Metropolitan area"=>["59 (1001)", "49 (119)"], "Non-metropolitan area"=>["41 (711)", "51 (124)"]}, ….

SLIDE 7

This semester...

Goal:

○ Automated metadata extraction ○ Faceted search ■ Find studies of related populations

SLIDE 8

Dataset

First table

○ ~1,500 first tables ○ Train: 1,001 ○ Test: 497

NXML format
Fresh codebase

○ But same table parsing approaches

SLIDE 9

Training

Manual annotation

○ Classify based on first 10 lines (or more, if needed) and caption ○ Final tally: ■ 41.36% sample characteristics ■ 58.64% other

Would certainly be improved with domain

expertise

SLIDE 10

Classification

Information gain

○ Tokens from factor and options

SLIDE 11

Classification

Test results:

○ 177 predicted positive ■ Random sample of 50 ■ Precision: 85.71% ○ 300 predicted negative ■ Random sample of 50 ■ Precision: 76.00%

SLIDE 12

Factor Alignment

Alignment approaches

○ Literal ○ Percentage-based ○ Name-inclusive

Evaluation

○ Choose 10 randomly, calculate precision ○ Report average precision ○ Has some drawbacks

SLIDE 13

Factor Alignment

○ Histology (N = 20) ■ Ductal,Lobular,Other ○ Morphological type ■ Ductal,Lobular,Other,Unknown ○ Histological type ■ Ductal,Lobular,Other,NA

○ Histological type ■ Ductal,Lobular,Ductulolobular,Medullary ○ Histology ■ Ductal,Lobular,Medullary

SLIDE 14

Factor Alignment

Results

○ Literal: 0.9167 ○ Percentage: 0.8624 ○ Name-based: 0.9500

SLIDE 15

Conclusion

Naive Bayes classifier works well because

data is independent

Simple methods of factor alignment are

effective

Automated approaches can help resolve

table structure and contents

Potential applications for faceted search