Tabular Data Extraction Epidemiology Table Classification and - - PowerPoint PPT Presentation
Tabular Data Extraction Epidemiology Table Classification and - - PowerPoint PPT Presentation
Tabular Data Extraction Epidemiology Table Classification and Factor Alignment Garrick Sherman Last semester... Worked with Dr. Andrew Leakey Plant biologist The effects of carbon dioxide on photosynthesis Data is locked
Last semester...
- Worked with Dr. Andrew Leakey
○ Plant biologist ○ The effects of carbon dioxide on photosynthesis
- Data is “locked away”
- Goals
○ Extract data from articles ○ Keep data associated with articles ○ Add structure to data
Last semester...
- Look for a set of search terms
- Parse HTML tables into CSV files
○ Also extract table captions
- Identify columns, “subtables,” and captions
about the search terms
Last semester...
- Column-based
○ 53 columns extracted ○ Recall: 0.1130 or 0.6279 ○ Precision: 0.3774 or 0.5094
- Subtables
○ 23 extracted ○ Recall: 0.1356 ○ Precision: 0.3158
This semester...
- Epidemiology journals
○ 11 high impact breast cancer journals ■ e.g. British Medical Journal, Cancer, International Journal of Breast Cancer, etc.
- Classify table as containing summary
sample characteristics
- Align factors
○ e.g. “Marital Status” and “Married”
{"Age at diagnosis (years):"=> {"<40"=>["9 (146)", "5 (12)"], "40-49"=>["26 (437)", "29 (71)"], "50-59"=>["37 (631)", "39 (94)"], "≥60"=>["29 (500)", "27 (66)"]}, "Marital status:"=> {"Living with partner"=>["79 (180)"], "Living alone"=>["21 (48)"]}, "Metropolitan classification:"=> {"Metropolitan area"=>["59 (1001)", "49 (119)"], "Non-metropolitan area"=>["41 (711)", "51 (124)"]}, ….
This semester...
- Goal:
○ Automated metadata extraction ○ Faceted search ■ Find studies of related populations
Dataset
- First table
○ ~1,500 first tables ○ Train: 1,001 ○ Test: 497
- NXML format
- Fresh codebase
○ But same table parsing approaches
Training
- Manual annotation
○ Classify based on first 10 lines (or more, if needed) and caption ○ Final tally: ■ 41.36% sample characteristics ■ 58.64% other
- Would certainly be improved with domain
expertise
Classification
- Information gain
○ Tokens from factor and options
Classification
- Test results:
○ 177 predicted positive ■ Random sample of 50 ■ Precision: 85.71% ○ 300 predicted negative ■ Random sample of 50 ■ Precision: 76.00%
Factor Alignment
- Alignment approaches
○ Literal ○ Percentage-based ○ Name-inclusive
- Evaluation
○ Choose 10 randomly, calculate precision ○ Report average precision ○ Has some drawbacks
Factor Alignment
- #1
○ Histology (N = 20) ■ Ductal,Lobular,Other ○ Morphological type ■ Ductal,Lobular,Other,Unknown ○ Histological type ■ Ductal,Lobular,Other,NA
- #2
○ Histological type ■ Ductal,Lobular,Ductulolobular,Medullary ○ Histology ■ Ductal,Lobular,Medullary
Factor Alignment
- Results
○ Literal: 0.9167 ○ Percentage: 0.8624 ○ Name-based: 0.9500
Conclusion
- Naive Bayes classifier works well because
data is independent
- Simple methods of factor alignment are
effective
- Automated approaches can help resolve
table structure and contents
- Potential applications for faceted search