Using Secondary Information to inform Evidence-Based Discovery - - PowerPoint PPT Presentation

using secondary information
SMART_READER_LITE
LIVE PREVIEW

Using Secondary Information to inform Evidence-Based Discovery - - PowerPoint PPT Presentation

Using Secondary Information to inform Evidence-Based Discovery Catherine (Cathy) Blake Associate Director - Center for Informatics Research in Science and Scholarship (CIRSS) Associate Professor - Graduate School of Library and Information


slide-1
SLIDE 1

1

Using Secondary Information to inform Evidence-Based Discovery

Catherine (Cathy) Blake

Associate Director - Center for Informatics Research in Science and Scholarship (CIRSS) Associate Professor - Graduate School of Library and Information Science with courtesy appointments in Computer Science and Medical Information Science University of Illinois at Urbana Champaign

clblake@illinois.edu

slide-2
SLIDE 2

2

Motivation

  • Relentless increase in electronic text

– Life Sciences

  • 22 million citations
  • 5,200 journals
  • 12,000 new articles each week

– Chemistry

  • > 110,000 articles in 1 year
  • Consequences

– Hundreds of thousands of relevant articles – Implicit connections between literature go unnoticed

We need to shift from Retrieval to Synthesis

250 500 750 1000 1950 1970 1990 2010

Thousands Year MEDLINE Articles/Year

slide-3
SLIDE 3

3

Scientists as a User Population

Medical Public Health Hypothesis projection Reliability of palpatory procedures Smoking and impotence Analysis design Qualitative Quantitative My data collection methods Prospective Interviews Observations Retrospective Interviews Artifacts

slide-4
SLIDE 4

Iteration Collaboration Analysis Extraction Context Information Hypothesis Projection Retrieval Corpus MEDLINE Embase Verification Facts

Manual Synthesis

Select Extract Analyze Verify

Guesswork guided by scientifically trained intuition

Rescher (1978)

slide-5
SLIDE 5

Iteration Collaboration Synthetic Estimate Analysis Extraction Context Information Hypothesis Projection Retrieval Corpus MEDLINE Embase Verification Facts

Information Synthesis

slide-6
SLIDE 6

Meta-Analysis vs. Information Synthesis

Systematic Review

External database Entire study Primary info. Secondary information

Key Information Synthesis Information Synthesis

  • Traditional analysis

– same study design – medicine = RCT – epidemiology = cohort

  • Information Synthesis

– any study that includes required information – use a synthetic estimate for missing information

slide-7
SLIDE 7

Using a Synthetic Estimate

What are people with Breast Cancer exposed to? What are people in a similar population exposed to? Are these rates significantly different? Studies with Breast Cancer patients Database of risk factors BRFSS Facts for each study

  • number of patients
  • age of patients
  • geographic location
  • risk-factor exposure …

Codebook

  • question asked
  • age, gender
  • % responses

1 2 3

slide-8
SLIDE 8
slide-9
SLIDE 9

9

METIS Information Extractor

Semantic grammar based on words, numbers, and semantic types in the Unified Medical Language System (UMLS) Information extracted :

 risk factor exposure (tobacco and alcohol )  gender  age (min, max, mean)  start and end dates  number of subjects with medical condition  geographical location

{term;’age’} {term:’of’} {number;10<n2<110}{term;’to’}{number;10<n2<110}

The age of breast cancer subjects ranged between 20 to 64 years old.

{semantic type: neoplastic process, or disease}

slide-10
SLIDE 10

Recall Prec. Recall Prec. (1) Number of subjects 0.65 0.90 0.53 0.95 (2) Tobacco Use Table Rows 0.92 0.88 0.98 0.87 Table Column 0.82 0.82 0.47 0.47 (3) Age Minimum 0.90 0.90 0.70 0.70 Maximum 1.00 1.00 0.80 0.80 Mean 0.50 0.50 0.60 0.60 (4) Location 0.83 0.83 0.71 0.71 (5) Timeframe Start Year 0.90 0.90 0.70 0.70 End Year 1.00 1.00 0.60 0.60 Average 0.84 0.86 0.68 0.71

METIS Info Extractor

slide-11
SLIDE 11

Synthetic Estimate Evaluation

0.2 0.4 0.6 0.8 1

1 2 3 4 Average

Article Identifier Control Rate

Actual Estimated

Tobacco Consumption

0.2 0.4 0.6 0.8 1 1 2 3 4 Average Article Identifier Control Rate Actual Estimated

Alcohol Consumption

slide-12
SLIDE 12
slide-13
SLIDE 13

Findings thus far …

  • To what extent can information synthesis tasks be automated?

– METIS Info extractor: ~60-70% precision and recall – Synthetic estimate is close to values in the traditional studies

  • How do effect-sizes compare with a traditional meta-analysis ?

– Similar effect-sizes – More work required to explore publication bias

  • Could this be used to detect risk factors sooner ?

– risk factors are reported as secondary information before primary information

  • How much effort would this save ?

– Given full text : 31 years

13

slide-14
SLIDE 14

Acknowledgements

  • Using Scientific Text to Identify Breast Cancer Risk-Factors

– California Breast Cancer Research program

  • Towards Evidence-Based Discovery (NSF)

– This material is based upon work supported by the National Science Foundation under Grant No. (1115774). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

  • Sociotechnical Data Analytics (IMLS)

– This project is made possible by a grant from the U.S. Institute of Museum and Library Services (IMLS), Laura Bush 21st Century Librarian Program Grant Number RE-05-12-0054-12

  • Thanks to user groups, annotators and academic mentors

– Particularly to Dr. Adams, Dr. Tengs, Dr. Catherine Carpenter, Dr. Wanda Pratt, Nora Williams and Craig Evans

slide-15
SLIDE 15

Questions and comments most welcome

Cathy Blake clblake@illinois.edu