dimacs workshop opening closing comments
play

DIMACS Workshop Opening-Closing Comments Stephen E. Fienberg - PowerPoint PPT Presentation

DIMACS Workshop Opening-Closing Comments Stephen E. Fienberg Department of Statistics & Center for Automated Learning and Discovery Carnegie Mellon University Pittsburgh, PA, U.S.A. 1 Some Integrative Themes Integrating diverse data


  1. DIMACS Workshop Opening-Closing Comments Stephen E. Fienberg Department of Statistics & Center for Automated Learning and Discovery Carnegie Mellon University Pittsburgh, PA, U.S.A. 1

  2. Some Integrative Themes • Integrating diverse data sources • Privacy/confidentiality • Data across time and space • Signal detection and setting cutoffs • Datamining to the rescue? • Models and methods of inference 2

  3. Integrating Diverse Data Sources • Public health data/non-traditional data – Grocery store sales – Pharmacy sales – School attendance records • Matching records/identifiers? – Fellegi–Sunter and modern Bayesian embellishments – Capture-recapture methods for estimating population totals of exposure and infection 3

  4. What Do Following Populations Have in Common? • People in the U.S. • Fish • Penguins • People infected with HIV virus • Homeless • Prostitutes in • Adolescent Glasgow injuries in • Italians with Pittsburgh, PA diabetes • WWW • Atrocities in Kosovo 4

  5. Multiple List Data for Query 140 Northern Light n =159 yes no Lycos Lycos yes no yes no HotBot HotBot HotBot HotBot yes no yes no yes no yes no yes 1 0 2 0 0 0 1 0 yes Excite no 2 0 3 2 0 0 0 2 yes Infoseek yes 1 0 2 1 0 0 3 4 no Excite no 1 3 0 8 2 0 3 19 AltaVista yes 0 0 0 1 0 0 0 0 yes Excite no 0 0 1 1 0 0 5 4 no Infoseek yes 0 0 0 1 0 0 4 22 no Excite 5 no 0 0 7 17 2 3 31 ?

  6. Simple Models Often Work • Let the y ij ’s be independent r.v.’s, with p i j = Pr { y ij = 1} for page i observed in list j , where log { p ij /( 1- p ij ) } = θ i + β j i = 1, 2, . . . , N; j = 1, 2 , . . . k . • If we take into account individual heterogeneity represented by { θ i }, samples are “independent.” 6

  7. Posterior Distribution of N for Query 140 n = 159 Q1,Q3 Median 0.0015 n Observed GL* GL* Average = 165 0.0010 GL* Max = 322 0.0005 0.0000 0 500 1000 1500 2000 2500 N 7

  8. Privacy/Confidentiality • Matching records raises major issues of privacy and confidentiality – Can we integrate sources without identifiers? – Role of intermediaries for linkage and then application of disclosure limitation methods 8

  9. Conceptual Confidentiality Kernel Confidentiality Checks: I Data Users Data Merger (record linkage) Data Disclosure Sources Detection/Warning Risk Low ? Kernel Confidentiality Checks: II 9

  10. Time and Space • Recording timing of occurrence of events is crucial component of data • Data result in multivariate time series or point processes for events/purchases/reports – Multiple products purchased – Doctors visits – School absences • Spatial information makes data sparser • Crude counts versus individual records 10

  11. Supermarket Sales Records All Products 50,000 … Dairy Health & Beauty Produce 2,050 Analgesics Cough & Cold Stomach 650 850 550 11

  12. Confounding Natural Periodicities 12

  13. Signal Detection • Adverse events � Discovery of cause – e.g., detecting signature of outbreak in response to anthrax attack – What about alternative explanations? 13

  14. Setting Detection Cutoffs • Fixed thresholds? • Tradeoff between false positives and false negatives • Nature of followup? – Back to privacy issues again 14

  15. What Are We Looking For? • Anticipating specific problems, e.g., in response to smallpox vaccination campaign • Surveillance systems to measure everything 15

  16. Datamining to the Rescue? • Bad News : – For broad based screening and surveillance, p>>n and we encounter curse of dimensionality – Model selection on large numbers of features has major problems • Good News : – For prediction we may be willing to settle for black box (or at least gray box) predictions – Datamining methods may turn out to be useful here but jury is out 16

  17. Models and Inference Methods • Black box approaches (including simple “robust” methods) versus models for underlying phenomena • Frequentist vs. Bayesian methods – Specifying likelihood is hard – Picking priors based on real information or for smoothing is relatively easy • First get statistical tools that work, and then figure out how to move them into the field or to approximate 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend