Implications of Big Data for Statistics Instruction 17 Nov 2013 - PDF document

Implications of Big Data for Statistics Instruction 17 Nov 2013 Teaching Introductory Business Statistics Implications of Big Data to Undergraduates in an Era of Big Data for Statistics Instruction “The integration of business, Big Data and Mark L. Berenson statistics is both necessary and long overdue.” Montclair State University Kaiser Fung ( Significance , August 2013) MSMESB Mini ‐ Conference DSI ‐ Baltimore November 17, 2013 Computer Scientists and Statisticians Bigger n Doesn’t Necessarily Mean Better Must Coordinate to Accomplish Results a Common Goal: Making Reliable • 128,053,180 was the USA population in 1936 Decisions from the Available Data. • 78,000,000 were Voting Age Eligible (61.0%) • Computer Scientist’s Concern is Data Management • 27,752,648 voted for Roosevelt (60.8%) • Statistician’s Concern is Data Analysis • Computer Scientist’s Interest is in Quantity of Data • 16,681,862 voted for Landon (36.5%) • Statistician’s Interest is in Quality of Data • • Computer Scientist’s Decisions are Based on Frequency of Counts 10,000,000 received mailed surveys from • Statistician’s Decisions are Based on Magnitude of Effect Literary Digest • 2,300,000 responded to the mailed survey Kaiser Fung ( Significance , August 2013) Re ‐ Engineering the Inference Topic in the The Case for Inference Business Statistics Core ‐ Required Course in an Era of Big Data • Probability Sampling in Surveys and Randomization in Experiments – C. I. E. of the Population Mean “The potential for randomized web testing is – C. I. E. of the Population Proportion almost limitless.” – Concept of Effect Size for Comparing Two Groups (A/B Testing) Ian Ayres, Super Crunchers , 2007 – C. I. E. of the Difference in Two Independent Group Means – C. I. E. of the Standardized Mean Difference Effect Size – C. I. E. of the Population Point Biserial Correlation Effect Size – C. I. E. of the Difference in Two Independent Group Proportions – Phi ‐ Coefficient Measure of Association in 2x2 Tables – C. I. E. of the Population Odds Ratio Effect Size 1 2013 ‐ Berenson2 ‐ DSI ‐ MSMESB ‐ Slides.pdf

Implications of Big Data for Statistics Instruction 17 Nov 2013 The Case for Inference The Case for Inference in an Era of Big Data in an Era of Big Data “Any large organization that is not exploiting both “Testing is a road that never ends. Tastes regression and randomization is presumptively change. What worked yesterday will not work missing value. Especially in mature industries, tomorrow. A system of periodic retesting with where profit margins narrow, firms ‘ competing on randomized trials is a way to ensure that your analytics’ will increasingly be driven to use both marketing efforts remain optimized.” tools to stay ahead. … Randomization and Ian Ayres, Super Crunchers , 2007 regression are the twin pillars of Super Crunching.” Ian Ayres, Super Crunchers , 2007 The Problem with Hypothesis Testing The Case for Effect Size Measures to in an Era of Big Data Replace Hypothesis Testing (NHST) in an Era of Big Data • H. Jeffreys (1939) and D.V. Lindley (1957) point out that any observed trivial difference will The use of NHST “has caused scientific research become statistically significant if the sample workers to pay undue attention to the results of sizes are large enough. the tests of significance that they perform on their data and too little attention on the magnitude of the effects they are investigating.“ Frank Yates ( JASA , 1951) The Case for Effect Size Measures to The Case for Effect Size Measures to Replace Hypothesis Testing (NHST) Replace Hypothesis Testing (NHST) in an Era of Big Data in an Era of Big Data “In many experiments, it seems obvious that the “Estimates of appropriate effect sizes and [their] different treatments must produce some difference, confidence intervals are the minimum expectations however small, in effect. Thus the hypothesis that for all APA journals.” there is no difference is unrealistic. The real Publication Manual of the APA (2010) problem is to obtain estimates of the size of the differences.” George W. Cochran and Gertrude M. Cox, Experimental Design , 2 nd Ed., (1957) 2 2013 ‐ Berenson2 ‐ DSI ‐ MSMESB ‐ Slides.pdf

Implications of Big Data for Statistics Instruction 17 Nov 2013 What an Effect Size Measures Why Effect Size is Important • When comparing differences in the means of Knowing the magnitude of an effect enables an two groups the effect size quantifies the assessment of the practical importance of the magnitude of that difference. results. • When studying the association between two variables the effect size measures the strength of the relationship between them. Early Researchers Why the C.I.E. is Superior to the NHST in the Study of Effect Size • Jacob Cohen (NYU) • A confidence interval estimate is superior to a • Gene Glass (Johns Hopkins) hypothesis test because it gives the same • Larry Hedges (University of Chicago) information and provides a measure of precision. • Ingram Olkin (Stanford) • Robert Rosenthal (Harvard) • Donald Rubin (Harvard) ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ • Ken Kelley (Notre Dame) Necessity for Standardization A C.I.E. is an Effect Size Measure • If common scales are being used to measure • If unfamiliar scales are being used to measure the outcome variables a regular confidence the outcome variables, in order to make interval estimate (of the unstandardized mean comparisons with results from other, similar difference) provides a representation of the studies done using different scales, a effect size. transformation to standardized units will be more informative and a confidence interval estimate of the standardized mean difference provides a representation of the effect size. 3 2013 ‐ Berenson2 ‐ DSI ‐ MSMESB ‐ Slides.pdf

Implications of Big Data for Statistics Instruction 17 Nov 2013 What Standardization Achieves Cohen’s Effect Size Classifications • Cohen (1992) developed effect size cut points of • A standardized effect size removes the sample .2, .5 and .8, respectively, for small , medium and size of the outcome variable from the effect large effects for standardized mean differences. estimate, producing a dimensionless • Cohen classified effect size cut points of .1, .3 standardized effect that can be compared and .5, respectively, for small , medium and large across different but related outcome variables effects for correlations. in other studies. • Cohen described a medium effect size as one in which the researcher can visually see the gains from treatment E above and beyond that of treatment C. Bonett’s Effect Size C. I. E. C. I. E. for the Difference in Means * Assuming Unequal Variances of Two Independent Groups *Assuming Unequal Variances *Assuming Equal Sample Sizes Bonett’s (2008) approximate (1 -  ) 100% confidence interval estimate of the population *Assuming Equal Sample Sizes standardized mean difference effect size  is given by   ˆ ˆ 1 / 2   (  ) As the sample sizes increase approaches so that for very large n and n an Z Var t Z  / 2   / 2 / 2 E C where the estimate of  is approximate (1 -  ) 100% confidence interval estimate of the difference in the population    E  Y Y means ( ) is given by ˆ   E C  ˆ C 1 / 2  2  2  ( ) 1 / 2 S S . The variance of the statistic  ˆ proposed by Bonett       and ˆ  E C  when 2 2 n n S S 2 E C     ( )  E C  Y Y Z  / 2 E C  n  reduces to * ˆ  2 ( 4  4 ) S S 2 ˆ (  )   Var E C where the equal sample sizes n and n are given by n .  8  ˆ 4 (  1 ) ( 1 ) n n * * * E C when the equal sample sizes n and n are represented by n . * E C Rosenthal’s Effect Size Confidence Interval Rosenthal’s Effect Size Confidence Interval for the Point Biserial Correlation Coefficient for the Point Biserial Correlation Coefficient An approximate (1 -  ) 100% confidence interval estimate of the population point biserial  coefficient of correlation is obtained using the Fisher Z transformation where When the sample sizes are equal the point biserial correlation coefficient r is obtained from pb r pb  1   r  pb Z 0 . 5 ln     Y Y r  1   r   E C pb r pb 2 S Y and the standard error is 1 / 2 1 / 2  2   1  n  i  2   ( )  S   Y Y Z   3  ij r  n n  Y Y   E C    1  1 where E C and i j . Y S 2 Y    n n The confidence interval limits for this are obtained from  so that Z Z S  E C  r  / 2 Z r        1 / 2 1 r 1  0 . 5 ln  pb    Z   / 2   ˆ are related as follows:  1  3 Also, for very large sample size, r and   r   n n  pb E C pb and the approximate (1 -  ) 100% confidence interval estimate of the population  is obtained  ˆ 2 pb r ˆ by taking the antilogs of the above lower and upper limits. The conversions for each limit are  and   pb r     pb ˆ 1 / 2  1 / 2  2  4 1 2 given by r pb 2 Z  1 e r    pb 2 Z 1 e r 4 2013 ‐ Berenson2 ‐ DSI ‐ MSMESB ‐ Slides.pdf

Implications of Big Data for Statistics Instruction 17 Nov 2013 - PDF document

Implications of Big Data for Statistics Instruction 17 Nov 2013 Teaching Introductory Business Statistics Implications of Big Data to Undergraduates in an Era of Big Data for Statistics Instruction The integration of business, Big Data and Mark

Implications of Big Data for Statistics Instruction 17 Nov 2013 Teaching Introductory Business

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Implications of global economic crisis Implications of global economic crisis Implications of

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

Instruction encoding The ISA defines The format of an instruction (syntax) The

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Welcome and Introductions Statistical Consulting What is it, and why is it important? Welcome to

Are the clients of flawed classes (also) defect prone? Authors: Radu & Cristina Marinescu

TIDY TEXT Jeff Goldsmith, PhD Department of Biostatistics 1 Text data Written

STAT 213 Logistic Regression II Colin Reimer Dawson Oberlin College 28 April 2016 Outline

Topics of the day Logistic regression and generalized linear models Rasmus Waagepetersen

STATISTICS 536B, Lecture #3 March 3, 2015 General options for binary Y , binary X , confounders C

Human-Centered Natural Language Processing CSE392 - Spring 2019 Special Topic in CS The

An Evidence Based Search For Neutron Star Ringdowns James Clark

Sambuz

Useful Links

Newsletter

Mail Us

Implications of Big Data for Statistics Instruction 17 Nov 2013 - PDF document

Implications of Big Data for Statistics Instruction 17 Nov 2013 Teaching Introductory Business Statistics Implications of Big Data to Undergraduates in an Era of Big Data for Statistics Instruction The integration of business, Big Data and Mark

Implications of Big Data for Statistics Instruction 17 Nov 2013 Teaching Introductory Business

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Implications of global economic crisis Implications of global economic crisis Implications of

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

Instruction encoding The ISA defines The format of an instruction (syntax) The

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

Welcome and Introductions Statistical Consulting What is it, and why is it important? Welcome to

Are the clients of flawed classes (also) defect prone? Authors: Radu &amp; Cristina Marinescu

TIDY TEXT Jeff Goldsmith, PhD Department of Biostatistics 1 Text data Written

STAT 213 Logistic Regression II Colin Reimer Dawson Oberlin College 28 April 2016 Outline

Topics of the day Logistic regression and generalized linear models Rasmus Waagepetersen

STATISTICS 536B, Lecture #3 March 3, 2015 General options for binary Y , binary X , confounders C

Human-Centered Natural Language Processing CSE392 - Spring 2019 Special Topic in CS The

An Evidence Based Search For Neutron Star Ringdowns James Clark

Sambuz

Useful Links

Newsletter

Mail Us

Are the clients of flawed classes (also) defect prone? Authors: Radu & Cristina Marinescu