modeling Dongmei Li Department of Public Health Sciences Office of - PowerPoint PPT Presentation

Hypothesis Test Hypotheses. A. H 0 : μ 1 = μ 2 against H a : μ 1 ≠ μ 2 (two-sided) [ H a : μ 1 > μ 2 (right-sided) H a : μ 1 < μ 2 (left-sided) ] Test statistic. B.  2 2 ( x x ) s s    1 2 1 2 t where SE  stat x x 1 2 SE n n  x x 1 2 1 2 df Welch C. P -value. Convert the t stat to P- value with t table or software. Interpret. D. Significance level (optional). Compare P to prior specified α level. 35

Hypothesis Test – Example A. Hypotheses. H 0 : μ 1 = μ 2 vs. H a : μ 1 ≠ μ 2 B. Test stat. In prior analyses we calculated sample mean difference = 34.75 mg/dL, SE = 13.563 and df conserv = 19.  ( x x ) 34.75    1 2 t 2.56 with 19 df stat SE 13.563  x x 1 2 C. P -value. P = 0.019 → good evidence against H 0 (“significant difference”). D. Significance level ( optional ). The evidence against H 0 is significant at α = 0.05 but not at α = 0.01. 36

Equal Variance t Procedure  Also called pooled variance t procedure  Not as robust as prior method, but…  Historically important  Calculated by software programs  Leads to advanced ANOVA techniques 37

Pooled variance procedure We start by calculating this pooled estimate of variance 2 2  ( df )( s ) ( df )( s ) 2  1 1 2 2 s pooled  df df 1 2 where 2 is the variance in group and s i i   1 df n i i 38

 The pooled variance is used to calculate this standard error estimate:   1 1   2   SE s    x x pooled 1 2  n n  1 2  Confidence Interval   ( x x ) ( t )( SE )    1 2 x x df , 1 1 2 2  Test statistic  ( x x )  1 2 t stat SE  x x 1 2  All with df = df 1 + df 2 = ( n 1 − 1) + ( n 2 − 1) 39

Pooled Variance t Confidence Interval Group n i s i xbar i Data 1 20 36.64 245.05 2 20 48.34 210.30   1 1      SE 1839.623 13 . 56  x x   1 2 20 20      df ( 20 1 ) ( 20 1 ) 38       95 % CI for ( x x ) ( t )( SE )  x 1 2 1 2 38 ,. 975 x 1 2    ( 245 . 05 210 . 30 ) ( 2 . 02 )( 13.56 )    34 . 75 27 . 39 (7.36, 62.14) 40

Pooled Variance t Test Data: Group xbar i n i s i 1 20 36.64 245.05 2 20 48.34 210.30   1 1      SE 1839.623 13 . 56  x x   1 2 20 20      df ( 20 1 ) ( 20 1 ) 38       H : H : 0 1 2 a 1 2  34 . 75 x x     1 2 t 2.56; df 38 stat SE 13 . 56  x x 1 2 41  P 0 . 014

How to do it in Excel?  Data set: Presentation3_FCL.xlsx  First do Levene’s test to see whether two group has equal variance 42

How to do it in Excel?  Levene’s test shows no significant difference in variance between groups 43

How to do it in Excel?  Next use t-Test: Two-Sample Assuming Equal Variances for the test 44

How to do it in Excel?  Click OK to get results 45

Excel results  p-value = 0.014  Significant difference in fasting cholesterol levels between Type A personality subjects and Type B personality subjects. 46

How to do it in JMP?  Data set: Presentation3_FCL.jmp  Analyze --- Fit Y by X 47

How to do it in JMP?  Select Means/ANOVA/ Pooled t for the equal variance t test  Select t Test for unequal variance t test 48

Results from JMP  p-value <0.05  Significant difference in fasting cholesterol levels between Type A personality subjects and Type B personality subjects. 49

Conditions for Inference Conditions required for t procedures: “Validity conditions” a. Good information (no information bias) b. Good sample (“no selection bias”) c. “No confounding” “Sampling conditions” a. Independence b. Normal sampling distribution 50

ANOVA 51

Illustrative Example: Data Pets as moderators of a stress response . This chapter follows the analysis of data from a study in which heart rates (bpm) of participants were monitored after being exposed to a psychological stressor. Participants were randomized to one of three groups:  Group 1 - monitored in presence of pet dog  Group 2 - monitored in the presence of human friend  Group 3 - monitored with neither dog nor human friend present 52

Illustrative Example: Data 53

Descriptive Statistics  Data are described and explored before moving to inferential calculations  Here are summary statistics by group: 54

Side-by-Side Boxplots 55

Analysis of Variance  One-way ANalysis Of VAriance (ANOVA)  Categorical explanatory variable  Quantitative response variable  Test group means for a significant difference  Statistical hypotheses  H 0 : μ 1 = μ 2 = … = μ k  H a : at least one of the μ i s differ  Method: compare variability between groups to variability within groups ( F statistic) 56

Analysis of Variance, cont. R. A. Fisher (1890-1962) The F in the F statistic stands for “Fisher” 57

Mean Square Between: Graphically 58

Mean Square Between: E xample 59

Mean Square Within: Graphically 60

Mean Square Within: Example 61

The e F F sta stati tisti stic a c and nd AN ANOVA A ta table  Data are arranged to form an ANOVA table  F statistic is the ratio of the MSB to MSW F stat  “signal -to- MSB 1193 . 843    F stat 14 . 08 noise” ratio MSW 84 . 793 62

F stat and P -value  The F stat has numerator and denominator degrees of freedom: df 1 and df 2 respectively (corresponding to df B and df W )  Convert F stat to P -value with a computer program  The P -value corresponds to the area in the right tail beyond 63

F stat and P -value P < 0.001 64

How to do one-way ANOVA in EXCEL?  Data set: Presentation3_pet.xlsx 65

How to do one-way ANOVA in EXCEL? 66

Analysis results from Excel  One-way ANOVA shows significant difference in mean FEV values among the four different smoker groups. 67

How to do one-way ANOVA in JMP?  Presentation3_pet.jmp file  Analyze---Fit Y by X 68

How to do one-way ANOVA in JMP? 69

Analysis results from JMP  Pairwise comparisons from Tukey’s method shows the signficant difference among all three groups. 70

Correlation and simple linear regression 71

Data type for correlation and regression  Quantitative response variable Y (“dependent variable”)  Quantitative explanatory variable X (“independent variable”)  Historically important public health data set used to illustrate techniques (Doll, 1955)  n = 11 countries  Explanatory variable = per capita cigarette consumption in 1930 (CIG1930)  Response variable = lung cancer mortality per 100,000 (LUNGCA) 72

Data, cont. 73

Scatterplot Bivariate ( x i , y i ) points plotted as scatter plot. 74

Inspect scatterplot’s  Form: Can the relation be described with a straight or some other type of line?  Direction : Do points tend trend upward or downward?  Strength of association: Do point adhere closely to an imaginary trend line?  Outliers (in any): Are there any striking deviations from the overall pattern? 75

Judging Correlational Strength  Correlational strength refers to the degree to which points adhere to a trend line  The eye is not a good judge of strength.  The top plot appears to show a weaker correlation than the bottom plot. However, these are plots of the same data sets. (The perception of a difference is an artifact of axes scaling.) 76

Correlation Correlation coefficient r quantifies linear relationship  with a number between −1 and 1. When all points fall on a line with an upward slope, r = 1.  When all data points fall on a line with a downward slope, r = −1 When data points trend upward, r is positive; when data  points trend downward, r is negative. The closer r is to 1 or −1, the stronger the correlation.  77

Examples of correlations 78

Calculating r (Pearson Correlation)  Formula Correlation coefficient tracks the degree to which X and Y “go together.”  Recall that z scores quantify the amount a value lies above or below its mean in standard deviations units.  When z scores for X and Y track in the same direction, their products are positive and r is positive (and vice versa). 79

Calculating r, Example 80

Scatter plot and r in Excel  Data set: Presentation3_CIGLungCA.xlsx  Scatter plot: select the data in column B and C---Insert --- Scatter Plot---Add x and y axis labels  Correlation: Lung cancer mortality vs.  Data analysis---correlation cigarette consumption 50 45 40 Lung cancer mortality 35 30 25 20 15 10 5 0 0 200 400 600 800 1000 1200 1400 CIG1930 81

Scatter plot and r in JMP  Data set: Presentation3_CIGLungCA.jmp  Scatter plot: Graph---Scatter plot matrix 82

Interpretation of r Direction. The sign of r indicates the direction of the 1. association: positive ( r > 0), negative ( r < 0), or no association ( r ≈ 0). Strength. The closer r is to 1 or −1, the stronger the 2. association. Coefficient of determination. The square of the 3. correlation coefficient ( r 2 ) is called the coefficient of determination. This statistic quantifies the proportion of the variance in Y [mathematically] “explained” by X. For the illustrative data, r = 0.737 and r 2 = 0.54. Therefore, 54% of the variance in Y is explained by X. 83

Notes, cont. 4. Reversible relationship. With correlation, it does not matter whether variable X or Y is specified as the explanatory variable; calculations come out the same either way. [This will not be true for regression.] 5. Outliers. Outliers can have a profound effect on r . This figure has an r of 0.82 that is fully accounted for by the single outlier. 84

Notes, cont. 6. Linear relations only. Correlation applies only to linear relationships This figure shows a strong non-linear relationship, yet r = 0.00. 7. Correlation does not necessarily mean causation. Beware lurking variables (next slide). 85

Confounded Correlation A near perfect negative correlation ( r = −.987) was seen between cholera mortality and elevation above sea level during a 19th century epidemic. We now know that cholera is transmitted by water. The observed relationship between cholera and elevation was confounded by the lurking variable proximity to polluted water. 86

Hypothesis Test Random selection from a random scatter can result in an apparent correlation We conduct the hypothesis test to guard against identifying too many random correlations. 87

Hypothesis Test A. Hypotheses. Let ρ represent the population correlation coefficient. H 0 : ρ = 0 vs. H a : ρ ≠ 0 (two -sided) [or H a : ρ > 0 (right-sided) or H a : ρ < 0 (left-sided)] B. Test statistic 2  r 1 r   t where SE stat r  SE n 2 r   df n 2 C. P -value. Convert t stat to P -value with software or t table. 88

Hypothesis Test – Illustrative Example H 0 : ρ = 0 vs. H a : ρ ≠ 0 (two -sided) A. B. Test stat  2 1 0 . 737   0.2253 SE r  11 2 0 . 737   t 3.27 stat 0.2253    df 11 2 9 C. .005 < P < .01 by Table C. P = .0097 by computer. The evidence against H 0 is highly significant. 89

Exercise (True/False)  1. Correlation coefficient r quantifies the relationship between quantitative variables X and Y .  2. The closer r is to 1, the stronger the linear relation between X and Y .  3. If r is close to zero, X and Y are unrelated.  4. The value of r changes when the units of measure are changed. 90

Regression  Regression describes the relationship in the data with a line that predicts the average change in Y per unit X.  The best fitting line is found by minimizing the sum of squared residuals, as shown in this figure. 91

Regression Line, cont.  The regression line equation is: where ŷ ≡ predicted value of Y , a ≡ the intercept of the line, and b ≡ the slope of the line  Equations to calculate a and b SLOPE: INTERCEPT: 92

Regression Line, cont. Slope b is the key statistic produced by the regression 93

Calculate regression line in Excel  Data analysis---regression 94

Calculate regression line in JMP  Analyze---fit Y by X 95

Conditions for Inference Inference about the regression line requires these conditions  Linearity  Independent observations  Normality at each level of X  Equal variance at each level of X 96

Conditions for Inference This figure illustrates Normal and equal variation around the regression line at all levels of X 97

Assessing Conditions  The scatterplot should be visually inspected for linearity, Normality, and equal variance  Plotting the residuals from the model can be helpful in this regard.  The table lists residuals for the illustrative data 98

Assessing Conditions, cont.  A stemplot of the residuals show |-1|6 no major departures from |-0|2336 Normality | 0|01366 | 1|4 x10  This residual plot shows more variability at higher X values (but the data is very sparse) 99

Residual Plots With a little experience, you can get good at reading residual plots. Here’s an example of linearity with equal variance. 100

modeling Dongmei Li Department of Public Health Sciences Office of - PowerPoint PPT Presentation

Quantitative response variable modeling Dongmei Li Department of Public Health Sciences Office of Public Health Studies University of Hawaii at M noa Outline T-test ANOVA Correlation and simple linear regression 2 One-sample

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Topics Why E Field Modeling What is E Field Modeling Case Studies Questions 2 Why

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Verilog HDL:Digital Design and Modeling Chapter 5 Gate-Level Modeling Chapter 5 Gate-Level

Modeling Offsets and Linkage in a Modeling Offsets and Linkage in a Modeling Offsets and Linkage

Modeling Land Competition Modeling Land Competition Modeling Land Competition Ron Sands Ron

Importance of Soft Tissue Modeling Importance of Soft Tissue Modeling Most medical procedures

Verilog HDL:Digital Design and Modeling Chapter 8 Behavioral Modeling Chapter 8 Behavioral

Why choice modeling? Elea McDonnell Feit Instructor DataCamp Marketing Analytics in R: Choice

Mixed Eect Models Danielle Quinn PhD Candidate, Memorial University Regression Modeling in R:

Verilog HDL:Digital Design and Modeling Chapter 9 Structural Modeling Chapter 9 Structural

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling with UML Chapter 2, lecture 1, Overview: modeling with UML What is modeling?

Computer Simulation Modeling Jonathan Thaler Department of Computer Science 1 / 61 Modeling

SIDS Night Eye Guardian MAY14-29 Nicole Bruck bruckna@iastate.edu Jeremy Dubansky

THE INCIDENCE AND COSTS OF THE INCIDENCE AND COSTS OF CHEMOTHERAPY SIDE EFFECTS CHEMOTHERAPY

MID Term SRC Presentation S. N Roll No. Student Name Faculties Schedule 11-10-2018 Manju

Hindustan Unilever Limited SQ 17 Results Presentation : 25 th Oct 2017 1 Safe Harbor Statement

Tribal and Minority-based Practices Tribal and Minority-based Practices Presented by: Caroline

Supporting Literacy in Lets think about literacy Communication: Visual Scene Displays How

INVESTOR DAY November 14, 2019 www.badgerinc.com | TSX:BAD TODAYS PRESENTERS Paul John

ALL ABOARD! THE VOICE OF THE NURSE IN THE BOARDROOM