Statistical Analysis of Corpus Data with R A short introduction to - PowerPoint PPT Presentation

Statistical Analysis of Corpus Data with R A short introduction to regression and linear models Designed by Marco Baroni 1 and Stefan Evert 2 1 Center for Mind/Brain Sciences (CIMeC) University of Trento 2 Institute of Cognitive Science (IKW) University of Onsabr¨ uck Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 1 / 15

Outline Outline Regression 1 Simple linear regression General linear regression Linear statistical models 2 A statistical model of linear regression Statistical inference Generalised linear models 3 Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 2 / 15

Regression Simple linear regression Linear regression Can random variable Y be predicted from r. v. X ? ☞ focus on linear relationship between variables Linear predictor: Y ≈ β 0 + β 1 · X ◮ β 0 = intercept of regression line ◮ β 1 = slope of regression line Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 3 / 15

Regression Simple linear regression Linear regression Can random variable Y be predicted from r. v. X ? ☞ focus on linear relationship between variables Linear predictor: Y ≈ β 0 + β 1 · X ◮ β 0 = intercept of regression line ◮ β 1 = slope of regression line Least-squares regression minimizes prediction error n � 2 � � Q = y i − ( β 0 + β 1 x i ) i =1 for data points ( x 1 , y 1 ) , . . . , ( x n , y n ) Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 3 / 15

Regression Simple linear regression Simple linear regression Coefficients of least-squares line � n i =1 x i y i − nx n y n ˆ β 1 = � n i =1 x 2 i − nx 2 n β 0 = y n − ˆ ˆ β 1 x n Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 4 / 15

Regression Simple linear regression Simple linear regression Coefficients of least-squares line � n i =1 x i y i − nx n y n ˆ β 1 = � n i =1 x 2 i − nx 2 n β 0 = y n − ˆ ˆ β 1 x n Mathematical derivation of regression coefficients ◮ minimum of Q ( β 0 , β 1 ) satisfies ∂ Q /∂β 0 = ∂ Q /∂β 1 = 0 ◮ leads to normal equations (system of 2 linear equations) n n n X X X ˆ ˜ − 2 y i − ( β 0 + β 1 x i ) = 0 β 0 n + β 1 x i = ➜ y i i =1 i =1 i =1 n n n n X X X x 2 X − 2 ˆ y i − ( β 0 + β 1 x i ) ˜ = 0 x i + β 1 i = β 0 ➜ x i x i y i i =1 i =1 i =1 i =1 ◮ regression coefficients = unique solution ˆ β 0 , ˆ β 1 Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 4 / 15

Regression Simple linear regression The Pearson correlation coefficient Measuring the “goodness of fit” of the linear prediction ◮ variation among observed values of Y = sum of squares S 2 y ◮ closely related to (sample estimate for) variance of Y n � S 2 ( y i − y n ) 2 y = i =1 ◮ residual variation wrt. linear prediction: S 2 resid = Q Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 5 / 15

Regression Simple linear regression The Pearson correlation coefficient Measuring the “goodness of fit” of the linear prediction ◮ variation among observed values of Y = sum of squares S 2 y ◮ closely related to (sample estimate for) variance of Y n � S 2 ( y i − y n ) 2 y = i =1 ◮ residual variation wrt. linear prediction: S 2 resid = Q Pearson correlation = amount of variation “explained” by X R 2 = 1 − S 2 � n i =1 ( y i − β 0 − β 1 x i ) 2 resid = 1 − � n S 2 i =1 ( y i − y n ) 2 y ☞ correlation vs. slope of regression line R 2 = ˆ β 1 ( y ∼ x ) · ˆ β 1 ( x ∼ y ) Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 5 / 15

Regression General linear regression Multiple linear regression Linear regression with multiple predictor variables Y ≈ β 0 + β 1 X 1 + · · · + β k X k minimises n � 2 � � Q = y i − ( β 0 + β 1 x i 1 + · · · + β k x ik ) i =1 � � � � for data points x 11 , . . . , x 1 k , y 1 , . . . , x n 1 , . . . , x nk , y n Multiple linear regression fits n -dimensional hyperplane instead of regression line Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 6 / 15

Regression General linear regression Multiple linear regression: The design matrix Matrix notation of linear regression problem y ≈ Z β “Design matrix” Z of the regression data   1 x 11 x 12 · · · x 1 k 1 x 21 x 22 · · · x 2 k   Z =  . . . .  . . . .   . . . .   1 x n 1 x n 2 · · · x nk � ′ � y = y 1 y 2 . . . y n � ′ � β = β 0 β 1 β 2 . . . β k ☞ A ′ denotes transpose of a matrix; y , β are column vectors Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 7 / 15

Regression General linear regression General linear regression Matrix notation of linear regression problem y ≈ Z β Residual error Q = ( y − Z β ) ′ ( y − Z β ) Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 8 / 15

Regression General linear regression General linear regression Matrix notation of linear regression problem y ≈ Z β Residual error Q = ( y − Z β ) ′ ( y − Z β ) System of normal equations satisfying ∇ β Q = 0: Z ′ Z β = Z ′ y Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 8 / 15

Regression General linear regression General linear regression Matrix notation of linear regression problem y ≈ Z β Residual error Q = ( y − Z β ) ′ ( y − Z β ) System of normal equations satisfying ∇ β Q = 0: Z ′ Z β = Z ′ y Leads to regression coefficients β = ( Z ′ Z ) − 1 Z ′ y ˆ Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 8 / 15

Regression General linear regression General linear regression Predictor variables can also be functions of the observed variables ➜ regression only has to be linear in coefficients β E.g. polynomial regression with design matrix x 2 x k   1 x 1 · · · 1 1 x 2 x k 1 · · · x 2  2 2  Z =  . . . .  . . . .   . . . .   x 2 x k 1 x n · · · n n corresponding to regression model Y ≈ β 0 + β 1 X + β 2 X 2 + · · · + β k X k Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 9 / 15

Linear statistical models A statistical model of linear regression Linear statistical models Linear statistical model ( ǫ = random error) Y = β 0 + β 1 x 1 + · · · + β k x k + ǫ ǫ ∼ N (0 , σ 2 ) ☞ x 1 , . . . , x k are not treated as random variables! ◮ ∼ = “is distributed as”; N ( µ, σ 2 ) = normal distribution Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 10 / 15

Linear statistical models A statistical model of linear regression Linear statistical models Linear statistical model ( ǫ = random error) Y = β 0 + β 1 x 1 + · · · + β k x k + ǫ ǫ ∼ N (0 , σ 2 ) ☞ x 1 , . . . , x k are not treated as random variables! ◮ ∼ = “is distributed as”; N ( µ, σ 2 ) = normal distribution Mathematical notation: β 0 + β 1 x 1 + · · · + β k x k , σ 2 � � Y | x 1 , . . . , x k ∼ N Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 10 / 15

Linear statistical models A statistical model of linear regression Linear statistical models Linear statistical model ( ǫ = random error) Y = β 0 + β 1 x 1 + · · · + β k x k + ǫ ǫ ∼ N (0 , σ 2 ) ☞ x 1 , . . . , x k are not treated as random variables! ◮ ∼ = “is distributed as”; N ( µ, σ 2 ) = normal distribution Mathematical notation: β 0 + β 1 x 1 + · · · + β k x k , σ 2 � � Y | x 1 , . . . , x k ∼ N Assumptions ◮ error terms ǫ i are i.i.d. (independent, same distribution) ◮ error terms follow normal (Gaussian) distributions ◮ equal (but unknown) variance σ 2 = homoscedasticity Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 10 / 15

Linear statistical models A statistical model of linear regression Linear statistical models Probability density function for simple linear model � n � 1 − 1 � ( y i − β 0 − β 1 x i ) 2 Pr ( y | x ) = (2 πσ 2 ) n / 2 · exp 2 σ 2 i =1 ◮ y = ( y 1 , . . . , y n ) = observed values of Y (sample size n ) ◮ x = ( x 1 , . . . , x n ) = observed values of X Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 11 / 15

Linear statistical models A statistical model of linear regression Linear statistical models Probability density function for simple linear model � n � 1 − 1 � ( y i − β 0 − β 1 x i ) 2 Pr ( y | x ) = (2 πσ 2 ) n / 2 · exp 2 σ 2 i =1 ◮ y = ( y 1 , . . . , y n ) = observed values of Y (sample size n ) ◮ x = ( x 1 , . . . , x n ) = observed values of X Log-likelihood has a familiar form: n 1 ( y i − β 0 − β 1 x i ) 2 ∝ Q � log Pr ( y | x ) = C − 2 σ 2 i =1 ➥ MLE parameter estimates ˆ β 0 , ˆ β 1 from linear regression Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 11 / 15

Linear statistical models Statistical inference Statistical inference for linear models Model comparison with ANOVA techniques ◮ Is variance reduced significantly by taking a specific explanatory factor into account? ◮ intuitive: proportion of variance explained (like R 2 ) ◮ mathematical: F statistic ➜ p -value Baroni & Evert (Trento/Osnabr¨ uck) SIGIL: Linear Models 12 / 15

Statistical Analysis of Corpus Data with R A short introduction to - PowerPoint PPT Presentation

Statistical Analysis of Corpus Data with R A short introduction to regression and linear models Designed by Marco Baroni 1 and Stefan Evert 2 1 Center for Mind/Brain Sciences (CIMeC) University of Trento 2 Institute of Cognitive Science (IKW)

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Statistical Analysis of Corpus Data with R Hypothesis Testing for Corpus Frequency Data The

Statistical Analysis of Corpus Data with R The Limitations of Random Sampling Models for Corpus

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

Corpus Analysis from a Mathematical Perspective Corpus Statistics Research Group launch event

ERROR ANALYSIS IN A WRITTEN LEARNER CORPUS FROM SPANISH SPEAKERS EFL LEARNERS. A CORPUS BASED

Statistical Analysis of Corpus Data with R You shall know a word by the company it keeps!

Statistical Analysis of Corpus Data with R You shall know a word by the company it keeps!

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

SH 358 IMPROVEMENTS Corpus Christi District Updated October 2018 SH 358 Improvements Corpus

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Smarter and Trustworthy.

Factors of Gibbs measures on subshifts What is a Gibbs measure? Two-ish definitions Equivalence

Efficient identification of k -closed strings Hayam Alamro 1 Mai Alzamel 1 Costas S. Iliopoulos 1

Factors of Low Individual Degree Polynomials this Work 3 Background 2 1 Conclusion Conclusion

Linear Factor Models Lecture slides for Chapter 13 of Deep Learning www.deeplearningbook.org Ian

s r srs tr

ftools : a faster Stata for large datasets Sergio Correia, Board of Governors of the Federal

Updated on Electron Energy Reconstruction Aaron Higuera University of Houston <latexit

Unit 4 Polynomial/Rational Functions Remainder and Factor Theorems (Chap 2.3) William (Bill)

Statistical Analysis of Corpus Data with R A short introduction to - PowerPoint PPT Presentation

Statistical Analysis of Corpus Data with R A short introduction to regression and linear models Designed by Marco Baroni 1 and Stefan Evert 2 1 Center for Mind/Brain Sciences (CIMeC) University of Trento 2 Institute of Cognitive Science (IKW)

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

Statistical Analysis of Corpus Data with R Hypothesis Testing for Corpus Frequency Data The

Statistical Analysis of Corpus Data with R The Limitations of Random Sampling Models for Corpus

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

Data and Analysis Part V Statistical Analysis of Data Alex Simpson Part V: Statistical Analysis

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

MACAQ : A Multi Annotated Corpus to study how we adapt Answers to various Questions Anne

Data and Analysis Note 12 Statistical Analysis of Data I Alex Simpson Note 12 Statistical

Corpus Analysis from a Mathematical Perspective Corpus Statistics Research Group launch event

ERROR ANALYSIS IN A WRITTEN LEARNER CORPUS FROM SPANISH SPEAKERS EFL LEARNERS. A CORPUS BASED

Statistical Analysis of Corpus Data with R You shall know a word by the company it keeps!

Statistical Analysis of Corpus Data with R You shall know a word by the company it keeps!

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

SH 358 IMPROVEMENTS Corpus Christi District Updated October 2018 SH 358 Improvements Corpus

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Smarter and Trustworthy.

Factors of Gibbs measures on subshifts What is a Gibbs measure? Two-ish definitions Equivalence

Efficient identification of k -closed strings Hayam Alamro 1 Mai Alzamel 1 Costas S. Iliopoulos 1

Factors of Low Individual Degree Polynomials this Work 3 Background 2 1 Conclusion Conclusion

Linear Factor Models Lecture slides for Chapter 13 of Deep Learning www.deeplearningbook.org Ian

s r srs tr

ftools : a faster Stata for large datasets Sergio Correia, Board of Governors of the Federal

Updated on Electron Energy Reconstruction Aaron Higuera University of Houston &lt;latexit

Unit 4 Polynomial/Rational Functions Remainder and Factor Theorems (Chap 2.3) William (Bill)

Updated on Electron Energy Reconstruction Aaron Higuera University of Houston <latexit