Probability and Statistics for Computer Science Correla)on is not - PowerPoint PPT Presentation

Probability and Statistics ì for Computer Science “Correla)on is not Causa)on” but Correla)on is so beau)ful! Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 9.1.2020

" " # in your Please use sign * question formal indicate chat a to comment or . mic keep eo please your mute * sound quality the Zoom . the of websites out check please * chat the Notebook in & Code simulation .

Last time Parameters Location i Mode Mean IM ) Median , , parameters Scale : Inter quartile Standard ( g ) ' range ciqr ) deviation ( 62 ) variance x' I x ' Data : standardizing

Objectives � Median, Interquar)le range, box plot and outlier � ScaRer plots, Correla)on Coefficient Heatmap, 3D bar, Time series plots, I � Visualizing & Summarizing rela%onships

Median � To organize the data we first sort it � Then if the number of items N is odd median = middle item's value if the number of items N is even median = mean of middle 2 items' values

Properties of Median � Scaling data scales the median median ( { k · x i } ) = k · median ( { x i } ) c Ei , ki - ul ) a rgmmin median = � Transla)ng data translates the median median ( { x i + c } ) = median ( { x i } ) + c -

Percentile � k th percen)le is the value rela)ve to which k% of the data items have smaller or equal numbers � Median is roughly the 50 th percen)le 12 } I ' I 5 6 7 4 3 2 . , , , , , , ¥751 = ? percentile 6 > 5th .

Interquartile range � iqr = (75th percen)le) - (25th percen)le) -1 20 � Scaling data scales the interquar)le range iqr ( { k · x i } ) = | k | · iqr ( { x i } ) AT � Transla)ng data does NOT change the interquar)le range iqr ( { x i + c } ) = iqr ( { x i } )

Box plots Vehicle death by region � Boxplots � Simpler than histogram DEATH � Good for outliers � Easier to use for comparison Data from hRps://www2.stetson.edu/ ~jrasp/data.htm

Boxplots details, outliers � How to Outlier define > 1.5 iqr Whisker outliers? - (the default) foot Box Interquar)le Range (iqr) Median < 1.5 iqr

Q. TRUE or FALSE mean is more sensi)ve to outliers than median ⑦ True False B.

Q. TRUE or FALSE interquar)le range is more sensi)ve to outliers than std. A True ⑤ false

Sensitivity of summary statistics to outliers � mean and standard devia)on are - - very sensi)ve to outliers � median and interquar)le range are - - not sensi)ve to outliers

Modes � Modes are peaks in a histogram � If there are more than 1 mode, we should be curious as to why

Multiple modes � We have seen the “iris” data which looks to Iris have several peaks Data: “iris” in R

Example Bi-modes distribution � Modes may indicate mul)ple popula)ons blood cell red Data: Erythrocyte cells in healthy humans Piagnerelli, JCP 2007

Tails and Skews O tails outlier , C → night + nil Credit: Prof.Forsyth

t.tl#. - 3 3 Smiled - I 4 - z l o L : an arrears -

Q. How is this skewed? A Lep I B Right 46 mean = ? Median = 47

Looking at relationships in data � Finding rela)onships between features in a data set or many data - sets is one of the most important tasks in data analysis

Relationship between data features � Example: does the weight of people relate to their height? Q � x : HIGHT, y: WEIGHT

Scatter plot � Body Fat data set

Scatter plot � ScaRer plot with density O o° O

Scatter plot � Removed of outliers & standardized

Correlation y ✓ y ✓ covariance . Y . I ch 13

Correlation seen from scatter plots Zero Posi)ve Nega)ve Correla)on correla)on correla)on Credit: Prof.Forsyth

What kind of Correlation? � Line of code in a database and number of bugs � Frequency of hand washing and number of germs on your hands � GPA and hours spent playing video games � earnings and happiness Credit: Prof. David Varodayan

Correlation doesn’t mean causation � Shoe size is correlated to reading skills, but it doesn’t mean making feet grow will make one person read faster.

Correlation Coefficient � Given a data set consis)ng of { ( x i , y i ) } items ( x 1 , y 1 ) ... ( x N , y N ) , � Standardize the coordinates of each feature: x i = x i − mean ( { x i } ) y i = y i − mean ( { y i } ) � � std ( { x i } ) std ( { y i } ) � Define the correla)on coefficient as: N � corr ( { ( x i , y i ) } ) = 1 x i � � y i N i =1

Correlation Coefficient x i = x i − mean ( { x i } ) y i = y i − mean ( { y i } ) � � std ( { x i } ) std ( { y i } ) � N corr ( { ( x i , y i ) } ) = 1 x i � � y i N i =1 = mean ( { � y i } ) x i �

Q: Correlation Coefficient � Which of the following describe(s) correla)on coefficient correctly? A. It’s unitless B. It’s defined in standard coordinates o C. Both A & B N � corr ( { ( x i , y i ) } ) = 1 x i � � y i N i =1

A visualization of correlation coefficient hRps://rpsychologist.com/d3/correla)on/ In a data set consis)ng of items { ( x i , y i ) } ( x 1 , y 1 ) ... ( x N , y N ) , shows posi)ve correla)on corr ( { ( x i , y i ) } ) > 0 shows nega)ve correla)on corr ( { ( x i , y i ) } ) < 0 shows no correla)on corr ( { ( x i , y i ) } ) = 0

The Properties of Correlation Coefficient � The correla)on coefficient is symmetric corr ( { ( x i , y i ) } ) = corr ( { ( y i , x i ) } ) � Transla)ng the data does NOT change the correla)on coefficient

The Properties of Correlation Coefficient � Scaling the data may change the sign of the correla)on coefficient corr ( { ( a x i + b, c y i + d ) } ) = sign ( a c ) corr ( { ( x i , y i ) } )

4 : - Z - 44 4 - Z 2 O

4 : -2 -4 - 4 -2 4 0 2

The Properties of Correlation Coefficient � The correla)on coefficient is bounded within [-1, 1] if and only if x i = � � corr ( { ( x i , y i ) } ) = 1 y i if and only if corr ( { ( x i , y i ) } ) = − 1 x i = − � � y i

Which%of%the%following%has%correlation% coefficient%equal%to%1?% Y Y Y ÷ . . × ^ a A. #Leb#and#right# B. #Leb# C. #Middle# #

Concept of Correlation Coefficient’s bound � The correla)on coefficient can be wriRen as � N corr ( { ( x i , y i ) } ) = 1 x i � � y i T > N vi. U i =1 N - Vi = -2 Ui � N � � x i y i II corr ( { ( x i , y i ) } ) = √ √ N N i =1 � It’s the inner product of two vectors � � � � and � y 1 � y N � x 1 � x N √ √ √ √ N , ... N , ... N N

Inner product � Inner product’s geometric meaning: ν 1 EEE | ν 1 | | ν 2 | cos ( θ ) θ ν 2 � Lengths of both vectors ν 1 = � � ν 2 = � � � � y 1 y N x N � x 1 � √ √ √ √ N , ... N , ... N N are 1

Bound of correlation coefficient | corr ( { ( x i , y i ) } ) | = | cos ( θ ) | ≤ 1 = ν 1 θ ν 2 ν 1 = � � � � ν 2 = y 1 � y N � x N � x 1 � √ √ √ √ N , ... N , ... N N

The Properties of Correlation Coefficient � Symmetric � Transla)ng invariant � Scaling only may change sign � bounded within [-1, 1]

Using correlation to predict � Cau'on ! Correla)on is NOT Causa)on 7 Credit: Tyler Vigen

How do we go about the prediction? � Removed of outliers & standardized

Using correlation to predict � Given a correlated data set { ( x i , y i ) } we can predict a value that goes with p y 0 a value x 0 � In standard coordinates { ( � x i , � y i ) } we can predict a value that goes with � p y 0 a value � x 0

Q: � Which coordinates will you use for the predictor using correla)on? A. Standard coordinates D B. Original coordinates C. Either

Linear predictor and its error � We will assume that our predictor is linear y p = a � � x + b � We denote the predic)on at each in the data � x i set as p � y i p = a � � y i x i + b � The error in the predic)on is denoted u i p = � u i = � y i − � y i − a � y i x i − b

⇒ Require the mean of error to be zero We would try to make the mean of error equal to zero so that it is also centered around 0 as the standardized data: mean 45 - ij% Yeargain center = - a E - b 3 , mean 48 = meant 5- a. meant 35 = - b - b = O = b = 0 A

Require the variance of error is minimal 3%2 ) # mean 14 Ui - mean GZ minimize , • = meant In :3 ' ? mean Cfc E - yep , -3 , O = " -4533 = mean CECE - ax - zeaxagt a' E' 3 , a = mean 48 ' Hein "3sta ' - y = mean 48 ' } ) - za nee managing TE moonlit -3 ) - i - rear ta = - of } ) = mean CECIL Ice - sashay ← varia 't - =o da - 28+29=0

Require the variance of error is minimal

Here is the linear predictor! jP=a Ee b y p = r � � x q = r b =o Correla)on coefficient

Prediction Formula � In standard coordinates p = r � � r = corr ( { ( x i , y i ) } ) where y 0 x 0 � In original coordinates y p 0 − mean ( { y i } ) = rx 0 − mean ( { x i } ) std ( { y i } ) std ( { x i } )

Probability and Statistics for Computer Science Correla)on is not - PowerPoint PPT Presentation

Probability and Statistics for Computer Science Correla)on is not Causa)on but Correla)on is so beau)ful! Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 9.1.2020 " " # in your Please use sign *

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Counting and Probability Whats to come? Counting and Probability Whats to come?

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

Statistics 1B Statistics 1B 1 (11) 0. Lecture 1. Introduction and probability review

Statistics 370 Probability and Statistics for Engineers Instructor: Peter Bloomfield Course

Chapter II.2: Basic Probability Theory and Statistics 1. What is a probability? 1.1. Probability

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

The Virtual Immunization Communication (VIC ) Network is a project of the National Public Health

The Future of PrEP Shots and Potent Treatment: An

NO DISCLOSURES Richard A. Jacobs, MD, PhD. Outline Case History of Lyme disease A 35 yo

HOST Statistics ECE 525 Introduction Probability and statistics play very important roles in

Malaria & Red Cell Disorders SK Cheong Faculty of Medicine & Health Sciences, Universiti

Standardization of extracellular vesicle measurements by flow cytometry Edwin van der Pol

Recent selection in Tibet, Greenland & China Anders Albrechtsen April 3, 2019 Signatures of

BIOE 301/362 Lecture One Overview of Lecture 1 Course Overview: Course organization

Probability and Statistics for Computer Science Correla)on is not - PowerPoint PPT Presentation

Probability and Statistics for Computer Science Correla)on is not Causa)on but Correla)on is so beau)ful! Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 9.1.2020 " " # in your Please use sign *

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Counting and Probability Whats to come? Counting and Probability Whats to come?

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

Statistics 1B Statistics 1B 1 (11) 0. Lecture 1. Introduction and probability review

Statistics 370 Probability and Statistics for Engineers Instructor: Peter Bloomfield Course

Chapter II.2: Basic Probability Theory and Statistics 1. What is a probability? 1.1. Probability

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

The Virtual Immunization Communication (VIC ) Network is a project of the National Public Health

The Future of PrEP Shots and Potent Treatment: An

NO DISCLOSURES Richard A. Jacobs, MD, PhD. Outline Case History of Lyme disease A 35 yo

HOST Statistics ECE 525 Introduction Probability and statistics play very important roles in

Malaria &amp; Red Cell Disorders SK Cheong Faculty of Medicine &amp; Health Sciences, Universiti

Standardization of extracellular vesicle measurements by flow cytometry Edwin van der Pol

Recent selection in Tibet, Greenland &amp; China Anders Albrechtsen April 3, 2019 Signatures of

BIOE 301/362 Lecture One Overview of Lecture 1 Course Overview: Course organization

Malaria & Red Cell Disorders SK Cheong Faculty of Medicine & Health Sciences, Universiti

Recent selection in Tibet, Greenland & China Anders Albrechtsen April 3, 2019 Signatures of