Probability and Statistics for Computer Science The statement that - - PowerPoint PPT Presentation

probability and statistics
SMART_READER_LITE
LIVE PREVIEW

Probability and Statistics for Computer Science The statement that - - PowerPoint PPT Presentation

Probability and Statistics for Computer Science The statement that The average US family has 2.6 children invites mockery Prof. Forsyth reminds us about criAcal thinking Credit: wikipedia Hongye Liu, Teaching Assistant


slide-1
SLIDE 1

ì

Probability and Statistics for Computer Science

“The statement that “The average US family has 2.6 children” invites mockery” –

  • Prof. Forsyth reminds us

about criAcal thinking

Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 8.27.2020 Credit: wikipedia

slide-2
SLIDE 2

Last lecture

✺ Welcome/OrientaAon ✺ Big picture of the contents ✺ Lecture 1 - Data VisualizaAon &

Summary (I)

✺ Some feedbacks

slide-3
SLIDE 3

Warm up question:

✺ What kind of data is a le[er grade? ✺ What do you ask for usually about the

stats of an exam with numerical scores?

slide-4
SLIDE 4

Objectives

✺ Grasp Summary StaAsAcs ✺ Learn more Data VisualizaAon for

Rela2onships

slide-5
SLIDE 5

Summarizing 1D continuous data

For a data set {x} or annotated as {xi}, we summarize with:

✺ LocaAon Parameters ✺ Scale parameters

slide-6
SLIDE 6

Summarizing 1D continuous data

✺ Mean

mean(xi) = 1 N

N

  • i=1

xi

It’s the centroid of the data geometrically, by idenAfying the data set at that point, you find the center of balance.

slide-7
SLIDE 7

Properties of the mean

✺ Scaling data scales the mean ✺ TranslaAng the data translates the mean

mean({k · xi}) = k · mean({xi})

mean({xi + c}) = mean({xi}) + c

slide-8
SLIDE 8

Less obvious properties of the mean

✺ The signed distances from the mean

sum to 0

✺ The mean minimizes the sum of the

squared distance from any real value

N

  • i=1

(xi − mean({xi})) = 0

argmin

µ N

  • i=1

(xi − µ)2 = mean({xi})

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

Q1:

✺ What is the answer for

mean(mean({xi})) ?

  • A. mean({xi}) B. unsure C. 0
slide-12
SLIDE 12

Standard Deviation (σ)

✺ The standard deviaAon

std({xi}) =

  • 1

N

N

  • i=1

(xi − mean({xi}))2

std({xi}) =

  • mean({xi − mean({xi}))2})
slide-13
SLIDE 13
  • Q2. Can a standard deviation of a dataset

be -1?

  • A. YES
  • B. NO
slide-14
SLIDE 14

Properties of the standard deviation

✺ Scaling data scales the standard deviaAon ✺ TranslaAng the data does NOT change the

standard deviaAon std({k · xi}) = |k| · std({xi})

std({xi + c}) = std({xi})

slide-15
SLIDE 15

Standard deviation: Chebyshev’s inequality (1st look)

✺ At most items are k standard

deviaAons (σ) away from the mean

✺ Rough jusAficaAon: Assume mean =0

N k2

N − N K2

0.5N K2 0.5N K2

−kσ

std =

  • 1

N [(N − N k )02 + N k2(kσ)2] = σ

slide-16
SLIDE 16

Variance (σ2)

✺ Variance = (standard deviaAon)2 ✺ Scaling and translaAng similar to standard

deviaAon

var({xi}) = 1 N

N

  • i=1

(xi − mean({xi}))2

var({k · xi}) = k2 · var({xi})

var({xi + c}) = var({xi})

slide-17
SLIDE 17

Q3: Standard deviation

✺ What is the value of

std(mean({xi}) ?

  • A. 0 B. 1 C. unsure
slide-18
SLIDE 18

Standard Coordinates/normalized data

✺ The mean tells where the data set is and the

standard devia-on tells how spread out it is. If we are interested only in comparing the shape, we could define:

✺ We say is in standard coordinates

  • xi = xi − mean({xi})

std({xi)} { xi}

slide-19
SLIDE 19

Q4: Mean of standard coordinates

✺ μ of is:

  • A. 1 B. 0 C. unsure
  • xi = xi − mean({xi})

std({xi)}

{ xi}

slide-20
SLIDE 20

Q5: Standard deviation (σ) of standard coordinates

✺ σ of is:

  • A. 1 B. 0 C. unsure
  • xi = xi − mean({xi})

std({xi)}

{ xi}

slide-21
SLIDE 21

Q6: Variance of standard coordinates

✺ Variance of is:

  • A. 1 B. 0 C. unsure
  • xi = xi − mean({xi})

std({xi)}

{ xi}

slide-22
SLIDE 22

Q7: Estimate the range of data in standard coordinates

✺ EsAmate as close as possible, 90% data

is within:

  • A. [-10, 10]
  • B. [-100, 100]
  • C. [-1, 1]
  • D. [-4, 4]
  • E. others
  • xi = xi − mean({xi})

std({xi)}

slide-23
SLIDE 23

Summary stats of standard Coordinates/normalized data

slide-24
SLIDE 24

Standard Coordinates/normalized data to μ=0, σ=1, σ2=1

✺ Data in standard coordinates always has

mean = 0; standard deviaAon =1; variance = 1.

✺ Such data is unit-less, plots based on this

someAmes are more comparable

✺ We see such normalizaAon very oren in

staAsAcs

slide-25
SLIDE 25

Median

✺ To organize the data we first sort it ✺ Then if the number of items N is odd

median = middle item's value if the number of items N is even median = mean of middle 2 items' values

slide-26
SLIDE 26

Properties of Median

✺ Scaling data scales the median ✺ TranslaAng data translates the median

median({k · xi}) = k · median({xi})

median({xi + c}) = median({xi}) + c

slide-27
SLIDE 27

Percentile

✺ kth percenAle is the value relaAve to

which k% of the data items have smaller

  • r equal numbers

✺ Median is roughly the 50th percenAle

slide-28
SLIDE 28

Q8: Scaling effect on percentiles

✺ Scaling data scales the percenAle

  • A. True B. False
slide-29
SLIDE 29

Q9: Translating effect on percentiles

✺ TranslaAng data does NOT change the

percenAle

  • A. True B. False
slide-30
SLIDE 30

Interquartile range

✺ iqr = (75th percenAle) - (25th percenAle) ✺ Scaling data scales the interquarAle range ✺ TranslaAng data does NOT change the

interquarAle range

iqr({k · xi}) = |k| · iqr({xi}) iqr({xi + c}) = iqr({xi})

slide-31
SLIDE 31

Box plots

✺ Boxplots

✺ Simpler than

histogram

✺ Good for outliers ✺ Easier to use

for comparison

Data from h[ps://www2.stetson.edu/ ~jrasp/data.htm

Vehicle death by region

DEATH

slide-32
SLIDE 32

Boxplots details, outliers

✺ How to

define

  • utliers?

(the default)

Whisker Box Median Outlier InterquarAle Range (iqr) > 1.5 iqr < 1.5 iqr

slide-33
SLIDE 33

Discussion

✺ Pick a group to debate

slide-34
SLIDE 34

Sensitivity of summary statistics to

  • utliers

✺ mean and standard deviaAon are

very sensiAve to outliers

✺ median and interquarAle range are

not sensiAve to outliers

slide-35
SLIDE 35

Modes

✺ Modes are peaks in a histogram ✺ If there are more than 1 mode, we

should be curious as to why

slide-36
SLIDE 36

Multiple modes

✺ We have seen

the “iris” data which looks to have several peaks

Data: “iris” in R

slide-37
SLIDE 37

Example Bi-modes distribution

✺ Modes may

indicate mulAple

populaAons

Data: Erythrocyte cells in healthy humans Piagnerelli, JCP 2007

slide-38
SLIDE 38

Tails and Skews

Credit: Prof.Forsyth

slide-39
SLIDE 39

Looking at relationships in data

✺ Finding relaAonships between

features in a data set or many data sets is one of the most important tasks in data science

slide-40
SLIDE 40

Heatmap

SummarizaAon of 4 locaAons’ annual mean temperature by month ✺ Display matrix of data via gradient of color(s)

slide-41
SLIDE 41

3D bar chart

✺ Transparent

3D bar chart is good for small # of samples across categories

slide-42
SLIDE 42

Relationship between data feature and time

✺ Example: How does Amazon’s stock change

  • ver 1 years?

take out the pair of features x: Day y: AMZN

slide-43
SLIDE 43

Relationship between data features

✺ Example: does the weight of people relate to

their height?

✺ x : HIGHT, y: WEIGHT

slide-44
SLIDE 44

The visual way for continuous features

✺ Time series plot ✺ Sca[er plot

slide-45
SLIDE 45

Time Series Plot: Stock of Amazon

slide-46
SLIDE 46

Scatter plot

✺ A most effecAve tool for geographic

data and 2D data in general. It should be your first step with a new 2D dataset.

slide-47
SLIDE 47

Scatter plot

✺ Body Fat data set

slide-48
SLIDE 48

Scatter plot

✺ Sca[er plot with density

slide-49
SLIDE 49

Scatter plot

✺ Removed of outliers & standardized

slide-50
SLIDE 50

Scatter plot

✺ Coupled with

heatmap to show a 3rd feature

slide-51
SLIDE 51

Correlation seen from scatter plots

PosiAve correlaAon NegaAve correlaAon Zero CorrelaAon

Credit: Prof.Forsyth

slide-52
SLIDE 52

What kind of Correlation?

✺ line of code in a database and number of bugs ✺ GPA and hours spent playing video games ✺ earnings and happiness

Credit: Prof. David Varodayan

slide-53
SLIDE 53

Correlation doesn’t mean causation

✺ Shoe size is correlated to reading skills,

but it doesn’t mean making feet grow will make one person read faster.

slide-54
SLIDE 54

Assignments

✺ HW1 due Thurs. Sept. 3. ✺ Quiz 1 (open 4:30pm today un2l Sat.) ✺ Reading upto Chapter 2.1 ✺ Next Ame: the quanAtaAve part of

correlaAon coefficient

slide-55
SLIDE 55

Additional References

✺ Charles M. Grinstead and J. Laurie Snell

"IntroducAon to Probability”

✺ Morris H. Degroot and Mark J. Schervish

"Probability and StaAsAcs”

slide-56
SLIDE 56

See you next time

See You!