Data Exploration Tyler Moore CSE 7338 Computer Science & - - PDF document

data exploration
SMART_READER_LITE
LIVE PREVIEW

Data Exploration Tyler Moore CSE 7338 Computer Science & - - PDF document

Notes Data Exploration Tyler Moore CSE 7338 Computer Science & Engineering Department, SMU, Dallas, TX Lecture 5 Notes Outline Data exploration 1 2 / 27 Data exploration Notes Guide to analyzing data Type of Data Exploration


slide-1
SLIDE 1

Data Exploration

Tyler Moore

CSE 7338 Computer Science & Engineering Department, SMU, Dallas, TX

Lecture 5

Outline

1

Data exploration

2 / 27 Data exploration

Guide to analyzing data

Type of Data Exploration Statistics RByEx 1 numerical variable

2 4 6 8 0.0 0.4 0.8 ecdf(br$logbreach) x Fn(x) 2 4 6 8 log(#records breached)

  • ne way t-test, Wilcox test

6.3 1 categorical variable

CARD HACK PHYS STAT 400 800

– 3.1 # categories=2 – prop.test 6.2 1 categorical, 1 numerical

  • BSF

EDU 2 4 6 8 Organization Type log(#records breached) 2 4 6 8 FALSE TRUE log(#records breached) Breach type

  • anova, Permutation

10 # categories=2 – 2-way t, Wilcox test, Perm. 6.4 2 categorical variables

TOH

BSF BSO BSR EDU GOV MED NGO CARD DISC HACK INSD PHYS PORT STAT UNKN

χ2 test 3.2–3.5

4 / 27 Data exploration

R resources

“Beginner’s Guide to R” is good for introducing how to program in R, make plots etc. Chapters available at http://lyle.smu.edu/~tylerm/courses/econsec/rbegin.html “R by example” is good for introducing how to do data exploration and statistics in R (and therefore is useful for this part of the course) Chapters available at http://lyle.smu.edu/~tylerm/courses/econsec/rbyex.html

5 / 27

Notes Notes Notes Notes

slide-2
SLIDE 2

Data exploration

Exploring data sets

Ideally, a dataset will consist of a table with each record corresponding to a row in the table Each column is a variable associated with that record Two primary variable types: numerical and categorical Other types include dates, strings

6 / 27 Data exploration

Format of data sets

time numbreach numrecords firm

  • rgtype

hacktype city 2005-01-18 3500 3500 George Mason University EDU HACK Fairfax 2005-01-22 15790 15790 University of California, San Diego EDU HACK San Diego 2005-02-12 45000 45000 University of Northern Colorado EDU PORT Greeley 2005-02-15 163000 163000 Science Applications International Corp. (SAIC) BSO STAT San Diego 2005-02-18 85 85 ChoicePoint BSO INSD Alpharetta 7 / 27 Data exploration

Exploration Case 1: One numerical variable

Visualization options

1

Box plots

2

Strip charts

3

CDFs

Statistical tests

1

Confidence interval for median value: 1-way Wilcoxon test

Questions of interest

1

What is the distribution of numerical values?

2

What are the summary statistics of the numerical values (mean, median)?

8 / 27 Data exploration

Box plots

  • 2

4 6 8 log(# breached records)

br$logbreach < −log ( br$numbreach , 1 0 ) boxplot ( br$logbreach , yl ab =’ log(# breached r e c o r d s ) ’ )

9 / 27

Notes Notes Notes Notes

slide-3
SLIDE 3

Data exploration

Strip charts

2 4 6 8 log(#records breached)

  • ●●
  • ●●
  • ● ●
  • s t r i p c h a r t ( br$logbreach , method=’ stack ’ ,

pch=19, xl a b=”log(# r e c o r d s breached )”)

10 / 27 Data exploration

Cumulative distribution functions

0e+00 2e+07 4e+07 6e+07 8e+07 1e+08 0.0 0.2 0.4 0.6 0.8 1.0 ecdf(br$numbreach) x Fn(x)

11 / 27 Data exploration

Cumulative distribution functions

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 ecdf(br$logbreach) x Fn(x)

p l o t ( ecdf ( b r $ l o g b r e a c h ) ) a b l i n e ( v=4, c o l =’red ’ , l t y =’dashed ’ )

12 / 27 Data exploration

Get confidence interval for single numerical variable

If data is normally distributed, can use one-way t-test The data we study usually is not normally distributed, so you must use a non-parametric test such as the Wilcoxon test instead

13 / 27

Notes Notes Notes Notes

slide-4
SLIDE 4

Data exploration

Exploration case 2: One numerical variable and one categorical variable

Visualization options

1

Box plots

2

Strip charts

3

CDFs

Statistical tests

1

When number of categories = 2, compare the difference in median values for categories: 2-way Wilcoxon test

Questions of interest

1

Does the distribution of the numerical variable change for different categories

2

Is the difference in median values across categories statistically significant?

14 / 27 Data exploration

Box plots

  • BSF

BSO BSR EDU GOV MED NGO 2 4 6 8 Organization Type log(#records breached)

par ( l a s =1) boxplot ( l o g b r e a c h ˜ orgtype , method=’ stack ’ , pch=19, y la b=”log(# r e c o r d s breached )” , x la b=”O r g a n i z a t i o n Type ” , data=br )

15 / 27 Data exploration

Box plots

  • CARD

DISC HACK INSD PHYS PORT STAT UNKN 2 4 6 8 Breach Type log(#records breached)

16 / 27 Data exploration

Box plots

  • CARD

INSD PHYS UNKN DISC STAT HACK PORT 2 4 6 8 Breach Type log(#records breached)

#suppose we want to see the box p l o t s o r t e d by the number o br$hacktype < −r e o r d e r ( br$hacktype , br$logbreach , median , na . rm=T) boxplot ( l o g b r e a c h ˜ hacktype , method=’ stack ’ , pch=19, yl a b=”log(# r e c o r d s breached )” , x la b=”Breach Type ” , data=br )

17 / 27

Notes Notes Notes Notes

slide-5
SLIDE 5

Data exploration

Two-way Wilcoxon tests

See R console

18 / 27 Data exploration

Exploration case 3: One categorical variable

Visualization options

1

Bar plots

Statistical tests

1

When number of categories = 2, get confidence interval for proportions using prop.test

Questions of interest

1

Are the differences in proportion across categories statistically significant?

19 / 27 Data exploration

Proportional test

> #Task: get confidence interval for proportion of a single > # categorical variable with two possible values (e.g., breach, no breach) > #Solution: prop.test > #How many hospitals are there? > #http://www.aha.org/research/rc/stat-studies/fast-facts.shtml > #What proportion of them have suffered a breach > # (assuming only hospitals included in the category "MED")? > prop.test(length(br$orgtype[br$orgtype=="MED"]), + 5754-length(br$orgtype[br$orgtype=="MED"]), + p=.3,alternative="two.sided",correct=F) 1-sample proportions test without continuity correction data: length(br$orgtype[br$orgtype == "MED"]) out of 5754 - length(br$orgtype[br$orgtype == "MED"]), null probability 0.3 X-squared = 208.51, df = 1, p-value < 2.2e-16 alternative hypothesis: true p is not equal to 0.3 95 percent confidence interval: 0.1930776 0.2159367 sample estimates: p 0.2042696

20 / 27 Data exploration

Exploration case 4: Two categorical variables

Visualization options

1

Mosaic plots

2

Contingency tables

Statistical tests

1

χ2 test

Questions of interest

1

Are differences in proportion caused by chance, or are some groups under- and over-represented in other categories?

21 / 27

Notes Notes Notes Notes

slide-6
SLIDE 6

Data exploration

Contingency Tables

> TOH<-table(br$orgtype,br$hacktype) > TOH CARD DISC HACK INSD PHYS PORT STAT UNKN BSF 22 75 97 82 53 136 24 27 BSO 3 67 176 51 47 112 21 15 BSR 32 43 190 65 35 65 15 17 EDU 1 207 225 23 53 125 48 13 GOV 176 99 68 100 162 23 20 MED 1 116 53 166 183 343 87 27 NGO 7 28 8 10 32 5 3 > mosaicplot(TOH,col=rainbow(ncol(TOH)))

22 / 27 Data exploration

Mosaic plot

TOH

BSF BSO BSR EDU GOV MED NGO CARD DISC HACK INSD PHYS PORT STAT UNKN

23 / 27 Data exploration

Sorted mosaic plot

brt<-br brt$hacktype<-reorder(brt$hacktype,brt$hacktype,length) brt$orgtype<-reorder(brt$orgtype,brt$orgtype,length) TOH.srt<-table(brt$orgtype,brt$hacktype) par(las=1) mosaicplot(TOH.srt,col=rainbow(ncol(TOH.srt)))

24 / 27 Data exploration

Sorted mosaic plot

TOH.srt

NGO BSR BSO BSF GOV EDU MED CARD UNKN STAT INSD PHYS DISC HACK PORT

25 / 27

Notes Notes Notes Notes

slide-7
SLIDE 7

Data exploration

Chi-squared test

> TOH.cs<-chisq.test(TOH.srt) > #check for significance. > TOH.cs Pearson’s Chi-squared test data: TOH.srt X-squared = 841.0608, df = 42, p-value < 2.2e-16 > mosaicplot(TOH.srt,shade=T)

26 / 27 Data exploration

Mosaic plot with residuals

Standardized Residuals:

<−4 −4:−2 −2:0 0:2 2:4 >4

TOH.srt

NGO BSR BSO BSF GOV EDU MED CARD UNKN STAT INSD PHYS DISC HACK PORT

27 / 27

Notes Notes Notes Notes