De ciles- na.rm=TRUE o gether o gether Alwaysp De - - PowerPoint PPT Presentation

de ciles na rm true o gether o gether
SMART_READER_LITE
LIVE PREVIEW

De ciles- na.rm=TRUE o gether o gether Alwaysp De - - PowerPoint PPT Presentation

De ciles- na.rm=TRUE o gether o gether Alwaysp De ciles-withmissin Histo grams(chan Readinth Readinth Reminder Histo grams Percen TheVisualizationP Muchb Whatd Histo


slide-1
SLIDE 1

Always
p lot
y

  • ur
data
rst!

" Always. "
-
Se verus
Snap e

2
/
29

Always
p lot
y

  • ur
data
rst!

" Always. "
-
Se verus
Snap e

Wh y?

Outliers
an d
imp

  • ssible
v

alues Determine
c

  • rre ct


statistical
appr

  • ach

Assumptions
an d
diagn

  • stics

Disco ver
n ew
 relationships 2
/
29 Often
th e
 most
inf

  • rmative
asp

e ct
of analysis Comm unicates
th e
" data
st

  • ry"
th

e b est Most
abuse d
ar ea
of
quan titative science Figures
c an
b e
 very
 misleading

The
Visualization
P aradox

Misleading
Grap hs 3
/
29

Much
b etter

4
/
29

Graphical
m etho d should
match
 level
of measuremen t Lab el
all
ax es
an d include
gur e
c aption Simplicity
an d
clarity A void
of
‘ char tjunk’

K eys
t

  • G
  • o d
Viz'

s

5
/
29

Graphical
m etho d should
match
 level
of measuremen t Lab el
all
ax es
an d include
gur e
c aption Simplicity
an d
clarity A void
of
‘ char tjunk’ Unless
th ere
ar e
3
or more
v ariables,
a void 3D
gur es
(an d
e ven then,
a void
it) Black
&
w hite, grayscale/pattern
n e for
m

  • st
simp

le
gur es

K eys
t

  • G
  • o d
Viz'

s

5
/
29

Data
Visualizations

T ak es
practic e
--
tr y
a
bun ch
of
stuff

6
/
29

Data
Visualizations

T ak es
practic e
--
tr y
a
bun ch
of
stuff Resources Edward
 T ufte' s
 b o oks "R
 for
 Data
 Science"
 by
 Grolem und
 and
 Wickham "Data
 Visualization
 for
 So cial
 Science"
 by
 Healy

6
/
29

Coun ting
th e
n umb er
of

  • c currences
of
unique

even ts

Cate gorical
or
c

  • n

tin uous just
lik e
with
 tableF()
an d
 table1() Can
se e
 cen tral
t endency
(c

  • n

tin uous data)
or
 most
c

  • mmon
v

alue (cate gorical
data) Can
se e
 range
an d
extr emes

Fre quency
Distributions





















────────────────────────────────────────────────────── 




















x






Freq
CumFreq
Percent
CumPerc
Valid

CumValid 




















1






265

265




26.50%

26.50%

27.32%
27.32%

 




















2






222

487




22.20%

48.70%

22.89%
50.21%

 




















3






242

729




24.20%

72.90%

24.95%
75.15%

 




















4






241

970




24.10%

97.00%

24.85%
100.00%
 




















Missing
30


1000



3.00%


100.00%















 



















──────────────────────────────────────────────────────

7
/
29

Bar
Grap h

Fre quencies
an d
Viz' s
T

  • gether
 ❤

8
/
29

Bar
Grap h Histo gram

Fre quencies
an d
Viz' s
T

  • gether
 ❤

8
/
29

What
d

  • es
DISTRIBUTION
m

ean?

The
wa y
that
th e
data
p

  • in

ts
ar e
sc attere d

9
/
29

F

  • r


Con tin uous General
 shap e Exceptions
 (outliers) Mo des
 (p eaks) Cen ter
 &
 spread
 (chap
 3) Histo gram F

  • r


Ca tegorical Coun ts


  • f


each Percen t


  • r


Rate
 (adjusts
 for an
 ‘

  • ut

  • f’


to
 compare) Bar
 char t Pie
 char t


avoid!

What
d

  • es
DISTRIBUTION
m

ean?

The
wa y
that
th e
data
p

  • in

ts
ar e
sc attere d

9
/
29

Let' s
App ly
This
T

  • th

e
Inh

  • Dataset

10
/
29

Reminder

11
/
29

Read
in
th e
Data

library(tidyverse)


#
the
easy
button library(rio)





#
read
in
Excel
files library(furniture)


#
nice
tables data_raw
<-
rio::import("Ihno_dataset.xls")
%>%
 

dplyr::rename_all(tolower)


















#
converts
all
variable
names
to
lower
case

12
/
29

Read
in
th e
Data

library(tidyverse)


#
the
easy
button library(rio)





#
read
in
Excel
files library(furniture)


#
nice
tables data_raw
<-
rio::import("Ihno_dataset.xls")
%>%
 

dplyr::rename_all(tolower)


















#
converts
all
variable
names
to
lower
case

And
Cl ean
It

data_clean
<-
data_raw
%>%




















 

dplyr::mutate(majorF
=
factor(major, 































levels=
c(1,
2,
3,
4,
5), 































labels
=
c("Psychology",
"Premed", 










































"Biology",
"Sociology", 










































"Economics")))
%>% 

dplyr::mutate(coffeeF
=
factor(coffee, 
































levels
=
c(0,
1), 
































labels
=
c("Not
a
regular
coffee
drinker", 











































"Regularly
drinks
coffee")))

12
/
29

data_clean
%>%
















 

furniture::tableF(majorF) ##
 ##
───────────────────────────────────────── ##

majorF




Freq
CumFreq
Percent
CumPerc ##

Psychology
29


29





29.00%

29.00%
 ##

Premed




25


54





25.00%

54.00%
 ##

Biology



21


75





21.00%

75.00%
 ##

Sociology

15


90





15.00%

90.00%
 ##

Economics

10


100




10.00%

100.00% ##
───────────────────────────────────────── data_clean
%>%
 

furniture::tableF(phobia) ##
 ##
───────────────────────────────────── ##

phobia
Freq
CumFreq
Percent
CumPerc ##

0





12


12





12.00%

12.00%
 ##

1





15


27





15.00%

27.00%
 ##

2





12


39





12.00%

39.00%
 ##

3





16


55





16.00%

55.00%
 ##

4





21


76





21.00%

76.00%
 ##

5





11


87





11.00%

87.00%
 ##

6





1



88





1.00%


88.00%
 ##

7





4



92





4.00%


92.00%
 ##

8





4



96





4.00%


96.00%
 ##

9





1



97





1.00%


97.00%
 ##

10




3



100




3.00%


100.00% ##
─────────────────────────────────────

Fre quency
Distrubutions

13
/
29

Fre quency
Viz' s F

  • r
viz'

s,
w e
will
use
 ggplot2

This
pr

  • vides
th

e
m

  • st
p
  • werful,
b

eautiful
fram ework for
data
visualizations

14
/
29

Fre quency
Viz' s F

  • r
viz'

s,
w e
will
use
 ggplot2

This
pr

  • vides
th

e
m

  • st
p
  • werful,
b

eautiful
fram ework for
data
visualizations It
 is
 built


  • n


making
 layers Each
 plot
 has
 a
 " geom "
 function e.g.
 geom_bar()
 for
 bar
 char ts,
 geom_histogram()
 for histo grams,
 etc.

14
/
29

data_clean
%>%
 

ggplot()
+ 

aes(majorF)

Bar
Char ts

15
/
29

data_clean
%>%
 

ggplot()
+ 

aes(majorF) data_clean
%>%
 

ggplot()
+ 

aes(majorF)
+ 

geom_bar()

Bar
Char ts

15
/
29

Bar
Char ts

data_clean
%>%
 

ggplot()
+ 

aes(coffee)
+ 

geom_bar()

16
/
29

Histo grams

data_clean
%>%
 

ggplot()
+ 

aes(phobia)
+ 

geom_histogram()

17
/
29

Histo grams
(chan ge
n umb er
of
bins)

data_clean
%>%
 

ggplot()
+ 

aes(phobia)
+ 

geom_histogram(bins
=
8)

18
/
29

Histo grams
(chan ge
bins
t

  • siz

e
5)

data_clean
%>%
 

ggplot()
+ 

aes(phobia)
+ 

geom_histogram(binwidth
=
5)

19
/
29

Histo grams

data_clean
%>%
 

ggplot()
+ 

aes(mathquiz)
+ 

geom_histogram(binwidth
=
4)

20
/
29

Histo grams
-b y-
a
F actor
(c

  • lumns)

data_clean
%>%
 

ggplot()
+ 

aes(mathquiz)
+ 

geom_histogram(binwidth
=
4)
+ 

facet_grid(.
~
coffeeF)

21
/
29

Histo grams
-b y-
a
F actor
(r

  • ws)

data_clean
%>%
 

ggplot()
+ 

aes(mathquiz)
+ 

geom_histogram(binwidth
=
4)
+ 

facet_grid(coffeeF
~
.)

22
/
29

De ciles
(br eak
in to
10%
ch unks)

data_clean
%>%
 

dplyr::pull(statquiz)
%>%
 

quantile(probs
=
c(.10,
.20,
.30,
.40,
.50,
.60,
.70,
.80,
.90)) ##
10%
20%
30%
40%
50%
60%
70%
80%
90%
 ##
4.0
6.0
6.0
7.0
7.0
8.0
8.0
8.0
8.1

23
/
29

De ciles
-
with
missin g
v alues

data_clean
%>%
 

dplyr::pull(mathquiz)
%>%
 

quantile(probs
=
c(.10,
.20,
.30,
.40,
.50,
.60,
.70,
.80,
.90))

Error
in
quantile.default(.,
probs
=
c(0.1,
0.2,
0.3,
0.4,
0.5,
0.6,
0.7,
:
missing values
and
NaN's
not
allowed
if
'na.rm'
is
FALSE

24
/
29

De ciles
-
 na.rm
=
TRUE

data_clean
%>%
 

dplyr::pull(mathquiz)
%>%
 

quantile(probs
=
c(.10,
.20,
.30,
.40,
.50,
.60,
.70,
.80,
.90), 










na.rm
=TRUE) ##

10%

20%

30%

40%

50%

60%

70%

80%

90%
 ##
15.0
21.0
25.2
28.0
30.0
32.0
33.8
37.2
41.0

25
/
29

Quar tiles
(br eak
in to
4
ch unks)

data_clean
%>%
 

dplyr::pull(statquiz)
%>%
 

quantile(probs
=
c(0,
.25,
.50,
.75,
1)) ##


0%

25%

50%

75%
100%
 ##



1



6



7



8


10

26
/
29

Percen tiles

data_clean
%>%
 

dplyr::pull(statquiz)
%>%
 

quantile(probs
=
c(.01,
.05,
.173,
.90)) ##



1%



5%
17.3%


90%
 ##

2.98

3.00

5.00

8.10

27
/
29

Questions?

28
/
29

Next
T

  • pic

Cen ter
an d
Spr ead

29
/
29

Data
Visualization

Cohen
Chapt er
2
 


EDUC/PSY
6600

slide-2
SLIDE 2

Always
p lot
y

  • ur
data
rst!

" Always. "
-
Se verus
Snap e

2
/
29

slide-3
SLIDE 3

Always
p lot
y

  • ur
data
rst!

" Always. "
-
Se verus
Snap e

Wh y?

Outliers
an d
imp

  • ssible
v

alues Determine
c

  • rre ct


statistical
appr

  • ach

Assumptions
an d
diagn

  • stics

Disco ver
n ew
 relationships 2
/
29

slide-4
SLIDE 4

Often
th e
 most
inf

  • rmative
asp

e ct
of analysis Comm unicates
th e
" data
st

  • ry"
th

e b est Most
abuse d
ar ea
of
quan titative science Figures
c an
b e
 very
 misleading

The
Visualization
P aradox

Misleading
Grap hs 3
/
29
slide-5
SLIDE 5

Much
b etter

4
/
29

slide-6
SLIDE 6

Graphical
m etho d should
match
 level
of measuremen t Lab el
all
ax es
an d include
gur e
c aption Simplicity
an d
clarity A void
of
‘ char tjunk’

K eys
t

  • G
  • o d
Viz'

s

5
/
29

slide-7
SLIDE 7

Graphical
m etho d should
match
 level
of measuremen t Lab el
all
ax es
an d include
gur e
c aption Simplicity
an d
clarity A void
of
‘ char tjunk’ Unless
th ere
ar e
3
or more
v ariables,
a void 3D
gur es
(an d
e ven then,
a void
it) Black
&
w hite, grayscale/pattern
n e for
m

  • st
simp

le
gur es

K eys
t

  • G
  • o d
Viz'

s

5
/
29

slide-8
SLIDE 8

Data
Visualizations

T ak es
practic e
--
tr y
a
bun ch
of
stuff

6
/
29

slide-9
SLIDE 9

Data
Visualizations

T ak es
practic e
--
tr y
a
bun ch
of
stuff Resources Edward
 T ufte' s
 b o oks "R
 for
 Data
 Science"
 by
 Grolem und
 and
 Wickham "Data
 Visualization
 for
 So cial
 Science"
 by
 Healy

6
/
29

slide-10
SLIDE 10

Coun ting
th e
n umb er
of

  • c currences
of
unique

even ts

Cate gorical
or
c

  • n

tin uous just
lik e
with
 tableF()
an d
 table1() Can
se e
 cen tral
t endency
(c

  • n

tin uous data)
or
 most
c

  • mmon
v

alue (cate gorical
data) Can
se e
 range
an d
extr emes

Fre quency
Distributions





















────────────────────────────────────────────────────── 




















x






Freq
CumFreq
Percent
CumPerc
Valid

CumValid 




















1






265

265




26.50%

26.50%

27.32%
27.32%

 




















2






222

487




22.20%

48.70%

22.89%
50.21%

 




















3






242

729




24.20%

72.90%

24.95%
75.15%

 




















4






241

970




24.10%

97.00%

24.85%
100.00%
 




















Missing
30


1000



3.00%


100.00%















 



















──────────────────────────────────────────────────────

7
/
29

slide-11
SLIDE 11

Bar
Grap h

Fre quencies
an d
Viz' s
T

  • gether
 ❤

8
/
29

slide-12
SLIDE 12

Bar
Grap h Histo gram

Fre quencies
an d
Viz' s
T

  • gether
 ❤

8
/
29

slide-13
SLIDE 13

What
d

  • es
DISTRIBUTION
m

ean?

The
wa y
that
th e
data
p

  • in

ts
ar e
sc attere d

9
/
29

slide-14
SLIDE 14

F

  • r


Con tin uous General
 shap e Exceptions
 (outliers) Mo des
 (p eaks) Cen ter
 &
 spread
 (chap
 3) Histo gram F

  • r


Ca tegorical Coun ts


  • f


each Percen t


  • r


Rate
 (adjusts
 for an
 ‘

  • ut

  • f’


to
 compare) Bar
 char t Pie
 char t


avoid!

What
d

  • es
DISTRIBUTION
m

ean?

The
wa y
that
th e
data
p

  • in

ts
ar e
sc attere d

9
/
29

slide-15
SLIDE 15

Let' s
App ly
This
T

  • th

e
Inh

  • Dataset

10
/
29

slide-16
SLIDE 16

Reminder

11
/
29

slide-17
SLIDE 17

Read
in
th e
Data

library(tidyverse)


#
the
easy
button library(rio)





#
read
in
Excel
files library(furniture)


#
nice
tables data_raw
<-
rio::import("Ihno_dataset.xls")
%>%
 

dplyr::rename_all(tolower)


















#
converts
all
variable
names
to
lower
case

12
/
29

slide-18
SLIDE 18

Read
in
th e
Data

library(tidyverse)


#
the
easy
button library(rio)





#
read
in
Excel
files library(furniture)


#
nice
tables data_raw
<-
rio::import("Ihno_dataset.xls")
%>%
 

dplyr::rename_all(tolower)


















#
converts
all
variable
names
to
lower
case

And
Cl ean
It

data_clean
<-
data_raw
%>%




















 

dplyr::mutate(majorF
=
factor(major, 































levels=
c(1,
2,
3,
4,
5), 































labels
=
c("Psychology",
"Premed", 










































"Biology",
"Sociology", 










































"Economics")))
%>% 

dplyr::mutate(coffeeF
=
factor(coffee, 
































levels
=
c(0,
1), 
































labels
=
c("Not
a
regular
coffee
drinker", 











































"Regularly
drinks
coffee")))

12
/
29

slide-19
SLIDE 19

data_clean
%>%
















 

furniture::tableF(majorF) ##
 ##
───────────────────────────────────────── ##

majorF




Freq
CumFreq
Percent
CumPerc ##

Psychology
29


29





29.00%

29.00%
 ##

Premed




25


54





25.00%

54.00%
 ##

Biology



21


75





21.00%

75.00%
 ##

Sociology

15


90





15.00%

90.00%
 ##

Economics

10


100




10.00%

100.00% ##
───────────────────────────────────────── data_clean
%>%
 

furniture::tableF(phobia) ##
 ##
───────────────────────────────────── ##

phobia
Freq
CumFreq
Percent
CumPerc ##

0





12


12





12.00%

12.00%
 ##

1





15


27





15.00%

27.00%
 ##

2





12


39





12.00%

39.00%
 ##

3





16


55





16.00%

55.00%
 ##

4





21


76





21.00%

76.00%
 ##

5





11


87





11.00%

87.00%
 ##

6





1



88





1.00%


88.00%
 ##

7





4



92





4.00%


92.00%
 ##

8





4



96





4.00%


96.00%
 ##

9





1



97





1.00%


97.00%
 ##

10




3



100




3.00%


100.00% ##
─────────────────────────────────────

Fre quency
Distrubutions

13
/
29

slide-20
SLIDE 20

Fre quency
Viz' s F

  • r
viz'

s,
w e
will
use
 ggplot2

This
pr

  • vides
th

e
m

  • st
p
  • werful,
b

eautiful
fram ework for
data
visualizations

14
/
29

slide-21
SLIDE 21

Fre quency
Viz' s F

  • r
viz'

s,
w e
will
use
 ggplot2

This
pr

  • vides
th

e
m

  • st
p
  • werful,
b

eautiful
fram ework for
data
visualizations It
 is
 built


  • n


making
 layers Each
 plot
 has
 a
 " geom "
 function e.g.
 geom_bar()
 for
 bar
 char ts,
 geom_histogram()
 for histo grams,
 etc.

14
/
29

slide-22
SLIDE 22

data_clean
%>%
 

ggplot()
+ 

aes(majorF)

Bar
Char ts

15
/
29

slide-23
SLIDE 23

data_clean
%>%
 

ggplot()
+ 

aes(majorF) data_clean
%>%
 

ggplot()
+ 

aes(majorF)
+ 

geom_bar()

Bar
Char ts

15
/
29

slide-24
SLIDE 24

Bar
Char ts

data_clean
%>%
 

ggplot()
+ 

aes(coffee)
+ 

geom_bar()

16
/
29

slide-25
SLIDE 25

Histo grams

data_clean
%>%
 

ggplot()
+ 

aes(phobia)
+ 

geom_histogram()

17
/
29

slide-26
SLIDE 26

Histo grams
(chan ge
n umb er
of
bins)

data_clean
%>%
 

ggplot()
+ 

aes(phobia)
+ 

geom_histogram(bins
=
8)

18
/
29

slide-27
SLIDE 27

Histo grams
(chan ge
bins
t

  • siz

e
5)

data_clean
%>%
 

ggplot()
+ 

aes(phobia)
+ 

geom_histogram(binwidth
=
5)

19
/
29

slide-28
SLIDE 28

Histo grams

data_clean
%>%
 

ggplot()
+ 

aes(mathquiz)
+ 

geom_histogram(binwidth
=
4)

20
/
29

slide-29
SLIDE 29

Histo grams
-b y-
a
F actor
(c

  • lumns)

data_clean
%>%
 

ggplot()
+ 

aes(mathquiz)
+ 

geom_histogram(binwidth
=
4)
+ 

facet_grid(.
~
coffeeF)

21
/
29

slide-30
SLIDE 30

Histo grams
-b y-
a
F actor
(r

  • ws)

data_clean
%>%
 

ggplot()
+ 

aes(mathquiz)
+ 

geom_histogram(binwidth
=
4)
+ 

facet_grid(coffeeF
~
.)

22
/
29

slide-31
SLIDE 31

De ciles
(br eak
in to
10%
ch unks)

data_clean
%>%
 

dplyr::pull(statquiz)
%>%
 

quantile(probs
=
c(.10,
.20,
.30,
.40,
.50,
.60,
.70,
.80,
.90)) ##
10%
20%
30%
40%
50%
60%
70%
80%
90%
 ##
4.0
6.0
6.0
7.0
7.0
8.0
8.0
8.0
8.1

23
/
29

slide-32
SLIDE 32

De ciles
-
with
missin g
v alues

data_clean
%>%
 

dplyr::pull(mathquiz)
%>%
 

quantile(probs
=
c(.10,
.20,
.30,
.40,
.50,
.60,
.70,
.80,
.90))

Error
in
quantile.default(.,
probs
=
c(0.1,
0.2,
0.3,
0.4,
0.5,
0.6,
0.7,
:
missing values
and
NaN's
not
allowed
if
'na.rm'
is
FALSE

24
/
29

slide-33
SLIDE 33

De ciles
-
 na.rm
=
TRUE

data_clean
%>%
 

dplyr::pull(mathquiz)
%>%
 

quantile(probs
=
c(.10,
.20,
.30,
.40,
.50,
.60,
.70,
.80,
.90), 










na.rm
=TRUE) ##

10%

20%

30%

40%

50%

60%

70%

80%

90%
 ##
15.0
21.0
25.2
28.0
30.0
32.0
33.8
37.2
41.0

25
/
29

slide-34
SLIDE 34

Quar tiles
(br eak
in to
4
ch unks)

data_clean
%>%
 

dplyr::pull(statquiz)
%>%
 

quantile(probs
=
c(0,
.25,
.50,
.75,
1)) ##


0%

25%

50%

75%
100%
 ##



1



6



7



8


10

26
/
29

slide-35
SLIDE 35

Percen tiles

data_clean
%>%
 

dplyr::pull(statquiz)
%>%
 

quantile(probs
=
c(.01,
.05,
.173,
.90)) ##



1%



5%
17.3%


90%
 ##

2.98

3.00

5.00

8.10

27
/
29

slide-36
SLIDE 36

Questions?

28
/
29

slide-37
SLIDE 37

Next
T

  • pic

Cen ter
an d
Spr ead

29
/
29