Surveying Catholic sisters in 1967 Julia Silge Data Scientist at - - PowerPoint PPT Presentation

surveying catholic sisters in 1967
SMART_READER_LITE
LIVE PREVIEW

Surveying Catholic sisters in 1967 Julia Silge Data Scientist at - - PowerPoint PPT Presentation

DataCamp Supervised Learning in R: Case Studies SUPERVISED LEARNING IN R : CASE STUDIES Surveying Catholic sisters in 1967 Julia Silge Data Scientist at Stack Overflow DataCamp Supervised Learning in R: Case Studies Conference of Major


slide-1
SLIDE 1

DataCamp Supervised Learning in R: Case Studies

Surveying Catholic sisters in 1967

SUPERVISED LEARNING IN R: CASE STUDIES

Julia Silge

Data Scientist at Stack Overflow

slide-2
SLIDE 2

DataCamp Supervised Learning in R: Case Studies

Conference of Major Superiors of Women Sisters' Survey

Fielded in 1967 with over 600 questions Responses from over 130,000 sisters in almost 400 congregations Data is freely available

slide-3
SLIDE 3

DataCamp Supervised Learning in R: Case Studies

Opinions and attitudes in the 1960s

Response Code Disagree very much 1 Disagree somewhat 2 Neither agree nor disagree 3 Agree somewhat 4 Agree very much 5

Check out the survey's . codebook

slide-4
SLIDE 4

DataCamp Supervised Learning in R: Case Studies

Opinions and attitudes in the 1960s

> sisters67 # A tibble: 77,112 x 67 age sister v116 v117 v118 v119 v120 v121 v122 v123 v124 v125 <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1 60.0 1 1 1 3 5 1 1 3 5 3 1 2 70.0 2 2 2 4 4 1 3 1 5 4 1 3 60.0 3 1 1 3 2 2 3 1 1 3 1 4 60.0 4 5 1 2 4 1 3 4 3 3 4 5 50.0 5 2 3 3 3 2 2 1 5 2 5 6 40.0 7 4 3 2 5 4 3 1 5 2 5 7 50.0 9 5 4 5 4 4 5 3 5 4 2 8 40.0 10 5 4 3 5 1 3 5 5 5 4 9 30.0 11 2 2 3 5 1 3 3 5 1 3 10 30.0 12 4 1 5 5 1 4 3 5 1 5 # ... with 77,102 more rows, and 55 more variables: v126 <int>, v127 <int>, # v128 <int>, v129 <int>, v130 <int>, v131 <int>, v132 <int>, v133 <int>, # v134 <int>, v135 <int>, v136 <int>, v137 <int>, v138 <int>, v139 <int>, # v140 <int>, v141 <int>, v142 <int>, v143 <int>, v144 <int>, v145 <int>, # v146 <int>, v147 <int>, v148 <int>, v149 <int>, v150 <int>, v151 <int>, # v152 <int>, v153 <int>, v154 <int>, v155 <int>, v156 <int>, v157 <int>, # v158 <int>, v159 <int>, v160 <int>, v161 <int>, v162 <int>, v163 <int>, # v164 <int>, v165 <int>, v166 <int>, v167 <int>, v168 <int>, v169 <int>, # v170 <int>, v171 <int>, v172 <int>, v173 <int>, v174 <int>, v175 <int>, # v176 <int>, v177 <int>, v178 <int>, v179 <int>, v180 <int>

slide-5
SLIDE 5

DataCamp Supervised Learning in R: Case Studies

slide-6
SLIDE 6

DataCamp Supervised Learning in R: Case Studies

Opinions and attitudes in the 1960s

"Catholics should boycott indecent movies." "In the past 25 years, this country has moved dangerously close to socialism." "I would rather be called an idealist than a practical person."

slide-7
SLIDE 7

DataCamp Supervised Learning in R: Case Studies

Tidy your data

> sisters67 %>% + select(-sister) %>% + gather(key, value, -age) # A tibble: 5,012,280 x 3 age key value <dbl> <chr> <int> 1 60.0 v116 1 2 70.0 v116 2 3 60.0 v116 1 4 60.0 v116 5 5 50.0 v116 2 6 40.0 v116 4 7 50.0 v116 5 8 40.0 v116 5 9 30.0 v116 2 10 30.0 v116 4 # ... with 5,012,270 more rows

slide-8
SLIDE 8

DataCamp Supervised Learning in R: Case Studies

Let's practice!

SUPERVISED LEARNING IN R: CASE STUDIES

slide-9
SLIDE 9

DataCamp Supervised Learning in R: Case Studies

Exploratory data analysis with tidy data

SUPERVISED LEARNING IN R: CASE STUDIES

Julia Silge

Data Scientist at Stack Overflow

slide-10
SLIDE 10

DataCamp Supervised Learning in R: Case Studies

Counting agreement

> tidy_sisters %>% + count(value) # A tibble: 5 x 2 value n <int> <int> 1 1 1303555 2 2 844311 3 3 645401 4 4 1108859 5 5 1110154

slide-11
SLIDE 11

DataCamp Supervised Learning in R: Case Studies

Overall agreement with age

> tidy_sisters %>% + group_by(age) %>% + summarise(value = mean(value)) # A tibble: 9 x 2 age value <dbl> <dbl> 1 20.0 2.86 2 30.0 2.81 3 40.0 2.83 4 50.0 2.94 5 60.0 3.10 6 70.0 3.26 7 80.0 3.42 8 90.0 3.51 9 100 3.60

slide-12
SLIDE 12

DataCamp Supervised Learning in R: Case Studies

Agreement on questions by age

tidy_sisters %>% filter(key %in% paste0("v", 153:170)) %>% group_by(key, value) %>% summarise(age = mean(age)) %>% ggplot(aes(value, age, color = key)) + geom_line(alpha = 0.5, size = 1.5) + geom_point(size = 2) + facet_wrap(~key)

slide-13
SLIDE 13

DataCamp Supervised Learning in R: Case Studies

slide-14
SLIDE 14

DataCamp Supervised Learning in R: Case Studies

slide-15
SLIDE 15

DataCamp Supervised Learning in R: Case Studies

Freedom of speech

"People who don't believe in God have as much right to freedom of speech as anyone else."

slide-16
SLIDE 16

DataCamp Supervised Learning in R: Case Studies

slide-17
SLIDE 17

DataCamp Supervised Learning in R: Case Studies

Conservatism and heritage

"I like conservatism because it represents a stand to preserve our glorious heritage."

slide-18
SLIDE 18

DataCamp Supervised Learning in R: Case Studies

slide-19
SLIDE 19

DataCamp Supervised Learning in R: Case Studies

Vietnam War

"Catholics as a group should consider active opposition to US participation in Vietnam."

slide-20
SLIDE 20

DataCamp Supervised Learning in R: Case Studies

slide-21
SLIDE 21

DataCamp Supervised Learning in R: Case Studies

slide-22
SLIDE 22

DataCamp Supervised Learning in R: Case Studies

Let's explore!

SUPERVISED LEARNING IN R: CASE STUDIES

slide-23
SLIDE 23

DataCamp Supervised Learning in R: Case Studies

Predicting age with supervised machine learning

SUPERVISED LEARNING IN R: CASE STUDIES

Julia Silge

Data Scientist at Stack Overflow

slide-24
SLIDE 24

DataCamp Supervised Learning in R: Case Studies

Build models

"rpart" "xgbLinear" "gbm"

slide-25
SLIDE 25

DataCamp Supervised Learning in R: Case Studies

Choosing between multiple models

## CART sisters_cart <- train(age ~ ., method = "rpart", data = training) ## xgboost sisters_rf <- train(age ~ ., method = "xgbLinear", data = training) ## gbm sisters_gbm <- train(age ~ ., method = "gbm", data = training)

slide-26
SLIDE 26

DataCamp Supervised Learning in R: Case Studies

slide-27
SLIDE 27

DataCamp Supervised Learning in R: Case Studies

Why three data partitions?

Don't overestimate how well your model is performing!

slide-28
SLIDE 28

DataCamp Supervised Learning in R: Case Studies

Why three data partitions?

> validation %>% + mutate(prediction = predict(sisters_xg, validation)) %>% + rmse(truth = age, estimate = prediction) [1] 13.27101 > testing %>% + mutate(prediction = predict(sisters_xg, testing)) %>% + rmse(truth = age, estimate = prediction) [1] 13.36945

slide-29
SLIDE 29

DataCamp Supervised Learning in R: Case Studies

Let's practice!

SUPERVISED LEARNING IN R: CASE STUDIES

slide-30
SLIDE 30

DataCamp Supervised Learning in R: Case Studies

You made it!

SUPERVISED LEARNING IN R: CASE STUDIES

Julia Silge

Data Scientist at Stack Overflow

slide-31
SLIDE 31

DataCamp Supervised Learning in R: Case Studies

Predicting age

> metrics(model_results, truth = age, estimate = CART) # A tibble: 1 x 2 rmse rsq <dbl> <dbl> 1 14.8 0.170 > metrics(model_results, truth = age, estimate = XBG) # A tibble: 1 x 2 rmse rsq <dbl> <dbl> 1 13.3 0.338 > metrics(model_results, truth = age, estimate = GBM) # A tibble: 1 x 2 rmse rsq <dbl> <dbl> 1 12.8 0.382

slide-32
SLIDE 32

DataCamp Supervised Learning in R: Case Studies

Predicting age

Build your model with your training data Choose your model with your validation data Evaluate your model with your testing data

slide-33
SLIDE 33

DataCamp Supervised Learning in R: Case Studies

Diverse data, powerful tools

Fuel efficiency of cars Developers working remotely in the Stack Overflow survey Voter turnout in 2016 Catholic nuns' ages based on beliefs and attitudes

slide-34
SLIDE 34

DataCamp Supervised Learning in R: Case Studies

Practical machine learning

Dealing with class imbalance Improving performance with resampling (bootstrap, cross-validation)

slide-35
SLIDE 35

DataCamp Supervised Learning in R: Case Studies

Practical machine learning

Dealing with class imbalance Improving performance with resampling (bootstrap, cross-validation) Hyperparameter tuning?

slide-36
SLIDE 36

DataCamp Supervised Learning in R: Case Studies

Practical machine learning

Try out multiple modeling approaches for each new problem Overall, perform well gradient tree boosting and random forest

slide-37
SLIDE 37

DataCamp Supervised Learning in R: Case Studies

Never skip exploratory data analysis

slide-38
SLIDE 38

DataCamp Supervised Learning in R: Case Studies

Go train some models!

SUPERVISED LEARNING IN R: CASE STUDIES