DataCamp Supervised Learning in R: Case Studies SUPERVISED LEARNING IN R : CASE STUDIES Surveying Catholic sisters in 1967 Julia Silge Data Scientist at Stack Overflow
DataCamp Supervised Learning in R: Case Studies Conference of Major Superiors of Women Sisters' Survey Fielded in 1967 with over 600 questions Responses from over 130,000 sisters in almost 400 congregations Data is freely available
DataCamp Supervised Learning in R: Case Studies Opinions and attitudes in the 1960s Response Code Disagree very much 1 Disagree somewhat 2 Neither agree nor disagree 3 Agree somewhat 4 Agree very much 5 Check out the survey's codebook .
DataCamp Supervised Learning in R: Case Studies Opinions and attitudes in the 1960s > sisters67 # A tibble: 77,112 x 67 age sister v116 v117 v118 v119 v120 v121 v122 v123 v124 v125 <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1 60.0 1 1 1 3 5 1 1 3 5 3 1 2 70.0 2 2 2 4 4 1 3 1 5 4 1 3 60.0 3 1 1 3 2 2 3 1 1 3 1 4 60.0 4 5 1 2 4 1 3 4 3 3 4 5 50.0 5 2 3 3 3 2 2 1 5 2 5 6 40.0 7 4 3 2 5 4 3 1 5 2 5 7 50.0 9 5 4 5 4 4 5 3 5 4 2 8 40.0 10 5 4 3 5 1 3 5 5 5 4 9 30.0 11 2 2 3 5 1 3 3 5 1 3 10 30.0 12 4 1 5 5 1 4 3 5 1 5 # ... with 77,102 more rows, and 55 more variables: v126 <int>, v127 <int>, # v128 <int>, v129 <int>, v130 <int>, v131 <int>, v132 <int>, v133 <int>, # v134 <int>, v135 <int>, v136 <int>, v137 <int>, v138 <int>, v139 <int>, # v140 <int>, v141 <int>, v142 <int>, v143 <int>, v144 <int>, v145 <int>, # v146 <int>, v147 <int>, v148 <int>, v149 <int>, v150 <int>, v151 <int>, # v152 <int>, v153 <int>, v154 <int>, v155 <int>, v156 <int>, v157 <int>, # v158 <int>, v159 <int>, v160 <int>, v161 <int>, v162 <int>, v163 <int>, # v164 <int>, v165 <int>, v166 <int>, v167 <int>, v168 <int>, v169 <int>, # v170 <int>, v171 <int>, v172 <int>, v173 <int>, v174 <int>, v175 <int>, # v176 <int>, v177 <int>, v178 <int>, v179 <int>, v180 <int>
DataCamp Supervised Learning in R: Case Studies
DataCamp Supervised Learning in R: Case Studies Opinions and attitudes in the 1960s "Catholics should boycott indecent movies." "In the past 25 years, this country has moved dangerously close to socialism." "I would rather be called an idealist than a practical person."
DataCamp Supervised Learning in R: Case Studies Tidy your data > sisters67 %>% + select(-sister) %>% + gather(key, value, -age) # A tibble: 5,012,280 x 3 age key value <dbl> <chr> <int> 1 60.0 v116 1 2 70.0 v116 2 3 60.0 v116 1 4 60.0 v116 5 5 50.0 v116 2 6 40.0 v116 4 7 50.0 v116 5 8 40.0 v116 5 9 30.0 v116 2 10 30.0 v116 4 # ... with 5,012,270 more rows
DataCamp Supervised Learning in R: Case Studies SUPERVISED LEARNING IN R : CASE STUDIES Let's practice!
DataCamp Supervised Learning in R: Case Studies SUPERVISED LEARNING IN R : CASE STUDIES Exploratory data analysis with tidy data Julia Silge Data Scientist at Stack Overflow
DataCamp Supervised Learning in R: Case Studies Counting agreement > tidy_sisters %>% + count(value) # A tibble: 5 x 2 value n <int> <int> 1 1 1303555 2 2 844311 3 3 645401 4 4 1108859 5 5 1110154
DataCamp Supervised Learning in R: Case Studies Overall agreement with age > tidy_sisters %>% + group_by(age) %>% + summarise(value = mean(value)) # A tibble: 9 x 2 age value <dbl> <dbl> 1 20.0 2.86 2 30.0 2.81 3 40.0 2.83 4 50.0 2.94 5 60.0 3.10 6 70.0 3.26 7 80.0 3.42 8 90.0 3.51 9 100 3.60
DataCamp Supervised Learning in R: Case Studies Agreement on questions by age tidy_sisters %>% filter(key %in% paste0("v", 153:170)) %>% group_by(key, value) %>% summarise(age = mean(age)) %>% ggplot(aes(value, age, color = key)) + geom_line(alpha = 0.5, size = 1.5) + geom_point(size = 2) + facet_wrap(~key)
DataCamp Supervised Learning in R: Case Studies
DataCamp Supervised Learning in R: Case Studies
DataCamp Supervised Learning in R: Case Studies Freedom of speech "People who don't believe in God have as much right to freedom of speech as anyone else."
DataCamp Supervised Learning in R: Case Studies
DataCamp Supervised Learning in R: Case Studies Conservatism and heritage "I like conservatism because it represents a stand to preserve our glorious heritage."
DataCamp Supervised Learning in R: Case Studies
DataCamp Supervised Learning in R: Case Studies Vietnam War "Catholics as a group should consider active opposition to US participation in Vietnam."
DataCamp Supervised Learning in R: Case Studies
DataCamp Supervised Learning in R: Case Studies
DataCamp Supervised Learning in R: Case Studies SUPERVISED LEARNING IN R : CASE STUDIES Let's explore!
DataCamp Supervised Learning in R: Case Studies SUPERVISED LEARNING IN R : CASE STUDIES Predicting age with supervised machine learning Julia Silge Data Scientist at Stack Overflow
DataCamp Supervised Learning in R: Case Studies Build models "rpart" "xgbLinear" "gbm"
DataCamp Supervised Learning in R: Case Studies Choosing between multiple models ## CART sisters_cart <- train(age ~ ., method = "rpart", data = training) ## xgboost sisters_rf <- train(age ~ ., method = "xgbLinear", data = training) ## gbm sisters_gbm <- train(age ~ ., method = "gbm", data = training)
DataCamp Supervised Learning in R: Case Studies
DataCamp Supervised Learning in R: Case Studies Why three data partitions? Don't overestimate how well your model is performing!
DataCamp Supervised Learning in R: Case Studies Why three data partitions? > validation %>% + mutate(prediction = predict(sisters_xg, validation)) %>% + rmse(truth = age, estimate = prediction) [1] 13.27101 > testing %>% + mutate(prediction = predict(sisters_xg, testing)) %>% + rmse(truth = age, estimate = prediction) [1] 13.36945
DataCamp Supervised Learning in R: Case Studies SUPERVISED LEARNING IN R : CASE STUDIES Let's practice!
DataCamp Supervised Learning in R: Case Studies SUPERVISED LEARNING IN R : CASE STUDIES You made it! Julia Silge Data Scientist at Stack Overflow
DataCamp Supervised Learning in R: Case Studies Predicting age > metrics(model_results, truth = age, estimate = CART) # A tibble: 1 x 2 rmse rsq <dbl> <dbl> 1 14.8 0.170 > metrics(model_results, truth = age, estimate = XBG) # A tibble: 1 x 2 rmse rsq <dbl> <dbl> 1 13.3 0.338 > metrics(model_results, truth = age, estimate = GBM) # A tibble: 1 x 2 rmse rsq <dbl> <dbl> 1 12.8 0.382
DataCamp Supervised Learning in R: Case Studies Predicting age Build your model with your training data Choose your model with your validation data Evaluate your model with your testing data
DataCamp Supervised Learning in R: Case Studies Diverse data, powerful tools Fuel efficiency of cars Developers working remotely in the Stack Overflow survey Voter turnout in 2016 Catholic nuns' ages based on beliefs and attitudes
DataCamp Supervised Learning in R: Case Studies Practical machine learning Dealing with class imbalance Improving performance with resampling (bootstrap, cross-validation)
DataCamp Supervised Learning in R: Case Studies Practical machine learning Dealing with class imbalance Improving performance with resampling (bootstrap, cross-validation) Hyperparameter tuning?
DataCamp Supervised Learning in R: Case Studies Practical machine learning Try out multiple modeling approaches for each new problem Overall, gradient tree boosting and random forest perform well
DataCamp Supervised Learning in R: Case Studies Never skip exploratory data analysis
DataCamp Supervised Learning in R: Case Studies SUPERVISED LEARNING IN R : CASE STUDIES Go train some models!
Recommend
More recommend