the random forest for the decision trees
play

The (Random) Forest for the (Decision) Trees William Warfel Office - PowerPoint PPT Presentation

The (Random) Forest for the (Decision) Trees William Warfel Office of Institutional Research Washington State University RMAIR 2016 Admission and enrollment predictive tool Overview Goal: To assist administrative planning Decision Tree


  1. The (Random) Forest for the (Decision) Trees William Warfel Office of Institutional Research Washington State University RMAIR 2016 Admission and enrollment predictive tool

  2. Overview Goal: To assist administrative planning – Decision Tree – Random Forest – Data Specification and Needs – R-Studio – Random Forest Model – Prediction Results – Cautions – Next Steps – This model is used to predict new first-time freshmen enrollment. Other uses can include projection – of graduation, retention, yield, etc.

  3. Research Proposal: The Goal. To accurately predict Freshmen enrollment within 2.0%. • • Then, efficiently utilize Admissions efforts to capture the most successful students and retain them. First Time Freshmen 25000 20000 21302 18428 17411 15000 15572 14935 14280 14026 10000 11622 5000 5062 4873 4643 4505 4220 3974 3991 3801 0 2013-14 2014-15 2015-16 2016-17 2017-18 Applied Admitted Confirmed Enrolled

  4. Admission Analysis: The 6 Steps. STAGE STAGE STAGE STAGE STAGE STAGE 1 3 5 2 4 6 WSU Prospect Inquiry Application Admission Confirmation Enroll Offer

  5. What is a Decision Tree? • A tree-like tool used for graphical modeling that provides the possible decisions of possible outcomes. – Outcomes include chance, utility, cost, enrollment, etc. • Most commonly used in decision analysis to identify the most likely strategy or end goal. – Think elections! • Difficulty: – Imperfect information. – Changes to underlying data. • I.E. Cost of attendance or changes to admission criteria. • Solution: – Conditional probability. • Work Backwards!

  6. Decision Tree Examples!

  7. Decision Tree Examples!

  8. The Decision Tree Not Enroll Housing Enroll Accept Offer Not Enroll WSU Apply Admission Alive Offer Reject Offer Enroll Nothing

  9. What is a Random Forest? • Predictive performance for classification and regression that construct multiple decision trees. • In essence, bootstrapping (combining) multiple decision trees. • Random forests correct for a decision trees’ habit of overfitting to the training set. • Creates more trees – Tree Bagging (true technical term). – Bagging – to average noisy and unbiased models to create a model with low variance Think Amazon.com! •

  10. Random Forest: Amazon! Amazon Screen Shots (Me) Amazon Screen Shots (Becky)

  11. The Random Forest. Not Enroll Not Enroll Not Enroll Not Enroll Housing Not Enroll Not Enroll Housing Housing Housing Enroll Housing Housing Not Enroll Enroll Enroll Not Enroll Enroll Enroll Not Enroll Enroll Housing Housing Accept Offer Not Enroll Housing Enroll WSU Accept Offer Not Enroll Accept Offer Not Enroll Enroll Accept Offer Not Enroll Apply Admission Alive WSU Accept Offer Not Enroll Enroll WSU WSU Accept Offer Not Enroll Offer WSU Apply Admission Alive Apply Admission Alive Apply WSU Admission Alive Reject Offer Enroll Apply Admission Alive Offer Offer Apply Offer Alive Admission Accept Offer Not Enroll Reject Offer Offer Enroll Reject Offer Enroll Accept Offer Not Enroll Reject Offer Enroll Offer WSU Reject Offer Nothing Enroll Accept Offer Not Enroll WSU Reject Offer Enroll Apply Admission Alive Nothing WSU Nothing Apply Admission Alive Nothing Offer Nothing Apply Admission Alive Offer Nothing Reject Offer Enroll Offer Reject Offer Enroll Reject Offer Enroll Nothing Nothing Nothing

  12. Model: Needed Data The ENTIRE KITCHEN SINK! • Bring in all the data believed to predict enrollment. – Ethnicity – Sex – Housing – Freshmen / Transfer Orientation • Completed and Future Attendees – Financial Aid (FAFSA & Fin Aid Interest) • Scholarships awarded – Confirmations – Admission communication

  13. Model: Data Cautions Look out for: • Institutional actions that create inconsistencies within the data • Administrative or Legislative changes! – An admission waitlist is imposed – Changes to the admission criteria – Tuition decreases! • And at greater rates at other in-state universities. • Students that cancel housing contacts but remain active applicants.

  14. Why R-Studio? • Free-ware • Pre-loaded packages allow for specialized statistical techniques – Random Forest – Bootstrapping – LOTS and LOTS more! • University-wide cross-collaboration • Forums-on-Forums-on-Forums for help!

  15. Working with Data in R Studio

  16. Working with Data in R Studio • Import file using CSV format • Must specify headers • When using variables, everything is case sensitive! • Partition a data set • Warning messages will drive you mad!

  17. The Random Forest Model • Three Different Data Sets • Train – Fall 2014 Admits • Test – Fall 2015 Admits • Project – Fall 2016 Admits • All data is after the 6 th orientation session • Late June/early July • Different Random Forest Models • All Applicants • All Admitted • All Confirmed

  18. Random Forest: Output • Example: Enrollment as a function of housing, confirmations, admitted, Pell eligibility, etc. • Number of Tree: 500 is standard for R • No. of variables tried at each split is set by the algorithm to find the best match within the training data set.

  19. Random Forest: The out of box (OOB) error rate & Confusion Matrix • The OOB error (or estimate) is the error rate of the random forest predictor • A method to measure the Random Forest error of prediction • The OOB confusion matrix is obtained from the RF predictor • Consists of true positive, true negative, false positive, and false negative

  20. Random Forest: Variable Importance Plot • Provides the mean decrease in accuracy of all variables. • In the example provided, Attendance Code at orientation along with housing and confirmation status are the most accurate variables in prediction.

  21. Moment of Truth: The Projection Prediction – All Applicants Fall 2015: This Random Forest model predicts that 28.0% of all applicants will enroll Fall 2016. 4225/15090 = 28.0% Fall 2016: Compared to Fall 2015: • 4201/(13247+4201) = 24.08% • How accurate: • Fall 2015: 18428*24.08% = 4437 vs Actual 4220 • Fall 2016: 21302*28.0% = 5889 vs Actual 3991

  22. Moment of Truth: The Projection Prediction – All Applicants

  23. Moment of Truth: The Projection Prediction – Admitted This Random Forest model predicts that 28.1% of admitted students will enroll Fall 2016. 4211/(12327+4211) = 25.5% Compared to Fall 2015: • 4167/(11440+4167) = 26.7% How accurate: • Fall 2015: 14935*26.7% = 3988 vs Actual 4220 • Fall 2016: 15572*25.50% = 3971 vs Actual 3991

  24. Moment of Truth: The Projection Prediction – Admitted

  25. Moment of Truth: The Projection Prediction – Confirmed This Random Forest model predicts that 81.2% of confirmed students will enroll Fall 2016. 4200/(4200+972) = 81.2% Compared to Fall 2015: • 4142/(4142+1251) = 76.8% How accurate: • Fall 2015: 5062*76.8% = 3888 vs Actual 4220 • Fall 2016: 4643*81.2% = 3770 vs Actual 3991

  26. Moment of Truth: The Projection Prediction – Confirmed

  27. Cautions • Null values in R will cause problems. – Financial aid variables become problematic. 0 is not the same as null • • Dates vs. Events – Snap Dates across admission cycles must be consistent for prediction – Or, as I prefer, use data after the same New Student Orientation I used the 6 th orientation session year-over-year in this analysis • • Other models can assist in calibrating and ensuring model accuracy – Logistic regression, Markov-Chain models, etc. • R does have a learning curve

  28. Next Steps • Adjust model to focus on sub-populations of the admission pool – Honors students, STEM Majors, URM, etc. • Apply across the institution – WSU Tri-Cities – WSU Vancouver – WSU Spokane – WSU North Puget Sound at Everett – WSU Global • Apply to other areas of student prediction models – degree-time completion.

  29. References Headstrom, Ward. Using a Random Forest model to predict enrollment. Humboldt State University. CAIR 2013 Herzog, Serge. Estimating Student Retetnion and Degree- Completion Time: Decision Trees and Neural Networks vis-à-vis Regression. New Directions for Institutional Research. No. 131, Fall 2006. Sampath, V., Flagel, A., Figueroa, C. A Logistic Regression Model to Predict Freshmen Enrollments.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend