The (Random) Forest for the (Decision) Trees William Warfel Office - - PowerPoint PPT Presentation
The (Random) Forest for the (Decision) Trees William Warfel Office - - PowerPoint PPT Presentation
The (Random) Forest for the (Decision) Trees William Warfel Office of Institutional Research Washington State University RMAIR 2016 Admission and enrollment predictive tool Overview Goal: To assist administrative planning Decision Tree
Overview
–
Goal: To assist administrative planning
–
Decision Tree
–
Random Forest
–
Data Specification and Needs
–
R-Studio
–
Random Forest Model
–
Prediction Results
–
Cautions
–
Next Steps
–
This model is used to predict new first-time freshmen enrollment. Other uses can include projection
- f graduation, retention, yield, etc.
Research Proposal: The Goal.
- To accurately predict Freshmen enrollment within 2.0%.
- Then, efficiently utilize Admissions efforts to capture the most successful students and retain them.
14026 17411 18428 21302 11622 14280 14935 15572 4505 4873 5062 4643 3801 3974 4220 3991 5000 10000 15000 20000 25000 2013-14 2014-15 2015-16 2016-17 2017-18
First Time Freshmen
Applied Admitted Confirmed Enrolled
Admission Analysis: The 6 Steps.
Prospect Inquiry Application WSU Admission Offer Confirmation Enroll
STAGE 1 STAGE 3 STAGE 2 STAGE 6 STAGE 5 STAGE 4
What is a Decision Tree?
- A tree-like tool used for graphical modeling that provides the possible decisions of possible
- utcomes.
– Outcomes include chance, utility, cost, enrollment, etc.
- Most commonly used in decision analysis to identify the most likely strategy or end goal.
– Think elections!
- Difficulty:
– Imperfect information. – Changes to underlying data.
- I.E. Cost of attendance or changes to admission criteria.
- Solution:
– Conditional probability.
- Work Backwards!
Decision Tree Examples!
Decision Tree Examples!
The Decision Tree
Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer
What is a Random Forest?
- Predictive performance for classification and regression that construct multiple
decision trees.
- In essence, bootstrapping (combining) multiple decision trees.
- Random forests correct for a decision trees’ habit of overfitting to the training
set.
- Creates more trees – Tree Bagging (true technical term).
– Bagging – to average noisy and unbiased models to create a model with low
variance
- Think Amazon.com!
Random Forest: Amazon!
Amazon Screen Shots (Me) Amazon Screen Shots (Becky)
The Random Forest.
Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer
Model: Needed Data
The ENTIRE KITCHEN SINK!
- Bring in all the data believed to predict enrollment.
– Ethnicity – Sex – Housing – Freshmen / Transfer Orientation
- Completed and Future Attendees
– Financial Aid (FAFSA & Fin Aid Interest)
- Scholarships awarded
– Confirmations – Admission communication
Model: Data Cautions
Look out for:
- Institutional actions that create inconsistencies within
the data
- Administrative or Legislative changes!
– An admission waitlist is imposed – Changes to the admission criteria – Tuition decreases!
- And at greater rates at other in-state universities.
- Students that cancel housing contacts but remain
active applicants.
Why R-Studio?
- Free-ware
- Pre-loaded packages allow for specialized statistical
techniques
– Random Forest – Bootstrapping – LOTS and LOTS more!
- University-wide cross-collaboration
- Forums-on-Forums-on-Forums for help!
Working with Data in R Studio
Working with Data in R Studio
- Import file using CSV format
- Must specify headers
- When using variables,
everything is case sensitive!
- Partition a data set
- Warning messages will
drive you mad!
The Random Forest Model
- Three Different Data Sets
- Train – Fall 2014 Admits
- Test – Fall 2015 Admits
- Project – Fall 2016 Admits
- All data is after the 6th orientation
session
- Late June/early July
- Different Random Forest Models
- All Applicants
- All Admitted
- All Confirmed
Random Forest: Output
- Example: Enrollment as a
function of housing, confirmations, admitted, Pell eligibility, etc.
- Number of Tree: 500 is standard
for R
- No. of variables tried at each
split is set by the algorithm to find the best match within the training data set.
Random Forest: The out of box (OOB) error rate & Confusion Matrix
- The OOB error (or estimate) is the
error rate of the random forest predictor
- A method to measure the Random
Forest error of prediction
- The OOB confusion matrix is
- btained from the RF predictor
- Consists of true positive, true
negative, false positive, and false negative
Random Forest: Variable Importance Plot
- Provides the mean decrease
in accuracy of all variables.
- In the example provided,
Attendance Code at
- rientation along with
housing and confirmation status are the most accurate variables in prediction.
Moment of Truth: The Projection Prediction – All Applicants
This Random Forest model predicts that 28.0% of all applicants will enroll Fall 2016. 4225/15090 = 28.0% Compared to Fall 2015:
- 4201/(13247+4201) = 24.08%
- How accurate:
- Fall 2015: 18428*24.08% = 4437 vs Actual 4220
- Fall 2016: 21302*28.0% = 5889 vs Actual 3991
Fall 2015: Fall 2016:
Moment of Truth: The Projection Prediction – All Applicants
Moment of Truth: The Projection Prediction – Admitted
This Random Forest model predicts that 28.1% of admitted students will enroll Fall 2016. 4211/(12327+4211) = 25.5% Compared to Fall 2015:
- 4167/(11440+4167) = 26.7%
How accurate:
- Fall 2015: 14935*26.7% = 3988 vs Actual 4220
- Fall 2016: 15572*25.50% = 3971 vs Actual 3991
Moment of Truth: The Projection Prediction – Admitted
Moment of Truth: The Projection Prediction – Confirmed
This Random Forest model predicts that 81.2% of confirmed students will enroll Fall 2016. 4200/(4200+972) = 81.2% Compared to Fall 2015:
- 4142/(4142+1251) = 76.8%
How accurate:
- Fall 2015: 5062*76.8% = 3888 vs Actual 4220
- Fall 2016: 4643*81.2% = 3770 vs Actual 3991
Moment of Truth: The Projection Prediction – Confirmed
Cautions
- Null values in R will cause problems.
– Financial aid variables become problematic.
- 0 is not the same as null
- Dates vs. Events
– Snap Dates across admission cycles must be consistent for prediction – Or, as I prefer, use data after the same New Student Orientation
- I used the 6th orientation session year-over-year in this analysis
- Other models can assist in calibrating and ensuring model accuracy
– Logistic regression, Markov-Chain models, etc.
- R does have a learning curve
Next Steps
- Adjust model to focus on sub-populations of the admission pool
– Honors students, STEM Majors, URM, etc.
- Apply across the institution
– WSU Tri-Cities – WSU Vancouver – WSU Spokane – WSU North Puget Sound at Everett – WSU Global
- Apply to other areas of student prediction models – degree-time
completion.
References
Headstrom, Ward. Using a Random Forest model to predict
- enrollment. Humboldt State University. CAIR 2013
Herzog, Serge. Estimating Student Retetnion and Degree- Completion Time: Decision Trees and Neural Networks vis-à-vis
- Regression. New Directions for Institutional Research. No. 131,