The (Random) Forest for the (Decision) Trees William Warfel Office - - PowerPoint PPT Presentation

the random forest for the decision trees
SMART_READER_LITE
LIVE PREVIEW

The (Random) Forest for the (Decision) Trees William Warfel Office - - PowerPoint PPT Presentation

The (Random) Forest for the (Decision) Trees William Warfel Office of Institutional Research Washington State University RMAIR 2016 Admission and enrollment predictive tool Overview Goal: To assist administrative planning Decision Tree


slide-1
SLIDE 1

The (Random) Forest for the (Decision) Trees

Admission and enrollment predictive tool

William Warfel Office of Institutional Research Washington State University RMAIR 2016

slide-2
SLIDE 2

Overview

Goal: To assist administrative planning

Decision Tree

Random Forest

Data Specification and Needs

R-Studio

Random Forest Model

Prediction Results

Cautions

Next Steps

This model is used to predict new first-time freshmen enrollment. Other uses can include projection

  • f graduation, retention, yield, etc.
slide-3
SLIDE 3

Research Proposal: The Goal.

  • To accurately predict Freshmen enrollment within 2.0%.
  • Then, efficiently utilize Admissions efforts to capture the most successful students and retain them.

14026 17411 18428 21302 11622 14280 14935 15572 4505 4873 5062 4643 3801 3974 4220 3991 5000 10000 15000 20000 25000 2013-14 2014-15 2015-16 2016-17 2017-18

First Time Freshmen

Applied Admitted Confirmed Enrolled

slide-4
SLIDE 4

Admission Analysis: The 6 Steps.

Prospect Inquiry Application WSU Admission Offer Confirmation Enroll

STAGE 1 STAGE 3 STAGE 2 STAGE 6 STAGE 5 STAGE 4

slide-5
SLIDE 5

What is a Decision Tree?

  • A tree-like tool used for graphical modeling that provides the possible decisions of possible
  • utcomes.

– Outcomes include chance, utility, cost, enrollment, etc.

  • Most commonly used in decision analysis to identify the most likely strategy or end goal.

– Think elections!

  • Difficulty:

– Imperfect information. – Changes to underlying data.

  • I.E. Cost of attendance or changes to admission criteria.
  • Solution:

– Conditional probability.

  • Work Backwards!
slide-6
SLIDE 6

Decision Tree Examples!

slide-7
SLIDE 7

Decision Tree Examples!

slide-8
SLIDE 8

The Decision Tree

Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer

slide-9
SLIDE 9

What is a Random Forest?

  • Predictive performance for classification and regression that construct multiple

decision trees.

  • In essence, bootstrapping (combining) multiple decision trees.
  • Random forests correct for a decision trees’ habit of overfitting to the training

set.

  • Creates more trees – Tree Bagging (true technical term).

– Bagging – to average noisy and unbiased models to create a model with low

variance

  • Think Amazon.com!
slide-10
SLIDE 10

Random Forest: Amazon!

Amazon Screen Shots (Me) Amazon Screen Shots (Becky)

slide-11
SLIDE 11

The Random Forest.

Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer Apply WSU Admission Offer Accept Offer Housing Not Enroll Enroll Alive Not Enroll Enroll Nothing Reject Offer

slide-12
SLIDE 12

Model: Needed Data

The ENTIRE KITCHEN SINK!

  • Bring in all the data believed to predict enrollment.

– Ethnicity – Sex – Housing – Freshmen / Transfer Orientation

  • Completed and Future Attendees

– Financial Aid (FAFSA & Fin Aid Interest)

  • Scholarships awarded

– Confirmations – Admission communication

slide-13
SLIDE 13

Model: Data Cautions

Look out for:

  • Institutional actions that create inconsistencies within

the data

  • Administrative or Legislative changes!

– An admission waitlist is imposed – Changes to the admission criteria – Tuition decreases!

  • And at greater rates at other in-state universities.
  • Students that cancel housing contacts but remain

active applicants.

slide-14
SLIDE 14

Why R-Studio?

  • Free-ware
  • Pre-loaded packages allow for specialized statistical

techniques

– Random Forest – Bootstrapping – LOTS and LOTS more!

  • University-wide cross-collaboration
  • Forums-on-Forums-on-Forums for help!
slide-15
SLIDE 15

Working with Data in R Studio

slide-16
SLIDE 16

Working with Data in R Studio

  • Import file using CSV format
  • Must specify headers
  • When using variables,

everything is case sensitive!

  • Partition a data set
  • Warning messages will

drive you mad!

slide-17
SLIDE 17

The Random Forest Model

  • Three Different Data Sets
  • Train – Fall 2014 Admits
  • Test – Fall 2015 Admits
  • Project – Fall 2016 Admits
  • All data is after the 6th orientation

session

  • Late June/early July
  • Different Random Forest Models
  • All Applicants
  • All Admitted
  • All Confirmed
slide-18
SLIDE 18

Random Forest: Output

  • Example: Enrollment as a

function of housing, confirmations, admitted, Pell eligibility, etc.

  • Number of Tree: 500 is standard

for R

  • No. of variables tried at each

split is set by the algorithm to find the best match within the training data set.

slide-19
SLIDE 19

Random Forest: The out of box (OOB) error rate & Confusion Matrix

  • The OOB error (or estimate) is the

error rate of the random forest predictor

  • A method to measure the Random

Forest error of prediction

  • The OOB confusion matrix is
  • btained from the RF predictor
  • Consists of true positive, true

negative, false positive, and false negative

slide-20
SLIDE 20

Random Forest: Variable Importance Plot

  • Provides the mean decrease

in accuracy of all variables.

  • In the example provided,

Attendance Code at

  • rientation along with

housing and confirmation status are the most accurate variables in prediction.

slide-21
SLIDE 21

Moment of Truth: The Projection Prediction – All Applicants

This Random Forest model predicts that 28.0% of all applicants will enroll Fall 2016. 4225/15090 = 28.0% Compared to Fall 2015:

  • 4201/(13247+4201) = 24.08%
  • How accurate:
  • Fall 2015: 18428*24.08% = 4437 vs Actual 4220
  • Fall 2016: 21302*28.0% = 5889 vs Actual 3991

Fall 2015: Fall 2016:

slide-22
SLIDE 22

Moment of Truth: The Projection Prediction – All Applicants

slide-23
SLIDE 23

Moment of Truth: The Projection Prediction – Admitted

This Random Forest model predicts that 28.1% of admitted students will enroll Fall 2016. 4211/(12327+4211) = 25.5% Compared to Fall 2015:

  • 4167/(11440+4167) = 26.7%

How accurate:

  • Fall 2015: 14935*26.7% = 3988 vs Actual 4220
  • Fall 2016: 15572*25.50% = 3971 vs Actual 3991
slide-24
SLIDE 24

Moment of Truth: The Projection Prediction – Admitted

slide-25
SLIDE 25

Moment of Truth: The Projection Prediction – Confirmed

This Random Forest model predicts that 81.2% of confirmed students will enroll Fall 2016. 4200/(4200+972) = 81.2% Compared to Fall 2015:

  • 4142/(4142+1251) = 76.8%

How accurate:

  • Fall 2015: 5062*76.8% = 3888 vs Actual 4220
  • Fall 2016: 4643*81.2% = 3770 vs Actual 3991
slide-26
SLIDE 26

Moment of Truth: The Projection Prediction – Confirmed

slide-27
SLIDE 27

Cautions

  • Null values in R will cause problems.

– Financial aid variables become problematic.

  • 0 is not the same as null
  • Dates vs. Events

– Snap Dates across admission cycles must be consistent for prediction – Or, as I prefer, use data after the same New Student Orientation

  • I used the 6th orientation session year-over-year in this analysis
  • Other models can assist in calibrating and ensuring model accuracy

– Logistic regression, Markov-Chain models, etc.

  • R does have a learning curve
slide-28
SLIDE 28

Next Steps

  • Adjust model to focus on sub-populations of the admission pool

– Honors students, STEM Majors, URM, etc.

  • Apply across the institution

– WSU Tri-Cities – WSU Vancouver – WSU Spokane – WSU North Puget Sound at Everett – WSU Global

  • Apply to other areas of student prediction models – degree-time

completion.

slide-29
SLIDE 29

References

Headstrom, Ward. Using a Random Forest model to predict

  • enrollment. Humboldt State University. CAIR 2013

Herzog, Serge. Estimating Student Retetnion and Degree- Completion Time: Decision Trees and Neural Networks vis-à-vis

  • Regression. New Directions for Institutional Research. No. 131,

Fall 2006. Sampath, V., Flagel, A., Figueroa, C. A Logistic Regression Model to Predict Freshmen Enrollments.