Liquidity Modeling in Real Estate using Survival Analysis QCon San - - PowerPoint PPT Presentation

liquidity modeling in real estate using survival analysis
SMART_READER_LITE
LIVE PREVIEW

Liquidity Modeling in Real Estate using Survival Analysis QCon San - - PowerPoint PPT Presentation

Liquidity Modeling in Real Estate using Survival Analysis QCon San Francisco, April 2018 David Lundgren & Xinlu Huang Who has modeled time-to-event data before? Who has modeled time-to-event data before? Whats the half-life of a


slide-1
SLIDE 1

Liquidity Modeling in Real Estate using Survival Analysis

QCon San Francisco, April 2018 David Lundgren & Xinlu Huang

slide-2
SLIDE 2

Who has modeled time-to-event data before?

slide-3
SLIDE 3

Who has modeled time-to-event data before?

What’s the half-life of a startup in Silicon Valley?

slide-4
SLIDE 4

Who has modeled time-to-event data before?

What’s the half-life of a startup in Silicon Valley? When’s my team going to score another goal?

slide-5
SLIDE 5

Did you use survival analysis?

slide-6
SLIDE 6

Introduction

Xinlu Huang David Lundgren

slide-7
SLIDE 7

Talk Structure

  • Real Estate 100 and Opendoor 101

○ Modeling Liquidity via Days-on-market ○ Home Sale Case Studies

  • Pay Attention to the Negative Space (Model 1)
  • Solve a Simpler Problem (Model 2)
  • A General Recipe for Survival Analysis (Model 3)
  • Q & A
slide-8
SLIDE 8
slide-9
SLIDE 9
  • Opendoor bears the risk in reselling the home
  • Time-on-market varies substantially by home
  • Our unit costs are driven by how long it takes us to find a buyer for

a home How a home’s duration on the market impacts Opendoor

Real Estate 100 and Opendoor 101

slide-10
SLIDE 10

The Problem

How long will it take us to find a buyer for a home?

slide-11
SLIDE 11

Home Sale Case Studies

Home 1

Listed ~$800k

slide-12
SLIDE 12

Home Sale Case Studies

Home 1

Listed ~$800k 6+ months on the market

slide-13
SLIDE 13

Home Sale Case Studies

Home 2

Listed ~$300k

slide-14
SLIDE 14

Home Sale Case Studies

Home 2

Listed ~$300k 1 month on the market

slide-15
SLIDE 15

Framing the Problem

slide-16
SLIDE 16

Framing the Problem

Home List Price Square Feet Other Features Days-on-market (y) 423 Main Street $200k 2000 .... 30 111 Side Road $200k 2200 ... 100 ... 52 Downtown Ave $400k 1945 n/a 90 Outskirts Lane $300k 2100 n/a

slide-17
SLIDE 17

Model #1: Linear Regression

Home List Price Square Feet Other Features Days-on-market (y) 423 Main Street $200k 2000 .... 30 111 Side Road $200k 2200 ... 100 ...

slide-18
SLIDE 18

Does it work?

slide-19
SLIDE 19

Results

slide-20
SLIDE 20

Results

slide-21
SLIDE 21

Results

slide-22
SLIDE 22

Results

slide-23
SLIDE 23

Results

slide-24
SLIDE 24

Censoring

slide-25
SLIDE 25

Model #1: Linear Regression

Home List Price Square Feet ... Days-on-market (y) Explanation 423 Main Street $200k 2000 .... 30 111 Side Road $200k 2200 ... 100 ... 52 Downtown Ave $400k 1945 n/a Still on market after 200 days 90 Outskirts Lane $300k 2100 n/a Delisted after 300 days

slide-26
SLIDE 26

Pay attention to the negative space

Model #1: Takeaway

slide-27
SLIDE 27

Reframing the Problem

slide-28
SLIDE 28

Model #2: Classify “closed before 100 days-on-market”

days-on-market

100 days

?

slide-29
SLIDE 29

Model #2: Classify “closed before 100 days-on-market”

Home List Price ... Days-on-market Closed Within 100 Days (y) 423 Main Street $200k ... 30 1 111 Side Road $200k ... 100 ... 52 Downtown Ave $400k ... n/a (still on market after 200 days) 90 Outskirts Lane $300k ... n/a (delisted after 300 days)

slide-30
SLIDE 30

Does it Work?

slide-31
SLIDE 31

Pros

slide-32
SLIDE 32

Pro: Easy to Implement

days-on-market

?

slide-33
SLIDE 33

Pro: Easy to Implement - Just Set a Threshold

days-on-market

100 days

?

slide-34
SLIDE 34

Predicted Probability 0-100 days 100+ days

Pro: Easy-to-interpret Output

slide-35
SLIDE 35

Pro: Uses Censored Data

days-on-market

100 days

?

slide-36
SLIDE 36

Cons

slide-37
SLIDE 37

Easy to Implement - Just Set a Threshold

days-on-market

100 days

?

slide-38
SLIDE 38

Easy to Implement - Just Set a Threshold - But Which One?

days-on-market

100 days

?

10 days 120 days 45 days

slide-39
SLIDE 39

Easy-to-interpret Output

Predicted Probability 0-100 days 100+ days

slide-40
SLIDE 40

Predicted Probability 0-100 days 100+ days

Easy-to-interpret Output

Predicted Probability 0-100 days 100+ days x 50 150 x +

Wrong API

= ??

slide-41
SLIDE 41

days-on-market Predicted Probability

60 days

Easy-to-interpret Output Ideal API

Predicted Probability 0-100 days 100+ days

slide-42
SLIDE 42

Uses Censored Data

days-on-market

100 days

?

slide-43
SLIDE 43

Uses Censored Data (Partially)

days-on-market

100 days

?

But Discards Recent Observations

days-on-market

100 days

?

slide-44
SLIDE 44

Solve a Simpler Problem

Model #2: Takeaway

slide-45
SLIDE 45

Attempt #3

Survival Analysis

slide-46
SLIDE 46

When stuck, see if someone has already solved the problem...

Actuaries & medical professionals are interested in

  • What is the life expectancy of

the population of city A?

  • What is the probability of person

B surviving the next decade?

  • Given person C is 70 years old,

what is his/her life expectancy? Censored data is always an issue.

slide-47
SLIDE 47

Actuaries & medical professionals are interested in

  • What is the life expectancy of

the population of city A?

  • What is the probability of person

B surviving the next decade?

  • Given person C is 70 years old,

what is his/her life expectancy?

In this analogy, “death” is a happy event of finding a buyer:

Opendoor is interested in

  • What is the expected days on

market for all listings in city A?

  • What is the probability of listing B

taking 10 more days to sell?

  • Given listing C was on market for

70 days, how much longer until we expect to find a buyer?

slide-48
SLIDE 48

time Predicted Probability Days-on-market

Predicted Days-on-market = 45 Previously…. With survival analysis... 60 Predicted Probability 0-100 days 100+ days

slide-49
SLIDE 49

Look for Existing Solutions to Similar Problems

Model #3: Takeaway 1

slide-50
SLIDE 50

We found the right approach, but...

slide-51
SLIDE 51

Hurdle #1 It’s not easy to explain

The fundamental concepts requires calculus to explain well Limited intuition and tie-ins to tangible concepts for decision makers

????

slide-52
SLIDE 52

Hurdle #2 Scaling is hard with existing tools

  • Lots of R packages
  • Limited options for production-ready languages
  • Works great for small dataset; broke down with larger ones
slide-53
SLIDE 53

Hurdle #3 Modeling flexibility is hard with existing tools

  • Off-the-shelf packages: model choices are limited (proportional or

additive hazard models) ○ Non-flexible feature specification ○ Hard to implement time-varying features ○ …

  • Markov Chain Monte Carlo (Stan): complete freedom of model

specification, but ○ Took hours to train on a tiny dataset ○ Hard to maintain

slide-54
SLIDE 54

Let’s try to reformulate the problem

slide-55
SLIDE 55

Survival analysis made easy

Instead of telling you about... S(t), (t), Cox Proportional Models, Kaplan-Meier, ... We will show you a reformulation that

  • Easily scalable to large datasets
  • More concretely tied to real life numbers
  • Equivalent*
  • Allows flexible modeling extension

* with some hand-waving. Rigorous proof left to mathematicians in the audience as an exercise.

slide-56
SLIDE 56

Home

  • Ini. List

Price ... Days-on- market 423 Main Street $200k .... 30

Changing target again

slide-57
SLIDE 57

Home

  • Ini. List

Price ... Days-on- market “Current” days on market Sold in the next day (y) 423 Main Street $200k .... 30 423 Main Street $200k .... 30 1 423 Main Street $200k .... 30 2 ... 423 Main Street $200k .... 30 28 423 Main Street $200k .... 30 29 1

Changing target again

30 new data rows

slide-58
SLIDE 58

Home

  • Ini. List

Price ... Days-on- market “Current” days on market Sold in the next day (y) 423 Main Street $200k .... 30 423 Main Street $200k .... 30 1 423 Main Street $200k .... 30 2 ... 423 Main Street $200k .... 30 28 423 Main Street $200k .... 30 29 1 52 Downtown Ave $400k ... Still on market after 200 days

Changing target again

30 rows

slide-59
SLIDE 59

Home

  • Ini. List

Price ... Days-on- market “Current” days on market Sold in the next day (y) 423 Main Street $200k .... 30 423 Main Street $200k .... 30 1 423 Main Street $200k .... 30 2 ... 423 Main Street $200k .... 30 28 423 Main Street $200k .... 30 29 1 52 Downtown Ave $400k ... n/a ... 52 Downtown Ave $400k ... n/a 199

Changing target again

30 rows 200 rows

slide-60
SLIDE 60

Change fundamental unit of data listings ⇒ listing-days All listing data are used: closed, active, delisted...

slide-61
SLIDE 61

Binary classification to the rescue, again

We transformed the problem into vanilla binary classification

  • Pick your favorite binary classifier, as long as

○ Log-loss minimizing ○ Calibrated probabilities

  • Scalability ✔ (even though we made the dataset larger!)
slide-62
SLIDE 62

How to interpret?

Prediction = probability of listing closing in the next day (hazard rate in survival analysis parlance) Prediction = housing clearance rate, a.k.a. inventory turnover rate if we start with 100 homes on market today, how many will close before the end of the day/week/month/year? ✔ Model output ties directly to real world numbers, no calculus needed!

slide-63
SLIDE 63

How to interpret? (cont’d)

Example: expected days on market For each listing, we have a series of predictions (h1, h2, h3, h4, ...) for each day E[y] = ∑ y × P(y) = 1 × h1 + 2 × (1 - h1) h2 + 3 × (1 - h1) (1 - h2) h3 + 4 × … + ...

P(closing on day 1) P(days-on-market = 2) = P(not closing on day 1) × P(closing on day 2)

Prediction, a.k.a. the hazard rate, is the building block hazard rate + laws of probabilities = everything we want to know

slide-64
SLIDE 64

Complex modeling technique doesn’t always need complex implementation

Model #3: Takeaway 2

slide-65
SLIDE 65

How does it work?

Maximizing log-likelihood estimate: P(data | model) = P(listing1 on market for D1 days|model) * P(listing2 on market for D2 days|model) * … = (1-h11)(1-h12)...h1D1 * (1-h21)(1-h22)...h2D2 log P(data | model) = log(1-h11) + log(1-h12) … + log(h1D1) + … = Spoiler alert - look at log loss function ✔ Minimizing log loss in binary classification:

(Equation alert!)

Only one term matters depending on label (yi = {0, 1})

slide-66
SLIDE 66

We will show you a reformulation that is ✔ Easily scalable to large datasets ✔ More concretely tied to real life numbers ✔ Equivalent

  • Allows flexible modeling extension
slide-67
SLIDE 67

Time varying features

e.g. how does pricing change liquidity? Not straightforward to implement in

  • ff-the-shelf survival analysis models
slide-68
SLIDE 68

Time varying features

e.g. how does pricing change liquidity?

Home

  • Ini. List

Price “Current” list price ... Days-on- market “Current” days

  • n market

Sold in the next day (y) 423 Main Street $200k $200k .... 30 423 Main Street $200k $200k .... 30 1 423 Main Street $200k $190k .... 30 2 ... 423 Main Street $200k $170k .... 30 28 423 Main Street $200k $170k .... 30 29 1

slide-69
SLIDE 69

Time series analysis

Real life housing data is not stationary

slide-70
SLIDE 70

Time series analysis

  • Tricky to implement in traditional survival analysis
  • Listing centric view doesn’t work well

date Instead, train a series of models using snapshot of listings at time t then interpolate predictions using time series techniques

slide-71
SLIDE 71

Divide and conquer

Break problem down to interpretable intermediate steps

Model #3: Takeaway 3

slide-72
SLIDE 72

When You Have a Hammer, Everything Looks Like a Nail Survival Analysis

slide-73
SLIDE 73

Survival analysis is broadly useful

Churn prediction / user lifetime analysis

  • Not just if, but when and with what probability, a user leaves
  • Full probability distribution to compute lifetime value of customers

Credit / Loan default

  • Default early or default later in the loan?

System reliability

  • What are the distribution of lifetime of hard drives?

Any time-to-event prediction!

slide-74
SLIDE 74

Ask you doctor if survival analysis is right for you …

  • You want to model time-to-event, or even just binary classification
  • You work with censored data
  • You value a full probability distribution instead of point estimate
  • Time is a confounding factor (cohorts, mix shift, ….)
slide-75
SLIDE 75

If survival analysis is right for you, it can be easy to use! We’ve shown you a reformulation that

✔ Easily scalable to large datasets ✔ More concretely tied to real life numbers ✔ Allows flexible modeling extension

1. Transform

  • 2. Binary Classify

1 ...

slide-76
SLIDE 76

Thanks!

slide-77
SLIDE 77

Join us at Opendoor as we change real estate!

  • Founded in 2014
  • $100M+ transactions per month
  • In rapid expansion mode - we’re currently in 8

cities (more coming!)

  • We are hiring engineers and data scientists.

Please contact us: dave@opendoor.com & xinlu@opendoor.com

slide-78
SLIDE 78

Q & A