Taxi Travel Time Prediction Assignment 2 - Outcome Lecture - - PowerPoint PPT Presentation

taxi travel time prediction
SMART_READER_LITE
LIVE PREVIEW

Taxi Travel Time Prediction Assignment 2 - Outcome Lecture - - PowerPoint PPT Presentation

Taxi Travel Time Prediction Assignment 2 - Outcome Lecture Sebastian Caldas and Nicholay Topin Before we start: a survey! Who has done applied machine learning before? 2 Before we start: a survey! Who has done applied machine


slide-1
SLIDE 1

Assignment 2 - Outcome Lecture

Sebastian Caldas and Nicholay Topin

Taxi Travel Time Prediction

slide-2
SLIDE 2

Before we start: a survey!

2

  • Who has done applied machine learning before?
slide-3
SLIDE 3

Before we start: a survey!

3

  • Who has done applied machine learning before?
  • How much time did you spend on the implementation part of the

assignment?

slide-4
SLIDE 4

This lecture has 3 objectives:

Summarize the students’ solutions to the assignment Understand how the assignment relates to the course’s goals Provide the appropriate context for the next assignment

4

slide-5
SLIDE 5

This lecture has 3 objectives:

Summarize the students’ solutions to the assignment Understand how the assignment relates to the course’s goals Provide the appropriate context for the next assignment

5

slide-6
SLIDE 6

6

Ksenia Korovina Zachary Wojtowicz

slide-7
SLIDE 7

Global summary

slide-8
SLIDE 8

“By 5pm on March 13, 2019, make a submission to Kaggle that beats the baseline.”

8

  • Baseline was a simple “lookup table” approach

○ Calculate “hour block” for each data point: int(pickup_hour/5) ○ Features: hour block, PU location ID, DO location ID ○ At test-time, for a (block, PU ID, DO ID) tuple, predict average for matching training tuples

slide-9
SLIDE 9

“By 5pm on March 13, 2019, make a submission to Kaggle that beats the baseline.”

9

  • Baseline was a simple “lookup table” approach

○ Calculate “hour block” for each data point: int(pickup_hour/5) ○ Features: hour block, PU location ID, DO location ID ○ At test-time, for a (block, PU ID, DO ID) tuple, predict average for matching training tuples

  • Boosting and random forests with standard parameters outperform baseline
slide-10
SLIDE 10

Any comments?

slide-11
SLIDE 11

“Describe the pipeline used for your submission and present your results.”

11

1. Preprocessing

○ Mostly done for you (Thanks, Nicholay!) ○ Convert time t to ln(t + 1) to easily optimize RMSLE ○ Subsample the data (to account for limited resources)

slide-12
SLIDE 12

“Describe the pipeline used for your submission and present your results.”

12

1. Preprocessing

○ Mostly done for you (Thanks, Nicholay!) ○ Convert time t to ln(t + 1) to easily optimize RMSLE ○ Subsample the data (to account for limited resources)

2. Feature engineering

○ Remove “vendor id”, “payment type” and “passenger count” (?) ○ Day of week and hour of day (categorical) ○ Month (?) ○ Minute/Hour of the week ○ Weekday vs. weekend ○ Distance between locations ○ Average time for pick-up/drop-off pair ○ Traffic estimates (count for pick-up/drop-off pair, sometimes hour)

slide-13
SLIDE 13
slide-14
SLIDE 14

How can we handle categorical features?

slide-15
SLIDE 15

Why did the average time work?

slide-16
SLIDE 16

“Describe the pipeline used for your submission and present your results.”

16

  • 3. Split into train/val sets

○ Test set was given ○ Best estimates if train happened before val

slide-17
SLIDE 17

“Describe the pipeline used for your submission and present your results.”

17

  • 3. Split into train/val sets

○ Test set was given ○ Best estimates if train happened before val

  • 4. Method Selection

○ Random forests (most popular) ○ Boosted trees ○ Nearest neighbors ○ Shallow feed-forward neural network (quite unpopular?) ○ Classifier per pick-up/drop-off pair (sometimes band of day) ■ Requires handling sparsity

slide-18
SLIDE 18

“Describe the pipeline used for your submission and present your results.”

18

  • 3. Split into train/val sets

○ Test set was given ○ Best estimates if train happened before val

  • 4. Method Selection

○ Random forests (most popular) ○ Boosted trees ○ Nearest neighbors ○ Shallow feed-forward neural network (quite unpopular?) ○ Classifier per pick-up/drop-off pair (sometimes band of day) ■ Requires handling sparsity ○ Few students had their own baselines.

slide-19
SLIDE 19

“Describe the pipeline used for your submission and present your results.”

19

  • 5. Tuning

○ Tune on a developer set (different from train/val) ○ Cross-validation (?) ○ Different hyperparameters per pick-up/drop-off pair (MTL) ○ Pick an extreme value of the grid search (?)

  • 6. Evaluate

○ Convert back from log-space ○ Evaluate on val set (before submitting to Kaggle)

slide-20
SLIDE 20

“Describe the pipeline used for your submission and present your results.”

20

  • 5. Tuning

○ Tune on a developer set (different from train/val) ○ Cross-validation (?) ○ Different hyperparameters per pick-up/drop-off pair (MTL) ○ Pick an extreme value of the grid search (?)

  • 6. Evaluate

○ Convert back from log-space ○ Evaluate on val set (before submitting to Kaggle)

  • 7. Iterate
slide-21
SLIDE 21

Any comments?

slide-22
SLIDE 22

“Propose concrete and meaningful modifications or extensions to your solution. ”

22

  • The first step is to understand / diagnose your current approach
slide-23
SLIDE 23
  • The first step is to understand / diagnose your current approach

“Propose concrete and meaningful modifications or extensions to your solution. ”

23

Figures by Jie Xie

slide-24
SLIDE 24
  • The first step is to understand / diagnose your current approach

Figure by Zachary Wojtowicz

“Propose concrete and meaningful modifications or extensions to your solution. ”

24

slide-25
SLIDE 25
  • The first step is to understand / diagnose your current approach

Figures by Vignesh Kannan

“Propose concrete and meaningful modifications or extensions to your solution. ”

25

slide-26
SLIDE 26

Figures by Aditya Galada

“Propose concrete and meaningful modifications or extensions to your solution. ”

26

  • The first step is to understand / diagnose your current approach
slide-27
SLIDE 27

Figure by Neel Guha

“Propose concrete and meaningful modifications or extensions to your solution. ”

27

  • The first step is to understand / diagnose your current approach
slide-28
SLIDE 28

Now, how can we do better?

slide-29
SLIDE 29

“Propose concrete and meaningful modifications or extensions to your solution. ”

29

  • Better features

○ Make sure to include spatio-temporal features ○ Distance and average travel seem powerful but could be redundant

slide-30
SLIDE 30

“Propose concrete and meaningful modifications or extensions to your solution. ”

30

  • Better features

○ Make sure to include spatio-temporal features ○ Distance and average travel seem powerful but could be redundant

  • Better models

○ Properly tuning your current models

slide-31
SLIDE 31

“Propose concrete and meaningful modifications or extensions to your solution. ”

31

  • Better features

○ Make sure to include spatio-temporal features ○ Distance and average travel seem powerful but could be redundant

  • Better models

○ Properly tuning your current models

  • More data

○ Subsample more data ○ Random forests seems to plateau after a while ○ External data sources ■ Weather data ■ Traffic data ■ Holidays

slide-32
SLIDE 32

Any comments?

slide-33
SLIDE 33

This lecture has 3 objectives:

Summarize the students’ solutions to the assignment Understand how the assignment relates to the course’s goals Provide the appropriate context for the next assignment

33

slide-34
SLIDE 34

Typical Steps of Applied Data Analysis

Steps

Overview of research Some research questions the data might answer Description of data Data checks / transfer Return to questions and translating them Present to collaborators

  • Simple methods to give preliminary answers

Present to collaborators

  • Do better / Iterate

Present to collaborators

slide-35
SLIDE 35

Typical Steps of Applied Data Analysis

Steps

Overview of research Some research questions the data might answer Description of data Data checks / transfer Return to questions and translating them Present to collaborators

  • Simple methods to give preliminary answers

Present to collaborators

  • Do better / Iterate

Present to collaborators

slide-36
SLIDE 36

Typical Steps of Applied Data Analysis

Steps

Overview of research Some research questions the data might answer Description of data Data checks / transfer Return to questions and translating them Present to collaborators

  • Simple methods to give preliminary answers

Present to collaborators

  • Do better / Iterate

Present to collaborators

slide-37
SLIDE 37

This lecture has 3 objectives:

Summarize the students’ solutions to the assignment Understand how the assignment relates to the course’s goals Provide the appropriate context for the next assignment

37

slide-38
SLIDE 38

Typical Steps of Applied Data Analysis

Steps

Overview of research Some research questions the data might answer Description of data Data checks / transfer Return to questions and translating them Present to collaborators

  • Simple methods to give preliminary answers

Present to collaborators

  • Do better / Iterate

Present to collaborators

slide-39
SLIDE 39

39

Assignment 3 will focus on iterating upon your preliminary pipeline

  • We will provide you with a new preprocessed version of the data.
  • We will not impose any restrictions on which pipeline you decide to

implement and you can use external sources of data.

  • We will provide a set of baselines which you should beat
slide-40
SLIDE 40

40

slide-41
SLIDE 41

41

Just like Assignment 2, Assignment 3 will have two deadlines:

  • By the first deadline, you should have a Kaggle submission that beats our

proposed baselines

○ Failing to do so will impact your grade

  • By the second deadline, you should improve your model and write your

report

○ This second deadline is the one previously specified in the course’s calendar

  • The first deadline will be one week before the second
slide-42
SLIDE 42

42

Just like Assignment 2, Assignment 3 will have two deadlines:

  • By the first deadline, you should have a Kaggle submission that beats our

proposed baselines

○ Failing to do so will impact your grade

  • By the second deadline, you should improve your model and write your

report

○ This second deadline is the one previously specified in the course’s calendar

  • The first deadline will be one week before the second
  • The Kaggle competition is meant to incentivize you

○ Your grade will not be negatively affected based on your ranking ○ The only exception is failing to beat the given baselines

slide-43
SLIDE 43

Any questions?