Symbolic network analysis Kaggle of bike sharing data Data sets - - PowerPoint PPT Presentation

symbolic network analysis
SMART_READER_LITE
LIVE PREVIEW

Symbolic network analysis Kaggle of bike sharing data Data sets - - PowerPoint PPT Presentation

Bikes V. Batagelj, A. Ferligoj Symbolic network analysis Kaggle of bike sharing data Data sets Citi Bike Analyses Conclusions References Vladimir Batagelj, Anu ska Ferligoj IMFM Ljubljana, IAM UP Koper and University of Ljubljana


slide-1
SLIDE 1

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Symbolic network analysis

  • f bike sharing data

Citi Bike Vladimir Batagelj, Anuˇ ska Ferligoj

IMFM Ljubljana, IAM UP Koper and University of Ljubljana

CMStatistics 2016 Sevilla, 9-11. December 2016

  • V. Batagelj, A. Ferligoj

Bikes

slide-2
SLIDE 2

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Outline

1 Kaggle 2 Data sets 3 Analyses 4 Conclusions 5 References

Vladimir Batagelj: vladimir.batagelj@fmf.uni-lj.si Anuˇ ska Ferligoj: anuska.ferligoj@fdv.uni-lj.si Last version of slides (12. december 2016, 14 : 39): bikes.pdf

  • V. Batagelj, A. Ferligoj

Bikes

slide-3
SLIDE 3

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Kaggle

Some time ago I found on Kaggle https://www.kaggle.com/benhamner/sf-bay-area-bike-share a contest dealing with an analysis of data on bike sharing system in the San Francisco Bay Area. After some searching it turned out that similar data sets are available for several cities around the world (mainly in US).

  • V. Batagelj, A. Ferligoj

Bikes

slide-4
SLIDE 4

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Some Open data sets on Bike Sharing Systems

  • n my disk

Bike sharing City data available # of trips Capital Washington, D.C. 2010/10-2016/09 14691090 Hubway Boston 2011/07-2016/06 3930659 Divvy Chicago 2013/01-2016/06 7867601 Citi Bike New York 2013/07-2016/09 33319019 BABS San Francisco 2013/08-2016/08 983648 Healthy Ride Pittsburgh 2015/07-2016/09 118422 Indego Philadelphia 2015/04-2016/09 673703 NiceRide Minnesota 2010/06-2015/12 1808452 Santander C. London 2015/01-2016/11 19212558

  • V. Batagelj, A. Ferligoj

Bikes

slide-5
SLIDE 5

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Data about stations

The Stations file is a snapshot of station locations and capacities during the reporting time interval:

  • Station ID
  • Station name
  • Lat/Long coordinates
  • Number of individual docking points at each station

In some cases also the data about station elevantions are available. North American Bike Share Association’s open data standard – gbfs General Bikeshare Feed Specification; Systems using gbfs. Most of the systems provide a feed service returning a JSON file with current status of stations. Divvy, Indego, CitiBike stations: info, status

  • V. Batagelj, A. Ferligoj

Bikes

slide-6
SLIDE 6

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Reading station status in R

wdir <- "C:/Users/batagelj/data/bikes/philly" setwd(wdir) stat <- "https://gbfs.bcycle.com/bcycle_indego/station_status.json" num <- 0 setInternet2(use = TRUE) p1 <- proc.time() while (num < 5){ num <- num+1 fsave <- paste(’status_’,as.character(num),’.json’,sep=’’) test <- tryCatch(download.file(stat,fsave,method="auto"), error=function(e) e) Sys.sleep(60) p2 <- proc.time() cat(p2 - p1,’\n’); flush.console() p1 <- p2 }

  • V. Batagelj, A. Ferligoj

Bikes

slide-7
SLIDE 7

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Data about trips

Each trip is anonymized and includes:

  • Bike number
  • Trip start day and time
  • Trip end day and time
  • Trip start station
  • Trip end station
  • Rider type

In some cases additional data are available: Gender, Year of birth.

  • V. Batagelj, A. Ferligoj

Bikes

slide-8
SLIDE 8

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Additional data sources

Weather For cities in US we can get the weather data at NOAA, Quality Controlled Local Climatological Data Precipitations, wind, temperature, humidity, pressure. Maps The ESRI shape files descriptions of maps can be found using Google. Boston, Bay Area Cities, New York, Pittsburgh

Large temporal and spatial network data.

There were some contests for analysing of bike sharing data. Some interesting observations were presented. Also some blogs and papers were written on this topic. In December 2016 there were 100 hits in WoS to the query "bike sharing system*".

  • V. Batagelj, A. Ferligoj

Bikes

slide-9
SLIDE 9

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Analyses

Different overall distributions: Pitts; Bay; Boston; NYC BSS Impact of weather: temperature (day/night, winter), precipitations. Cycles: year (temperature), week (working days/weekend), day (hours, parts of the day): week; days in a week Other factors: subscriber/customer, trip duration, gendre, rider’s age, speed, elevation: age The moves of bikes among stations by the system can be recognized as those rides where the bike’s next trip started at a different station from where the previous trip dropped off. Arrivals/departures; Boston; Changes Prediction: SF Bay Area: count prediction

  • V. Batagelj, A. Ferligoj

Bikes

slide-10
SLIDE 10

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Analyses

We find especially interesting a blog by Todd W. Schneider: A Tale of Twenty-Two Million Citi Bike Rides: Analyzing the NYC Bike Share System and Jackson Whitmore: What’s happening with Healthy Ride?, April 2016. In the following slides we present some results from them.

  • V. Batagelj, A. Ferligoj

Bikes

slide-11
SLIDE 11

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Year / Winter

by Todd W. Schneider

  • V. Batagelj, A. Ferligoj

Bikes

slide-12
SLIDE 12

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Working days / Weekend

by Todd W. Schneider

  • V. Batagelj, A. Ferligoj

Bikes

slide-13
SLIDE 13

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Subscribers / Custumers

by Jackson Whitmore

  • V. Batagelj, A. Ferligoj

Bikes

slide-14
SLIDE 14

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Bike sharing data and networks

The bike sharing data can be viewed as a spatial and temporal network: Nodes – stations: name, location, capacity, (state) Links – trips: from, to, start time, finish time, bike’s id, rider type, gender, age From this basic network we can construct several derived networks. In most systems the data about nodes are static – fixed for longer period of time. It could be possible to collect these data using feeds. Selecting an appropriate granulation (5 min, 15 min, 1 hour, part of a day, day, week, month, quartal, year) and some restrictions (rider type, gender, age, . . . ) we get the corresponding frequency distributions in nodes and on links.

  • V. Batagelj, A. Ferligoj

Bikes

slide-15
SLIDE 15

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Symbolic networks

Symbolic data analysis (SDA) is an extension of standard data analysis where symbolic data tables are used as input and symbolic

  • bjects are outputted as a result. The data units are called symbolic

since they are more complex than standard ones, as they not only contain values or categories, but also include internal variation and

  • structure. SDA was proposed by Edwin Diday in 1980’s (see book).

Assigning distributions to nodes and links we get a symbolic network. There are different distributions on links: departures: (# of trips starting in selected time interval), activity: (# of trips active in selected time interval), duration: (# of trips with duration in selected time interval), etc. and in nodes, for example: departures: the sum of link distributions for incident links, imbalance, etc.

  • V. Batagelj, A. Ferligoj

Bikes

slide-16
SLIDE 16

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Our analysis

NY Citi Bike one year data from October 2015 to September2016. 13266296 trips, 678 stations. The Citi Bike system had an expansion in August 2015. We constructed a departures network with daily distributions with half hour granulation. First we looked for extreme elements (links or nodes).

In a selected time interval: flow(u, v) = # of trips starting in a node u and finishing in a node v

  • ut(v) = # of trips starting in a node v

in(v) = # of trips finishing in a node v flow(u, v; k) = # of trips starting in a node u in thek-th half hour and finishing in a node v . . .

  • V. Batagelj, A. Ferligoj

Bikes

slide-17
SLIDE 17

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

The most active stations / Top 3

activity(v) = out(v) + in(v)

n station trips n station trips 1 W 41 St & 8 Ave 281996 6 W 45 St & 8 Ave 170593 2 Nassau Ave & Russell St 203855 7 W 38 St & 8 Ave 164378 3 W 20 St & 8 Ave 200629 8 E 14 St & Avenue B 163962 4 W 16 St & The High Line 196414 9 E 53 St & Madison Ave 162828 5 W 22 St & 8 Ave 188394 10 W 53 St & 10 Ave 161931

  • V. Batagelj, A. Ferligoj

Bikes

slide-18
SLIDE 18

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Imbalance

In a selected time interval: diff (v) = out(v) − in(v) fDist(v) =

48

  • k=1

|out(v; k) − in(v; k)|

n station

  • ut

in diff station fDist 1 5 Ave & E 73 St 60524 34559 25965 5 Ave & E 73 St 84703 2 Van Vorst Park 29962 14920 15042 Fulton St & William St 66453 3 8 Ave & W 33 St 57127 67592

  • 10465

E 75 St & 3 Ave 51297 4 W Broadway & Spring St 15217 23544

  • 8327

W 22 St & 8 Ave 50530 5 E 51 St & 1 Ave 72651 80783

  • 8132

E 33 St & 2 Ave 47893 6 E 75 St & 3 Ave 56302 48891 7411 Water - Whitehall Plaza 45554 7 Catherine St & Monroe St 36858 29455 7403 E 51 St & 1 Ave 34086 8 E 45 St & 3 Ave 48116 41601 6515 W 37 St & 10 Ave 33865 9 Water - Whitehall Plaza 71364 65638 5726 Cambridge Pl & Gates Ave 32562 10 6 Ave & Canal St 23473 28451

  • 4978

E 16 St & Irving Pl 30293

  • V. Batagelj, A. Ferligoj

Bikes

slide-19
SLIDE 19

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Imbalance / diff

Top 4

  • V. Batagelj, A. Ferligoj

Bikes

slide-20
SLIDE 20

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Imbalance / fDist

Top 4

  • V. Batagelj, A. Ferligoj

Bikes

slide-21
SLIDE 21

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

The largest flows / Top 6

  • V. Batagelj, A. Ferligoj

Bikes

slide-22
SLIDE 22

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Clamix – Clustering modal valued symbolic data

Two clustering methods for symbolic objects are implemented: the adapted leaders method and the adapted agglomerative hierarchical clustering Ward’s method. Clamix: R-forge, doc Paper: V. Batagelj, N. Kejˇ zar, and S. Korenjak-ˇ

  • Cerne. Clustering of

Modal Valued Symbolic Data. ArXiv e-prints, 1507.06683, July 2015. We clustered the set of 589 links with flow at least 1250. This gives as typical flow distribution shapes.

  • V. Batagelj, A. Ferligoj

Bikes

slide-23
SLIDE 23

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Clustering of flows

139 332 225 398 441 542 9 359 360 429 19 241 242 274 141 364 411 367 473 169 515 413 102 403 494 199 108 200 284 401 340 404 31 488 162 314 559 405 481 232 436 393 123 530 98 194 265 427 244 282 67 292 351 50 257 201 267 491 136 362 236 423 46 79 41 350 430 326 388 434 248 198 456 370 170 89 320 371 124 502 106 212 247 75 356 323 471 496 35 173 550 15 472 80 407 560 567 135 193 94 3 554 122 400 540 561 101 191 437 519 87 70 227 469 13 119 524 144 563 190 250 295 258 297 448 349 475 521 528 21 355 568 18 74 164 166 217 22 310 151 387 83 223 311 539 309 63 68 444 65 187 459 11 32 51 342 543 262 396 431 208 551 53 495 569 167 251 278 368 579 85 91 271 49 432 512 243 425 7 224 529 556 97 120 177 134 339 82 380 465 100 588 229 290 202 313 535 301 336 344 454 218 116 318 449 182 445 374 234 584 369 458 145 5 109 112 115 140 238 485 152 321 264 288 466 128 544 147 205 575 457 583 246 377 395 378 338 467 249 357 585 308 317 373 118 293 93 383 451 33 237 341 490 518 525 110 438 322 280 470 52 196 12 510 36 149 345 157 555 440 331 104 558 230 523 304 330 478 133 197 95 419 325 397 574 489 381 578 266 277 298 233 384 88 113 228 307 468 156 353 281 211 422 66 86 37 114 77 175 174 447 363 517 337 188 386 420 178 526 186 443 450 61 121 2 38 361 327 580 146 73 209 8 299 483 195 476 276 463 538 103 548 148 300 352 587 14 125 207 71 99 150 245 45 55 172 498 180 319 4 163 287 253 433 206 117 343 324 184 486 487 107 536 333 417 534 442 226 553 39 131 78 171 239 385 492 508 275 537 81 460 546 105 155 408 566 26 69 572 84 455 142 263 252 92 504 506 259 421 499 286 43 62 533 44 439 497 20 165 516 562 428 90 54 493 545 283 48 291 220 255 358 294 375 391 273 160 289 181 315 573 64 346 25 269 382 279 389 303 42 130 509 219 513 28 185 60 204 222 414 305 347 254 474 96 34 216 328 256 394 520 153 424 76 402 435 399 214 503 129 452 59 231 532 464 392 365 192 541 306 56 261 215 461 240 418 17 511 57 210 390 221 426 446 10 354 6 462 564 1 272 127 72 549 285 410 484 412 565 213 577 582 58 23 29 176 348 203 547 154 302 137 158 334 552 30 270 586 189 416 183 501 24 168 480 40 372 522 366 379 570 479 329 409 531 268 47 557 260 482 126 316 576 132 235 406 500 505 159 376 296 514 16 138 179 161 453 111 143 507 477 581 571 335 415 27 312 527 589

0e+00 2e−04 4e−04 6e−04

Cluster Dendrogram

Height

  • V. Batagelj, A. Ferligoj

Bikes

slide-24
SLIDE 24

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Clustering of flows / 7 clusters

  • V. Batagelj, A. Ferligoj

Bikes

slide-25
SLIDE 25

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Conclusions

  • bike sharing data are an interesting type of data,
  • prepare some extended data sets; get or collect the dynamic

stations data,

  • additional analyses:
  • other symbolic objects: nodes (in and out distribution),

links (subscriber, custumer distribution), . . .

  • stability of distribution shape through time
  • . . .
  • compare bike sharing systems
  • Taxi (Yellow and Green) and Uber data are available for New

York.

  • V. Batagelj, A. Ferligoj

Bikes

slide-26
SLIDE 26

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

References I

1 V. Batagelj, N. Kejˇ

zar, and S. Korenjak-ˇ

  • Cerne. Clustering of Modal

Valued Symbolic Data. ArXiv e-prints, 1507.06683, July 2015.

2 Lynne Billard; Edwin Diday (14 May 2012). Symbolic Data Analysis:

Conceptual Statistics and Data Mining. John Wiley & Sons.

3 Bay Area Bike Share: San Francisco Bay Area - Kaggle challenge,

Open data, challenge

4 Todd W. Schneider: A Tale of Twenty-Two Million Citi Bike Rides:

Analyzing the NYC Bike Share System.

5 Jackson Whitmore: What’s happening with Healthy Ride?, April

2016.

  • V. Batagelj, A. Ferligoj

Bikes

slide-27
SLIDE 27

Bikes

  • V. Batagelj,
  • A. Ferligoj

Kaggle Data sets Analyses Conclusions References

Acknowledgments

This work was supported in part by the Slovenian Research Agency (research programs P1-0294 and research projects J5-5537 and J1-5433). The first author’s attendance on the conference was partially supported by the COST Action IC1408 – CRoNoS.

  • V. Batagelj, A. Ferligoj

Bikes