[PPT] - Get a handout 1 Welcome to DS504/CS586: Big Data Analytics PowerPoint Presentation

SLIDE 1

1

Get a handout

SLIDE 2

DS504/CS586: Big Data Analytics

-Review
Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Location: AK233 Spring 2018

SLIDE 3

3

Next Session: Final Project Presentation

v 12/24 T: Submission day

v

Project reports to discussion board

v

Self-&-cross evaluation form to Assignment

v 12/26 R: Presentation Day

v

Quiz 2 (I will send you sample questions soon)

v 20 min each team (including Q&A)

v Team 1 v Team 2

v 10 min break

v Team 3 v Team 4 v Team 5 v Snacks and soft drink will be provided.

SLIDE 4

Today

1. CityLines
2. Review

– Key topics, techniques, discussed in the semester – Future opportunities

Big data analytics
Urban Computing

– 10 min break 7:20-7:30PM

2. Team 5 presentation and discussion: 7:30-8:30PM
3. Course evaluation 8:30PM-8:45PM
4. Finish at 8:45PM

– (last week we finished 5 minutes late.)

SLIDE 5

CityLines: Hybrid Hub-and-Spoke System for Urban Transportation Services

Yanhua Li Assistant Professor Computer Science Department Worcester Polytechnic Institute

SLIDE 6

Global Urbanization and Transportation

SLIDE 7

Today’s Urban Transit Services

Private Transit Public Transits

affordable ride-sharing services reduce the personal vehicle usage

SLIDE 8

Limitations of Today’s Public Transits

Fixed Routes and Time Tables

– Transit supply mis-match dynamic demands

Large number of stops and transfers

– Long travel time

SLIDE 9

Limitations of Today’s Private Transits

Expensive

– High operation cost, – Due to the exclusive service

Service delay

– On-demand services – Delay after the service request

Transit modes run independently

– Lack of inter-transit coordination

SLIDE 10

Future Smart Transit Today’s Transits

Private Transits

– High Cost – Service delay

Public Transits

– Fixed routes – Fixed timetable – Long travel time

No Inter-Transit

Coordination

Dynamic services

– Real time trip demands

Short travel time

– as private transits

Low cost

– as public transits

Future Urban Transit Services

Private Transits: Point-to-point mode Public Transits: fixed route mode

SLIDE 11

Hub-and-Spoke Transit Mode

Airlines routes

Traffic move along spokes connected via a few hubs

– Less operation cost (than private), thus lower cost – Less stops/stations (than public), thus lower transit time

A promising transit mode, and how to design it in urban

areas?

Package delivery system

SLIDE 12

CityLines Transit System

CityLines: a Hybrid Hub-and-Spoke Transit Mode

– point-to-point mode: high demand source-destination pairs – hub-and-spoke mode: low demand source-destination pairs D

1

S1 D1 S2 S3 D2 D3 D4 S1 D1 S2 S3 D2 D3 D4 Private transit Point-to-point model CityLines Hybrid hub-and-spoke mode Reduce routes, thus operation cost

SLIDE 13

CityLines Transit System

CityLines: a Hybrid Hub-and-Spoke Transit Mode

– point-to-point mode: high demand source-destination pairs – hub-and-spoke mode: low demand source-destination pairs S2 S3

S1 D1 S2 S3 D2 D3 D4 S1 D1 S2 S3 D2 D3 D4 Public transit Fixed-route model CityLines Hybrid hub-and-spoke mode Reduce stops/stations, thus travel time

SLIDE 14

CityLines Transit System Design

SLIDE 15

Input Data Description

Trip Demand Data (in Shenzhen):
Source: Taxi GPS, Bus, Subway Transactions
Duration: March 1st–30th, 2014.
Size: 19,428,453 trips in all transit modes
Format: Taxi ID, time, latitude, longitude, load
Road Map, Subway Lines, and Bus routes:

SLIDE 16

Stage 1: Road Map Gridding

Given a side length s=0.01o
1,508 grids are obtained
1,018 grids are strongly connected by road network

SLIDE 17

Stage 2: Trip demand aggregation

Trip demand: <src, dst, t>
Aggregated trip demand <src_grid, dst_grid, t>

6am to 9am No demand Low demand Medium demand High demand

The spatial distribution of trip demand sources

SLIDE 18

Stage 3: Optimal Hybrid Hub-and- Spoke Planning

Problem definition:
Given: n spokes, a set of K trip demands,

a budget of M point-to-point paths, L Hub stations

How to plan the hybrid hub-and-spoke network?
Goal: Minimize the average travel time
Constraints: Up to one-stop (at a hub) per trip

S1 D1 S2 S3 D2 D3 D4

SLIDE 19

Stage 3: Optimal Hybrid Hub-and- Spoke Planning

Challenges:
A large number of hub candidates: all spokes
n=1,018 spokes; L=10 hubs;
Joint modeling of point-to-point and hub-and-spoke
Two Components:
Optimal Hub Selection (OHS): Find L+M hub candidates
Goal: “Cover” the most shortest paths of trip demands
Optimal Trip Assignment (OTA): Hub-spoke net with L hubs
Goal: Minimize the average travel time
(introducing virtual hub to model the joint optimization )

SLIDE 20

Stage 3-I: Optimal Hub Selection (OHS)

Problem Definition:
Find M+L hub candidates
Goal: “Cover” the most trip demands
A hub h covers a trip demand <src, dst, t>,
If h is on the shortest path from src to dst.

S1 D1 S2 S3 D2 D3 D4

D

1

S1 D1 S2 S3 D2 D3 D4

L=2, M=1, L+M=3

SLIDE 21

Stage 3-I: Optimal Hub Selection (OHS)

Maximum Coverage Problem
NP-Hard Problem
Approximate Algorithm with rate 1-1/e [1]

[1] D. S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In Approximation algorithms for NP-hard problems. PWS Publishing Co., 1996.

SLIDE 22

Stage 3-II: Optimal Trip Assignment

p-Hub problem for hub-and-spoke model
p-Hub problem with L hubs and 1 virtual hub

LP relaxation based approximation solution [2] D1 S1 D1 S2 S3 D2 D3 D4 S1 S2 S3 D2 D3 D4

[2] A. T. Ernst and M. Krishnamoorthy. Exact and heuristic algorithms for the uncapacitated multiple allocation p-hub median problem. European Journal of Operational Research, 1998.

SLIDE 23

Comparison with Public and Private Transits

42 mins Average Travel Time: ~42mins reduction over public transits Slightly higher (4 mins) than private transits Aggregation level: Slightly less (8) than public transits ~23 more over private transits 23 per segment

Average travel time (min) over all trip demands Aggregation level: Average # passengers per trip segment

SLIDE 24

Case Studies: Point-to-point Model

SLIDE 25

Case Studies: Hub-and-spoke

SLIDE 26

Case Studies: Hybrid Hub-and-Spoke

SLIDE 27

Questions?

SLIDE 28

Introduction

What is “Big Data”?

28

SLIDE 29

Big Data Analytics techniques and tools for managing, analyzing and extracting knowledge from “big data”

29

SLIDE 30

CS586/DS504-2018 Spring

2. Data Preprocessing/Cleaning
1. Data Acquisition & Measurement
3. Data Management
4. Big Data Mining

Graph Mining, Data Clustering Recommender systems

5. Applications

Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering

4. K-means, DBSCAN
4. BFR, DENCLUE
4. Trajectory Clustering
5. Bike Lane Planning
1. Graph Mining
3. Index, Query
4. Data Collection
2. Map-Matching
4. Recommender Systems

More techniques

SLIDE 31

Big Data Mining Topics

Topics in Big Data Mining 1 Graph Mining: 2 Clustering Hierarchical K-means, BFR DBScan, DENCLUE Graph Sampling Node Importance Ranking

4. Crowdfunding and Crowdturfing.

(Guest Lec.) 3 Recommender Systems Content-Based Collaborative Filtering User-User Based Item-Item Based Facebook/Social graph estimation Social influence Topic sensitive PageRank Trajectory clustering Location-based recommender sys Personalized Geo-Social Recom.

5. Applications:

(CityLines, bike lane planning, etc)

SLIDE 32

Roadmap

1. Sampling & Indexing

– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc

2. Clustering

– Hirachical – K-means, BFR, – DBScan, DENCLUDE

3. Recommender System, Map-Matching, etc
4. Applications

– Social networks – Location based services – Urban computing, – and more

SLIDE 33

Sampling Techniques to Count Population

v German Tank Problem v Panther tanks, 1943. v World War II v Estimate # German Tanks (N) v the problem of estimating the maximum of a discrete uniform distribution from Sampling without replacement v m : the max series number v k : total number of tanks observed v Estimator: v the sample maximum plus the average gap between

bservations in the sample.

ˆ N = m(1 + k−1) − 1

SLIDE 34

Sampling Techniques to Count Population

Mark and recapture
a method commonly used in ecology to

estimate an animal population’s size N.

Step 1: A portion of the population K is

captured, marked, and released.

Step 2: Later, another portion n is

captured and the number of marked individuals within the sample is counted k.

Estimation: ˆ

N = Kn k

SLIDE 35

35

Sampling Big Data

1.1 Random sampling (uniform & independent)

1.2 crawling

} vertex sampling } BFS sampling

35

} random walk sampling } edge sampling

SLIDE 36

1.1 Random Vertex Sampling & Index

One-dimension Data

– YouTube: Random Prefix Sampling – Index structure: B-Tree, List Index

Two Dimension Data (Spatial Data)

– Google map/Foursquare: Random Region Sampling/Random Region Zoom-in – Index structure: Grid-based / Quad Tree / R-Tree

Three Dimension Data (spatio-temporal data)

– Trajectory sampling: Random index sampling – Index structure (combinations): B-Tree+Quad-tree, 3-D R-tree

SLIDE 37

Full B-Tree Structure

SLIDE 38

Grid-based Spatial Indexing

g1 p1 p3 g2 p4

g1 g2

p1 p3 p4

Indexing

– Partition the space into disjoint and uniform grids – Build an index between each grid and the points in the grid

SLIDE 39

39

Quad-Tree

Indexing

– Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example).

1 2 3

00 02 03 30 31 32 33 30 00

1 2 3

SLIDE 40

2. Clustering
1. Hierarchical
2. K-means -> BFR
3. DBScan -> DENCLUDE

SLIDE 41

Roadmap

1. Sampling & Indexing

– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc

2. Clustering

– Hirachical – K-means, DBScan

3. Recommender System, Map-Matching, etc
4. Applications

SLIDE 42

Content based Recommendation

likes

Item profiles

Red Circles Triangles

User profile

match recommend build

42

J. Leskovec, A.

Rajaraman, J.

SLIDE 43

Collaborative Filtering

v Consider user x v Find set N of other users whose ratings are “similar” to x’s ratings v Estimate x’s ratings based on ratings

f users in N

43

J. Leskovec, A.

Rajaraman, J. x N

SLIDE 44

Roadmap

1. Sampling & Indexing

– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc

2. Clustering

– Hirachical – K-means, DBScan

3. Recommender System, Anomaly detection, Map-

Matching, etc

4. Applications

SLIDE 45

45

Class Outcomes

SLIDE 46

46

What is DS504/CS586 about?

v We’ll learn about – Advanced Techniques for Big Data Analytics

Large scale data sampling and estimation,
Data Cleaning,
Graph Data Mining,
Data management, clustering, etc.

– Applications with Big Data Analytics

Urban Computing
Social network analysis
Recommender system, etc.

v Learning outcomes

– Understand & Explain challenges and advances in the state-of-art in big data analytics. – Design, develop and fully execute a big data analytics project. – Communicate the ideas effectively in the form of a presentation and written documents to a technical audience.

SLIDE 47

CS586/DS504-2018 Spring

2. Data Preprocessing/Cleaning
1. Data Acquisition & Measurement
3. Data Management
4. Big Data Mining

Graph Mining, Data Clustering Recommender systems

5. Applications

Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering

4. K-means, DBSCAN
4. BFR, DENCLUE
4. Trajectory Clustering
5. Bike Lane Planning
1. Graph Mining
3. Index, Query
4. Data Collection
2. Map-Matching
4. Recommender Systems

More techniques

SLIDE 48

Logistics 48

Workload

v Focus more on critical thinking, problem

solving, “heads-on/hands-on” experiences!

v Understand, formulate and solve problems v Read and critique research papers v Two Course Projects v Oral presentations v Team Work, v Coding,

SLIDE 49

Grading

– Projects (40%)

Project 1 (10%)
Project 2 (30%)

– Final reports in the discussion forum (by 11:59pm 4/24 Tue); – Self-and-peer evaluation form for project 2 (by 11:59PM 4/24 Tue);

– Written work (30%):

Critiques + Project reports (20%)
Quiz (10%, with 5% each)

– Oral work (30%):

(Project and paper) presentations

Workload and Grading

SLIDE 50

Problems

fp fg t Nt ɵ Na v dv fr w Np α

Categories Regions Categories Categories Regions Features

A

X = R×U Z

Time slots Regions

Y Y = T×RT X

Yt-1 Fm(t-1)

t-1

Ft(t-1) Fh(t-1) Fm(t)

t

Ft(t) Fh(t) Fm(t+1)

t+1

Ft(t+1) Fh(t+1) Yt Yt-1 cx

ANN

w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1

b''

Data Models and Algorithms Data Scientist

SLIDE 51

51

Want to learn more? Future Opportunities.

SLIDE 52

Urban Computing Research Group at WPI

DiDi
Mobike
JD
Yunyan
TianLai online Karaoke

SLIDE 53

Urban Computing Research Group at WPI

Human-in-Loop Urban Computing

SLIDE 54

54

Research opportunities are available in my group.

1. Funding support for PhD

students

2. Independent Study for MS