1
Get a handout 1 Welcome to DS504/CS586: Big Data Analytics - - PowerPoint PPT Presentation
Get a handout 1 Welcome to DS504/CS586: Big Data Analytics - - PowerPoint PPT Presentation
Get a handout 1 Welcome to DS504/CS586: Big Data Analytics --Review Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK233 Spring 2018 Next Session: Final Project Presentation v 12/24 T: Submission day Project reports to discussion
DS504/CS586: Big Data Analytics
- -Review
- Prof. Yanhua Li
Welcome to
Time: 6:00pm –8:50pm R Location: AK233 Spring 2018
3
Next Session: Final Project Presentation
v 12/24 T: Submission day
v
Project reports to discussion board
v
Self-&-cross evaluation form to Assignment
v 12/26 R: Presentation Day
v
Quiz 2 (I will send you sample questions soon)
v 20 min each team (including Q&A)
v Team 1 v Team 2
v 10 min break
v Team 3 v Team 4 v Team 5 v Snacks and soft drink will be provided.
Today
- 1. CityLines
- 2. Review
– Key topics, techniques, discussed in the semester – Future opportunities
- Big data analytics
- Urban Computing
– 10 min break 7:20-7:30PM
- 2. Team 5 presentation and discussion: 7:30-8:30PM
- 3. Course evaluation 8:30PM-8:45PM
- 4. Finish at 8:45PM
– (last week we finished 5 minutes late.)
CityLines: Hybrid Hub-and-Spoke System for Urban Transportation Services
Yanhua Li Assistant Professor Computer Science Department Worcester Polytechnic Institute
Global Urbanization and Transportation
Today’s Urban Transit Services
Private Transit Public Transits
affordable ride-sharing services reduce the personal vehicle usage
Limitations of Today’s Public Transits
- Fixed Routes and Time Tables
– Transit supply mis-match dynamic demands
- Large number of stops and transfers
– Long travel time
Limitations of Today’s Private Transits
- Expensive
– High operation cost, – Due to the exclusive service
- Service delay
– On-demand services – Delay after the service request
- Transit modes run independently
– Lack of inter-transit coordination
Future Smart Transit Today’s Transits
- Private Transits
– High Cost – Service delay
- Public Transits
– Fixed routes – Fixed timetable – Long travel time
- No Inter-Transit
Coordination
- Dynamic services
– Real time trip demands
- Short travel time
– as private transits
- Low cost
– as public transits
Future Urban Transit Services
Private Transits: Point-to-point mode Public Transits: fixed route mode
Hub-and-Spoke Transit Mode
Airlines routes
- Traffic move along spokes connected via a few hubs
– Less operation cost (than private), thus lower cost – Less stops/stations (than public), thus lower transit time
- A promising transit mode, and how to design it in urban
areas?
Package delivery system
CityLines Transit System
- CityLines: a Hybrid Hub-and-Spoke Transit Mode
– point-to-point mode: high demand source-destination pairs – hub-and-spoke mode: low demand source-destination pairs D
1
S1 D1 S2 S3 D2 D3 D4 S1 D1 S2 S3 D2 D3 D4 Private transit Point-to-point model CityLines Hybrid hub-and-spoke mode Reduce routes, thus operation cost
CityLines Transit System
- CityLines: a Hybrid Hub-and-Spoke Transit Mode
– point-to-point mode: high demand source-destination pairs – hub-and-spoke mode: low demand source-destination pairs S2 S3
S1 D1 S2 S3 D2 D3 D4 S1 D1 S2 S3 D2 D3 D4 Public transit Fixed-route model CityLines Hybrid hub-and-spoke mode Reduce stops/stations, thus travel time
CityLines Transit System Design
Input Data Description
- Trip Demand Data (in Shenzhen):
- Source: Taxi GPS, Bus, Subway Transactions
- Duration: March 1st–30th, 2014.
- Size: 19,428,453 trips in all transit modes
- Format: Taxi ID, time, latitude, longitude, load
- Road Map, Subway Lines, and Bus routes:
Stage 1: Road Map Gridding
- Given a side length s=0.01o
- 1,508 grids are obtained
- 1,018 grids are strongly connected by road network
Stage 2: Trip demand aggregation
- Trip demand: <src, dst, t>
- Aggregated trip demand <src_grid, dst_grid, t>
6am to 9am No demand Low demand Medium demand High demand
The spatial distribution of trip demand sources
Stage 3: Optimal Hybrid Hub-and- Spoke Planning
- Problem definition:
- Given: n spokes, a set of K trip demands,
a budget of M point-to-point paths, L Hub stations
- How to plan the hybrid hub-and-spoke network?
- Goal: Minimize the average travel time
- Constraints: Up to one-stop (at a hub) per trip
S1 D1 S2 S3 D2 D3 D4
Stage 3: Optimal Hybrid Hub-and- Spoke Planning
- Challenges:
- A large number of hub candidates: all spokes
- n=1,018 spokes; L=10 hubs;
- Joint modeling of point-to-point and hub-and-spoke
- Two Components:
- Optimal Hub Selection (OHS): Find L+M hub candidates
- Goal: “Cover” the most shortest paths of trip demands
- Optimal Trip Assignment (OTA): Hub-spoke net with L hubs
- Goal: Minimize the average travel time
- (introducing virtual hub to model the joint optimization )
Stage 3-I: Optimal Hub Selection (OHS)
- Problem Definition:
- Find M+L hub candidates
- Goal: “Cover” the most trip demands
- A hub h covers a trip demand <src, dst, t>,
- If h is on the shortest path from src to dst.
S1 D1 S2 S3 D2 D3 D4
D
1
S1 D1 S2 S3 D2 D3 D4
L=2, M=1, L+M=3
Stage 3-I: Optimal Hub Selection (OHS)
- Maximum Coverage Problem
- NP-Hard Problem
- Approximate Algorithm with rate 1-1/e [1]
[1] D. S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In Approximation algorithms for NP-hard problems. PWS Publishing Co., 1996.
Stage 3-II: Optimal Trip Assignment
- p-Hub problem for hub-and-spoke model
- p-Hub problem with L hubs and 1 virtual hub
LP relaxation based approximation solution [2] D1 S1 D1 S2 S3 D2 D3 D4 S1 S2 S3 D2 D3 D4
[2] A. T. Ernst and M. Krishnamoorthy. Exact and heuristic algorithms for the uncapacitated multiple allocation p-hub median problem. European Journal of Operational Research, 1998.
Comparison with Public and Private Transits
42 mins Average Travel Time: ~42mins reduction over public transits Slightly higher (4 mins) than private transits Aggregation level: Slightly less (8) than public transits ~23 more over private transits 23 per segment
Average travel time (min) over all trip demands Aggregation level: Average # passengers per trip segment
Case Studies: Point-to-point Model
Case Studies: Hub-and-spoke
Case Studies: Hybrid Hub-and-Spoke
Questions?
Introduction
What is “Big Data”?
28
Big Data Analytics techniques and tools for managing, analyzing and extracting knowledge from “big data”
29
CS586/DS504-2018 Spring
- 2. Data Preprocessing/Cleaning
- 1. Data Acquisition & Measurement
- 3. Data Management
- 4. Big Data Mining
Graph Mining, Data Clustering Recommender systems
- 5. Applications
Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering
- 4. K-means, DBSCAN
- 4. BFR, DENCLUE
- 4. Trajectory Clustering
- 5. Bike Lane Planning
- 1. Graph Mining
- 3. Index, Query
- 4. Data Collection
- 2. Map-Matching
- 4. Recommender Systems
More techniques
Big Data Mining Topics
Topics in Big Data Mining 1 Graph Mining: 2 Clustering Hierarchical K-means, BFR DBScan, DENCLUE Graph Sampling Node Importance Ranking
- 4. Crowdfunding and Crowdturfing.
(Guest Lec.) 3 Recommender Systems Content-Based Collaborative Filtering User-User Based Item-Item Based Facebook/Social graph estimation Social influence Topic sensitive PageRank Trajectory clustering Location-based recommender sys Personalized Geo-Social Recom.
- 5. Applications:
(CityLines, bike lane planning, etc)
Roadmap
- 1. Sampling & Indexing
– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc
- 2. Clustering
– Hirachical – K-means, BFR, – DBScan, DENCLUDE
- 3. Recommender System, Map-Matching, etc
- 4. Applications
– Social networks – Location based services – Urban computing, – and more
Sampling Techniques to Count Population
v German Tank Problem v Panther tanks, 1943. v World War II v Estimate # German Tanks (N) v the problem of estimating the maximum of a discrete uniform distribution from Sampling without replacement v m : the max series number v k : total number of tanks observed v Estimator: v the sample maximum plus the average gap between
- bservations in the sample.
ˆ N = m(1 + k−1) − 1
Sampling Techniques to Count Population
- Mark and recapture
- a method commonly used in ecology to
estimate an animal population’s size N.
- Step 1: A portion of the population K is
captured, marked, and released.
- Step 2: Later, another portion n is
captured and the number of marked individuals within the sample is counted k.
- Estimation: ˆ
N = Kn k
35
Sampling Big Data
1.1 Random sampling (uniform & independent)
1.2 crawling
} vertex sampling } BFS sampling
35
} random walk sampling } edge sampling
1.1 Random Vertex Sampling & Index
- One-dimension Data
– YouTube: Random Prefix Sampling – Index structure: B-Tree, List Index
- Two Dimension Data (Spatial Data)
– Google map/Foursquare: Random Region Sampling/Random Region Zoom-in – Index structure: Grid-based / Quad Tree / R-Tree
- Three Dimension Data (spatio-temporal data)
– Trajectory sampling: Random index sampling – Index structure (combinations): B-Tree+Quad-tree, 3-D R-tree
Full B-Tree Structure
Grid-based Spatial Indexing
g1 p1 p3 g2 p4
g1 g2
p1 p3 p4
- Indexing
– Partition the space into disjoint and uniform grids – Build an index between each grid and the points in the grid
39
Quad-Tree
- Indexing
– Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example).
1 2 3
00 02 03 30 31 32 33 30 00
1 2 3
- 2. Clustering
- 1. Hierarchical
- 2. K-means -> BFR
- 3. DBScan -> DENCLUDE
Roadmap
- 1. Sampling & Indexing
– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc
- 2. Clustering
– Hirachical – K-means, DBScan
- 3. Recommender System, Map-Matching, etc
- 4. Applications
Content based Recommendation
likes
Item profiles
Red Circles Triangles
User profile
match recommend build
42
- J. Leskovec, A.
Rajaraman, J.
Collaborative Filtering
v Consider user x v Find set N of other users whose ratings are “similar” to x’s ratings v Estimate x’s ratings based on ratings
- f users in N
43
- J. Leskovec, A.
Rajaraman, J. x N
Roadmap
- 1. Sampling & Indexing
– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc
- 2. Clustering
– Hirachical – K-means, DBScan
- 3. Recommender System, Anomaly detection, Map-
Matching, etc
- 4. Applications
45
Class Outcomes
46
What is DS504/CS586 about?
v We’ll learn about – Advanced Techniques for Big Data Analytics
- Large scale data sampling and estimation,
- Data Cleaning,
- Graph Data Mining,
- Data management, clustering, etc.
– Applications with Big Data Analytics
- Urban Computing
- Social network analysis
- Recommender system, etc.
v Learning outcomes
– Understand & Explain challenges and advances in the state-of-art in big data analytics. – Design, develop and fully execute a big data analytics project. – Communicate the ideas effectively in the form of a presentation and written documents to a technical audience.
CS586/DS504-2018 Spring
- 2. Data Preprocessing/Cleaning
- 1. Data Acquisition & Measurement
- 3. Data Management
- 4. Big Data Mining
Graph Mining, Data Clustering Recommender systems
- 5. Applications
Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering
- 4. K-means, DBSCAN
- 4. BFR, DENCLUE
- 4. Trajectory Clustering
- 5. Bike Lane Planning
- 1. Graph Mining
- 3. Index, Query
- 4. Data Collection
- 2. Map-Matching
- 4. Recommender Systems
More techniques
Logistics 48
Workload
v Focus more on critical thinking, problem
solving, “heads-on/hands-on” experiences!
v Understand, formulate and solve problems v Read and critique research papers v Two Course Projects v Oral presentations v Team Work, v Coding,
- Grading
– Projects (40%)
- Project 1 (10%)
- Project 2 (30%)
– Final reports in the discussion forum (by 11:59pm 4/24 Tue); – Self-and-peer evaluation form for project 2 (by 11:59PM 4/24 Tue);
– Written work (30%):
- Critiques + Project reports (20%)
- Quiz (10%, with 5% each)
– Oral work (30%):
- (Project and paper) presentations
Workload and Grading
Problems
fp fg t Nt ɵ Na v dv fr w Np α
Categories Regions Categories Categories Regions FeaturesA
X = R×U Z
Time slots RegionsY Y = T×RT X
Yt-1 Fm(t-1)
t-1
Ft(t-1) Fh(t-1) Fm(t)
t
Ft(t) Fh(t) Fm(t+1)
t+1
Ft(t+1) Fh(t+1) Yt Yt-1 cx
ANN
w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1
b''
Data Models and Algorithms Data Scientist
51
Want to learn more? Future Opportunities.
Urban Computing Research Group at WPI
- DiDi
- Mobike
- JD
- Yunyan
- TianLai online Karaoke
Urban Computing Research Group at WPI
- Human-in-Loop Urban Computing
54
Research opportunities are available in my group.
- 1. Funding support for PhD
students
- 2. Independent Study for MS