Get a handout 1 Welcome to DS504/CS586: Big Data Analytics - - PowerPoint PPT Presentation

get a handout
SMART_READER_LITE
LIVE PREVIEW

Get a handout 1 Welcome to DS504/CS586: Big Data Analytics - - PowerPoint PPT Presentation

Get a handout 1 Welcome to DS504/CS586: Big Data Analytics --Review Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK233 Spring 2018 Next Session: Final Project Presentation v 12/24 T: Submission day Project reports to discussion


slide-1
SLIDE 1

1

Get a handout

slide-2
SLIDE 2

DS504/CS586: Big Data Analytics

  • -Review
  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Location: AK233 Spring 2018

slide-3
SLIDE 3

3

Next Session: Final Project Presentation

v 12/24 T: Submission day

v

Project reports to discussion board

v

Self-&-cross evaluation form to Assignment

v 12/26 R: Presentation Day

v

Quiz 2 (I will send you sample questions soon)

v 20 min each team (including Q&A)

v Team 1 v Team 2

v 10 min break

v Team 3 v Team 4 v Team 5 v Snacks and soft drink will be provided.

slide-4
SLIDE 4

Today

  • 1. CityLines
  • 2. Review

– Key topics, techniques, discussed in the semester – Future opportunities

  • Big data analytics
  • Urban Computing

– 10 min break 7:20-7:30PM

  • 2. Team 5 presentation and discussion: 7:30-8:30PM
  • 3. Course evaluation 8:30PM-8:45PM
  • 4. Finish at 8:45PM

– (last week we finished 5 minutes late.)

slide-5
SLIDE 5

CityLines: Hybrid Hub-and-Spoke System for Urban Transportation Services

Yanhua Li Assistant Professor Computer Science Department Worcester Polytechnic Institute

slide-6
SLIDE 6

Global Urbanization and Transportation

slide-7
SLIDE 7

Today’s Urban Transit Services

Private Transit Public Transits

affordable ride-sharing services reduce the personal vehicle usage

slide-8
SLIDE 8

Limitations of Today’s Public Transits

  • Fixed Routes and Time Tables

– Transit supply mis-match dynamic demands

  • Large number of stops and transfers

– Long travel time

slide-9
SLIDE 9

Limitations of Today’s Private Transits

  • Expensive

– High operation cost, – Due to the exclusive service

  • Service delay

– On-demand services – Delay after the service request

  • Transit modes run independently

– Lack of inter-transit coordination

slide-10
SLIDE 10

Future Smart Transit Today’s Transits

  • Private Transits

– High Cost – Service delay

  • Public Transits

– Fixed routes – Fixed timetable – Long travel time

  • No Inter-Transit

Coordination

  • Dynamic services

– Real time trip demands

  • Short travel time

– as private transits

  • Low cost

– as public transits

Future Urban Transit Services

Private Transits: Point-to-point mode Public Transits: fixed route mode

slide-11
SLIDE 11

Hub-and-Spoke Transit Mode

Airlines routes

  • Traffic move along spokes connected via a few hubs

– Less operation cost (than private), thus lower cost – Less stops/stations (than public), thus lower transit time

  • A promising transit mode, and how to design it in urban

areas?

Package delivery system

slide-12
SLIDE 12

CityLines Transit System

  • CityLines: a Hybrid Hub-and-Spoke Transit Mode

– point-to-point mode: high demand source-destination pairs – hub-and-spoke mode: low demand source-destination pairs D

1

S1 D1 S2 S3 D2 D3 D4 S1 D1 S2 S3 D2 D3 D4 Private transit Point-to-point model CityLines Hybrid hub-and-spoke mode Reduce routes, thus operation cost

slide-13
SLIDE 13

CityLines Transit System

  • CityLines: a Hybrid Hub-and-Spoke Transit Mode

– point-to-point mode: high demand source-destination pairs – hub-and-spoke mode: low demand source-destination pairs S2 S3

S1 D1 S2 S3 D2 D3 D4 S1 D1 S2 S3 D2 D3 D4 Public transit Fixed-route model CityLines Hybrid hub-and-spoke mode Reduce stops/stations, thus travel time

slide-14
SLIDE 14

CityLines Transit System Design

slide-15
SLIDE 15

Input Data Description

  • Trip Demand Data (in Shenzhen):
  • Source: Taxi GPS, Bus, Subway Transactions
  • Duration: March 1st–30th, 2014.
  • Size: 19,428,453 trips in all transit modes
  • Format: Taxi ID, time, latitude, longitude, load
  • Road Map, Subway Lines, and Bus routes:
slide-16
SLIDE 16

Stage 1: Road Map Gridding

  • Given a side length s=0.01o
  • 1,508 grids are obtained
  • 1,018 grids are strongly connected by road network
slide-17
SLIDE 17

Stage 2: Trip demand aggregation

  • Trip demand: <src, dst, t>
  • Aggregated trip demand <src_grid, dst_grid, t>

6am to 9am No demand Low demand Medium demand High demand

The spatial distribution of trip demand sources

slide-18
SLIDE 18

Stage 3: Optimal Hybrid Hub-and- Spoke Planning

  • Problem definition:
  • Given: n spokes, a set of K trip demands,

a budget of M point-to-point paths, L Hub stations

  • How to plan the hybrid hub-and-spoke network?
  • Goal: Minimize the average travel time
  • Constraints: Up to one-stop (at a hub) per trip

S1 D1 S2 S3 D2 D3 D4

slide-19
SLIDE 19

Stage 3: Optimal Hybrid Hub-and- Spoke Planning

  • Challenges:
  • A large number of hub candidates: all spokes
  • n=1,018 spokes; L=10 hubs;
  • Joint modeling of point-to-point and hub-and-spoke
  • Two Components:
  • Optimal Hub Selection (OHS): Find L+M hub candidates
  • Goal: “Cover” the most shortest paths of trip demands
  • Optimal Trip Assignment (OTA): Hub-spoke net with L hubs
  • Goal: Minimize the average travel time
  • (introducing virtual hub to model the joint optimization )
slide-20
SLIDE 20

Stage 3-I: Optimal Hub Selection (OHS)

  • Problem Definition:
  • Find M+L hub candidates
  • Goal: “Cover” the most trip demands
  • A hub h covers a trip demand <src, dst, t>,
  • If h is on the shortest path from src to dst.

S1 D1 S2 S3 D2 D3 D4

D

1

S1 D1 S2 S3 D2 D3 D4

L=2, M=1, L+M=3

slide-21
SLIDE 21

Stage 3-I: Optimal Hub Selection (OHS)

  • Maximum Coverage Problem
  • NP-Hard Problem
  • Approximate Algorithm with rate 1-1/e [1]

[1] D. S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In Approximation algorithms for NP-hard problems. PWS Publishing Co., 1996.

slide-22
SLIDE 22

Stage 3-II: Optimal Trip Assignment

  • p-Hub problem for hub-and-spoke model
  • p-Hub problem with L hubs and 1 virtual hub

LP relaxation based approximation solution [2] D1 S1 D1 S2 S3 D2 D3 D4 S1 S2 S3 D2 D3 D4

[2] A. T. Ernst and M. Krishnamoorthy. Exact and heuristic algorithms for the uncapacitated multiple allocation p-hub median problem. European Journal of Operational Research, 1998.

slide-23
SLIDE 23

Comparison with Public and Private Transits

42 mins Average Travel Time: ~42mins reduction over public transits Slightly higher (4 mins) than private transits Aggregation level: Slightly less (8) than public transits ~23 more over private transits 23 per segment

Average travel time (min) over all trip demands Aggregation level: Average # passengers per trip segment

slide-24
SLIDE 24

Case Studies: Point-to-point Model

slide-25
SLIDE 25

Case Studies: Hub-and-spoke

slide-26
SLIDE 26

Case Studies: Hybrid Hub-and-Spoke

slide-27
SLIDE 27

Questions?

slide-28
SLIDE 28

Introduction

What is “Big Data”?

28

slide-29
SLIDE 29

Big Data Analytics techniques and tools for managing, analyzing and extracting knowledge from “big data”

29

slide-30
SLIDE 30

CS586/DS504-2018 Spring

  • 2. Data Preprocessing/Cleaning
  • 1. Data Acquisition & Measurement
  • 3. Data Management
  • 4. Big Data Mining

Graph Mining, Data Clustering Recommender systems

  • 5. Applications

Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering

  • 4. K-means, DBSCAN
  • 4. BFR, DENCLUE
  • 4. Trajectory Clustering
  • 5. Bike Lane Planning
  • 1. Graph Mining
  • 3. Index, Query
  • 4. Data Collection
  • 2. Map-Matching
  • 4. Recommender Systems

More techniques

slide-31
SLIDE 31

Big Data Mining Topics

Topics in Big Data Mining 1 Graph Mining: 2 Clustering Hierarchical K-means, BFR DBScan, DENCLUE Graph Sampling Node Importance Ranking

  • 4. Crowdfunding and Crowdturfing.

(Guest Lec.) 3 Recommender Systems Content-Based Collaborative Filtering User-User Based Item-Item Based Facebook/Social graph estimation Social influence Topic sensitive PageRank Trajectory clustering Location-based recommender sys Personalized Geo-Social Recom.

  • 5. Applications:

(CityLines, bike lane planning, etc)

slide-32
SLIDE 32

Roadmap

  • 1. Sampling & Indexing

– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc

  • 2. Clustering

– Hirachical – K-means, BFR, – DBScan, DENCLUDE

  • 3. Recommender System, Map-Matching, etc
  • 4. Applications

– Social networks – Location based services – Urban computing, – and more

slide-33
SLIDE 33

Sampling Techniques to Count Population

v German Tank Problem v Panther tanks, 1943. v World War II v Estimate # German Tanks (N) v the problem of estimating the maximum of a discrete uniform distribution from Sampling without replacement v m : the max series number v k : total number of tanks observed v Estimator: v the sample maximum plus the average gap between

  • bservations in the sample.

ˆ N = m(1 + k−1) − 1

slide-34
SLIDE 34

Sampling Techniques to Count Population

  • Mark and recapture
  • a method commonly used in ecology to

estimate an animal population’s size N.

  • Step 1: A portion of the population K is

captured, marked, and released.

  • Step 2: Later, another portion n is

captured and the number of marked individuals within the sample is counted k.

  • Estimation: ˆ

N = Kn k

slide-35
SLIDE 35

35

Sampling Big Data

1.1 Random sampling (uniform & independent)

1.2 crawling

} vertex sampling } BFS sampling

35

} random walk sampling } edge sampling

slide-36
SLIDE 36

1.1 Random Vertex Sampling & Index

  • One-dimension Data

– YouTube: Random Prefix Sampling – Index structure: B-Tree, List Index

  • Two Dimension Data (Spatial Data)

– Google map/Foursquare: Random Region Sampling/Random Region Zoom-in – Index structure: Grid-based / Quad Tree / R-Tree

  • Three Dimension Data (spatio-temporal data)

– Trajectory sampling: Random index sampling – Index structure (combinations): B-Tree+Quad-tree, 3-D R-tree

slide-37
SLIDE 37

Full B-Tree Structure

slide-38
SLIDE 38

Grid-based Spatial Indexing

g1 p1 p3 g2 p4

g1 g2

p1 p3 p4

  • Indexing

– Partition the space into disjoint and uniform grids – Build an index between each grid and the points in the grid

slide-39
SLIDE 39

39

Quad-Tree

  • Indexing

– Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example).

1 2 3

00 02 03 30 31 32 33 30 00

1 2 3

slide-40
SLIDE 40
  • 2. Clustering
  • 1. Hierarchical
  • 2. K-means -> BFR
  • 3. DBScan -> DENCLUDE
slide-41
SLIDE 41

Roadmap

  • 1. Sampling & Indexing

– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc

  • 2. Clustering

– Hirachical – K-means, DBScan

  • 3. Recommender System, Map-Matching, etc
  • 4. Applications
slide-42
SLIDE 42

Content based Recommendation

likes

Item profiles

Red Circles Triangles

User profile

match recommend build

42

  • J. Leskovec, A.

Rajaraman, J.

slide-43
SLIDE 43

Collaborative Filtering

v Consider user x v Find set N of other users whose ratings are “similar” to x’s ratings v Estimate x’s ratings based on ratings

  • f users in N

43

  • J. Leskovec, A.

Rajaraman, J. x N

slide-44
SLIDE 44

Roadmap

  • 1. Sampling & Indexing

– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc

  • 2. Clustering

– Hirachical – K-means, DBScan

  • 3. Recommender System, Anomaly detection, Map-

Matching, etc

  • 4. Applications
slide-45
SLIDE 45

45

Class Outcomes

slide-46
SLIDE 46

46

What is DS504/CS586 about?

v We’ll learn about – Advanced Techniques for Big Data Analytics

  • Large scale data sampling and estimation,
  • Data Cleaning,
  • Graph Data Mining,
  • Data management, clustering, etc.

– Applications with Big Data Analytics

  • Urban Computing
  • Social network analysis
  • Recommender system, etc.

v Learning outcomes

– Understand & Explain challenges and advances in the state-of-art in big data analytics. – Design, develop and fully execute a big data analytics project. – Communicate the ideas effectively in the form of a presentation and written documents to a technical audience.

slide-47
SLIDE 47

CS586/DS504-2018 Spring

  • 2. Data Preprocessing/Cleaning
  • 1. Data Acquisition & Measurement
  • 3. Data Management
  • 4. Big Data Mining

Graph Mining, Data Clustering Recommender systems

  • 5. Applications

Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering

  • 4. K-means, DBSCAN
  • 4. BFR, DENCLUE
  • 4. Trajectory Clustering
  • 5. Bike Lane Planning
  • 1. Graph Mining
  • 3. Index, Query
  • 4. Data Collection
  • 2. Map-Matching
  • 4. Recommender Systems

More techniques

slide-48
SLIDE 48

Logistics 48

Workload

v Focus more on critical thinking, problem

solving, “heads-on/hands-on” experiences!

v Understand, formulate and solve problems v Read and critique research papers v Two Course Projects v Oral presentations v Team Work, v Coding,

slide-49
SLIDE 49
  • Grading

– Projects (40%)

  • Project 1 (10%)
  • Project 2 (30%)

– Final reports in the discussion forum (by 11:59pm 4/24 Tue); – Self-and-peer evaluation form for project 2 (by 11:59PM 4/24 Tue);

– Written work (30%):

  • Critiques + Project reports (20%)
  • Quiz (10%, with 5% each)

– Oral work (30%):

  • (Project and paper) presentations

Workload and Grading

slide-50
SLIDE 50

Problems

fp fg t Nt ɵ Na v dv fr w Np α

Categories Regions Categories Categories Regions Features

A

X = R×U Z

Time slots Regions

Y Y = T×RT X

Yt-1 Fm(t-1)

t-1

Ft(t-1) Fh(t-1) Fm(t)

t

Ft(t) Fh(t) Fm(t+1)

t+1

Ft(t+1) Fh(t+1) Yt Yt-1 cx

ANN

w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1

b''

Data Models and Algorithms Data Scientist

slide-51
SLIDE 51

51

Want to learn more? Future Opportunities.

slide-52
SLIDE 52

Urban Computing Research Group at WPI

  • DiDi
  • Mobike
  • JD
  • Yunyan
  • TianLai online Karaoke
slide-53
SLIDE 53

Urban Computing Research Group at WPI

  • Human-in-Loop Urban Computing
slide-54
SLIDE 54

54

Research opportunities are available in my group.

  • 1. Funding support for PhD

students

  • 2. Independent Study for MS

students Contact: yli15@wpi.edu website: http://wpi.edu/~yli15/ index.html