Pick up a handout on the front table 1 Welcome to DS504/CS586: - - PowerPoint PPT Presentation

pick up a handout on the front table
SMART_READER_LITE
LIVE PREVIEW

Pick up a handout on the front table 1 Welcome to DS504/CS586: - - PowerPoint PPT Presentation

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH116 Fall 2017 Next Session: Final Project Presentation v 12/14 R 20 min each team (including


slide-1
SLIDE 1

1

Pick up a handout

  • n the front table
slide-2
SLIDE 2

DS504/CS586: Big Data Analytics

  • -Review
  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Location: KH116 Fall 2017

slide-3
SLIDE 3

3

Next Session: Final Project Presentation

v 12/14 R

v

20 min each team (including Q&A)

v

Team 1

v

Team 2

v

Team 3

v

Team 4

v

Team 5

v

Team 6

v

Team 7

v Snacks and soft drink will be provided.

slide-4
SLIDE 4

Today

  • 1. Review

– Key topics, techniques, discussed in the semester – Future opportunities

  • Big data analytics
  • Urban Computing

– 10 min break 7:20-7:30PM

  • 3. Team 1 presentation and discussion: 7:30PM
  • 4. Course evaluation 8:15PM-8:30PM
  • 5. Finish at 8:30PM

– (last week we finished 18 minutes late.)

slide-5
SLIDE 5

Introduction

What is “Big Data”?

5

slide-6
SLIDE 6

Big Data Analytics techniques and tools for managing, analyzing and extracting knowledge from “big data”

6

slide-7
SLIDE 7

CS586/DS504-2017Fall

  • 2. Data Preprocessing/Cleaning
  • 1. Data Acquisition & Measurement
  • 3. Data Management
  • 4. Big Data Mining

Graph Mining, Data Clustering Recommender systems, Outlier Detection

  • 5. Applications

Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering

  • 4. K-means, DBSCAN
  • 4. BFR, DENCLUE
  • 4. Trajectory Clustering
  • 5. Urban: Bike sharing
  • 1. Graph Mining
  • 3. Index, Query
  • 4. Data Collection
  • 2. Map-Matching
  • 4. Recommender Systems
  • 4. Outlier Detection

More techniques

slide-8
SLIDE 8

Big Data Mining Topics

Topics in Big Data Mining 1 Graph Mining: 2 Clustering Hierarchical K-means, BFR DBScan, DENCLUE Graph Sampling Node Importance Ranking

  • 4. Outlier Detection

3 Recommender Systems Content-Based Collaborative Filtering User-User Based Item-Item Based Facebook/Social graph estimation Social influence Topic sensitive PageRank Trajectory clustering Location-based recommender sys Personalized Geo-Social Recom.

  • 5. Big Data Integration (Guest Lec.)
slide-9
SLIDE 9

Roadmap

  • 1. Sampling & Indexing

– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc

  • 2. Clustering

– Hirachical – K-means, BFR, – DBScan, DENCLUDE

  • 3. Recommender System, Map-Matching, etc
  • 4. Applications

– Social networks – Location based services – Urban computing, – and more

slide-10
SLIDE 10

Sampling Techniques to Count Population

v German Tank Problem v Panther tanks, 1943. v World War II v Estimate # German Tanks (N) v the problem of estimating the maximum of a discrete uniform distribution from Sampling without replacement v m : the max series number v k : total number of tanks observed v Estimator: v the sample maximum plus the average gap between

  • bservations in the sample.

ˆ N = m(1 + k−1) − 1

slide-11
SLIDE 11

Sampling Techniques to Count Population

  • Mark and recapture
  • a method commonly used in ecology to

estimate an animal population’s size N.

  • Step 1: A portion of the population K is

captured, marked, and released.

  • Step 2: Later, another portion n is

captured and the number of marked individuals within the sample is counted k.

  • Estimation: ˆ

N = Kn k

slide-12
SLIDE 12

12

Sampling Big Data

1.1 Random sampling (uniform & independent)

1.2 crawling

} vertex sampling } BFS sampling

12

} random walk sampling } edge sampling

slide-13
SLIDE 13

1.1 Random Vertex Sampling & Index

  • One-dimension Data

– YouTube: Random Prefix Sampling – Index structure: B-Tree, List Index

  • Two Dimension Data (Spatial Data)

– Google map/Foursquare: Random Region Sampling/Random Region Zoom-in – Index structure: Grid-based / Quad Tree / R-Tree

  • Three Dimension Data (spatio-temporal data)

– Trajectory sampling: Random index sampling – Index structure (combinations): B-Tree+Quad-tree, 3-D R-tree

slide-14
SLIDE 14

Full B-Tree Structure

slide-15
SLIDE 15

Grid-based Spatial Indexing

g1 p1 p3 g2 p4

g1 g2

p1 p3 p4

  • Indexing

– Partition the space into disjoint and uniform grids – Build an index between each grid and the points in the grid

slide-16
SLIDE 16

16

Quad-Tree

  • Indexing

– Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example).

1 2 3

00 02 03 30 31 32 33 30 00

1 2 3

slide-17
SLIDE 17

Random Vertex Sampling: YouTube

Comments from other YouTube users

slide-18
SLIDE 18

YouTube Video ID Space

slide-19
SLIDE 19

Random Prefix Sampling

  • Let pL denote the probability that a randomly generated

id matches a given L-length prefix pL =1/|S|L=1/64L, if L=1,…,10 pL =1/(|S|10|T|)=1/(6410*16), if L=11

  • Generate m prefixes of length L.
  • Let XiL be the total number of videos with a prefix i of

length L, and N the total number of videos then, XiL ~ Binomial(N, pL);

slide-20
SLIDE 20

Unbiased Estimator for the Total Number

  • f Videos
  • Given m samples XiL by querying randomly generated

prefixes of the same length in [1,11], we have the unbiased estimator of total number of videos (See paper for the confidence interval and variance)

ˆ N = 1 mpL Xi

L i=1 m

slide-21
SLIDE 21

Practical Issues

slide-22
SLIDE 22

Simple random region sampling

Tabulating Stage: Estimation Unbiased estimator of the total number

  • f

venues N: : Number of venues of Xt;

Please refer to the paper for proof of the unbiasedness, confidence interval, and estimator design of other statistics.

slide-23
SLIDE 23

Random Region Zoom-in on Maps

  • RRZI(A): At each step, RRZI divides the current

queried region into two sub-regions and randomly selects a non-empty sub-region to zoom-in when it contains more than k PoIs (k=5)

23

Step 1 Step 2 Step 3 Step 4 Probability of sampling the sub-region

slide-24
SLIDE 24

Random Region Zoom-in on Maps

  • RRZI and RRZIC can be viewed as weighted sampling

methods.

24

Estimators of sum and distribution aggregates: sampled sub-regsions r1,…,rm

probability of sampling the sub-region ri

slide-25
SLIDE 25

Motivation & Problem Definition

How to sample B index leaf nodes to estimate # of trajectories in q with a guaranteed error bound?

q covers n index leaf nodes

slide-26
SLIDE 26

Random Index Sampling

B Sampled index leaf nodes Trajectory list r1, r2 r3, r5 r6, r7 r9, r10 … kq

1, kq 2

kq

3, kq 5

kq

6, kq 7

kq

9, kq 10

… Occurrence time Inverted index r1 r2 r3 … Lng Lat Time Index leaf node list … query q ST-indexed data Data Indexing Structure Sampling and Estimation

r2 r6 r9

r5 r1 r6 r7 r3

Index leaf node list Index leaf node list

slide-27
SLIDE 27

Random Index Sampling

  • Stage 1: Sampling Stage:
  • Uniformly at random sample B index leaf nodes

with replacement

  • Stage 2: Estimation Stage: (Unbiased Estimator)
  • Convergence analysis:

when , . is the maximum number of trajectories in an index leaf node.

slide-28
SLIDE 28

1 2 6 4 5 3

Undirected !!

1.2 Crawling based Sampling

slide-29
SLIDE 29

Minas Gjoka, UC Irvine Walking in Facebook 29

(1) Breadth-First-Search (BFS)

C A E G F B D H

Unexplored Explored Visited

  • Starting from a seed, explores all

neighbor nodes. Process continues iteratively without replacement.

  • BFS leads to bias towards high

degree nodes

Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006

  • Early measurement studies of

OSNs use BFS as primary sampling technique

i.e [Mislove et al], [Ahn et al], [Wilson et al.]

slide-30
SLIDE 30

Random Walk

  • Adjacency matrix
  • Transition Probability Matrix
  • |E|: number of links
  • Stationary Distribution

1 4 3 2

D = 3 2 3 2 ! " # # # # $ % & & & &

Undirected

A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

Symmetric

P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 " # $ $ $ $ % & ' ' ' '

πi = di 2 E

P

ij = 1

ki

slide-31
SLIDE 31

Metropolis-Hastings Random Walk

  • Adjacency matrix
  • Transition Probability Matrix
  • |E|: number of links
  • Stationary Distribution

1 4 3 2

D = 3 2 3 2 ! " # # # # $ % & & & &

Undirected

A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

Symmetric

P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 " # $ $ $ $ % & ' ' ' '

, ,

1 min(1, ) if neighbor of 1 if =

MH w w MH y y

k w k k P P w

u u u u u

u u

¹

ì ï ï = í

  • ï

ï î å

1 V

u

p =

slide-32
SLIDE 32
  • 2. Clustering
  • 1. Hierarchical
  • 2. K-means -> BFR
  • 3. DBScan -> DENCLUDE
slide-33
SLIDE 33

Example: Hierarchical clustering

(5,3)

  • (1,2)
  • (2,1)
  • (4,1)
  • (0,0)
  • (5,0)

x (1.5,1.5) x (4.5,0.5) x (1,1)

x (4.7,1.3)

Data:

  • … data point

x … centroid Dendrogram

slide-34
SLIDE 34

Example: K-means

  • J. Leskovec, A.

Rajaraman, J.

34

x x x x x x x x x … data point … centroid x x x Clusters after round 1

slide-35
SLIDE 35

Example: K-means

  • J. Leskovec, A.

Rajaraman, J.

35

x x x x x x x x x … data point … centroid x x x Clusters after round 2

slide-36
SLIDE 36

Example: K-means

  • J. Leskovec, A.

Rajaraman, J.

36

x x x x x x x x x … data point … centroid x x x Clusters at the end

slide-37
SLIDE 37

BFR: “Galaxies” Picture

37

A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

slide-38
SLIDE 38

Summarizing Sets of Points

For each cluster, the discard set (DS) is summarized by:

  • The number of points, N
  • The vector SUM, whose ith component is the

sum of the coordinates of the points in the ith dimension

  • The vector SUMSQ: ith component = sum of

squares of coordinates in ith dimension

  • 2d + 1 values represent any size cluster

– d = number of dimensions

38

A cluster. All its points are in the DS. The centroid

slide-39
SLIDE 39

DBSCAN Algorithm: Example

  • Parameter
  • e = 2 cm
  • MinPts = 3
slide-40
SLIDE 40

DBSCAN Algorithm: Example

  • Parameter
  • e = 2 cm
  • MinPts = 3
slide-41
SLIDE 41

DBSCAN Algorithm: Example

  • Parameter
  • e = 2 cm
  • MinPts = 3
slide-42
SLIDE 42

Roadmap

  • 1. Sampling & Indexing

– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc

  • 2. Clustering

– Hirachical – K-means, DBScan

  • 3. Recommender System, Anomaly detection, Map-

Matching, etc

  • 4. Applications
slide-43
SLIDE 43

Content based Recommendation

likes

Item profiles

Red Circles Triangles

User profile

match recommend build

43

  • J. Leskovec, A.

Rajaraman, J.

slide-44
SLIDE 44

Collaborative Filtering

vConsider user x vFind set N of other users whose ratings are “similar” to x’s ratings vEstimate x’s ratings based on ratings

  • f users in N

44

  • J. Leskovec, A.

Rajaraman, J. x N

slide-45
SLIDE 45

Roadmap

  • 1. Sampling & Indexing

– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc

  • 2. Clustering

– Hirachical – K-means, DBScan

  • 3. Recommender System, Anomaly detection, Map-

Matching, etc

  • 4. Applications
slide-46
SLIDE 46

Air Pollution: A Global Concern !

Air quality monitor station PM2.5, PM10, NO2, SO2, CO, O3

S1 S2 S4 S5 S8 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21

50kmx40km

slide-47
SLIDE 47

Inferring Real-Time and Fine-Grained air quality throughout a city using Big Data

Meteorology Traffic POIs Road networks Human Mobility Historical air quality data Real-time air quality reports

Zheng, Y., et al. U-Air: when urban air quality inference meets big data. KDD 2013

S1 S2 S4 S5 S8 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21

slide-48
SLIDE 48

48

Class Outcomes

slide-49
SLIDE 49

49

What is DS504/CS586 about?

v We’ll learn about – Advanced Techniques for Big Data Analytics

  • Large scale data sampling and estimation,
  • Data Cleaning,
  • Graph Data Mining,
  • Data management, clustering, etc.

– Applications with Big Data Analytics

  • Urban Computing
  • Social network analysis
  • Recommender system, etc.

v Learning outcomes

– Understand & Explain challenges and advances in the state-of-art in big data analytics. – Design, develop and fully execute a big data analytics project. – Communicate the ideas effectively in the form of a presentation and written documents to a technical audience.

slide-50
SLIDE 50

CS586/DS504-2017Fall

  • 2. Data Preprocessing/Cleaning
  • 1. Data Acquisition & Measurement
  • 3. Data Management
  • 4. Big Data Mining

Graph Mining, Data Clustering Recommender systems, Outlier Detection

  • 5. Applications

Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering

  • 4. K-means, DBSCAN
  • 4. BFR, DENCLUE
  • 4. Trajectory Clustering
  • 5. Urban: Bike sharing
  • 1. Graph Mining
  • 3. Index, Query
  • 4. Data Collection
  • 2. Map-Matching
  • 4. Recommender Systems
  • 4. Outlier Detection

More techniques

slide-51
SLIDE 51

Logistics 51

Workload

v Focus more on critical thinking, problem

solving, “heads-on/hands-on” experiences!

v

Understand, formulate and solve problems

v

Read and critique research papers

v

Two Course Projects

v

Oral presentations

v

Team Work,

v

Coding,

slide-52
SLIDE 52
  • Grading

– Projects (40%)

  • Project 1 (10%)
  • Project 2 (30%)

– Final reports in the discussion forum (by 11:59pm 12/12 Tue); – Self-and-peer evaluation form for project 2 (by 11:59PM 12/12 Tue);

– Written work (30%):

  • Critiques + Project reports (20%)
  • Quiz (10%, with 5% each)

– Oral work (30%):

  • (Project and paper) presentations

Workload and Grading

slide-53
SLIDE 53

Problems

fp fg t Nt ɵ Na v dv fr w Np α

Categories Regions Categories Categories Regions Features

A

X = R×U Z

Time slots Regions

Y Y = T×RT X

Yt-1 Fm(t-1)

t-1

Ft(t-1) Fh(t-1) Fm(t)

t

Ft(t) Fh(t) Fm(t+1)

t+1

Ft(t+1) Fh(t+1) Yt Yt-1 ∆ ∆ c ∆ ∆

x

ANN

w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1

b''

Data

Models and Algorithms

Data Scientist

slide-54
SLIDE 54

54

Want to learn more? Future Opportunities.

slide-55
SLIDE 55

Urban Computing Research Group at WPI

  • DiDi
  • Mobike
  • JD
  • Yunyan
  • TianLai online Karaoke
slide-56
SLIDE 56

Urban Computing Research Group at WPI

  • Human-in-Loop Urban Computing
slide-57
SLIDE 57

57

Research opportunities are available in my group.

  • 1. Funding support for PhD

students

  • 2. Independent Study for MS

students Contact: yli15@wpi.edu website: http://wpi.edu/~yli15/index.html