1
Pick up a handout
- n the front table
Pick up a handout on the front table 1 Welcome to DS504/CS586: - - PowerPoint PPT Presentation
Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH116 Fall 2017 Next Session: Final Project Presentation v 12/14 R 20 min each team (including
1
Time: 6:00pm –8:50pm R Location: KH116 Fall 2017
3
v 12/14 R
v
20 min each team (including Q&A)
v
v
v
v
v
v
v
v Snacks and soft drink will be provided.
– Key topics, techniques, discussed in the semester – Future opportunities
– 10 min break 7:20-7:30PM
– (last week we finished 18 minutes late.)
5
6
Graph Mining, Data Clustering Recommender systems, Outlier Detection
Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering
More techniques
Topics in Big Data Mining 1 Graph Mining: 2 Clustering Hierarchical K-means, BFR DBScan, DENCLUE Graph Sampling Node Importance Ranking
3 Recommender Systems Content-Based Collaborative Filtering User-User Based Item-Item Based Facebook/Social graph estimation Social influence Topic sensitive PageRank Trajectory clustering Location-based recommender sys Personalized Geo-Social Recom.
– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc
– Hirachical – K-means, BFR, – DBScan, DENCLUDE
– Social networks – Location based services – Urban computing, – and more
v German Tank Problem v Panther tanks, 1943. v World War II v Estimate # German Tanks (N) v the problem of estimating the maximum of a discrete uniform distribution from Sampling without replacement v m : the max series number v k : total number of tanks observed v Estimator: v the sample maximum plus the average gap between
estimate an animal population’s size N.
captured, marked, and released.
captured and the number of marked individuals within the sample is counted k.
12
} vertex sampling } BFS sampling
12
} random walk sampling } edge sampling
– YouTube: Random Prefix Sampling – Index structure: B-Tree, List Index
– Google map/Foursquare: Random Region Sampling/Random Region Zoom-in – Index structure: Grid-based / Quad Tree / R-Tree
– Trajectory sampling: Random index sampling – Index structure (combinations): B-Tree+Quad-tree, 3-D R-tree
g1 p1 p3 g2 p4
g1 g2
p1 p3 p4
– Partition the space into disjoint and uniform grids – Build an index between each grid and the points in the grid
16
– Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example).
1 2 3
00 02 03 30 31 32 33 30 00
1 2 3
Comments from other YouTube users
L i=1 m
Please refer to the paper for proof of the unbiasedness, confidence interval, and estimator design of other statistics.
queried region into two sub-regions and randomly selects a non-empty sub-region to zoom-in when it contains more than k PoIs (k=5)
23
Step 1 Step 2 Step 3 Step 4 Probability of sampling the sub-region
methods.
24
Estimators of sum and distribution aggregates: sampled sub-regsions r1,…,rm
probability of sampling the sub-region ri
B Sampled index leaf nodes Trajectory list r1, r2 r3, r5 r6, r7 r9, r10 … kq
1, kq 2
kq
3, kq 5
kq
6, kq 7
kq
9, kq 10
… Occurrence time Inverted index r1 r2 r3 … Lng Lat Time Index leaf node list … query q ST-indexed data Data Indexing Structure Sampling and Estimation
Index leaf node list Index leaf node list
when , . is the maximum number of trajectories in an index leaf node.
Undirected !!
Minas Gjoka, UC Irvine Walking in Facebook 29
C A E G F B D H
Unexplored Explored Visited
neighbor nodes. Process continues iteratively without replacement.
degree nodes
Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006
OSNs use BFS as primary sampling technique
i.e [Mislove et al], [Ahn et al], [Wilson et al.]
D = 3 2 3 2 ! " # # # # $ % & & & &
Undirected
A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &
Symmetric
P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 " # $ $ $ $ % & ' ' ' '
P
ij = 1
ki
D = 3 2 3 2 ! " # # # # $ % & & & &
Undirected
A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &
Symmetric
P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 1/ 3 " # $ $ $ $ % & ' ' ' '
, ,
1 min(1, ) if neighbor of 1 if =
MH w w MH y y
k w k k P P w
u u u u u
u u
¹
ì ï ï = í
ï î å
u
(5,3)
x (1.5,1.5) x (4.5,0.5) x (1,1)
x (4.7,1.3)
Data:
x … centroid Dendrogram
Rajaraman, J.
34
x x x x x x x x x … data point … centroid x x x Clusters after round 1
Rajaraman, J.
35
x x x x x x x x x … data point … centroid x x x Clusters after round 2
Rajaraman, J.
36
x x x x x x x x x … data point … centroid x x x Clusters at the end
37
A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
– d = number of dimensions
38
A cluster. All its points are in the DS. The centroid
– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc
– Hirachical – K-means, DBScan
Matching, etc
likes
Item profiles
Red Circles Triangles
User profile
match recommend build
43
Rajaraman, J.
44
Rajaraman, J. x N
– Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc
– Hirachical – K-means, DBScan
Matching, etc
Air quality monitor station PM2.5, PM10, NO2, SO2, CO, O3
S1 S2 S4 S5 S8 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21
50kmx40km
Meteorology Traffic POIs Road networks Human Mobility Historical air quality data Real-time air quality reports
Zheng, Y., et al. U-Air: when urban air quality inference meets big data. KDD 2013
S1 S2 S4 S5 S8 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21
48
49
v We’ll learn about – Advanced Techniques for Big Data Analytics
– Applications with Big Data Analytics
v Learning outcomes
– Understand & Explain challenges and advances in the state-of-art in big data analytics. – Design, develop and fully execute a big data analytics project. – Communicate the ideas effectively in the form of a presentation and written documents to a technical audience.
Graph Mining, Data Clustering Recommender systems, Outlier Detection
Urban Computing, Social Network Analysis Networking Indexing, Query Processing Error Correction, Map-Matching Representative data collection: Sampling Techniques Sampling and index Clustering
More techniques
Logistics 51
v Focus more on critical thinking, problem
v
Understand, formulate and solve problems
v
Read and critique research papers
v
Two Course Projects
v
Oral presentations
v
Team Work,
v
Coding,
– Projects (40%)
– Final reports in the discussion forum (by 11:59pm 12/12 Tue); – Self-and-peer evaluation form for project 2 (by 11:59PM 12/12 Tue);
– Written work (30%):
– Oral work (30%):
fp fg t Nt ɵ Na v dv fr w Np α
Categories Regions Categories Categories Regions FeaturesA
X = R×U Z
Time slots RegionsY Y = T×RT X
Yt-1 Fm(t-1)
t-1
Ft(t-1) Fh(t-1) Fm(t)
t
Ft(t) Fh(t) Fm(t+1)
t+1
Ft(t+1) Fh(t+1) Yt Yt-1 ∆ ∆ c ∆ ∆
x
ANN
w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1
b''
Models and Algorithms
54
57