Pick up a handout on the front table 1 Welcome to DS504/CS586: - PowerPoint PPT Presentation

Pick up a handout on the front table 1

Welcome to DS504/CS586: Big Data Analytics --Review Prof. Yanhua Li Time: 6:00pm –8:50pm R Location: KH116 Fall 2017

Next Session: Final Project Presentation v 12/14 R 20 min each team (including Q&A) v Team 1 v Team 2 v Team 3 v Team 4 v Team 5 v Team 6 v Team 7 v v Snacks and soft drink will be provided. 3

Today • 1. Review – Key topics, techniques, discussed in the semester – Future opportunities • Big data analytics • Urban Computing – 10 min break 7:20-7:30PM • 3. Team 1 presentation and discussion: 7:30PM • 4. Course evaluation 8:15PM-8:30PM • 5. Finish at 8:30PM – (last week we finished 18 minutes late.)

Introduction What is “Big Data”? 5

Big Data Analytics techniques and tools for managing, analyzing and extracting knowledge from “big data” 6

CS586/DS504-2017Fall 5. Applications Techniques Sampling and index Urban Computing, Social Network Analysis Networking 1. Graph Mining 3. Index, Query 4. Big Data Mining 4. Data Collection Graph Mining, Data Clustering Recommender systems, Outlier Detection Clustering 4. K-means, DBSCAN 3. Data Management 4. BFR, DENCLUE Indexing, Query Processing 4. Trajectory Clustering 5. Urban: Bike sharing 2. Data Preprocessing/Cleaning Error Correction, Map-Matching More techniques 2. Map-Matching 1. Data Acquisition & Measurement 4. Recommender Systems 4. Outlier Detection Representative data collection: Sampling

Big Data Mining Topics Topics in Big Data Mining 1 Graph Mining : 3 Recommender Systems Content-Based Graph Sampling Collaborative Filtering Node Importance Ranking User-User Based Facebook/Social graph estimation Item-Item Based Social influence Location-based recommender sys Topic sensitive PageRank Personalized Geo-Social Recom. 2 Clustering Hierarchical K-means, BFR 4. Outlier Detection DBScan, DENCLUE 5. Big Data Integration (Guest Lec.) Trajectory clustering

Roadmap • 1. Sampling & Indexing – Random prefix/region/zoomin/region sampling – Index structure: B-Tree, Quad-tree, R-tree, etc • 2. Clustering – Hirachical – K-means, BFR, – DBScan, DENCLUDE • 3. Recommender System, Map-Matching, etc • 4. Applications – Social networks – Location based services – Urban computing, – and more

Sampling Techniques to Count Population v German Tank Problem v Panther tanks, 1943. v World War II v Estimate # German Tanks ( N ) v the problem of estimating the maximum of a discrete uniform distribution from Sampling without replacement v m : the max series number v k : total number of tanks observed ˆ N = m (1 + k − 1 ) − 1 v Estimator: v the sample maximum plus the average gap between observations in the sample.

Sampling Techniques to Count Population • Mark and recapture • a method commonly used in ecology to estimate an animal population’s size N . • Step 1: A portion of the population K is captured, marked, and released. • Step 2: Later, another portion n is captured and the number of marked individuals within the sample is counted k . N = Kn • Estimation: ˆ k

Sampling Big Data 1.1 R andom sampling 1.2 c rawling (uniform & independent) } vertex sampling } BFS sampling } edge sampling } random walk sampling 12 12

1.1 Random Vertex Sampling & Index • One-dimension Data – YouTube: Random Prefix Sampling – Index structure: B-Tree, List Index • Two Dimension Data (Spatial Data) – Google map/Foursquare: Random Region Sampling/Random Region Zoom-in – Index structure: Grid-based / Quad Tree / R-Tree • Three Dimension Data (spatio-temporal data) – Trajectory sampling: Random index sampling – Index structure (combinations): B-Tree+Quad-tree, 3-D R-tree

Full B-Tree Structure

Grid-based Spatial Indexing • Indexing – Partition the space into disjoint and uniform grids – Build an index between each grid and the points in the grid g2 g1 g1 p1 p3 p1 p4 p3 g2 p4

Quad-Tree • Indexing – Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example). 00 0 1 0 1 2 3 03 02 30 00 30 31 2 3 33 32 16

Random Vertex Sampling: YouTube Comments from other YouTube users

YouTube Video ID Space

Random Prefix Sampling • Let p L denote the probability that a randomly generated id matches a given L-length prefix p L = 1/|S| L =1/64 L , if L=1,…,10 p L = 1/(|S| 10 |T|)=1/(64 10 *16), if L=11 • Generate m prefixes of length L. • Let X iL be the total number of videos with a prefix i of length L , and N the total number of videos then, X iL ~ Binomial( N, p L );

Unbiased Estimator for the Total Number of Videos • Given m samples X iL by querying randomly generated prefixes of the same length in [1,11], we have the unbiased estimator of total number of videos m 1 ˆ ∑ L N = X i mp L (See paper for the confidence interval and variance) i = 1

Practical Issues

Simple random region sampling Tabulating Stage: Estimation Unbiased estimator of the total number of venues N : : Number of venues of X t ; Please refer to the paper for proof of the unbiasedness, confidence interval, and estimator design of other statistics.

Random Region Zoom-in on Maps • RRZI( A ): At each step, RRZI divides the current queried region into two sub-regions and randomly selects a non-empty sub-region to zoom-in when it contains more than k PoIs ( k =5) Probability of sampling the sub-region Step 1 Step 2 Step 3 Step 4 23

Random Region Zoom-in on Maps • RRZI and RRZIC can be viewed as weighted sampling methods. Estimators of sum and distribution aggregates: sampled sub-regsions r 1 ,…, r m probability of sampling the sub-region r i 24

Motivation & Problem Definition q covers n index leaf nodes How to sample B index leaf nodes to estimate # of trajectories in q with a guaranteed error bound?

Random Index Sampling Sampling and Estimation B Sampled index leaf nodes Trajectory list Occurrence time k q 1 , k q r 1 , r 2 2 k q 3 , k q r 3 , r 5 r 3 r 5 5 r 1 r 2 … k q 6 , k q r 6 , r 7 7 r 6 r 7 r 6 r 9 k q 9 , k q r 9 , r 10 10 … … Lat Time r 1 Index leaf node list r 2 Index leaf node list Lng r 3 Index leaf node list query q … … Inverted index ST-indexed data Data Indexing Structure

Random Index Sampling • Stage 1: Sampling Stage: • Uniformly at random sample B index leaf nodes with replacement • Stage 2: Estimation Stage: (Unbiased Estimator) • Convergence analysis: when , . is the maximum number of trajectories in an index leaf node.

1.2 Crawling based Sampling Undirected !! 2 3 1 6 4 5

(1) Breadth-First-Search (BFS) • Starting from a seed, explores all neighbor nodes. Process continues F iteratively without replacement. G E H C • BFS leads to bias towards high D B degree nodes A Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006 Unexplored • Early measurement studies of Explored OSNs use BFS as primary sampling technique Visited i.e [Mislove et al], [Ahn et al], [Wilson et al.] Minas Gjoka, UC Irvine Walking in Facebook 29

Random Walk • Adjacency matrix 1 2 ! $ ! $ 3 0 0 0 0 1 1 1 # & # & 1 0 1 0 0 2 0 0 # & # & Symmetric D = A = # & # & 0 0 3 0 1 1 0 1 # & # & 0 0 0 2 1 0 1 0 " % " % 4 3 • Transition Probability Matrix Undirected ij = 1 " % 0 1/ 3 1/ 3 1/ 3 P $ ' k i 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' $ ' 1/ 3 1/ 3 0 1/ 3 $ ' 1/ 2 0 1/ 2 0 # & • |E|: number of links • Stationary Distribution π i = d i 2 E

Metropolis-Hastings Random Walk • Adjacency matrix 1 2 ! $ ! $ 3 0 0 0 0 1 1 1 # & # & 1 0 1 0 0 2 0 0 # & # & Symmetric D = A = # & # & 0 0 3 0 1 1 0 1 # & # & 0 0 0 2 1 0 1 0 " % " % 4 3 • Transition Probability Matrix Undirected ì 1 min(1, k u u ) if neighbor of w ï " % ï 0 1/ 3 1/ 3 1/ 3 k k $ ' = í MH P u w 1/ 3 1/ 3 1/ 3 0 P = A • D − 1 = $ ' u , w î å ï - u MH 1 P if = w $ ' 1/ 3 1/ 3 0 1/ 3 u , y ï $ ' 1/ 3 0 1/ 3 1/ 3 ¹ u y # & • |E|: number of links • Stationary Distribution 1 p = u V

2. Clustering • 1. Hierarchical • 2. K-means -> BFR • 3. DBScan -> DENCLUDE

Example: Hierarchical clustering (5,3) o (1,2) x (1.5,1.5) o x (4.7,1.3) x (1,1) x (4.5,0.5) o (2,1) o (4,1) o (0,0) o (5,0) Data: o … data point x … centroid Dendrogram

Example: K-means x x x x x x x x x x x x … data point … centroid Clusters after round 1 J. Leskovec, A. 34 Rajaraman, J.

Example: K-means x x x x x x x x x x x x … data point … centroid Clusters after round 2 J. Leskovec, A. 35 Rajaraman, J.

Pick up a handout on the front table 1 Welcome to DS504/CS586: - PowerPoint PPT Presentation

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH116 Fall 2017 Next Session: Final Project Presentation v 12/14 R 20 min each team (including

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review

MA/CSSE 473 Day 01 Course Intro Algorithms Intro Pick up a handout from the back table No in

Databases Announcements Create Table and Drop Table Create Table 4 Create Table CREATE

#join Front JellyBox Build: 21_LCD Front In this video, we add the front piece to the rest of the

About the guy in front Conservation Biology BSC3052 About the guy in front About the guy in

CS4102 Algorithms Solutions to HW6 Fall 2018 and HW8 up front Warm up: Pick up a slip of paper

Passages worth the dig: Passages worth the dig: Picking a Pastor/Leader How would YOU

MA/CSSE 473 Day 01 Course Intro Algorithms Intro Pick up a handout from the back table MA/CSSE

Make a Presentation BSBCMM401 STUDENT HANDOUT STUDENT HANDOUT Make a Presentation BSBCMM401

Investor Handout Second quarter 2019 Please note that statements made in this handout, including

Handout Download at http://thejourneyler.org/handout.pdf Friday, August 17, 18 Not RGH /*R

HANDOUTS 1 Slide 2 Handout contents Page 2-3 Handout contents 4 Introduction 5 - 6 Paying

BSBCMM401 - Make a Presentation STUDENT HANDOUT 1 STUDENT HANDOUT Make a Presentation BSBCMM401

Jewish Estate Planning Handout materials are available for download or printing on the HANDOUT

Handout 7 Summary of this handout: Key Exchange Protocols Wide-Mouth Frog Needham-Schroeder

Handout 8 Summary of this handout: Asymmetric Cryptography Public Key Cryptography

1. W elcom e and

Coexistence, Collaboration, and Coordination Paradigms in the Presence of Mobility Gruia-Catalin

Intergenerational mobility in developing countries: on the axiomatic foundation of

Supporting Mobility in MobilityFirst F. Zhang, K. Nagaraja, T. Nguyen, D. Raychaudhuri, Y. Zhang

Graph Theory and Modal Logic Yutaka Miyazaki Osaka University of Economics and Law (OUEL) Aug.

ProofTheory: Logicaland Philosophical Aspects Class 4: Hypersequents forModal Logics Greg

The modal -calculus Hierarchy on Restricted Classes of Transition Systems Luca Alberucci 1

Connecting the categorical and the modal logic approaches to Quantum Mechanics Giovanni Cin` a