DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm –8:50pm R Location: AK233 Spring 2018

Service Providing Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce The Environment Air Pollution, ... Win Urban Data Analytics Data Mining, Machine Learning, Visualization Urban Computing Urban Data Management People Win Win Cities OS Spatio-temporal index, streaming, trajectory, and graph data management,... Human Meteorolo Road Air Social Energy Networks POIs Traffic mobility Quality gy Media Tackle the Big challenges Urban Sensing & Data Acquisition Participatory Sensing, Crowd Sensing, Mobile Sensing in Big cities using Big data! Urban Computing: concepts, methodologies, and applications . Zheng, Y., et al. ACM transactions on Intelligent Systems and Technology .

2D-Spatial Queries K Nearest Neighbour (KNN) Region (Range) Query Queries Given a point or an object, Ask for objects that lie find the nearest object that partially or fully inside a satisfies given conditions specified region.

Trajectory Data Management v Range queries Tr 3 R E.g. Retrieve the trajectories of vehicles passing a Tr 2 given rectangular region R between 2pm-4pm in the Tr 1 past month A) Range Query • KNN queries Tr 3 q 1 E.g. Retrieve the trajectories of people with the minimum Tr 2 Tr 2 aggregated distance to a set of query points p 2 q 2 Tr 1 Tr 1 Publications: [1][2] for a single point query, [3] for multiple query points B) KNN Point Query q t E.g. Retrieve the trajectories of people with the minimum Tr 3 3 aggregated distance to a query trajectory Tr 2 Tr 2 Publications: Chen et al, SIGMOD05; Vlachos et al, Tr 1 Tr 1 ICDE02; Yi et al, ICDE98. C) KNN Trajectory Query [1] E. Frentzos, et al. Algorithms for nearest neighbor search on moving object trajectories. Geoinformatica, 2007 [2] D. Pfoser, et al. Novel approaches in query processing for moving object trajectories. VLDB, 2000. [3] Zaiben Chen, et al. Searching Trajectories by Locations: An Efficiency Study, SIGMOD 2010

Spatial/Temporal Indexing Structures v Temporal Indexing (1-D data) § List index § B-tree v Space Partition-Based Indexing Structures (2-D data) § Grid-based § Quad-tree

List Index Structure v Example § From YouTube Prefixes § To YouTube videos IDs

Full B-Tree Structure

B-Tree Index v B-tree is the most commonly used data structures for indexing. v It is fully dynamic, that is it can grow and shrink.

Three Types B-Tree Nodes v Root node - contains node pointers to branch nodes. v Branch node - contains pointers to leaf nodes or other branch nodes. v Leaf node - contains index items

Spatial/Temporal Indexing Structures v Temporal Indexing (1-D data) § List index § B-tree v Space Partition-Based Indexing Structures (2-D data) § Grid-based § Quad-tree

Grid-based Spatial Indexing v Indexing § Partition the space into disjoint and uniform grids § Build an index between each grid and the points in the grid g1 g2 g1 p1 p3 p1 p4 p3 g2 p4

Grid-based Spatial Indexing v Range Query § Find the girds intersecting the range query § Retrieve the points from the grids and identify the points in the range g1 p2 p4 g2 p3 p4 p2 g3 p1 p3 g4 p1

Grid-based Spatial Indexing v Nearest neighbor query § Euclidian distance § Road network distance is quite different The nearest object is The nearest object is Fast approximation within the grid outside the grid p2 p2 p1 p1 p1

Grid-based Spatial Indexing v Advantages § Easy to implement and understand § Very efficient for processing range and nearest queries v Disadvantages § Index size could be big § Difficult to deal with unbalanced data § Think about what we discussed last time on the POI sampling and estimation.

Quad-Tree Indexing • – Each node of a quad-tree is associated with a rectangular region of space; the top node is associated with the entire target space. – Each non-leaf node divides its region into four equal sized quadrants – Leaf nodes have between zero and some fixed maximum number of points (set to 1 in example). 00 0 1 0 1 2 3 03 02 30 00 30 31 2 3 33 32 15

Quad-Tree • Range query (ok) 00 0 1 0 1 2 3 03 02 20 23 30 31 2 3 33 32

Quad-Tree • Nearest Neighbour Query (hard) 00 0 1 0 1 2 3 03 02 20 23 30 31 2 3 33 32

Spatial/Temporal (3D) Indexing Structures v Temporal Indexing (1-D data) § List index § B-tree v Space Partition-Based Indexing Structures (2-D data) § Grid-based § Quad-tree

Sampling Big Trajectory Data

Big Trajectory Data in Urban Networks Taxi GPS Trajectory Mobile User Trajectory • Urban roving sensors deliver big trajectory data. • Reveal moving patterns and urban issues . Challenge How to manage the big trajectory data to enable efficient query processing.

Trajectory Aggregate Query • A trajectory aggregate query • Retrieves statistics of distinct trajectories passing a user-specified spatio-temporal region; • Examples, • # of taxi trips with average speed of more than 5 miles per hour in New York City in 2014; • # of mobile users with iPhone in Hong Kong in 2013.

Exhaustive Search • r i : a sequence of GPS points in (TID, Lat, Lng, Time) • q : a trajectory aggregate query with N q Trajectories • Spatio-temporal indexing: B-tree, Quad-tree, etc,

Challenges with Big Trajectory Data • Long responding time for large trajectory dataset • In 2013, Shenzhen, China; Mobility Data 788.6TB 6million users Taxi GPS 1.58 TB 22,083 taxis Bus GPS 1.34 TB 8,427 buses • Query: # of iPhone users and taxi/bus trips iPhone Users 0.8 million Taxi GPS 302 million trips Bus GPS 1.38 billion trips 12 minutes to get the exact answers! (System: A cluster of 3 machines with 24 Intel X5670 2.93GHz processors, 94GB memory.)

Key Challenges on Exact Answer r 1 r 2 2 1 • A trajectory r i may traverse multiple index leaf nodes • Cannot pre-compute and store the results on indices • Summing up two answers leads to over-counting

Motivation & Problem Definition q covers n index leaf nodes How to sample B index leaf nodes to estimate # of trajectories in q with a guaranteed error bound?

Random Index Sampling Sampling and Estimation B Sampled index leaf nodes Trajectory list Occurrence time k q 1 , k q r 1 , r 2 2 k q 3 , k q r 3 , r 5 r 3 r 5 5 r 1 r 2 … k q 6 , k q r 6 , r 7 7 r 6 r 7 r 6 r 9 k q 9 , k q r 9 , r 10 10 … … Lat Time r 1 Index leaf node list r 2 Index leaf node list Lng r 3 Index leaf node list query q … … Inverted index ST-indexed data Data Indexing Structure

Random Index Sampling • Stage 1: Sampling Stage: • Uniformly at random sample B index leaf nodes with replacement • Stage 2: Estimation Stage: (Unbiased Estimator) • Convergence analysis: when , . is the maximum number of trajectories in an index leaf node.

Evaluation v Dataset: 3TB real human mobility data in a large city in eastern China Statistics Value 400 square miles City Size three million people City Population Size eight days at the end of 2010 Duration 109,914 3G users Number of trajectories 400 million (407, 040, 083) # of spatio-temporal points v Baseline Algorithm § Exhaustive search v Evaluation metric § Relative error & Responding time

Evaluation Results 20 5 n=7k (ES: 112s) n=7k 0.3 Query processing time (s) n=13k (ES: 115s) n=13k n=23k (ES: 117s) n=23k 0.2 15 Relative error ratio Ground Truth 2% 0.1 10 0 − 0.1 5 − 0.2 0 − 0.3 0.1 0.2 0.4 0.8 1.6 0.1 0.2 0.4 0.8 1.6 B/n(%) B/n(%) Relative error Processing time Up to 2% relative error 5 times reduction

Concurrent Random Index Sampling • Practical Issue: • A large number of concurrent aggregate queries • Idea of Concurrent Random Index Sampling (CRIS): • Sampling Reuse • Stratified Sampling Technique

Concurrent Random Index Sampling Unbiased Estimators:

Summary v Approximate query processing § Single trajectory aggregate query • via Random Index Sampling (RIS) § Concurrent trajectory aggregate queries • via Concurrent Random Index Sampling (CRIS)

Any Comments & Critiques?

Weka v 6 weeks v https://weka.waikato.ac.nz/dataminingwithweka/preview v https://www.futurelearn.com/courses/data-mining-with- weka

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK233 Spring 2018 Service Providing Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce The Environment Air

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu.

DS504/CS586: Big Data Analytics --Presentation Example Prof. Yanhua Li Time: 6:00pm 8:50pm R.

B -trees CSCI 333 Williams College Logistics Lab 2b Office hours Tuesday night, 7-9pm

CS 10: Problem solving via Object Oriented Programming Prioritizing 2 Agenda 1. Heaps 2. Heap

PiTree: Practical Implementations of ABR Algorithms Using Decision Trees Paper # P5C-04 Zili Meng

Mat 3770 Relationships Week 7 Mary Trees Examples Balanced Prufer Exercises Spring 2014

A Datatype for Binary Trees Functions on Trees Amtoft from Hatcliff Binary Trees One of many

CS 225 Data Structures Mar March h 11 11 BT BTrees Wade Fa Wa Fagen-Ul Ulmsch schnei

Storage and Indexing DBS Database Systems Reading: R&G Chapters 8, 9 & 10.1 Implementing

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK233 Spring 2018 Service Providing Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce The Environment Air

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu.

DS504/CS586: Big Data Analytics --Presentation Example Prof. Yanhua Li Time: 6:00pm 8:50pm R.

B -trees CSCI 333 Williams College Logistics Lab 2b Office hours Tuesday night, 7-9pm

CS 10: Problem solving via Object Oriented Programming Prioritizing 2 Agenda 1. Heaps 2. Heap

PiTree: Practical Implementations of ABR Algorithms Using Decision Trees Paper # P5C-04 Zili Meng

Mat 3770 Relationships Week 7 Mary Trees Examples Balanced Prufer Exercises Spring 2014

A Datatype for Binary Trees Functions on Trees Amtoft from Hatcliff Binary Trees One of many

CS 225 Data Structures Mar March h 11 11 BT BTrees Wade Fa Wa Fagen-Ul Ulmsch schnei

Storage and Indexing DBS Database Systems Reading: R&amp;G Chapters 8, 9 &amp; 10.1 Implementing

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

Storage and Indexing DBS Database Systems Reading: R&G Chapters 8, 9 & 10.1 Implementing