DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning - - PowerPoint PPT Presentation

ds504 cs586 big data analytics data pre processing and
SMART_READER_LITE
LIVE PREVIEW

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning - - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 232 Fall 2016 The Data Equation Oceans of Data Ocean Biodiversity Praia de Forte, Brazil Informatics,


slide-1
SLIDE 1

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm – 8:50pm R Location: AK 232 Fall 2016

slide-2
SLIDE 2

Ocean Biodiversity Informatics, Hamburg

The Data Equation Oceans of Data

Praia de Forte, Brazil

slide-3
SLIDE 3

Data Quality Dimensions

v Accuracy § Errors in data Example:”Jhn” vs. “John” v Currency § Lack of updated data Example: Residence (Permanent) Address: out-dated vs. up-to-dated v Consistency § Discrepancies into the data Example: ZIP Code and City consistent v Completeness

§ Lack of data § Partial knowledge of the records in a table

slide-4
SLIDE 4

Ocean Biodiversity Informatics, Hamburg

Geographic outliers - GIS

Country, State, named district, etc.

Gazetteer of Brazilian localities

slide-5
SLIDE 5

What do we mean by ‘Data Quality’? An essential or distinguishing characteristic

necessary for data to be fit for use. SDTS 02/92 The general intent of describing the quality of a particular dataset or record is to describe the fitness of that dataset or record for a particular use that one may have in mind for the data. Chrisman, 1991

slide-6
SLIDE 6

Loss of data quality

Loss of data quality can occur at many stages:

v At the time of collection v During digitisation v During documentation v During storage and archiving v During analysis and manipulation v At time of presentation v And through the use to which they are put Don’t underestimate the simple elegance of quality

  • improvement. Other than teamwork, training, and discipline,

it requires no special skills. Anyone who wants to can be an effective contributor. (Redman 2001).

slide-7
SLIDE 7

Data Cleaning

v Data cleaning tasks

§ Accuracy: Smooth out noisy data § Currency: Update the records § Consistency: Correct inconsistent data § Completeness: Fill in missing values

slide-8
SLIDE 8

Map matching

slide-9
SLIDE 9

Map-matching

v Problem: (Sampled data)

§ GPS trajectory = a sequence of GPS locations with time stamps

§ Map a GPS trajectory onto a road network § a sequence of GPS points à a sequence of road segments

slide-10
SLIDE 10

Spatial Data

v

e1 e2 e3

e3.start e3.end

e4

slide-11
SLIDE 11

Map-Matching

v Why it is important

§ A fundamental step in many transportation applications

  • Navigation and driving
  • Traffic analysis
  • Taxi dispatching and recommendations

§ Examples:

  • Find the vehicles passing Institute Road
  • Calculate the average travel time from WPI

campus to MIT campus

  • When will the Bus 3 arrive at stop Highland St &

North Ashland St

slide-12
SLIDE 12

Map-Matching

v Simple solution for high-sampling-rate data

§ Weighted distance

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-13
SLIDE 13

Map-Matching (for low sampling rate)

v Why difficult

? ? ? ? ? ?

b) Overpass (a) Parallel roads c) Spur

slide-14
SLIDE 14

Map-Matching

v According to the additional information used

§ Geometric § Topological § Probabilistic § Advanced techniques

v According to the range of sampling points

§ Local/incremental § Global

Yu Zheng. Trajectory Data Mining: An Overview. ACM Transaction on Intelligent Systems and Technology, 6, 3, 2015.

slide-15
SLIDE 15

Map-matching

v Insights

§ Consider both local and global information § Incorporating both spatial and temporal features

pa pb pc

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-16
SLIDE 16

Map-matching framework

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-17
SLIDE 17

Map-matching

v Solution (incorporating spatial information)

§ (Observation Probability) Model local possibility

  • (Transmission Probability) Considering context (global)

§ Spatial analysis function

𝑓𝑗

3

𝑓𝑗

1

𝑓𝑗

2

𝑑𝑗

3

𝑞𝑗

𝑑𝑗

2

𝑑𝑗

1

𝑑𝑗

2

𝑞𝑗−1

𝑞𝑗

𝑑𝑗

1

𝑞𝑗+1

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

N(cj

i) =

1 √ 2πσ e

(xj i −µ)2 2σ2

V (ct

i−1 → cs i) =

di−1→i w(i−1,t)→(i,s)

Fs(ct

i−1 → cs i) = V (ct i−1 → cs i) ∗ N(cs i)

slide-18
SLIDE 18

Map-matching

  • Solution (Cosine Similarity)
  • Temporal analysis function (Considering temporal information)
  • Shortest path is used.

Pi-1 Pi A Highway A Service Road Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Ft(ct

i1 → cs i) =

Pk

u=1(e0 u.v × ¯

v(i1,t)!(i,s)) qPk

u=1(e0 u.v)2 ×

qPk

u=1 ¯

v2

(i1,t)!(i,s)

¯ v(i−1,t)→(i,s) = Pk

u=1 lu

∆ti−1→i

slide-19
SLIDE 19

Map-matching

  • Aggregating

– Spatial and temporal information – Local and global information

  • Dynamic programing
  • Spatio-temporal function

𝑑1

1

𝑑1

2

𝑑1

3

𝑑1

1 → 𝑑2 1

𝑑1

3 → 𝑑2 2

𝑑2

1

𝑑2

2

𝑑𝑜

1

𝑑𝑜

2

P1's candidates P2's candidates Pn's candidates Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

F(ct

i−1 → cs i) = Fs(ct i−1 → cs i) ∗ Ft(ct i−1 → cs i), 2 ≤ i ≤ n

slide-20
SLIDE 20

Map-matching

  • Path Selection
  • Dynamic programing

𝑑1

1

𝑑1

2

𝑑1

3

𝑑1

1 → 𝑑2 1

𝑑1

3 → 𝑑2 2

𝑑2

1

𝑑2

2

𝑑𝑜

1

𝑑𝑜

2

P1's candidates P2's candidates Pn's candidates Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

P = argmaxPcF(Pc) F(Pc) = N(cs1

1 ) + n

X

i=2

F(csi−1

i−1 → csi i )

slide-21
SLIDE 21

Map-matching Example

  • Path Selection
slide-22
SLIDE 22

Map-matching framework

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-23
SLIDE 23

Localized ST-Matching Strategy

  • Path Selection

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-24
SLIDE 24

Evaluations

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

AN = #Correctly Matched Road Seg #all road segments

AL = P Length Matched Road Seg Length of the trajectory

slide-25
SLIDE 25

25

Course Project

slide-26
SLIDE 26

Project 1 directions

What is your project goal?

v What new story you want to tell? v New contents to sample? v New sampling methods via API? v New statistics of YouTube, view count distribution,

dynamics, or # uploaders/active users?

v Analysis on other websites, Twitter, Facebook,

Foursquare, Yelp, with API interfaces Broad impacts? (Keep in mind)

v How YouTube is evolving?

§ More business or personal videos? How to distinguish the two § How special events, e.g., NBA game, breaking news, affect the uploading, viewing behaviors

v Online Marketing, advertising?

slide-27
SLIDE 27

27

A Few Words on Course Project I

Project I: Collecting and Measuring Online Data

v

Team work; each team 3-4 students.

v

Starting date: Week 3 (9/8 R)

v

Proposal Due: Week 4 (9/15 R) 2 pages roughly

v

Due date/time: Before Class on Week 8 (10/13 R)

v

Presentation date/time: Class on Week 8 (10/13 R)

§ Selected teams only

v

Requiring Programming in C/C++, Java, Python, and, etc

v

Choose one online site/service with APIs to download data, or use existing datasets.

v

Examples:

v

(1) estimate site statistics, or

v

(2) applying machine learning methods to predict future trends, or

v

(3) perform time-series analysis to capture dynamic patterns,

v

  • r something else, as long as your work can potentially bring research value to

the community.

slide-28
SLIDE 28

Transport Layer 3-28

A Few Words on Course Project I

v Group meeting with Prof Li by appointment)

§ Week 3 (9/8 R), Starting date § Week 4 (9/15 R), Proposal Due: 2 pages roughly (upload it to discussion board) § Week 5 (9/22 R), Methodology due (upload it to discussion board) § Week 6 (9/29 R), Results due (upload it to discussion board) § Week 7 (10/6 R), Conclusion due (upload it class discussion board) § Week 8 (10/13 R), Final Report due at 11:59pm EST & Self and Cross-evaluation due at 11:59pm EST § Week 8 (10/13 R), In-class Presentation (10 min) (Selected teams only)

slide-29
SLIDE 29

29

Course Project II

v Projects will be in groups!

v 3-5 students per group, depending on

enrollment

v Topics on your choice (related to big data analytics)

v Application-driven v Fundamental data analytics research v Data sources on course website

http://wpi.edu/~yli15/courses/DS504Fall16/ Resources.html Talk to me once you have an idea.

slide-30
SLIDE 30

30

Next Class: Data Management

v Do assigned readings before class

v Be prepared, read and review required readings on your own in

advance!

v Do literature survey: find and read related papers if any v Bring your questions to the class and look for answers during

the class.

v Submit reviews/critiques

v

In myWPI before class

v

Bring 2 hardcopies to the class

v

Hand in one copy, and keep one copy with you.

Review Writing: http://users.wpi.edu/~yli15/courses/DS504Fall16/Critiques.html

v Attend in-class discussions

v Please ask and answer questions in (and out of) class! v Let’s try to make the class interactive and fun!