DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning - - PowerPoint PPT Presentation

ds504 cs586 big data analytics data pre processing and
SMART_READER_LITE
LIVE PREVIEW

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning - - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 233 Spring 2018 Merged CS586 and DS504 Graded one review Examples of Reviews/ Critiques Random selection.


slide-1
SLIDE 1

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm – 8:50pm R Location: AK 233 Spring 2018

slide-2
SLIDE 2

Merged CS586 and DS504 Graded one review Examples of Reviews/ Critiques Random selection.

slide-3
SLIDE 3

Ocean Biodiversity Informatics, Hamburg

The Data Equation Oceans of Data

Praia de Forte, Brazil

slide-4
SLIDE 4

Data Quality Dimensions

v Accuracy § Errors in data Example:”Jhn” vs. “John” v Currency § Lack of updated data Example: Residence (Permanent) Address: out-dated vs. up-to-dated v Consistency § Discrepancies into the data Example: ZIP Code and City consistent v Completeness

§ Lack of data § Partial knowledge of the records in a table

slide-5
SLIDE 5

Ocean Biodiversity Informatics, Hamburg

Geographic outliers - GIS

Country, State, named district, etc.

Gazetteer of Brazilian localities

slide-6
SLIDE 6

What do we mean by ‘Data Quality’? An essential or distinguishing characteristic

necessary for data to be fit for use. SDTS 02/92 The general intent of describing the quality of a particular dataset or record is to describe the fitness of that dataset or record for a particular use that one may have in mind for the data. Chrisman, 1991

slide-7
SLIDE 7

Loss of data quality

Loss of data quality can occur at many stages:

v At the time of collection v During digitisation v During documentation v During storage and archiving v During analysis and manipulation v At time of presentation v And through the use to which they are put Don’t underestimate the simple elegance of quality

  • improvement. Other than teamwork, training, and discipline,

it requires no special skills. Anyone who wants to can be an effective contributor. (Redman 2001).

?

slide-8
SLIDE 8

Data Cleaning

v Data cleaning tasks

§ Accuracy: Smooth out noisy data § Currency: Update the records § Consistency: Correct inconsistent data § Completeness: Fill in missing values

slide-9
SLIDE 9

Map matching

slide-10
SLIDE 10

Map-matching

v Problem: (Sampled data)

§ GPS trajectory = a sequence of GPS locations with time stamps

§ Map a GPS trajectory onto a road network § a sequence of GPS points à a sequence of road segments

slide-11
SLIDE 11

Spatial Data

v

e1 e2 e3

e3.start e3.end

e4

slide-12
SLIDE 12

Map-Matching

v Why it is important

§ A fundamental step in many transportation applications

  • Navigation and driving
  • Traffic analysis
  • Taxi dispatching and recommendations

§ Examples:

  • Find the vehicles passing Institute Road
  • Calculate the average travel time from WPI

campus to MIT campus

  • When will the Bus 3 arrive at stop Highland St &

North Ashland St

slide-13
SLIDE 13

Map-Matching

v Simple solution for high-sampling-rate data

§ Weighted distance

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-14
SLIDE 14

Map-Matching (for low sampling rate)

v Why difficult

? ? ? ? ? ?

b) Overpass (a) Parallel roads c) Spur

slide-15
SLIDE 15

Map-Matching

v According to the additional information used

§ Geometric § Topological § Probabilistic § Advanced techniques

v According to the range of sampling points

§ Local/incremental § Global

Yu Zheng. Trajectory Data Mining: An Overview. ACM Transaction on Intelligent Systems and Technology, 6, 3, 2015.

slide-16
SLIDE 16

Map-matching

v Insights

§ Consider both local and global information § Incorporating both spatial and temporal features

pa pb pc

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-17
SLIDE 17

Map-matching framework

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-18
SLIDE 18

Map-matching

v Solution (incorporating spatial information)

§ (Observation Probability) Model local possibility

  • (Transmission Probability) Considering context (global)

§ Spatial analysis function

𝑓𝑗

3

𝑓𝑗

1

𝑓𝑗

2

𝑑𝑗

3

𝑞𝑗

𝑑𝑗

2

𝑑𝑗

1

𝑑𝑗

2

𝑞𝑗−1

𝑞𝑗

𝑑𝑗

1

𝑞𝑗+1

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

N(cj

i) =

1 √ 2πσ e

(xj i −µ)2 2σ2

V (ct

i−1 → cs i) =

di−1→i w(i−1,t)→(i,s)

Fs(ct

i−1 → cs i) = V (ct i−1 → cs i) ∗ N(cs i)

slide-19
SLIDE 19

Map-matching

  • Solution (Cosine Similarity)
  • Temporal analysis function (Considering temporal information)
  • Shortest path is used.

Pi-1 Pi A Highway A Service Road Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Ft(ct

i1 → cs i) =

Pk

u=1(e0 u.v × ¯

v(i1,t)!(i,s)) qPk

u=1(e0 u.v)2 ×

qPk

u=1 ¯

v2

(i1,t)!(i,s)

¯ v(i−1,t)→(i,s) = Pk

u=1 lu

∆ti−1→i

slide-20
SLIDE 20

Map-matching

  • Aggregating

– Spatial and temporal information – Local and global information

  • Dynamic programing
  • Spatio-temporal function

𝑑1

1

𝑑1

2

𝑑1

3

𝑑1

1 → 𝑑2 1

𝑑1

3 → 𝑑2 2

𝑑2

1

𝑑2

2

𝑑𝑜

1

𝑑𝑜

2

P1's candidates P2's candidates Pn's candidates Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

F(ct

i−1 → cs i) = Fs(ct i−1 → cs i) ∗ Ft(ct i−1 → cs i), 2 ≤ i ≤ n

slide-21
SLIDE 21

Map-matching

  • Path Selection
  • Dynamic programing

𝑑1

1

𝑑1

2

𝑑1

3

𝑑1

1 → 𝑑2 1

𝑑1

3 → 𝑑2 2

𝑑2

1

𝑑2

2

𝑑𝑜

1

𝑑𝑜

2

P1's candidates P2's candidates Pn's candidates Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

P = argmaxPcF(Pc) F(Pc) = N(cs1

1 ) + n

X

i=2

F(csi−1

i−1 → csi i )

slide-22
SLIDE 22

Map-matching Example

  • Path Selection
slide-23
SLIDE 23

Map-matching framework

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-24
SLIDE 24

Localized ST-Matching Strategy

  • Path Selection

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

slide-25
SLIDE 25

Evaluations

Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

AN = #Correctly Matched Road Seg #all road segments

AL = P Length Matched Road Seg Length of the trajectory

slide-26
SLIDE 26

26

Homework assginement

slide-27
SLIDE 27

Project 1 Example

USPS https://ribbs.usps.gov/intelligentmail_package/documents/ tech_guides/PUB199IMPBImpGuide.pdf

Project 1 Proposals

slide-28
SLIDE 28

28

Next Class: Data Management

v Do assigned readings before class

v Be prepared, read and review required readings on your own in

advance!

v Do literature survey: find and read related papers if any v Bring your questions to the class and look for answers during

the class.

v Submit reviews/critiques

v

In Canvas before class

v

Bring 2 hardcopies to the class

v

Hand in one copy, and keep one copy with you.

Review Writing: http://users.wpi.edu/~yli15/courses/DS504Fall16/Critiques.html

v Attend in-class discussions

v Please ask and answer questions in (and out of) class! v Let’s try to make the class interactive and fun!