Welcome to DS504/CS586: Big Data Analytics Application I Prof. - - PowerPoint PPT Presentation

welcome to ds504 cs586 big data analytics application i
SMART_READER_LITE
LIVE PREVIEW

Welcome to DS504/CS586: Big Data Analytics Application I Prof. - - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Application I Prof. Yanhua Li Time: 6:00pm 8:50pm R Loca2on: KH 116 Fall 2017 16 critiques & Next Thur we have the last critique. Already graded 4 of them. Plan to grade 1-2 more.


slide-1
SLIDE 1

DS504/CS586: Big Data Analytics Application I

  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm R Loca2on: KH 116 Fall 2017

slide-2
SLIDE 2
  • 16 critiques & Next Thur we have the last

critique.

– Already graded 4 of them. – Plan to grade 1-2 more.

slide-3
SLIDE 3
  • Grading

– Projects (40%)

  • Project 1 (10%)
  • Project 2 (30%)

– Final reports in the discussion forum (by 11:59pm 12/12 Tue); – Self-and-peer evalua2on form for project 2 (by 11:59PM 12/12 Tue);

– WriPen work (30%):

  • Cri2ques + Project reports (20%)
  • Quiz (10%, with 5% each)

– Oral work (30%):

  • Presenta2on (project presenta2on + reading assignment

presenta2on)

slide-4
SLIDE 4
  • Final Project Presentation

– 20 minutes each group (including Q&A and transition)

– Schedule:

  • 12/14 Thu

– Last week presentation data for all 7 teams – We will have snacks and soda.

slide-5
SLIDE 5

5

Next Class: Summary and Discussion

v Review of the semester v Plus the last critique/review

slide-6
SLIDE 6

Urban Sensing & Data Acquisition

Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy

Urban Data Management

Spatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data Analytics

Data Mining, Machine Learning, Visualization

Service Providing

Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...

Urban Compu,ng: concepts, methodologies, and applica,ons. Zheng, Y., et al. ACM transac+ons on Intelligent Systems and Technology.

  • Urban Compu2ng,

Social Network Analysis, Networking

  • Graph Mining, Data

Clustering, Recommender systems

  • Indexing, Query

Processing

  • Error Correc2on, Map-

Matching

  • Representa2ve data

collec2on: Sampling

slide-7
SLIDE 7

Real-world problems are always messy

  • Mul2ple models
  • Key features
  • Data Sparsity
slide-8
SLIDE 8
  • What do we do to solve a classifica2on/inference/

predic2on problem?

– Data Cleaning – Feature selec2on – Inference model – Evalua2on

  • An example of how to solve real world

applica2on problem

slide-9
SLIDE 9

U-Air: When Urban Air Quality Meets Big Data

Authors: Yu Zheng, Microsok Research Asia

slide-10
SLIDE 10

Background

  • Air quality

– NO2, SO2 – Aerosols: PM2.5, PM10

  • Why it maPers

– Healthcare – Pollu2on control and dispersal

  • Reality

– Building a measurement sta2on is not easy – A limited number of sta2ons (poor coverage)

Beijing only has 22 air quality monitor sta2ons in its urban areas (50kmx40km) Air quality monitor station

slide-11
SLIDE 11

2PM, June 17, 2013

slide-12
SLIDE 12

Challenges

  • Air quality varies by loca2ons

non-linearly

  • Affected by many factors

– Weathers, traffic, land use… – Subtle to model with a clear formula

40 80 120 160 200 240 280 320 360 400 440 480 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Deviation of PM2.5 between S12 and S13

>35%

Propor2on

A) Beijing (8/24/2012 - 3/8/2013)

slide-13
SLIDE 13

We do not really know the air quality of a loca,on without a monitoring sta,on!

slide-14
SLIDE 14

Challenges

  • Exis2ng methods do not work well

– Linear interpola2on – Classical dispersion models

  • Gaussian Plume models and Opera2onal Street Canyon models
  • Many parameters difficult to obtain: Vehicle emission rates, street

geometry, the roughness coefficient of the urban surface…

– Satellite remote sensing

  • Suffer from clouds
  • Does not reflect ground air quality
  • Vary in humidity, temperature, loca2on, and seasons

– Outsourced crowd sensing using portable devices

  • Limited to a few gasses: CO2 and CO
  • Sensors for detec2ng aerosol are not portable: PM10, PM2.5
  • A long period of sensing process, 1-2 hours

30,000 + USD, 10ug/m3 202×85×168(mm)

slide-15
SLIDE 15

Inferring Real-Time and Fine-Grained air quality throughout a city using Big Data

Meteorology Traffic POIs Road networks Human Mobility Historical air quality data Real-2me air quality reports

slide-16
SLIDE 16
slide-17
SLIDE 17

Applica2ons

  • Loca2on-based air quality awareness

– Fine-grained pollu2on alert – Rou2ng based on air quality

  • Deploying new monitoring sta2ons
  • A step towards iden2fying the root cause of air pollu2on

S2 S1 S5 S3 S7 S6 S4 S1 S8 S9 S10

slide-18
SLIDE 18

Difficul2es

  • 1. How to iden2fy features from each kind of data source
  • 2. Incorporate mul2ple heterogeneous data sources into a

learning model

– Spa2ally-related data: POIs, road networks – Temporally-related data: traffic, meteorology, human mobility

  • 3. Data sparseness (liPle training data)

– Limited number of sta2ons – Many places to infer

slide-19
SLIDE 19

Methodology Overview

  • Par22on a city into disjoint grids
  • Extract features for each grid from its affec2ng region

– Meteorological features – Traffic features – Human mobility features – POI features – Road network features

  • Co-training-based semi-supervised learning model for each

pollutant

– Predict the AQI labels – Data sparsity – Two classifiers

slide-20
SLIDE 20

Meteorological Features: Fm

  • Rainy, Sunny, Cloudy, Foggy
  • Wind speed
  • Temperature
  • Humidity
  • Barometer pressure

Good Moderate Unhealthy Unhealthy-S Very Unhealthy

AQI of PM10 August to Dec. 2012 in Beijing

slide-21
SLIDE 21

Traffic Features: Ft

  • Distribu2on of speed by 2me: F(v)
  • Expecta2on of speed: E(V)
  • Standard devia2on of Speed: D

Good Moderate km Unhealthy-S Very Unhealthy Unhealthy 0≤v<20 20≤v<40 v≥ 40 E(v) D(v)

GPS trajectories generated by over 30,000 taxis From August to Dec. 2012 in Beijing

slide-22
SLIDE 22

Human Mobility Features: Fh

  • Human mobility implies

– Traffic flow – Land use of a loca2on – Func2on of a region (like residen2al or business areas)

  • Features:

– Number of arrivals ​𝑔↓𝑏 and leavings ​𝑔↓𝑚

A) AQI of PM10 B) AQI of NO2

fl fl fa fa

Good Moderate Unhealthy Unhealthy-S Very Unhealthy Good Moderate Unhealthy Unhealthy-S Very Unhealthy

Number of arrivals fa and leavings (departures) fl Parks vs factories

slide-23
SLIDE 23

Extrac2ng Traffic/Human Mobility Features

  • Offline spa2o-temporal indexing
  • ta: arrival 2me
  • Traj: traj ID
  • Ii: the index of the first GPS point (in the trajectory) entering a grid
  • Io: the index of the last GPS point (in the trajectory) entering the grid
slide-24
SLIDE 24

POI Features: Fp

  • Why POI

– Indicate the land use and the func2on of the region – the traffic paPerns in the region

  • Features

– Numbers of POIs over categories – Por2on of vacant places – The changes in the number of POIs

  • Factories, shopping malls,
  • hotel and real estates
  • Parks, decora2on and furniture markets
slide-25
SLIDE 25

Road Network Features: Fr

  • Why road networks

– Have a strong correla2on with traffic flows – A good complementary of traffic modeling

  • Features:

– Total length of highways ​𝑔↓ℎ – Total length of other (low-level) road segments ​𝑔↓𝑠 – The number of intersec2ons ​𝑔↓𝑡 in the grid’s affec2ng region

fh fr fs

slide-26
SLIDE 26

Semi-Supervised Learning Model

  • Philosophy of the model

– States of air quality

  • Temporal dependency in a loca2on
  • Geo-correla2on between loca2ons

– Genera2on of air pollutants

  • Emission from a loca2on
  • Propaga2on among loca2ons

– Two sets of features

  • Spa2ally-related
  • Temporally-related

s2 s1 s3 s4 l s2 s1 s3 s4 l s2 s1 s3 s4 ti t1 t2 l Time G e

  • s

p a c e

A location with AQI labels A location to be inferred Temporal dependency Spatial correlation

POIs:

Spatial

Fh

Temporal

Road Networks: Fr Ft Fm Meteorologic: Traffic: Human mobility: Fp

Spa2al Classifier Temporal Classifier

Co-Training

slide-27
SLIDE 27

Co-Training-Based Learning Model

  • Spa2al classifier

– Model the spa2al correla2on between AQI of different loca2ons – Using spa2ally-related features – Based on a neural network

  • Input genera2on

– Select n sta2ons to pair with – Perform m rounds

∆P1x ∆R1x c

D1

Fp Fr l1

D2

d1x

D1 D2 D1 D1

1 1

Fp Fr lk

k k

Fp

x

Fr

x

lx ∆Pkx ∆Rkx c dkx

1 k

Input generation

cx

ANN

w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1

b''

slide-28
SLIDE 28

Ar2ficial Neural Networks (ANN)

X1 X2 X3 Y

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

X1 X2 X3 Y Black box

Output Input

Output Y is 1 if at least two of the three inputs are equal to 1.

slide-29
SLIDE 29

Ar2ficial Neural Networks (ANN)

X1 X2 X3 Y

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

S

X1 X2 X3 Y Black box

0.3 0.3 0.3 t=0.4 Output node Input nodes

⎩ ⎨ ⎧ = > − + + =

  • therwise

true is if 1 ) ( where ) 4 . 3 . 3 . 3 . (

3 2 1

z z I X X X I Y

slide-30
SLIDE 30

Ar2ficial Neural Networks (ANN)

  • Model is an assembly of

inter-connected nodes and weighted links

  • Output node sums up each
  • f its input value according

to the weights of its links

  • Compare output node

against some threshold t

S

X1 X2 X3 Y Black box

w1 t Output node Input nodes w2 w3

) ( t X w I Y

i i i

− = ∑

Perceptron Model

) ( t X w sign Y

i i i

− =

  • r
slide-31
SLIDE 31

General Structure of ANN

Activation function g(Si )

Si Oi

I1 I2 I3 wi1 wi2 wi3 Oi Neuron i Input Output threshold, t

Input Layer Hidden Layer Output Layer x1 x2 x3 x4 x5 y

Training ANN means learning the weights of the neurons

slide-32
SLIDE 32

Co-Training-Based Learning Model

  • Spa2al classifier

– Model the spa2al correla2on between AQI of different loca2ons – Using spa2ally-related features – Based on a neural network

  • Input genera2on

– Select n sta2ons to pair with – Perform m rounds

∆P1x ∆R1x c

D1

Fp Fr l1

D2

d1x

D1 D2 D1 D1

1 1

Fp Fr lk

k k

Fp

x

Fr

x

lx ∆Pkx ∆Rkx c dkx

1 k

Input generation

cx

ANN

w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1

b''

slide-33
SLIDE 33

Co-Training-Based Learning Model

  • Temporal classifier

– Model the temporal dependency of the air quality in a loca2on – Using temporally related features – Based on a Linear-Chain Condi2onal Random Field (CRF)

Yt-1 Fm(t-1)

t-1

Ft(t-1) Fh(t-1) Fm(t)

t

Ft(t) Fh(t) Fm(t+1)

t+1

Ft(t+1) Fh(t+1) Yt Yt-1

slide-34
SLIDE 34

Learning Process

Yt-1 Fm(t-1)

t-1

Ft(t-1) Fh(t-1) Fm(t)

t

Ft(t) Fh(t) Fm(t+1)

t+1

Ft(t+1) Fh(t+1) Yt Yt-1

∆P1x ∆R1x c

D1

Fp Fr l1

D2

c d1x

D1 D2 D1 D1

1 1

Fp Fr lk

k k

Fp

x

Fr

x

lx ∆Pkx ∆Rkx c dkx

1 k x

ANN Input generation

w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1

b''

Training

Temporally-related features Spa2ally-related features

Labeled data Unlabeled data Inference

slide-35
SLIDE 35

Inference Process

∆P1x ∆R1x c

D1

Fp Fr l1

D2

c d1x

D1 D2 D1 D1

1 1

Fp Fr lk

k k

Fp

x

Fr

x

lx ∆Pkx ∆Rkx c dkx

1 k

x

ANN Input generation

w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1

b''

Temporally-related features Spa2ally-related features ​<𝑞↓𝑑1 ,​𝑞↓𝑑2 , …, ​𝑞↓𝑑𝑜 > ​<𝑞′↓𝑑1 ,​𝑞′↓𝑑2 , …, ​𝑞′↓𝑑𝑜 > × 𝑑=​arg↓​𝑑↓𝑗 ∈𝒟 𝑁𝑏𝑦(​𝑞↓𝑑𝑗 × 𝑞′↓𝑑𝑗 )

< pc1, · · · , pcn >

< p0

c1, · · · , p0 cn > c = argci∈CMax(P ci

SC × P ci T C)

Yt-1 Fm(t-1)

t-1

Ft(t-1) Fh(t-1) Fm(t)

t

Ft(t) Fh(t) Fm(t+1)

t+1

Ft(t+1) Fh(t+1) Yt Yt-1

slide-36
SLIDE 36

Evalua2on

Data sources Beijing Shanghai Shenzhen Wuhan POI

2012 Q1 271,634 321,529 107,061 102,467 2012 Q3 272,109 317,829 107,171 104,634

Road

#.Segments 162,246 171,191 45,231 38,477 Highways 1,497km 1,963km 256km 1,193km Roads 18,525km 25,530km KM 6,100km 9,691km #. Intersec. 49,981 70,293 32,112 25,359

AQI

#. Sta2on 22 10 9 10 Hours 23,300 8,588 6,489 6,741 Time spans 8/24/2012-3/8/201 3 1/19/2013-3/8/20 13 2/4/2013-3/8/201 3 2/4/2013-3/8/2013

Urban Size (grids)

5050km (2500) 5050km (2500) 5745km(2565) 4525km (1165)

  • Datasets

S1 S2 S4 S5 S8 S5 S2 S1 S7 S5 S3 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21 S3 S7 S6 S4 S1 S8 S9 S10 S1 S4 S2 S6 S9 S8 S1 S2 S4 S3 S10 S5 S9 S6 S7 S8

A) Beijing B) Shanghai C) Shenzhen D) Wuhan

slide-37
SLIDE 37

Evalua2on

  • Ground Truth

– Remove a sta2on – Cross ci2es

  • Baselines

– Linear and Gaussian Interpola2ons – Classical Dispersion Model – Decision Tree (DT): – CRF-ALL – ANN-ALL

slide-38
SLIDE 38

Evalua2on

PM10 NO2 Features Precision Recall Precision Recall

Fm

0.572 0.514 0.477 0.454

Ft

0.341 0.36 0.371 0.35

Fh

0.327 0.364 0.411 0.483 Fp+Fr 0.441 0.443 0.307 0.354

Fm+Ft

0.664 0.675 0.634 0.635 Fm+Ft+Fp+Fr 0.731 0.734 0.701 0.691 Fm+Ft+Fp+Fr+Fh 0.773 0.754 0.723 0.704

  • Does every kind of feature count?
slide-39
SLIDE 39

Evalua2on

  • Overall performance of the co-training

20 40 60 80 100 120 140 160 0.65 0.70 0.75 0.80

  • Num. of Iterations

SC TC Co-Training

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

NO2 PM10

Accuracy U-Air Linear Guassian Classical DT CRF-ALL ANN-ALL

Accuracy

slide-40
SLIDE 40

Evalua2on

Ground Truth Predic,ons G M S U G

3789 402 102 0.883

Recall M

602 3614 204 0.818

S

41 200 532 50 0.646

U

22 70 219 0.704 0.855 0.853 0.586 0.814 0.828

Precision

  • Confusion matrix of Co-Training on PM10
slide-41
SLIDE 41

Evalua2on

Ci,es PM2.5 PM10 NO2 Prec. Rec. Prec. Rec. Prec. Rec. Beijing 0.764 0.763 0.762 0.745 0.730 0.749 Shanghai 0.705 0.725 0.702 0.718 0.715 0.706 Shenzhen 0.740 0.737 0.710 0.742 0.732 0.722 Wuhan 0.727 0.723 0.731 0.739 0.744 0.719

  • Performance of Spa2al classifier
slide-42
SLIDE 42

Evalua2on

Procedures Time(ms) Procedures Time(ms) Feature extrac,on (per grid)

Ft&Fh

53.2 Inference (per grid) SC 21.5

Fp

28.8 TC 13.1

Fr

14.4 Total 131

  • Efficiency study
  • Single grid 131s
  • Inferring the AQIs for en2re Beijing in 5 minutes
slide-43
SLIDE 43

Conclusion

  • Infer fine-grained air quality with

– Real-2me and historical air quality readings from exis2ng sta2ons – Other data sources: meteorology, POIs, road network, human mobility, and traffic condi2on

  • Co-Training-based semi-supervised learning approach

– Deal with data sparsity by learning from unlabeled data – Model the spa2al correla2on among the air quality of different loca2ons – Model the temporal dependency of the air quality in a loca2on

  • Results

– 0.82 with traffic data (co-training) – 0.76 if only using spa2al classifier

slide-44
SLIDE 44

Ques2ons?