welcome to ds504 cs586 big data analytics application i
play

Welcome to DS504/CS586: Big Data Analytics Application I Prof. - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Application I Prof. Yanhua Li Time: 6:00pm 8:50pm R Loca2on: KH 116 Fall 2017 16 critiques & Next Thur we have the last critique. Already graded 4 of them. Plan to grade 1-2 more.


  1. Welcome to DS504/CS586: Big Data Analytics Application I Prof. Yanhua Li Time: 6:00pm –8:50pm R Loca2on: KH 116 Fall 2017

  2. • 16 critiques & Next Thur we have the last critique. – Already graded 4 of them. – Plan to grade 1-2 more.

  3. • Grading – Projects (40%) • Project 1 (10%) • Project 2 (30%) – Final reports in the discussion forum (by 11:59pm 12/12 Tue); – Self-and-peer evalua2on form for project 2 (by 11:59PM 12/12 Tue); – WriPen work (30%): • Cri2ques + Project reports (20%) • Quiz (10%, with 5% each) – Oral work (30%): • Presenta2on (project presenta2on + reading assignment presenta2on)

  4. • Final Project Presentation – 20 minutes each group (including Q&A and transition) – Schedule: • 12/14 Thu – Last week presentation data for all 7 teams – We will have snacks and soda.

  5. Next Class: Summary and Discussion v Review of the semester v Plus the last critique/review 5

  6. Service Providing • Urban Compu2ng, Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Social Network Air Pollution, ... Analysis, Networking Urban Data Analytics Data Mining, Machine Learning, Visualization • Graph Mining, Data Clustering, Urban Data Management Recommender systems Spatio-temporal index, streaming, trajectory, and graph data management,... • Indexing, Query Processing Human Meteorolo Road Air Social Energy Networks POIs Traffic mobility Quality gy Media • Error Correc2on, Map- Urban Sensing & Data Acquisition Matching Participatory Sensing, Crowd Sensing, Mobile Sensing • Representa2ve data collec2on: Sampling Urban Compu,ng: concepts, methodologies, and applica,ons . Zheng, Y., et al. ACM transac+ons on Intelligent Systems and Technology .

  7. Real-world problems are always messy • Mul2ple models • Key features • Data Sparsity

  8. • What do we do to solve a classifica2on/inference/ predic2on problem? – Data Cleaning – Feature selec2on – Inference model – Evalua2on • An example of how to solve real world applica2on problem

  9. U-Air: When Urban Air Quality Meets Big Data Authors: Yu Zheng, Microsok Research Asia

  10. Background Air quality monitor station • Air quality – NO2, SO2 – Aerosols: PM2.5, PM10 • Why it maPers – Healthcare – Pollu2on control and dispersal • Reality – Building a measurement sta2on is not easy – A limited number of sta2ons (poor coverage) Beijing only has 22 air quality monitor sta2ons in its urban areas (50kmx40km)

  11. 2PM, June 17, 2013

  12. Challenges • Air quality varies by loca2ons non-linearly • Affected by many factors – Weathers, traffic, land use… – Subtle to model with a clear formula 0.30 0.25 Propor2on 0.20 0.15 >35% 0.10 0.05 0.00 0 40 80 120 160 200 240 280 320 360 400 440 480 Deviation of PM2.5 between S12 and S13 A) Beijing (8/24/2012 - 3/8/2013)

  13. We do not really know the air quality of a loca,on without a monitoring sta,on!

  14. Challenges • Exis2ng methods do not work well – Linear interpola2on – Classical dispersion models • Gaussian Plume models and Opera2onal Street Canyon models • Many parameters difficult to obtain: Vehicle emission rates, street geometry, the roughness coefficient of the urban surface… – Satellite remote sensing • Suffer from clouds • Does not reflect ground air quality • Vary in humidity, temperature, loca2on, and seasons – Outsourced crowd sensing using portable devices • Limited to a few gasses: CO 2 and CO • Sensors for detec2ng aerosol are not portable: PM10, PM2.5 • A long period of sensing process, 1-2 hours 30,000 + USD, 10ug/m 3 202×85×168 ( mm )

  15. Inferring Real-Time and Fine-Grained air quality throughout a city using Big Data Meteorology Traffic Human Mobility POIs Road networks Historical air quality data Real-2me air quality reports

  16. Applica2ons • Loca2on-based air quality awareness – Fine-grained pollu2on alert – Rou2ng based on air quality • Deploying new monitoring sta2ons • A step towards iden2fying the root cause of air pollu2on S1 S9 S6 S1 S2 S4 S3 S8 S10 S7 S5

  17. Difficul2es • 1. How to iden2fy features from each kind of data source • 2. Incorporate mul2ple heterogeneous data sources into a learning model – Spa2ally-related data: POIs, road networks – Temporally-related data: traffic, meteorology, human mobility • 3. Data sparseness (liPle training data) – Limited number of sta2ons – Many places to infer

  18. Methodology Overview • Par22on a city into disjoint grids • Extract features for each grid from its affec2ng region – Meteorological features – Traffic features – Human mobility features – POI features – Road network features • Co-training-based semi-supervised learning model for each pollutant – Predict the AQI labels – Data sparsity – Two classifiers

  19. Meteorological Features: F m Rainy, Sunny, Cloudy, Foggy • Wind speed • Temperature • Humidity • Barometer pressure • Good AQI of PM 10 Moderate Unhealthy-S Unhealthy August to Dec. 2012 in Beijing Very Unhealthy

  20. Traffic Features: F t Distribu2on of speed by 2me: F(v) • Expecta2on of speed: E(V) • 0 ≤ v <20 Standard devia2on of Speed: D • 20 ≤ v <40 v ≥ 40 km E ( v ) D ( v ) GPS trajectories generated by over 30,000 taxis From August to Dec. 2012 in Beijing Good Moderate Unhealthy-S Unhealthy Very Unhealthy

  21. Human Mobility Features: F h • Human mobility implies – Traffic flow – Land use of a loca2on – Func2on of a region (like residen2al or business areas) • Features: Number of arrivals f a and leavings (departures) f l – Number of arrivals ​𝑔↓𝑏 and leavings ​𝑔↓𝑚 f a f a Good Good Moderate Moderate Unhealthy-S Unhealthy-S Unhealthy Unhealthy Very Unhealthy Very Unhealthy f l f l A) AQI of PM 10 B) AQI of NO 2 Parks vs factories

  22. Extrac2ng Traffic/Human Mobility Features Offline spa2o-temporal indexing • t a : arrival 2me • Traj : traj ID • I i : the index of the first GPS point (in the trajectory) entering a grid • I o : the index of the last GPS point (in the trajectory) entering the grid •

  23. POI Features: F p • Why POI – Indicate the land use and the func2on of the region – the traffic paPerns in the region • Features – Numbers of POIs over categories – Por2on of vacant places – The changes in the number of POIs • Factories, shopping malls, • hotel and real estates • Parks, decora2on and furniture markets

  24. Road Network Features: F r Why road networks • – Have a strong correla2on with traffic flows – A good complementary of traffic modeling Features: • – Total length of highways ​𝑔↓ℎ f h – Total length of other (low-level) road segments ​𝑔↓𝑠 f r – The number of intersec2ons ​𝑔↓𝑡 in the grid’s affec2ng region f s

  25. Semi-Supervised Learning Model • Philosophy of the model s 4 Time l s 1 s 2 s 3 – States of air quality t i s 4 • Temporal dependency in a loca2on l s 1 s 2 s 3 • Geo-correla2on between loca2ons t 2 – Genera2on of air pollutants s 4 e c l a p s 1 s o • Emission from a loca2on s 2 s 3 e t 1 G • Propaga2on among loca2ons – Two sets of features A location with AQI labels A location to be inferred • Spa2ally-related Temporal dependency Spatial correlation • Temporally-related Co-Training Road Networks: F r Spa2al Classifier POIs: F p Spatial Traffic: F t Meteorologic: F m Temporal Classifier Human mobility: F h Temporal

  26. Co-Training-Based Learning Model • Spa2al classifier – Model the spa2al correla2on between AQI of different loca2ons – Using spa2ally-related features – Based on a neural network • Input genera2on w 11 b 1 1 F p ∆ P 1 x D 1 w' 11 – Select n sta2ons to pair with F r 1 ∆ R 1 x D 1 b' 1 – Perform m rounds w 1 l 1 d 1 x D 2 c 1 x F p x F r l x c x b'' k ∆ P kx F p D 1 w r k ∆ R kx F r D 1 b' r l k D 2 d kx w pq w' qr c k b q Input generation ANN

  27. Ar2ficial Neural Networks (ANN) Black box X 1 X 2 X 3 Y Input 1 0 0 0 X 1 1 0 1 1 Output 1 1 0 1 1 1 1 1 X 2 Y 0 0 1 0 0 1 0 0 X 3 0 1 1 1 0 0 0 0 Output Y is 1 if at least two of the three inputs are equal to 1.

  28. Ar2ficial Neural Networks (ANN) Input nodes Black box X 1 X 2 X 3 Y Output 1 0 0 0 X 1 node 0.3 1 0 1 1 1 1 0 1 0.3 1 1 1 1 X 2 Y S 0 0 1 0 0 1 0 0 X 3 0 1 1 1 0.3 t=0.4 0 0 0 0 Y I ( 0 . 3 X 0 . 3 X 0 . 3 X 0 . 4 0 ) = + + − > 1 2 3 1 if z is true ⎧ where I ( z ) = ⎨ 0 otherwise ⎩

  29. Ar2ficial Neural Networks (ANN) • Model is an assembly of Input nodes Black box inter-connected nodes and Output X 1 weighted links node w 1 w 2 X 2 Y S • Output node sums up each w 3 of its input value according X 3 t to the weights of its links Perceptron Model • Compare output node = ∑ Y I ( w X t ) against some threshold t − or i i i Y sign ( w X t ) ∑ = − i i i

  30. General Structure of ANN x 1 x 2 x 3 x 4 x 5 Input Layer Input Neuron i Output I 1 w i1 Activation w i2 S i O i I 2 O i function w i3 Hidden g(S i ) I 3 Layer threshold, t Output Training ANN means learning the Layer weights of the neurons y

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend