DS504/CS586: Big Data Analytics --Introduction & Logistics - - PowerPoint PPT Presentation

ds504 cs586 big data analytics introduction logistics
SMART_READER_LITE
LIVE PREVIEW

DS504/CS586: Big Data Analytics --Introduction & Logistics - - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm 8:50pm THURSDAY Location: KH 116 Fall 2017 Who am I? Yanhua Li , PhD Assistant Professor Computer Science & Data Science PhD,


slide-1
SLIDE 1

DS504/CS586: Big Data Analytics

  • -Introduction & Logistics
  • Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm THURSDAY Location: KH 116 Fall 2017

slide-2
SLIDE 2

Who am I?

Yanhua Li, PhD Assistant Professor Computer Science & Data Science PhD, Computer Science, U of Minnesota, 2013 PhD, Electrical Engineering, BUPT, 2009 Research Interests: Big data analytics, Smart Cities, Measurement, Spatio-temporal Data Mining Industrial Experience: Bell-Labs, Microsoft Research, HUAWEI research Labs

slide-3
SLIDE 3

3

What is DS504/CS586 about?

v A second Level DS/CS course (primarily) for graduates v CS/DS Ph.D students in big data analytics and related areas; v then other Ph.D students or MS students with v Experience in databases and/or in data mining, or equivalent

knowledge.

v Sufficient programming experience is expected so that you

are comfortable to undertake a course project.

slide-4
SLIDE 4

Introduction

What is “Big Data”?

4

slide-5
SLIDE 5

Big Data – What is it?

  • A “big” buzzword …
  • No single standard definition…
  • Talk to 1000 people, there will be 1000

“definitions” … “Big Data” is data whose scale, diversity, complexity, and/or quality require new architectures, techniques, algorithms, analytics, and interfaces to manage it and extract value and hidden knowledge from it…

slide-6
SLIDE 6

Why Now?

Big Data and Big Challenges

slide-7
SLIDE 7

Big Data

  • Volume
  • Variety
  • Velocity
  • Veracity
slide-8
SLIDE 8

Big Data

  • Volume
  • Variety
  • Velocity

Thanks: http://www-01.ibm.com/software/data/bigdata/images/4-Vs

slide-9
SLIDE 9

Big Data

  • Volume
  • Variety
  • Velocity

Thanks: http://www-01.ibm.com/software/data/bigdata/images/4- Vs-of-big-data.jpg

slide-10
SLIDE 10

Big Data

  • Volume
  • Variety
  • Velocity

Thanks: http://www-01.ibm.com/software/data/bigdata/images/4-Vs data.jpg

slide-11
SLIDE 11

Big Data

  • Volume
  • Variety
  • Velocity

Thanks: http://www- 01.ibm.com/software/data/bigdata/images/4-Vs-of-big-data.jpg Thanks: http://www- 01.ibm.com/software/data/bigdata/i mages/4-Vs-of-big-data.jp

slide-12
SLIDE 12

4Vs

12

slide-13
SLIDE 13

The Model Has Changed…

Old Model of Generating/Consuming Data has Changed Old Model: Few privileged companies are generating and “owning” data, all others are consuming data (in controlled packages)

slide-14
SLIDE 14

The Model Has Changed…

  • New Model of Generating/Consuming Data has

Changed Producers :

  • Everyone - Man, Woman and Child, and Devices

Consumers:

  • Professionals
  • Businesses
  • Scientists
  • And us
  • Everyone wants a piece of this pie …

14

slide-15
SLIDE 15

What Sectors Can Benefit?

  • Businesses
  • Transportation
  • Science & Engineering
  • Governments
  • Energy
  • Healthcare
  • Education
  • Entertainment

Utilize data to improve people’s life quality

slide-16
SLIDE 16

Big Data Analytics techniques and tools for managing, analyzing and extracting knowledge from “big data”

16

slide-17
SLIDE 17

Roadmap

  • 1. Intro of Big Data Analytics
  • 1. 5 minutes break
  • 2. Logistics
  • 1. 10 minutes break, talk to other students
  • 3. Application stories

Self-intro (and group forming Hand in your survey Email you for permission or not You will need to find your team and let me know

slide-18
SLIDE 18

Begin with application stories Done with the high level introduction

slide-19
SLIDE 19

Big Challenges in Big Cities

slide-20
SLIDE 20

Big Data in Cities

slide-21
SLIDE 21

Cities OS People The Environment Win Win Win Urban Computing

Tackle the Big challenges in Big cities using Big data!

Urban Sensing & Data Acquisition

Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy

Urban Data Management

Spatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data Analytics

Data Mining, Machine Learning, Visualization

Service Providing

Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...

Urban Computing: concepts, methodologies, and applications. Zheng, Y., et al. ACM transactions on Intelligent Systems and Technology.

slide-22
SLIDE 22

Cities OS People The Environment Win Win Win Urban Computing

Urban Sensing & Data Acquisition

Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy

Urban Data Management

Spatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data Analytics

Data Mining, Machine Learning, Visualization

Service Providing

Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...

Zheng, Y., et al. Urban Computing: concepts, methodologies, and applications. ACM transactions on Intelligent Systems and Technology.

  • Data sparsity and missing
  • Skewed sample distribution
  • Limited resources
slide-23
SLIDE 23

Urban Sensing

  • Biased distribution
  • Data sparsity and missing

A sample of data à An entire dataset

Taxi flow Entire traffic flow

S1 S2 S4 S5 S8 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21

Air quality monitoring stations

Inferring Gas Consumption and Pollution Emission of Vehicles throughout a City. KDD 2014. Zheng, Y., et al. U-Air: when urban air quality inference meets big data. KDD 2013

slide-24
SLIDE 24

Urban Sensing

  • Static sensing: Where to

deploy sensor to maximize the gain?

  • Crowdsensing: How to arrange

the incentives dynamically?

A limited resource (budget, labors, land…)

S1 S2 S4 S5 S8 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21

Suggesting locations for monitoring stations, KDD 2015

slide-25
SLIDE 25

Improving Medical Emergency Services using Big Data

Location Selection for Ambulance Stations: A Data-Driven Approach, ACM SIGSPATIAL 2015

Dispatching Center Ambulance stations Patients Hospital

  • Select locations for Ambulance Stations
  • Dynamic ambulance allocation

Save 30+% time!

Yilun Wang, Yu Zheng, et al. Travel Time Estimation of a Path using Sparse Trajectories.. KDD 2014

slide-26
SLIDE 26

Cities OS People The Environment Win Win Win Urban Computing

Urban Sensing & Data Acquisition

Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy

Urban Data Management

Spatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data Analytics

Data Mining, Machine Learning, Visualization

Service Providing

Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...

Zheng, Y., et al. Urban Computing: concepts, methodologies, and applications. ACM transactions on Intelligent Systems and Technology.

  • Management in spatio-temporal spaces
  • Multi-modality data
  • Dynamic, high velocity and volume
slide-27
SLIDE 27

Urban Data Management

  • Managing multi-modality data

– Categorical and numeric data – Different scales, densities, updating frequency, and ST properties

  • Dynamic and big volume

– Group query strategy – Computing in parallel

Spatio-temporal Static Data Spatial Static Temporal Dynamic Data Spatio-Temporal Dynamic Data

Road/Transportation Networks POI Distributions Trajectory Data Spatial-temporal Crowd Souring Data Weather/AQI Station Data Road Traffic Data

Point-Based Network-Based Yu Zheng. Trajectory Data Mining: An Overview. ACM Transactions on Intelligent Systems and Technology (ACM TIST). 2015

slide-28
SLIDE 28

Cities OS People The Environment Win Win Win Urban Computing

Urban Sensing & Data Acquisition

Participatory Sensing, Crowd Sensing, Mobile Sensing Traffic Road Networks POIs Air Quality Human mobility Meteorolo gy Social Media Energy

Urban Data Management

Spatio-temporal index, streaming, trajectory, and graph data management,...

Urban Data Analytics

Data Mining, Machine Learning, Visualization

Service Providing

Improve urban planning, Ease Traffic Congestion, Save Energy, Reduce Air Pollution, ...

Zheng, Y., et al. Urban Computing: concepts, methodologies, and applications. ACM transactions on Intelligent Systems and Technology.

  • Texts and images à

spatial and spatio-temporal data;

  • A single data source à

Data cross different domains

  • Separate data mining algorithms à

machine learning + data management

slide-29
SLIDE 29

Data Integration vs Knowledge Fusion

Dataset A Dataset B Dataset C Schema Mapping Schema Mapping Schema Mapping Duplicate Detection Domain S Data Merge

A) Paradigm of the conventional data fusion

Object

Knowledge Extraction Knowledge Extraction Knowledge Extraction Knowledge Fusion Knowledge Knowledge Knowledge Dataset A Dataset B Dataset C Domain B Domain C Domain A

Latent Object

Yu Zheng. Methodologies for Cross-Domain Data Fusion: An Overview. IEEE Transactions on Big Data, 1, 1, 2015.

Cross-Domain Data Fusion

slide-30
SLIDE 30

Multi-View-Based Learning

slide-31
SLIDE 31

Urban Computing for Urban Planning

Best Paper Nominee Award at UbiComp 2011 The Most Cited Paper

slide-32
SLIDE 32

City-Wide Traffic Modeling

Partition a city into regions with major roads Regions are root causes of the problem

Yu Zheng, et al. Urban Computing with Taxicabs, In Proc. Of UbiComp 2011

slide-33
SLIDE 33
slide-34
SLIDE 34

Shanghai Big Data Hotpot Restaurant

slide-35
SLIDE 35

KDD 2013

http://urbanair.msra.cn/

When Urban Air Meets Big Data

slide-36
SLIDE 36

Air Pollution: A Global Concern !

Air quality monitor station PM2.5, PM10, NO2, SO2, CO, O3

S1 S2 S4 S5 S8 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21

50kmx40km

slide-37
SLIDE 37

We do not really know the air quality of a location without a monitoring station!

slide-38
SLIDE 38

Inferring Real-Time and Fine-Grained air quality throughout a city using Big Data

Meteorology Traffic POIs Road networks Human Mobility Historical air quality data Real-time air quality reports

Zheng, Y., et al. U-Air: when urban air quality inference meets big data. KDD 2013

S1 S2 S4 S5 S8 S3 S6 S7 S6 S9 S10 S12 S11 S13 S14 S22 S15 S16 S16 S17 S18 S19 S20 S21

slide-39
SLIDE 39

Zheng, Y., et al. U-Air: When Urban Air Quality Inference Meets Big Data. KDD 2013

Urban Air System

http://urbanair.msra.cn

slide-40
SLIDE 40

Multi-View Learning Framework

Temporal Predictor Inflection Predictor Spatial Predictor Local Data

Shape features

Recent Meteorology

Weather Forecast

Recent AQI ∆AQI ∆AQI Prediction Aggregator Spatial Neighbor Data ∆AQI Recent Meteorology Selected factors

Recent AQI

Threshold Final AQI ∆AQI AQI

Spatial View Temporal View Data-Driven Kernel Learning

  • Features: Non-overlapping features providing different views
  • Models: Model extrapolation and trend regression respectively
  • Training: Combination of small models vs. a big model
slide-41
SLIDE 41

Meteorological data (3,514 stations)

a district-level (or even finer) granularity; Hourly update

Weather forecasts (2,612 districts)

The next three-day forecast (3-hou or 6-hour segments); Update every 3h, 6h, 12h

384 Cities

slide-42
SLIDE 42

http://urbcomp02/# Air Quality Data

NO2, SO2, O3, CO, PM2.5 and PM10 About 2,000 stations in 330 Chinese cities, Hourly updates

slide-43
SLIDE 43

Revisit Big Data

  • NOT a single data source

which is very big

  • NOT mean full data
  • NOT mean very dense

data

  • May need less domain

knowledge

  • Tools are ready
  • Data across different

domains

  • Sample of (label) data
  • Data sparsity always exists
  • More understanding of data

itself and data science

  • Many unsolved problems

Big Data needs comprehensive capabilities to deliver end-to-end services!

Big Data ≠ Machine learning ≠ Deep Learning Big Data ≠ Mining Single Dataset ≠ Simple Statistics Big Data ≠ Cloud Computing ≠ Hadoop

slide-44
SLIDE 44

Problems

fp fg t Nt ɵ Na v dv fr w Np α

Categories Regions Categories Categories Regions Features

A

X = R×U Z Time slots Regions Y Y = T×RT X

Yt-1 Fm(t-1)

t-1

Ft(t-1) Fh(t-1) Fm(t)

t

Ft(t) Fh(t) Fm(t+1)

t+1

Ft(t+1) Fh(t+1) Yt Yt-1 ∆ ∆ c ∆ ∆

x

ANN

w'11 w'qr w1 wr wpq w11 b1 bq b'r b'1

b''

Data Models and Algorithms Data Scientist

slide-45
SLIDE 45

Take Away Messages

  • 3B: Big city, Big challenges, Big data
  • 3M: Data Management, Mining and Machine learning
  • 3W: Win-Win-Win: people, city, and the environment

3·BMW

Zheng, Y., et al. Urban Computing: concepts, methodologies, and applications. ACM transactions on Intelligent Systems and Technology.

Yu Zheng. Methodologies for Cross-Domain Data Fusion: An Overview. IEEE Transactions on Big Data, 1, 1, 2015. Yu Zheng. Trajectory Data Mining: An Overview. ACM Transactions on Intelligent Systems and Technology. 2015

slide-46
SLIDE 46

What data available in our course?

slide-47
SLIDE 47

Road Map in Shenzhen

20,656 Road Segments

slide-48
SLIDE 48

Subway Lines

5 subway lines and 118 Subway Stations

slide-49
SLIDE 49

8,875 Buses serve 814 Bus Routes

Bus Routes

slide-50
SLIDE 50

Bus Stop Distribution

5,359 Bus Stops

slide-51
SLIDE 51

22,803 Taxis

slide-52
SLIDE 52

Transportation Billing Data

slide-53
SLIDE 53

Urban Issues

Regional Weather-Traffic Sensitivity Smart Taxi Drivers Traffic Estimation & Prediction Smart shuttle service

slide-54
SLIDE 54

Urban Issues (cont.)

Population Census Low Sample Rate Map Matching Logistic Planning

slide-55
SLIDE 55

Questions?