1
1 Data science and engineering for local weather forecasts Nikhil - - PowerPoint PPT Presentation
1 Data science and engineering for local weather forecasts Nikhil - - PowerPoint PPT Presentation
1 Data science and engineering for local weather forecasts Nikhil R Podduturi Data {Scientist, Engineer} November, 2016 Agenda About MeteoGroup Introduction to weather data Problem description Data science and weather
Data science and engineering for local weather forecasts Nikhil R Podduturi Data {Scientist, Engineer} November, 2016
Agenda
- About MeteoGroup
- Introduction to weather data
- Problem description
- Data science and weather forecasting
- Engineering
- Verification
- Results
- Questions
3
4
How many of you check weather forecasts frequently?
5
6
Weather data
1.5 TB/day
7
8
Types of data
Observations:
- WMO weather stations (e.g: surface, upper-air, ships,
drifting buoys, aircrafts etc)
- MeteoGroup measurement network
9
Types of data
Observations:
- WMO weather stations (e.g: surface, upper-air, ships,
drifting buoys, aircrafts etc)
- MeteoGroup measurement network
Satellite data
10
Types of data
Observations:
- WMO weather stations (e.g: surface, upper-air, ships,
drifting buoys, aircrafts etc)
- MeteoGroup measurement network
Satellite data Radar data
11
Types of data
Observations:
- WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts
etc)
- MeteoGroup measurement network
Satellite data Radar data
User data
12
Types of data
Observations:
- WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts
etc)
- MeteoGroup measurement network
Satellite data Radar data User data
Numerical weather prediction model data
13
Numerical weather prediction models
- Complex and
Multidimensional data
14
Numerical weather prediction models
- Complex and
multidimensional data
- 5 NWP models from
different providers
15
Numerical weather prediction models
- Complex and
multidimensional data
- 5 NWP models from
different providers
- Data size per day - 0.5 TB
Data science and weather forecasting
16
17
18
Outcome
- Took 24 hours for 24 hour forecasts
- Grid interval - 736 km
- Poor results
MeteoGroup Forecasting system
19
MeteoGroup forecasting system
20 Forecasts 3 years of NWP data 3 years of
- bservation
data Daily NWP data Machine learning model Trained model
MeteoGroup forecasting system
Written in pascal
21
MeteoGroup forecasting system
Written in pascal Runs on in house high performance computing cluster
22
MeteoGroup forecasting system
Written in pascal Runs on in house high performance computing cluster Limitations
- Hard to maintain
- Not very transparent
- Scalability
23
24
Problem description
Next generation forecasting system
- Cloud based solution
25
Next generation forecasting system
- Cloud based solution
- Transparent
26
Next generation forecasting system
- Cloud based solution
- Transparent
- Scalable
27
Next generation forecasting system
- Cloud based solution
- Transparent
- Scalable
- Improve forecasting accuracy
28
29
Baseline model
NWP data Downscale to location Linear model Interpolate missing values
30
Baseline model
NWP data Downscale to location Linear model Interpolate missing values
Outcome:
- Very fast
- Poor accuracy
- Multicollinearity
Iteration 1
- Address multicollinearity using feature selection
- Scale the features
31
NWP data Downscale to location Linear model Interpolate missing values Feature selection Scale features
Iteration 1
- Address multicollinearity using feature selection
- Scale the features
32
NWP data Downscale to location Linear model Interpolate missing values Feature selection Scale features
Outcome:
- Improved accuracy
Iteration 2
33
- Model selection between linear and non-linear models
- Advanced feature selection
NWP data Downscale to location Model selection (linear and non-linear models) Interpolate missing values Advance feature selection Scale features
Iteration 2
34
- Model selection between linear and non-linear models
- Advanced feature selection
NWP data Downscale to location Model selection (linear and non-linear models) Interpolate missing values Advance feature selection Scale features
Outcome:
- On par with existing forecasting system
- Slow training
Engineering to scale the product
35
Baseline model engineering
36 (Scikit-learn, NumPy, Keras with TensorFlow)
Model engineering
37 (Scikit-learn, NumPy, Keras with TensorFlow)
Good:
- Python ML ecosystem
- Familiarity among the team
- Test driven and Agile Development
- Fail fast
Model engineering
38 (Scikit-learn, NumPy, Keras with TensorFlow)
Good:
- Python ML ecosystem
- Familiarity among the team
- Test driven and Agile Development
- Fail fast
Bad:
- Not scalable
47000 * 15 * 360 model runs
39
Locations Weather attributes e.g: temperature, wind etc Hours
Scaling with Apache Airflow
40
Apache Airflow
- By AirBnB
- Apache product since early 2016
Directed Acyclic Graph (DAG) Components
- UI
- Scheduler
- Executor(s)
Apache Airflow DAG
41
- Hooks (connections)
- Operators (tasks)
- Schedule
- Dependencies
Airflow and Mesos
42 deploy Mesos cluster persist AWS S3 Airflow scheduler
Airflow and Mesos
43
deploy
Mesos cluster
Persist
AWS S3 Airflow scheduler
Cont Integ
Verification
44
45 Deploy DAG Verify model Improve DAG
Model improvement cycle
Forecast verification
46 AWS S3 with models Forecast Engine JSON-LD
Verification metrics
47
- Mean absolute error
- Root mean squared error
- Mean error
- Heidke skill score
- Equitable threat score
- Probability density functions
- Error percentiles
48 Mean absolute error for different models (Temperature)
49 Probability distribution function for multiple models (Temperature)
Percentile graphs for each model (Temperature)
For demo please stop by MG booth
51
52
Results
Cloud based solution
- AWS S3, EC2, ElastiCache
Transparent Scalable Improve forecasting accuracy
53
Results
Cloud based solution
- AWS S3, EC2, ElastiCache
Transparent
- Verification microservice
Scalable Improve forecasting accuracy
54
Results
Cloud based solution
- AWS S3, EC2, ElastiCache
Transparent
- Verification microservice
Scalable
- Mesos cluster
- Training time a month to 5 hours (approx)
Improve forecasting accuracy
55
Results
Cloud based solution
- AWS S3, EC2, ElastiCache
Transparent
- Verification microservice
Scalable
- Mesos cluster
- Training time a month to 5 hours (approx)
Improve forecasting accuracy
- On par or better
Improvements
Hyperlocal AWS lambda integration Iterate for more accuracy
56
Questions?
57
We are hiring!
59