1 Data science and engineering for local weather forecasts Nikhil - - PowerPoint PPT Presentation

1 data science and engineering for local weather
SMART_READER_LITE
LIVE PREVIEW

1 Data science and engineering for local weather forecasts Nikhil - - PowerPoint PPT Presentation

1 Data science and engineering for local weather forecasts Nikhil R Podduturi Data {Scientist, Engineer} November, 2016 Agenda About MeteoGroup Introduction to weather data Problem description Data science and weather


slide-1
SLIDE 1

1

slide-2
SLIDE 2

Data science and engineering for local weather forecasts Nikhil R Podduturi Data {Scientist, Engineer} November, 2016

slide-3
SLIDE 3

Agenda

  • About MeteoGroup
  • Introduction to weather data
  • Problem description
  • Data science and weather forecasting
  • Engineering
  • Verification
  • Results
  • Questions

3

slide-4
SLIDE 4

4

How many of you check weather forecasts frequently?

slide-5
SLIDE 5

5

slide-6
SLIDE 6

6

Weather data

slide-7
SLIDE 7

1.5 TB/day

7

slide-8
SLIDE 8

8

Types of data

Observations:

  • WMO weather stations (e.g: surface, upper-air, ships,

drifting buoys, aircrafts etc)

  • MeteoGroup measurement network
slide-9
SLIDE 9

9

Types of data

Observations:

  • WMO weather stations (e.g: surface, upper-air, ships,

drifting buoys, aircrafts etc)

  • MeteoGroup measurement network

Satellite data

slide-10
SLIDE 10

10

Types of data

Observations:

  • WMO weather stations (e.g: surface, upper-air, ships,

drifting buoys, aircrafts etc)

  • MeteoGroup measurement network

Satellite data Radar data

slide-11
SLIDE 11

11

Types of data

Observations:

  • WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts

etc)

  • MeteoGroup measurement network

Satellite data Radar data

User data

slide-12
SLIDE 12

12

Types of data

Observations:

  • WMO weather stations (e.g: surface, upper-air, ships, drifting buoys, aircrafts

etc)

  • MeteoGroup measurement network

Satellite data Radar data User data

Numerical weather prediction model data

slide-13
SLIDE 13

13

Numerical weather prediction models

  • Complex and

Multidimensional data

slide-14
SLIDE 14

14

Numerical weather prediction models

  • Complex and

multidimensional data

  • 5 NWP models from

different providers

slide-15
SLIDE 15

15

Numerical weather prediction models

  • Complex and

multidimensional data

  • 5 NWP models from

different providers

  • Data size per day - 0.5 TB
slide-16
SLIDE 16

Data science and weather forecasting

16

slide-17
SLIDE 17

17

slide-18
SLIDE 18

18

Outcome

  • Took 24 hours for 24 hour forecasts
  • Grid interval - 736 km
  • Poor results
slide-19
SLIDE 19

MeteoGroup Forecasting system

19

slide-20
SLIDE 20

MeteoGroup forecasting system

20 Forecasts 3 years of NWP data 3 years of

  • bservation

data Daily NWP data Machine learning model Trained model

slide-21
SLIDE 21

MeteoGroup forecasting system

Written in pascal

21

slide-22
SLIDE 22

MeteoGroup forecasting system

Written in pascal Runs on in house high performance computing cluster

22

slide-23
SLIDE 23

MeteoGroup forecasting system

Written in pascal Runs on in house high performance computing cluster Limitations

  • Hard to maintain
  • Not very transparent
  • Scalability

23

slide-24
SLIDE 24

24

Problem description

slide-25
SLIDE 25

Next generation forecasting system

  • Cloud based solution

25

slide-26
SLIDE 26

Next generation forecasting system

  • Cloud based solution
  • Transparent

26

slide-27
SLIDE 27

Next generation forecasting system

  • Cloud based solution
  • Transparent
  • Scalable

27

slide-28
SLIDE 28

Next generation forecasting system

  • Cloud based solution
  • Transparent
  • Scalable
  • Improve forecasting accuracy

28

slide-29
SLIDE 29

29

Baseline model

NWP data Downscale to location Linear model Interpolate missing values

slide-30
SLIDE 30

30

Baseline model

NWP data Downscale to location Linear model Interpolate missing values

Outcome:

  • Very fast
  • Poor accuracy
  • Multicollinearity
slide-31
SLIDE 31

Iteration 1

  • Address multicollinearity using feature selection
  • Scale the features

31

NWP data Downscale to location Linear model Interpolate missing values Feature selection Scale features

slide-32
SLIDE 32

Iteration 1

  • Address multicollinearity using feature selection
  • Scale the features

32

NWP data Downscale to location Linear model Interpolate missing values Feature selection Scale features

Outcome:

  • Improved accuracy
slide-33
SLIDE 33

Iteration 2

33

  • Model selection between linear and non-linear models
  • Advanced feature selection

NWP data Downscale to location Model selection (linear and non-linear models) Interpolate missing values Advance feature selection Scale features

slide-34
SLIDE 34

Iteration 2

34

  • Model selection between linear and non-linear models
  • Advanced feature selection

NWP data Downscale to location Model selection (linear and non-linear models) Interpolate missing values Advance feature selection Scale features

Outcome:

  • On par with existing forecasting system
  • Slow training
slide-35
SLIDE 35

Engineering to scale the product

35

slide-36
SLIDE 36

Baseline model engineering

36 (Scikit-learn, NumPy, Keras with TensorFlow)

slide-37
SLIDE 37

Model engineering

37 (Scikit-learn, NumPy, Keras with TensorFlow)

Good:

  • Python ML ecosystem
  • Familiarity among the team
  • Test driven and Agile Development
  • Fail fast
slide-38
SLIDE 38

Model engineering

38 (Scikit-learn, NumPy, Keras with TensorFlow)

Good:

  • Python ML ecosystem
  • Familiarity among the team
  • Test driven and Agile Development
  • Fail fast

Bad:

  • Not scalable
slide-39
SLIDE 39

47000 * 15 * 360 model runs

39

Locations Weather attributes e.g: temperature, wind etc Hours

slide-40
SLIDE 40

Scaling with Apache Airflow

40

Apache Airflow

  • By AirBnB
  • Apache product since early 2016

Directed Acyclic Graph (DAG) Components

  • UI
  • Scheduler
  • Executor(s)
slide-41
SLIDE 41

Apache Airflow DAG

41

  • Hooks (connections)
  • Operators (tasks)
  • Schedule
  • Dependencies
slide-42
SLIDE 42

Airflow and Mesos

42 deploy Mesos cluster persist AWS S3 Airflow scheduler

slide-43
SLIDE 43

Airflow and Mesos

43

deploy

Mesos cluster

Persist

AWS S3 Airflow scheduler

Cont Integ

slide-44
SLIDE 44

Verification

44

slide-45
SLIDE 45

45 Deploy DAG Verify model Improve DAG

Model improvement cycle

slide-46
SLIDE 46

Forecast verification

46 AWS S3 with models Forecast Engine JSON-LD

slide-47
SLIDE 47

Verification metrics

47

  • Mean absolute error
  • Root mean squared error
  • Mean error
  • Heidke skill score
  • Equitable threat score
  • Probability density functions
  • Error percentiles
slide-48
SLIDE 48

48 Mean absolute error for different models (Temperature)

slide-49
SLIDE 49

49 Probability distribution function for multiple models (Temperature)

slide-50
SLIDE 50

Percentile graphs for each model (Temperature)

slide-51
SLIDE 51

For demo please stop by MG booth

51

slide-52
SLIDE 52

52

Results

Cloud based solution

  • AWS S3, EC2, ElastiCache

Transparent Scalable Improve forecasting accuracy

slide-53
SLIDE 53

53

Results

Cloud based solution

  • AWS S3, EC2, ElastiCache

Transparent

  • Verification microservice

Scalable Improve forecasting accuracy

slide-54
SLIDE 54

54

Results

Cloud based solution

  • AWS S3, EC2, ElastiCache

Transparent

  • Verification microservice

Scalable

  • Mesos cluster
  • Training time a month to 5 hours (approx)

Improve forecasting accuracy

slide-55
SLIDE 55

55

Results

Cloud based solution

  • AWS S3, EC2, ElastiCache

Transparent

  • Verification microservice

Scalable

  • Mesos cluster
  • Training time a month to 5 hours (approx)

Improve forecasting accuracy

  • On par or better
slide-56
SLIDE 56

Improvements

Hyperlocal AWS lambda integration Iterate for more accuracy

56

slide-57
SLIDE 57

Questions?

57

slide-58
SLIDE 58

We are hiring!

slide-59
SLIDE 59

59