Probabilistic Models for Understanding Ecological Data: Case - - PowerPoint PPT Presentation

probabilistic models for
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Models for Understanding Ecological Data: Case - - PowerPoint PPT Presentation

Probabilistic Models for Understanding Ecological Data: Case studies in Seeds, Fish and Coral Allan Tucker Brunel University London The Talk The Data Explosion and Ecology Case Studies: 1. Data Driven Models for prediction: Seeds 2.


slide-1
SLIDE 1

Probabilistic Models for Understanding Ecological Data: Case studies in Seeds, Fish and Coral

Allan Tucker Brunel University London

slide-2
SLIDE 2

The Talk

  • The Data Explosion and Ecology
  • Case Studies:
  • 1. Data Driven Models for prediction: Seeds
  • 2. Integrating Knowledge and Data: Coral
  • 3. Dynamic Models and Latent Variables: Fish
  • Conclusions
slide-3
SLIDE 3

Data historically...

Darwin, 1800s Galton, 1800s

  • Preserve of handful of scientists:

Newton, 1600s Pearson, 1900s

slide-4
SLIDE 4

Database Technology Timeline

– 1960s:

  • Data collection, database creation

– 1970s:

  • Relational data model
  • Relational DBMS implementation

– 1980s:

  • Advanced data models (extended-relational, OO, deductive, etc.)
  • Application-oriented DBMS (spatial, scientific, engineering, etc.)

– 1990s—2000s:

  • Data Warehousing
  • Multimedia and Web databases
  • Distributed DW: The Cloud
slide-5
SLIDE 5

Data Generation examples

  • Data collected from:
  • Online forms, Sensors, GIS, Mobile devices ...

CASOS Tech Report Kew Gardens, Harapen Project

slide-6
SLIDE 6
  • Increasing ability to record & store
  • So need to Analyse:
  • Data Mining,
  • Machine Learning,
  • Intelligent Data Analysis,
  • Knowledge Discovery in Databases
  • Bioinformatics
  • Ecoinformatics
  • Predictive Ecology

...

  • Large overlap with statistics

(and all the same caveats)

Data Analysis

slide-7
SLIDE 7

Bayesian Networks for Data Mining

  • Can be used to combine existing knowledge

with data using informative priors

  • Essentially use independence assumptions to

model the joint distribution of a domain

  • Independence represented by a graph: easily

interpreted

  • Inference algorithms to ask „What if?‟

questions

slide-8
SLIDE 8

Example Bayesian Network

Species C Species D Species E P(A) P(B) .001 .002 A B P(C) T T .95 T F .94 F T .29 F F .001 C P(E) C P(D) T .70 F .01 T .90 F .05 Species A Species B

slide-9
SLIDE 9

Bayesian Networks for Classification & Feature Selection & Forecasting

  • Nodes that can represents class labels or variables at

“points in time”

  • Also latent variables via EM
  • Feature Selection

X1 X2 X3 X4 XN X1 X2 X3 X4 XN t-1 t

X1 X3 X2 C XN X1 X2 X3 X5 X4 P(X1) P(X2) P(X3 | X1, X2) P(X4 | X3) P(X5 | X3) H X2 XN H X2 XN t-1 t

slide-10
SLIDE 10

Predictive Ecology 1 Data Driven Models

  • The Millennium SeedBank
  • RBG, Kew banking seeds for 35 years
  • MSB established for 12 years
  • 152 partner institutions in 54 countries worldwide
slide-11
SLIDE 11

The Millennium SeedBank

  • Collected and stored >47,000 collections representing >24,000

species

  • The Seedbank Database (SBD) - UK and worldwide
  • GIS data (Detailed Climate)
  • Use this data to build predictive models for successful

germination

slide-12
SLIDE 12

Results: Seedbank Data

  • Lots of similarity to filter method implying

independence of features but some interaction (e.g. scarification and latitude)

  • Generally high predictive scores
  • But explanation important
slide-13
SLIDE 13

Results: Seedbank Data

slide-14
SLIDE 14

Results: Seedbank Data

slide-15
SLIDE 15

Results: Seedbank Data

  • Markov Blanket includes all variables: all offer some

improvement in prediction of germination success

  • Exploit „what if‟ queries by entering observations into

model and applying inference: – Recognisable pattern emerging from Kew analysis that agrees with network: – Where pre-treatment is necessary, and it is applied, there is still relatively high probability of failure

slide-16
SLIDE 16

Summary

  • Use of data mining / machine learning to

– Utilise large scale data to predict and explain ecological phenomena – Explore data using „what if‟ models

  • Expanding this work to build models for predicting plant

traits of ecosystems in different regions

– Text mining of monographs – Large flora datasets – GIS, MSB, ...

  • Predict what species likely to grow with others and what

likely traits will be

slide-17
SLIDE 17

Predictive Ecology 2 Data and Knowledge Integration

  • Modelling Coral Carbonate Budgets
slide-18
SLIDE 18

Coral Reefs

  • Among the most complex and productive tropical

marine ecosystems

  • Made from calcium carbonate (CaCO3) secreted

by corals and other calcifying organisms

  • Structure holds great variety of organisms and

serves as breeding, spawning, nursery and foraging habitat

slide-19
SLIDE 19

Carbonate budget assessment

  • Increasing climate variability and anthropogenic

pressures driving reefs to deterioration and destruction

  • Carbonate budget assessment

− Management tool used to determine spatial and temporal variations of reef framework accretion (CaCO3 deposition) and erosion (CaCO3 removal) − BUT low reliability of this methodology for long term management actions due to limited temporal and spatial scales at which method can be used

  • Can we exploit a combination of data sources in one

framework to better manage reefs?

slide-20
SLIDE 20

Building the Model

  • Initial structure constructed based on systematic review
  • f published literature on carbonate budget (n= 11)
  • Integrate with climatic and human disturbance nodes

based on international guidelines for reef management and expert knowledge (parameters and structure)

  • Indonesia data collected at three sites

− Located across a gradient of sedimentation and turbidity − Continuous data discretised to two or three bins (severe/high, moderate/medium, low).

  • Data used to update priors
slide-21
SLIDE 21

Bayesian Network for Carbonate Budget

slide-22
SLIDE 22

Bayesian Network for Carbonate Budget

  • Three subsets of nodes can be distinguished:

– Nodes of the climatic and anthropogenic disturbances affecting coral reef framework accretive and erosive processes (grey- rectangular), – Nodes representing the direct effects of these disturbances on the framework processes (violet-rectangular) – Nodes closely related to CaCO3 accretive and erosive processes (blue-oval)

slide-23
SLIDE 23

Results: Carbonate budget assessment

  • Distinctive differences in the quantity of carbonate removed (CAR)

at three sites

  • Model was effective in detecting the quantitative differences in

bioerosion (CAR) across environmental gradients BUT explanation was not clearcut

  • Initial results proved ability of the model to inform which variables

needed further investigation to assist future data collection (filtering

  • ut independent)
slide-24
SLIDE 24

Summary

  • Can provide coral reef managers with tool that quantitatively assess

rate of change of reef structure and inform which variables have driven changes the most

  • Can provides managers with information on which reef components

the data collection should be focused on in order to better understand reef ecosystem status

  • Plan to extend this as a freely available tool to address questions

for conservation by providing potential scenarios of reef status

  • Plan to use data from different coral reef regions to provide reliable

analysis of prediction (generalise between different regions – more

  • n this later)
slide-25
SLIDE 25

Predictive Ecology 3 Dynamic Models with Latent Variables

slide-26
SLIDE 26

Fisheries Data

  • George‟s Bank, East Scotian Shelf and North Sea
  • Biomass data collected at different locations
  • 100s of different species
  • From 1960s until present day
  • Massively complex foodwebs:
  • Predator / prey, cannibalism, competition …
  • Foodwebs and catch data also available
  • Lots of unmeasured variables
slide-27
SLIDE 27

Functional Collapse in G Bank, N Sea & ESS

George’s Bank Functional Collapse in late „80s early „90s North Sea No Functional Collapse East Scotian Shelf Functional Collapse in late „80s early „90s

0.00 5000.00 10000.00 15000.00 20000.00 25000.00 30000.00 35000.00 2000 4000 6000 8000 10000 12000 1970 1975 1980 1985 1990 1995 2000 2005 0.00 10000.00 20000.00 30000.00 40000.00 50000.00 60000.00 2 4 6 8 10 1970 1975 1980 1985 1990 1995 2000 2005 Biomass Catch 0.00 50000.00 100000.00 150000.00 200000.00 250000.00 300000.00 50 100 150 200 250 300 350 400 1970 1975 1980 1985 1990 1995 2000 2005

(Jaio, 2009)

slide-28
SLIDE 28

Questions

  • Why do populations irrevocably collapse?
  • What underlying „states‟ dictate biomass?
  • Can we generalise between regions?
slide-29
SLIDE 29
  • 40
  • 38
  • 36
  • 34
  • 32
  • 30
  • 28
  • 26
  • 24

Pseudocalanus.spp AMBLYRAJA RADIATA PARALICHTHYS OBLONGUS Centropages.typicus Calanus.spp Oithona.spp CLUPEA HARENGUS Euphausiids LOPHIUS AMERICANUS HEMITRIPTERUS AMERICANUS BROSME BROSME UROPHYCIS TENUIS Haddock Catch GLYPTOCEPHALUS CYNOGLOSSUS Metridia.lucens SCOMBER SCOMBRUS PLACOPECTEN MAGELLANICUS MYOXOCEPHALUS … SEBASTES FASCIATUS OVALIPES OCELLATUS UROPHYCIS CHESTERI GADUS MORHUA CANCER IRRORATUS CITHARICHTHYS ARCTIFRONS TAUTOGOLABRUS ADSPERSUS PEPRILUS TRIACANTHUS HIPPOGLOSSOIDES PLATESSOIDES HELICOLENUS DACTYLOPTERUS HOMARUS AMERICANUS AMMODYTES DUBIUS MELANOGRAMMUS AEGLEFINUS ENCHELYOPUS CIMBRIUS …

Results: Feature Selection to identify “cod collapse” in George’s Bank

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 AMBLYRAJA RADIATA HELICOLENUS DACTYLOPTERUS CLUPEA HARENGUS Cod Catch ENCHELYOPUS CIMBRIUS BROSME BROSME Pseudocalanus.spp HEMITRIPTERUS AMERICANUS Centropages.typicus Haddock Catch SCOMBER SCOMBRUS CANCER IRRORATUS GLYPTOCEPHALUS CYNOGLOSSUS LOPHIUS AMERICANUS Oithona.spp UROPHYCIS TENUIS Euphausiids OVALIPES OCELLATUS CITHARICHTHYS ARCTIFRONS Calanus.spp HIPPOGLOSSOIDES PLATESSOIDES HOMARUS AMERICANUS PARALICHTHYS OBLONGUS Metridia.lucens LOLIGO PEALEII MACROZOARCES AMERICANUS MALACORAJA SENTA LEUCORAJA ERINACEA PLACOPECTEN MAGELLANICUS PEPRILUS TRIACANTHUS AMMODYTES DUBIUS SQUALUS ACANTHIAS … …

Confidence

Log Likelihood Filter BN Wrapper

Cod Catch jumps from bottom (clearly influential)

slide-30
SLIDE 30

Results: Fitting Dynamic Models & Identifying Functional Change

  • Selecting species based on George‟s Bank foodweb, FS

and cross correlation

  • Learn DBNs with latent state variable

H X2 XN H X2 XN t-1 t

slide-31
SLIDE 31
  • Predicting ESS event &

Cod biomass from G Bank

Th Skate Cod Cusk Cod Catch

G Bank ESS

Results: Dynamic Functional Models

slide-32
SLIDE 32

Dynamic Functional Models

  • Predicting N Sea event &

Cod biomass from G Bank

Atlantic herring Cod Rockling Cod Catch Th Skate Cusk

G Bank N Sea

slide-33
SLIDE 33

Summary

  • Using Fisheries Data from several locations:

– Identified functionally equivalent species in other locations – Used species in one location to build time-series models for prediction on species in other locations – Used latent variables to identify similar functional collapses (or not)

slide-34
SLIDE 34

Incorporating Variance and Autocorrelation metrics

  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5 3 3.5 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 Standardised Biomass Observed Cod HMMPred VarAutoPred 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

  • 10
  • 5

5 10 15 1980 1985 1990 1995 2000 2005 Expected Hidden Value Mean Var Mean Autocorr

  • Prediction is improved when regime shift metrics are

included (rather than relying on hidden states)

  • A particular improvement in ESS: drop in large peak

in 1982

slide-35
SLIDE 35

Conclusions

  • Looked at 3 case studies in ecology

– Data is noisy, complex, heterogeneous

  • Bayesian network approaches to

– Incorporate diverse data and expertise – Model latent variables and time – Perform, prediction, classification, forecasting & generalisation – Transparent: Can perform explanation

  • Structure and parameters are not black box
  • “What if” inference experiments
slide-36
SLIDE 36

Conclusions

  • Prediction is

„a property that sets the genuine sciences apart from those that arrogate to themselves the title without really earning it‟,

Peter Medawar, nobel laureate, immunologist and philosopher of science

  • Predictive Ecology is an important way to deal with

modelling ecological phenomena: – Confidence in models – Deal with overfitting

  • Systems approach also important
slide-37
SLIDE 37

Caveats

  • Data Quality

– Models only as good as the data that goes in – Exploitation of expert knowledge is key – Including the appropriate variables :

  • Human / Sociological factors
  • External factors to the system (latent variable analysis but

expertise is better! E.g. regime shift metrics)

  • Ecological events are often „novel situations‟

– Must be able to predict events outside of „normality‟ – If we have previous examples then must generalise to other regions – If not, must go beyond supervised learning (anomaly detection)

  • Issues with Data Sharing and reproducibility
slide-38
SLIDE 38

Thanks to...

  • Chiara Franco & Liz Hepburn, Essex University, UK
  • John Dickie & Don Kirkup, Royal Botanical Gardens, Kew, UK

– Kenwin Liu RBG, Kew – Robert Turner RBG Kew

  • Daniel Duplisea, Mont Joli Insitute, Canada

– Jerry Black DFO-BIO Halifax for assistance with the ESS survey data – Alida Bundy DFOBIO Halifax for the ESS food web, – ICES datras database for the North Sea IBTS data, – Bill Kramer NOAA–NMFS Woods Hole for the Georges Bank survey, – Jason Link NOASS–NMFS for the Georges Bank food web, – Jon Hare NOAA–NMFS for NE USA plankton data, – SAHFOS for North Sea plankton data – Mike Hammill for ESS grey seal data.