[PPT] - SCAN STATISTICS USING FOR ECONOMICAL RESEARCH V. Jansons, V. PowerPoint Presentation

SLIDE 1

SCAN STATISTICS USING FOR ECONOMICAL RESEARCH

V. Jansons, V. Jurenoks, K. Didenko

Riga Technical University, Latvia – Bulgaria, Yundola - 2008

SLIDE 2

Traditional Statistics Methods – more Appropriate for Local Investigations Taking Into Account Local Factors

SLIDE 3

Requirement of social-economics and technology nowadays is quickly and accurately determine whether the extra-events (hot spots - clusters) is

ccurring!!!

SLIDE 4

Socio-Economic Cluster Detection With Spatial Scan Statistics

Changes in modern urban planning;
Urban planning concepts are changing;
Need for sustainable cities;
Importance to integrate the multi-scale nature of

the city. New concepts in urban planning: fusion between

–Urban nuclei theory;
–Hierarchical centre/periphery models.

SLIDE 5

Scan Statistics Using to Determine Marketing Hot Spot – High Demand Density

A B C B B B A

SLIDE 6

Urban Cluster Detection? Why?

Characterize urban space;
Highlight specialized areas in an urban area;
Highlight deficiencies in a certain

neighbourhood;

Avoid excessive mobility;
Plan public transport networks.

SLIDE 7

The Spatial Scan Statistics Have Been Used to Detect and Extract Spatiotemporal Clusters

f Service Within the City of Riga

SLIDE 8

Scan Statistics Using for Latvian Forest (health) Analysis and Control

SLIDE 9

Latvian State Institutions Involved in Forest Fire Control

Defence Troops Defence Troops Local Government Local Government Communicat ion Service Communicat ion Service Medical Service Medical Service Civil Defence Civil Defence Latvian Railway Latvian Railway Fire Fighting Department Fire Fighting Department National Guard National Guard Latvian Air Transport Latvian Air Transport

State Forest Service State Forest Service

SLIDE 10

Forest and Forest Fire Prevention System in Latvia – Fire Lookout Towers and Fire Stations

Forests in Latvia cover 45 % of the surface of the country The state is the largest forest

wner in Latvia with

control of approximately 50%

f the forests. All

activities in Latvian forests must be conducted according to Latvian Forest Law.

SLIDE 11

Average Number of Forest Fires in Latvia

SLIDE 12

Space-Time Scan Statistic

Cylinders

are used rather than 2D circular zones.

SLIDE 13

Total Number of Fires in Latvia in Time

T1 T2

Evolution of Average Number of Forest Fires in Latvia in Time

Time

SLIDE 14

Example of Satellite-Based Forest Fire Monitoring in the Baltic Region

SLIDE 15

Monitoring of Aquatic Ecosystems and Groundwater in the River Basin Areas

Network Analysis of Biological Integrity in Freshwater Streams

SLIDE 16

Water Quality Sampling Stations

Each sampling stations control water parameters: Bacteria Chlorine levels pH Inorganic and organic pollutants Colour, turbidity, odour Many others

SLIDE 17

Example of Wireless Sensor Networks for Habitat Monitoring

SLIDE 18

Scalable Wireless Geo- Telemetry with Miniature Smart Sensors

SLIDE 19

Data Fusion Hierarchy for Smart Sensor Network with Wireless Geo-Telemetry Capability

Decisions

Information Retrieval Information Analysis

Data Analysis

Data Integration: Sensors, Time, Location Data Processing:Refinement and Filltering Signal Acquisition From Sensors

SLIDE 20

Forest Health Decision Support System Using Benchmarking and Scan Statistics

Key Forest Areas Forest Outside factors Threat Locations Forest Infected Non-infected Sample Ground Sensors Air/Space Platforms Data from sensors Benchmarking Data Processing - Compare Hot Spot Identification Module Benchmarking Module Verification Decision

SLIDE 21

Previous slides showed that data (information) for global Statistics control - Scan Statistics are enough!!! Computers capabilities allow to solve real Scan Statistics problems.

SLIDE 22

Objectives of the Scan Statistics:

Perform socio-economical surveillance of phenomena, to

detect areas of significantly high or low rates. Indicates

whether there is clustering;

Test whether a phenomena is randomly

distributed over space, over time or over space and time.

Shows us where it is;

Evaluate reported spatial or space-time

phenomena clusters, to see if they are statistically

significant. Produces a relative risk for the cluster;
Perform repeated time-periodic phenomena

surveillance for the early detection of phenomena

utbreaks.

SLIDE 23

Spatial Scan Statistic

The purpose of Scan statistics is the early

detection of clusters;

Two phases:
–Identification of the most probable clusters for

which the occurrences of a phenomenon within a region are higher than outside it;

–Distinguish clusters that are significant from

those which occurred by chance.

SLIDE 24

Scanning Window Principle

A scanning window considers every unit and its

neighbours in search for overdensities;

The size of window is increased;

SLIDE 25

Scanning Window Principle

A circular scanning window is placed at different coordinates with radius that vary from 0 to some set upper limit. For each location and size of window: the statistical criteria (Likelihood Ratio) is computed and the maximum is considered the most likely cluster

Grid points Circles around red point Circles around blew point

(Kulldorff, 1997; Neill and Moore, 2005)

SLIDE 26

Scanning Window Principle

To detect and localize

utbreaks, we can

search for spatial regions where the counts are significantly higher than expected. Imagine moving a space-time window around the scan area, allowing the window size, shape, and duration to vary.

SLIDE 27

Scanning Window Principle

To detect and localize

utbreaks, we can

search for spatial regions where the counts are significantly higher than expected. Imagine moving a space-time window around the scan area, allowing the window size, shape, and duration to vary.

SLIDE 28

Scanning Window Principle

To detect and localize

utbreaks, we can search

for spatial regions where the counts are significantly higher than expected. Imagine moving a space- time window around the scan area, allowing the window size, shape, and duration to vary.

SLIDE 29

Scanning Window Principle

To detect and localize

utbreaks, we can

search for spatial regions where the counts are significantly higher than expected. Imagine moving a space-time window around the scan area, allowing the window size, shape, and duration to vary.

SLIDE 30

Scan Statistics

In either case, we find the regions with highest values of a likelihood ratio statistic, and compute the statistical significance of each region by randomization testing. Parametric scan statistic approaches assume some parametric model for the distribution of counts, and learn the parameters from historical data.

Null hypothesisH0: no outbreak Alternative hypothesis H1: outbreak in region S

) | Data Pr( )) ( | Data Pr( ) ( L

1

H S H S =

Significant! (p = 0.01) Maximum region score = 9.5 Not significant! (p = 0.18)

SLIDE 31

Kulldorff’s Population-based Frequency Model of Cluster - Critical Region

qout = 0.01 qin = 0.02

S In Figure we illustrate a suspicion cluster – region with high level of intensity qin = 0.02 of phenomena. Scan statistic must gives answer – is this cluster real or it is “visual illusion”?

SLIDE 32

Simplest Model For This Situation Can be Written as:

Null hypothesis H0 (no clusters in region S)

qi = qall everywhere (use maximum likelihood estimate of qall in S);

Alternative hypothesis H1 (cluster in region S)

qi = qin inside region S, qi = qout elsewhere (use maximum likelihood estimates of qin and qout, subject to qin > qout).

SLIDE 33

Likelihood function is created depending on model selected. Likelihood Function is maximized over all window locations and sizes The one with the maximum likelihood is most likely cluster (least likely to have occurred by chance). Likelihood Ratio for this window becomes maximum likelihood.

ratio test statistic.
A p-value is obtained for the cluster by Monte -

Carlo hypothesis testing

Likelihood Function and p-Value

SLIDE 34

Poisson Models – is Most Popular Model in Scan Statistics

In each area, we assume the data X to be

distributed under null hypothesis H0, i.e.,

This yields the likelihood function L0

) ( ~ λ Poi X

( ) ∏

=

= =

n i i w n w

w x f w x x f L

1 1

) | ( | ,...,

SLIDE 35

Likelihood Ratio

We then compute the maximum of function L0;
We also compute the maximum of L1, which is

the same function with parameters unrestricted.

Each zone Z has different parameters, given the

heterogeneous population distribution. We want to find the zone which maximizes the LR (likelihood ratio) between likelihoods L1 and L0:

Z

L L Z LR ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

1

) (

SLIDE 36

Scan Statistic LRst From Likelihood Ratio

the scan statistic LRst is defined as
In the case of Poisson distributed process, the

Likelihood ratio takes the following form:

) ( max Z LR LR

Z st =

SLIDE 37

Poisson Likelihood Ratio

In the case of Poisson distributed process, the Likelihood ratio takes the following form:

I n c n c n c LF

tot

ut

in

c tot tot c

ut
ut

c in in i

⋅ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

⎪ ⎪ ⎩ ⎪ ⎪ ⎨ ⎧ > =

therwise

n c n c if I

ut
ut

in in

1

where

c

counts;

n

expected number of cases;

nin and nout

within or outside the scanning

window; I

indicator function.

SLIDE 38

Null Hypothesis Testing Using Monte-Carlo Method

No Yes Scanning process with fixed window w Defining the alocation of the factor with high intensities level Modeled p-value with Monte-Carlo for testing null hypothesis Testing null hypothesis - no cluster No clusters Is clusters !!! Repeat this procedure for other window w

SLIDE 39

Null Hypothesis Significance Testing

For each potential cluster, we generate N datasets (at

least 1000) using the parameters λ0 estimated for that zone, and we obtain a distribution for LR:

false is H H H

1

: : λ λ =

SLIDE 40

Bernoulli Model Similarly as Poisson Model is Most Popular Model in Scan Statistics

Bernoulli process is a discrete-time stochastic process based on Bernoulli trials An experiment whose outcome is random and can be either of two possible outcomes, “success” and “failure” Values expressed as 0 or 1 (non-cases or cases)

SLIDE 41

Likelihood Function for the Bernoulli Model

Scan similar to Poisson, visiting each event Likelihood function: C

total number of cases in dataset;

c

observed number of cases in window;

n

total number of cases and controls in window;

N

total number of cases and controls in dataset.

( ) ( )

() I n N c C n N n N c C n c n n c

c C n N c C c n c

⋅ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − − − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛

− − − − −

SLIDE 42

“Isolated” Points Structure Remain the Same After Summation of Points Intensities

Summation of points intensities and filtration with some level L

SLIDE 43

Points “Merging” Process After Summation

f Point’s Intensities

Summation of points intensities:

SLIDE 44

Upper Level Set of Intensity Surface

1

Z

2

Z

3

Z

Hotspot zones at level L (Connected Components of upper level set)

Intensity Region S Level L

SLIDE 45

Spatially distributed response variables Hotspot analysis Prioritization Decision support systems

Geoinformatic spatio-temporal data from a variety of data products and data sources with agencies, academia, and industry

Masks, filters Indicators, weights Masks, filters

Geoinformatic Surveillance System

Spatially distributed response variables Hotspot analysis Prioritization Decision support systems

Geoinformatic spatio-temporal data from a variety of data products and data sources with agencies, academia, and industry

Masks, filters Indicators, weights Masks, filters

Geoinformatic Surveillance System

Decision Support Systems Using Scan Statistics

SLIDE 46

Agency Databases Thematic Databases Other Databases Homeland Security Disaster Management Public Health Ecosystem Health Other Case Studies

Statistical Processing: Hotspot Detection, Prioritization, etc.

Data Sharing, Interoperable Middleware Standard or De Facto Data Model, Data Format, Data Access Arbitrary Data Model, Data Format, Data Access Application Specific De Facto Data/Information Standard Agency Databases Thematic Databases Other Databases Homeland Security Disaster Management Public Health Ecosystem Health Other Case Studies

Statistical Processing: Hotspot Detection, Prioritization, etc.

Data Sharing, Interoperable Middleware Standard or De Facto Data Model, Data Format, Data Access Arbitrary Data Model, Data Format, Data Access Application Specific De Facto Data/Information Standard

Information Structure and Data Flows

SLIDE 47

CONCLUSIONS

The Spatiotemporal Scan statistic can be used to

describe urban space in terms of density of service types;

Application to the marketing showed the relevancy
f the method;
High and low rates regions are not isolated in terms
f spatial connectivity;
The use of Scan Statistics allows to detect more

consistent clusters for investigated regions;

The use of Scan Statistics allows to detect the effect
f tourism in every regions.

SLIDE 48

THANK YOU FOR YOUR ATTENTION!

SLIDE 49

Yundola view from Google-Earth

SLIDE 50

Bulgaria

(Yundola)

SLIDE 51

Latvia

SLIDE 52

Scan Statistics and Time Series

We perform time series analysis to find the expected counts for each recent day; then compare actual to expected counts. For the standard scan statistic approach, we assume that each count is drawn from a Poisson distribution with unknown mean. For each of these regions, we compare the current counts for each location to the time series of historical counts for that location.

Expected counts Historical counts Current counts (3 day duration)