Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens - - PDF document

macro selection and micro editing a
SMART_READER_LITE
LIVE PREVIEW

Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens - - PDF document

Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens Statistics Netherlands prototype Problems with data-editing (read: challenges) Less funds supplied More electronic input More output demanded More efficient data


slide-1
SLIDE 1

Macro-selection and micro-editing: a prototype

Wim Hacking, Roger Lemmens Statistics Netherlands

slide-2
SLIDE 2

Problems with data-editing (read: challenges)

Less funds supplied More electronic input More output demanded

slide-3
SLIDE 3

More efficient data editing

1.

Electronic questionnaires with checks / feedback

2.

Automatic corrections / imputations

3.

Meso analysis for

– manual edits only when having effect at the aggregate level – Verifying results step 1 MacroView

slide-4
SLIDE 4

Microdata mIcrodata mAcrodata (aggregated) mAcrodata (aggregated) Combined aggregate + plausibility functions

aggregate

slide-5
SLIDE 5

Microdata mIcrodata Selection of the microdata Visualize Select subset Edit

slide-6
SLIDE 6

MacroView POC

  • Text-based (for now)
  • Must be flexible
  • For prototyping
  • Final version must be usable for different statistics
  • Input from & application to:
  • Transport
  • Production
slide-7
SLIDE 7

Microdata mIcrodata mAcrodata (aggregated) mAcrodata (aggregated) Combined aggregate + plausibility functions

slide-8
SLIDE 8

DATAMODEL MyData primary ID weight TheWeight FIELDS ID :integer InterviewId :integer TransportedWeight :real , ifmissing(0.0) TransportType :integer, ifmissing(0) Distance :real TheWeight :real Quarter :integer NstrCode :integer RULES IF Type = ‘X’ THEN Distance < 2000 “Distance for type X should not exceed 20 ENDIF ENDMODEL

Specification of microdata

slide-9
SLIDE 9

DATAMODEL MyAggregateData AGGREGATEBY AggDef = Quarter * NstrCode FILTERBY Filter1 = ‘Profit > 1000’ FIELDS Quarter: integer NSTRCode: integer Average_Distance :real Sum_Weights :real … ENDMODEL

Specification of macrodata

Quar ter NSTR Average_ Distance Sum_ Weights 1 1 1 2 … …

slide-10
SLIDE 10

DATAMODEL MyAggregateData AGGREGATEBY AggDef = Quarter * NstrCode AggDef2 = Quarter FILTERBY Filter1 = ‘Profit > 1000’ Filter2 = ‘Profit <= 1000’ FIELDS Quarter: integer NSTRCode: integer Average_Distance :real Sum_Weights :real … ENDMODEL

Specification of macrodata

Quarter Average_ Distance Sum_ Weights 1 2 … Quar ter NSTR Average_ Distance Sum_ Weights 1 1 1 2 … …

slide-11
SLIDE 11

Specification of aggregate calculation: how to detect anomalies?

1.

Distribution properties of the micro data, e.g. its variance.

2.

Processing properties

  • % item non-response
  • % previously automatically imputed values in field X

3.

Plausibility functions: e.g. the relative change between weighted t-1 and t data:

1 , 1 , , , 1 , 1 , − − − −

∑ ∑

− = Δ

t i t i t i t i t i t i

V w V w V w

slide-12
SLIDE 12

Comparing aggregate values

Quarter Average_ Distance(t-1) Average_ Distance(t) Delta 3500 0.75 0.14 3 2300 2100 0.09 (2,3) 1.6 (1) 4 1700 CV_Distance 1 2000 4.5 2.4 2 1500

slide-13
SLIDE 13

Aggregate MyAgg INPUT CY = MyData LY_Agg = MyAggregateData

  • utput
  • utputagg = MyAggregatedDataCompared

cells Difference := abs(LY_Agg.average_distance - AVG(CY.Distance))/AVG(LY.Distance) ; Med_Dist := Median( CY.Distance ); Cellcompare Aggdef2: ratio := average_distance(1) / average_distance(2) > 2.0 “There should be more transport in the first quarter!"; EndAggregate

Specification of an aggregate

1 1 − − − t t t

D D D

Quarter Average_ Distance Sum_ Weights 1 1.91 … 2 0.89 … … … …

slide-14
SLIDE 14

Grid Kerncel_Grid input = MyAggregateData rules color(average_distance, difference, 0, 30.0, Green, Red) EndGrid

Specification of a grid

NstrCode * Quarter

slide-15
SLIDE 15

Plot MicroPlot input = MyData type = scatter xcolumn = Distance YColumn = TransportedWeight ZColumn = TheWeight mincolorvalue = 1.0 maxcolorvalue = 500.0 color = (Red, Blue, Green) TooltipText=‘Weight=$THEWEIGHT$' endplot

Specification of a plot

slide-16
SLIDE 16

Proces Agg_LY(Data_LY, DataMacro_LY) Agg_LY.Done -> MyAgg(Data, DataMacro_LY, DataMacro) MyAgg.done -> comparegrid1(av.outputdata) AT Demo Position Top Label 'DataCompared' comparegrid1.RecordsSelected -> MicroPlot(comparegrid1.SelectedRecords) AT Demo POSITION Tab Label 'SelectedRecords' comparegrid1.RecordsSelected -> Details2(comparegrid1.SelectedRecords) AT Demo POSITION Left Label 'MicroData' MicroPlot.RecordsSelected -> MyEdit(MicroPlot.SelectedRecords) MyEdit.RecordEdited -> MyAgg(Data, DataMacro_LY, DataMacro) Endproces

Specification of the proces

slide-17
SLIDE 17

Specification of the proces (2)

slide-18
SLIDE 18

Micro-editing

slide-19
SLIDE 19

MacroView: currently

  • Testing with real data
  • Road transport data (~demo)
  • Production statistics
  • Generating new ideas for the POC
  • Investigating the possible integration with Blaise
  • Data layer
  • Syntax
  • Making the POC more robust
  • Better checking of illegal syntax
slide-20
SLIDE 20

A demo…

slide-21
SLIDE 21

Questions ?

slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

Grid met aggregaat + aankleuring

slide-26
SLIDE 26

Plots met aankleuring/zoom/selectie

slide-27
SLIDE 27

Idee:

slide-28
SLIDE 28

Todo/vragen:

Met echte data aan de slag bij V&V en Mesogaafmaken Vervolg: waar moet de definitieve versie gemaakt worden? Binnen een project of bij Blaise ?

slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34

Select data source(s)

Select of data sources occurs through so

called .boi file(s)

Presently only one data source can be

selected

Version for 2001: multiple data sources

– record set data – aggregated data: table meta (Cristal)

slide-35
SLIDE 35

Define groups

Grouping variable types are:

– enumeration – integer/ float – classification

Group the data, e.g.

– sex (male, female) – turnover (0..100, 101..200, 201..300) leads to 6 groups

slide-36
SLIDE 36

select variables, e.g.

– TotalSalaries , NoPersons

Define derived variables, if needed, e.g.

– AvgSalary := TotalSalaries / NoPersons

  • Generate table grid

Choose indicators to analyse:

– Mean, median, MAD, min, max, variance, standard deviation, alpha-trimmed mean, cell-filling, number of records.

Define/select variables to observe

slide-37
SLIDE 37

Show aggregates in grid:

slide-38
SLIDE 38

Define/select variables to observe

Std Mean(a) a CV(a) Group 1 100.2 3.4 0.034 Group 2 90.4 6.2 0.069

CV(a) = std(a) /mean(a)

Mean profit97 profit98 Δ profit Group 1 234 250 2.3% Group 2 456 503 14.7%

  • abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage
  • Filling(NoPersons) < 0.3 (non-response) --> signalling flag

Alternative aggregate functions

Filling < 0.3

Define checks on data (later: using checks from Blaise)

slide-39
SLIDE 39

Macro level: Detection of suspicious cells

Identify suspicious data in cells (outliers), which have a potential impact on survey estimates using:

Estimates of the distribution (mean,std,…) Manual outlier detection using boolean

expression variables

Automatic outlier detection using Kosinsky

algorithm

slide-40
SLIDE 40

Outlier detection: Kosinsky

1-D: use |x-median(x)|/MAD(x) distance to robustly

detect outliers

N-D: No median defined, use Mahalanobis distance

instead:

M2 = Σ (yi - μ) T C-1 (yi - μ) ( (yi - μ)/σ in 1D)

slide-41
SLIDE 41

Kosinski-algorithm

  • 1. Start with n0=0.1*n ‘good’ points
  • 2. Good points μ
  • 3. Obtain all Mahalanobis distances
  • 4. Take the (1+f)*ni points with the

smallest distances if distance < cutoff

  • 5. ni+1= (1+f)*ni
  • 6. Repeat until no more points added.
slide-42
SLIDE 42

Parameters for outlier detection: f and cutoff

slide-43
SLIDE 43

Grid after outlier detection

slide-44
SLIDE 44

Meso level: Detection of points in plots

Identify suspicious data in XY plots (outliers), which have a potential impact on survey estimates using:

Manual outlier detection using boolean

expression variables

Automatic outlier detection using Kosinsky

slide-45
SLIDE 45

Scatter plot with outliers marked blue:

slide-46
SLIDE 46

Define/select variables to observe

Std Mean(a) a CV(a) Group 1 100.2 3.4 0.034 Group 2 90.4 6.2 0.069

CV(a) = std(a) /mean(a)

Mean profit97 profit98 Δ profit Group 1 234 250 2.3% Group 2 456 503 14.7%

  • abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage
  • Filling(NoPersons) < 0.3 (non-response) --> signalling flag

Alternative aggregate functions

Filling < 0.3

Define checks on data (later: using checks from Blaise)

slide-47
SLIDE 47

Scatter plot with condition

(Points not complying to the conditions are green)

slide-48
SLIDE 48

Micro level: record editing

Identify suspicious records either

automatically or manually

Sort suspicious records according to some

  • utlier index, e.g. mahalanobis distance

Edit records manually; worst cases first

slide-49
SLIDE 49

Micro editing

Directly: click on suspicious point in xy-plot

and edit it

– now:

  • copy OLEDB --> Blaise
  • call RunDEP(...) in BlTools.dll
  • copy Blaise --> OLEDB

– later:

  • use DEC with existing IBlaiseDatabase reference
slide-50
SLIDE 50

MacroView and Slice:

The building blocks of macroview

  • Under the hood the macroview application consists of several

modules; these modules can and will be used as building blocks in SLICE.

slide-51
SLIDE 51

Slice: a decomposed macroview as an example

slide-52
SLIDE 52

Slice

  • Overall framework for editing/imputation and display modules, based
  • n COM technology
  • Data source independency: slice record sets
  • Small building blocks with just 1 function: (COM) idea, LEGO bricks:

flexibility

  • Data exchange between blocks based on SLICE records, data

description uses Blaise meta

  • Usable in many environments because of COM: VB, VC++, Delphi,

Manipula,...

  • Data from / to: Blaise, OLEDB databases, ...
  • Currently available: CherryPi and Imputation
slide-53
SLIDE 53

Future of macroview (1):

  • new aggregate functions for demography changes:

– import/export(data source1, data source 2)

Mean Year98 Year99 Group 1 100.2 140.2 Group 2 90.4 88.3 Import(..) 4

slide-54
SLIDE 54

Future of macroview (2):

  • ne more intuitive dialog for the definition van integer/float

(time/date) variable types

  • Manipulation of defined ranges
slide-55
SLIDE 55

Future of macroview (3):

  • More data sources :

– all data sources (Di) must contain the grouping fields (G) – other data sources with detail records: join fields J) must be supplied – in case of t-1, t-2, ...: often just aggregates; table meta: Cristal

G J G

V1 V2 G1 ... ... Gn ... ...

D1 D2

slide-56
SLIDE 56

Future of macroview (4):

  • plots of data at aggregated level, e.g. histogram of average

profit for some or all groups

  • more than one aggregate function active in grid

Mean Year98 Year99 Group 1 100.2 140.2 Group n 90.4 88.3 Import(..) 4 Year98 Year98 Group 1 10.2 100.2 Group n 9.4 90.4 Year99 140.2 88.3 Mean Std

slide-57
SLIDE 57

Future of macroview (5):

  • allow classification variable as a grouping variable (SBI,..)
  • box-whisker, …, plots
  • Integration with Blaise:

– Macroview as a whole part of the Blaise shell – Aggregation and robust outlier detection (Kosinsky) as slice modules; available via Manipula

slide-58
SLIDE 58

Slice in more detail

slide-59
SLIDE 59

MacroView and Slice:

The building blocks of macroview

Data Source Grid View Scatter plot Aggregate Record editing

slide-60
SLIDE 60

Slice modules:

Recieves and sends so-called Slice records Has no knowledge of the origin or destination

  • f the records

Has (in principle) one function Most modules will be non-interactive

Module Settings or Parameters Output Recordset(s) Input Recordset(s)

slide-61
SLIDE 61
  • Derived variables can also be conditions, e.g.

– Filling(NoPersons) > 0.3 (non-response) --> signalling flag – abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage – alternative aggregate function: std(a)/mean(a) or just mean(a) – .

Define/select variables to observe

slide-62
SLIDE 62

Storage

Storage is done using the IRegister component in

BlRegA.dll

Undo/Redo ??

slide-63
SLIDE 63

Blaise OLEDB Information (.boi)

file

Reference to Blaise meta file (.bmi) file Reference to an OLEDB provider and table,

e.g.

– provider = C:\MyDatabase.mdb (Access) , table = Customer Mapping between OLEDB fields and Blaise

fields:

– Customer.Name (String) <--> Name (STRING[20]) – Customer.Type (Integer) <--> Type (AgeClass : (Rich (1), Medium (2), Poor (3) ) )

slide-64
SLIDE 64

SiAggregate

Datasource (.boi) Income, Age,...

Grouping:

  • GetQuery
  • Integer Grouping

Enum Grouping Classification Grouping

slide-65
SLIDE 65

SiAggregate

Data Stratum Cells VarRanges Range StratumDef Var Aggregates Var

slide-66
SLIDE 66

Slice records:

Database-independent and simple recordset Records can come from/go to any OLEDB or Blaise

source

Will handle the distribution to multiple modules Will take care of buffering of records, if necessary

Recordset RecordDef Fields FieldDefs The internal structure

  • f a slice record
slide-67
SLIDE 67

Using the slice modules:

Both the modules and the Slice recordset can

be used from any programming language that supports the use of COM

Modules under construction:

– CherryPi: automatic error detection – SiImputeXXX: several imputation modules

slide-68
SLIDE 68

Near future:

Com object for the Slice records Modules: CherryPi and SiRegressionImputation

Future:

More modules (Aggregation, more imputation

methods, ...)

Visual prototyping tool

slide-69
SLIDE 69

Visual inspection methods:

  • Show Mahalanobis distances for selected recordset
  • Show outliers (based on Kosinski algorithm) in datagrid or in

scatterplot

  • Mark outliers by hand in a scatterplot
  • Clicking outlier point(s) allows editing of the record (using the

Blaise DEP)

  • Visual presentation

– Flagged aggregates – List of sorted records – Scatter plot – Histogram

slide-70
SLIDE 70

An example: the CherriPi module:

Input: Recordset and a set of Edit Rules Output: Copy of input recordsset with some fields

marked ‘erroneous’

Based on edit rules, e.g.

profit + costs = turnover costs - 0.5 * turnover > 0

Automatic localization of erroneous records Localize faulty item(s) in erroneous record: e.g.

algorithm as implemented by Cherry Pi (based on the Fellegi-Holt paradigm)

slide-71
SLIDE 71

Multiple datasources (e.g.: t, t-1)

Situation 1 Situation 2 Situation 3 Coupled fully partially not Grouping yes yes yes Aggregated no no yes Remark Cristal

slide-72
SLIDE 72

Cristal:

Dimensions = Variable Observation Item = Variable + aggregation method Hierarchy = Variable +range :

– Classification item = range

Data point = 1 cell in grid

= classification item x observation item cccc cccc cccc cccc cccc cccc cccc cccc cccc cccc cccc cccc