Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens - - PDF document
Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens - - PDF document
Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens Statistics Netherlands prototype Problems with data-editing (read: challenges) Less funds supplied More electronic input More output demanded More efficient data
Problems with data-editing (read: challenges)
Less funds supplied More electronic input More output demanded
More efficient data editing
1.
Electronic questionnaires with checks / feedback
2.
Automatic corrections / imputations
3.
Meso analysis for
– manual edits only when having effect at the aggregate level – Verifying results step 1 MacroView
Microdata mIcrodata mAcrodata (aggregated) mAcrodata (aggregated) Combined aggregate + plausibility functions
aggregate
Microdata mIcrodata Selection of the microdata Visualize Select subset Edit
MacroView POC
- Text-based (for now)
- Must be flexible
- For prototyping
- Final version must be usable for different statistics
- Input from & application to:
- Transport
- Production
Microdata mIcrodata mAcrodata (aggregated) mAcrodata (aggregated) Combined aggregate + plausibility functions
DATAMODEL MyData primary ID weight TheWeight FIELDS ID :integer InterviewId :integer TransportedWeight :real , ifmissing(0.0) TransportType :integer, ifmissing(0) Distance :real TheWeight :real Quarter :integer NstrCode :integer RULES IF Type = ‘X’ THEN Distance < 2000 “Distance for type X should not exceed 20 ENDIF ENDMODEL
Specification of microdata
DATAMODEL MyAggregateData AGGREGATEBY AggDef = Quarter * NstrCode FILTERBY Filter1 = ‘Profit > 1000’ FIELDS Quarter: integer NSTRCode: integer Average_Distance :real Sum_Weights :real … ENDMODEL
Specification of macrodata
Quar ter NSTR Average_ Distance Sum_ Weights 1 1 1 2 … …
DATAMODEL MyAggregateData AGGREGATEBY AggDef = Quarter * NstrCode AggDef2 = Quarter FILTERBY Filter1 = ‘Profit > 1000’ Filter2 = ‘Profit <= 1000’ FIELDS Quarter: integer NSTRCode: integer Average_Distance :real Sum_Weights :real … ENDMODEL
Specification of macrodata
Quarter Average_ Distance Sum_ Weights 1 2 … Quar ter NSTR Average_ Distance Sum_ Weights 1 1 1 2 … …
Specification of aggregate calculation: how to detect anomalies?
1.
Distribution properties of the micro data, e.g. its variance.
2.
Processing properties
- % item non-response
- % previously automatically imputed values in field X
3.
Plausibility functions: e.g. the relative change between weighted t-1 and t data:
1 , 1 , , , 1 , 1 , − − − −
∑ ∑
− = Δ
t i t i t i t i t i t i
V w V w V w
Comparing aggregate values
Quarter Average_ Distance(t-1) Average_ Distance(t) Delta 3500 0.75 0.14 3 2300 2100 0.09 (2,3) 1.6 (1) 4 1700 CV_Distance 1 2000 4.5 2.4 2 1500
Aggregate MyAgg INPUT CY = MyData LY_Agg = MyAggregateData
- utput
- utputagg = MyAggregatedDataCompared
cells Difference := abs(LY_Agg.average_distance - AVG(CY.Distance))/AVG(LY.Distance) ; Med_Dist := Median( CY.Distance ); Cellcompare Aggdef2: ratio := average_distance(1) / average_distance(2) > 2.0 “There should be more transport in the first quarter!"; EndAggregate
Specification of an aggregate
1 1 − − − t t t
D D D
Quarter Average_ Distance Sum_ Weights 1 1.91 … 2 0.89 … … … …
Grid Kerncel_Grid input = MyAggregateData rules color(average_distance, difference, 0, 30.0, Green, Red) EndGrid
Specification of a grid
NstrCode * Quarter
Plot MicroPlot input = MyData type = scatter xcolumn = Distance YColumn = TransportedWeight ZColumn = TheWeight mincolorvalue = 1.0 maxcolorvalue = 500.0 color = (Red, Blue, Green) TooltipText=‘Weight=$THEWEIGHT$' endplot
Specification of a plot
Proces Agg_LY(Data_LY, DataMacro_LY) Agg_LY.Done -> MyAgg(Data, DataMacro_LY, DataMacro) MyAgg.done -> comparegrid1(av.outputdata) AT Demo Position Top Label 'DataCompared' comparegrid1.RecordsSelected -> MicroPlot(comparegrid1.SelectedRecords) AT Demo POSITION Tab Label 'SelectedRecords' comparegrid1.RecordsSelected -> Details2(comparegrid1.SelectedRecords) AT Demo POSITION Left Label 'MicroData' MicroPlot.RecordsSelected -> MyEdit(MicroPlot.SelectedRecords) MyEdit.RecordEdited -> MyAgg(Data, DataMacro_LY, DataMacro) Endproces
Specification of the proces
Specification of the proces (2)
Micro-editing
MacroView: currently
- Testing with real data
- Road transport data (~demo)
- Production statistics
- Generating new ideas for the POC
- Investigating the possible integration with Blaise
- Data layer
- Syntax
- Making the POC more robust
- Better checking of illegal syntax
A demo…
Questions ?
Grid met aggregaat + aankleuring
Plots met aankleuring/zoom/selectie
Idee:
Todo/vragen:
Met echte data aan de slag bij V&V en Mesogaafmaken Vervolg: waar moet de definitieve versie gemaakt worden? Binnen een project of bij Blaise ?
Select data source(s)
Select of data sources occurs through so
called .boi file(s)
Presently only one data source can be
selected
Version for 2001: multiple data sources
– record set data – aggregated data: table meta (Cristal)
Define groups
Grouping variable types are:
– enumeration – integer/ float – classification
Group the data, e.g.
– sex (male, female) – turnover (0..100, 101..200, 201..300) leads to 6 groups
select variables, e.g.
– TotalSalaries , NoPersons
Define derived variables, if needed, e.g.
– AvgSalary := TotalSalaries / NoPersons
- Generate table grid
Choose indicators to analyse:
– Mean, median, MAD, min, max, variance, standard deviation, alpha-trimmed mean, cell-filling, number of records.
Define/select variables to observe
Show aggregates in grid:
Define/select variables to observe
Std Mean(a) a CV(a) Group 1 100.2 3.4 0.034 Group 2 90.4 6.2 0.069
CV(a) = std(a) /mean(a)
Mean profit97 profit98 Δ profit Group 1 234 250 2.3% Group 2 456 503 14.7%
- abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage
- Filling(NoPersons) < 0.3 (non-response) --> signalling flag
Alternative aggregate functions
Filling < 0.3
Define checks on data (later: using checks from Blaise)
Macro level: Detection of suspicious cells
Identify suspicious data in cells (outliers), which have a potential impact on survey estimates using:
Estimates of the distribution (mean,std,…) Manual outlier detection using boolean
expression variables
Automatic outlier detection using Kosinsky
algorithm
Outlier detection: Kosinsky
1-D: use |x-median(x)|/MAD(x) distance to robustly
detect outliers
N-D: No median defined, use Mahalanobis distance
instead:
M2 = Σ (yi - μ) T C-1 (yi - μ) ( (yi - μ)/σ in 1D)
Kosinski-algorithm
- 1. Start with n0=0.1*n ‘good’ points
- 2. Good points μ
- 3. Obtain all Mahalanobis distances
- 4. Take the (1+f)*ni points with the
smallest distances if distance < cutoff
- 5. ni+1= (1+f)*ni
- 6. Repeat until no more points added.
Parameters for outlier detection: f and cutoff
Grid after outlier detection
Meso level: Detection of points in plots
Identify suspicious data in XY plots (outliers), which have a potential impact on survey estimates using:
Manual outlier detection using boolean
expression variables
Automatic outlier detection using Kosinsky
Scatter plot with outliers marked blue:
Define/select variables to observe
Std Mean(a) a CV(a) Group 1 100.2 3.4 0.034 Group 2 90.4 6.2 0.069
CV(a) = std(a) /mean(a)
Mean profit97 profit98 Δ profit Group 1 234 250 2.3% Group 2 456 503 14.7%
- abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage
- Filling(NoPersons) < 0.3 (non-response) --> signalling flag
Alternative aggregate functions
Filling < 0.3
Define checks on data (later: using checks from Blaise)
Scatter plot with condition
(Points not complying to the conditions are green)
Micro level: record editing
Identify suspicious records either
automatically or manually
Sort suspicious records according to some
- utlier index, e.g. mahalanobis distance
Edit records manually; worst cases first
Micro editing
Directly: click on suspicious point in xy-plot
and edit it
– now:
- copy OLEDB --> Blaise
- call RunDEP(...) in BlTools.dll
- copy Blaise --> OLEDB
– later:
- use DEC with existing IBlaiseDatabase reference
MacroView and Slice:
The building blocks of macroview
- Under the hood the macroview application consists of several
modules; these modules can and will be used as building blocks in SLICE.
Slice: a decomposed macroview as an example
Slice
- Overall framework for editing/imputation and display modules, based
- n COM technology
- Data source independency: slice record sets
- Small building blocks with just 1 function: (COM) idea, LEGO bricks:
flexibility
- Data exchange between blocks based on SLICE records, data
description uses Blaise meta
- Usable in many environments because of COM: VB, VC++, Delphi,
Manipula,...
- Data from / to: Blaise, OLEDB databases, ...
- Currently available: CherryPi and Imputation
Future of macroview (1):
- new aggregate functions for demography changes:
– import/export(data source1, data source 2)
Mean Year98 Year99 Group 1 100.2 140.2 Group 2 90.4 88.3 Import(..) 4
Future of macroview (2):
- ne more intuitive dialog for the definition van integer/float
(time/date) variable types
- Manipulation of defined ranges
Future of macroview (3):
- More data sources :
– all data sources (Di) must contain the grouping fields (G) – other data sources with detail records: join fields J) must be supplied – in case of t-1, t-2, ...: often just aggregates; table meta: Cristal
G J G
V1 V2 G1 ... ... Gn ... ...
D1 D2
Future of macroview (4):
- plots of data at aggregated level, e.g. histogram of average
profit for some or all groups
- more than one aggregate function active in grid
Mean Year98 Year99 Group 1 100.2 140.2 Group n 90.4 88.3 Import(..) 4 Year98 Year98 Group 1 10.2 100.2 Group n 9.4 90.4 Year99 140.2 88.3 Mean Std
Future of macroview (5):
- allow classification variable as a grouping variable (SBI,..)
- box-whisker, …, plots
- Integration with Blaise:
– Macroview as a whole part of the Blaise shell – Aggregation and robust outlier detection (Kosinsky) as slice modules; available via Manipula
Slice in more detail
MacroView and Slice:
The building blocks of macroview
Data Source Grid View Scatter plot Aggregate Record editing
Slice modules:
Recieves and sends so-called Slice records Has no knowledge of the origin or destination
- f the records
Has (in principle) one function Most modules will be non-interactive
Module Settings or Parameters Output Recordset(s) Input Recordset(s)
- Derived variables can also be conditions, e.g.
– Filling(NoPersons) > 0.3 (non-response) --> signalling flag – abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage – alternative aggregate function: std(a)/mean(a) or just mean(a) – .
Define/select variables to observe
Storage
Storage is done using the IRegister component in
BlRegA.dll
Undo/Redo ??
Blaise OLEDB Information (.boi)
file
Reference to Blaise meta file (.bmi) file Reference to an OLEDB provider and table,
e.g.
– provider = C:\MyDatabase.mdb (Access) , table = Customer Mapping between OLEDB fields and Blaise
fields:
– Customer.Name (String) <--> Name (STRING[20]) – Customer.Type (Integer) <--> Type (AgeClass : (Rich (1), Medium (2), Poor (3) ) )
SiAggregate
Datasource (.boi) Income, Age,...
Grouping:
- GetQuery
- Integer Grouping
Enum Grouping Classification Grouping
SiAggregate
Data Stratum Cells VarRanges Range StratumDef Var Aggregates Var
Slice records:
Database-independent and simple recordset Records can come from/go to any OLEDB or Blaise
source
Will handle the distribution to multiple modules Will take care of buffering of records, if necessary
Recordset RecordDef Fields FieldDefs The internal structure
- f a slice record
Using the slice modules:
Both the modules and the Slice recordset can
be used from any programming language that supports the use of COM
Modules under construction:
– CherryPi: automatic error detection – SiImputeXXX: several imputation modules
Near future:
Com object for the Slice records Modules: CherryPi and SiRegressionImputation
Future:
More modules (Aggregation, more imputation
methods, ...)
Visual prototyping tool
Visual inspection methods:
- Show Mahalanobis distances for selected recordset
- Show outliers (based on Kosinski algorithm) in datagrid or in
scatterplot
- Mark outliers by hand in a scatterplot
- Clicking outlier point(s) allows editing of the record (using the
Blaise DEP)
- Visual presentation
– Flagged aggregates – List of sorted records – Scatter plot – Histogram
An example: the CherriPi module:
Input: Recordset and a set of Edit Rules Output: Copy of input recordsset with some fields
marked ‘erroneous’
Based on edit rules, e.g.
profit + costs = turnover costs - 0.5 * turnover > 0
Automatic localization of erroneous records Localize faulty item(s) in erroneous record: e.g.
algorithm as implemented by Cherry Pi (based on the Fellegi-Holt paradigm)
Multiple datasources (e.g.: t, t-1)
Situation 1 Situation 2 Situation 3 Coupled fully partially not Grouping yes yes yes Aggregated no no yes Remark Cristal
Cristal:
Dimensions = Variable Observation Item = Variable + aggregation method Hierarchy = Variable +range :
– Classification item = range
Data point = 1 cell in grid