macro selection and micro editing a
play

Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens - PDF document

Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens Statistics Netherlands prototype Problems with data-editing (read: challenges) Less funds supplied More electronic input More output demanded More efficient data


  1. Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens Statistics Netherlands prototype

  2. Problems with data-editing (read: challenges) � Less funds supplied � More electronic input � More output demanded

  3. More efficient data editing Electronic questionnaires with checks / 1. feedback Automatic corrections / imputations 2. Meso analysis for 3. – manual edits only when having effect at the aggregate level – Verifying results step 1 � MacroView

  4. mAcrodata mAcrodata (aggregated) Microdata mIcrodata (aggregated) aggregate Combined aggregate + plausibility functions

  5. Microdata mIcrodata Selection of the microdata Visualize Edit Select subset

  6. � MacroView POC - Text-based (for now) - Must be flexible - For prototyping - Final version must be usable for different statistics - Input from & application to: - Transport - Production

  7. mAcrodata mAcrodata (aggregated) Microdata mIcrodata (aggregated) Combined aggregate + plausibility functions

  8. Specification of microdata DATAMODEL MyData primary ID weight TheWeight FIELDS ID :integer InterviewId :integer TransportedWeight :real , ifmissing(0.0) TransportType :integer, ifmissing(0) Distance :real TheWeight :real Quarter :integer NstrCode :integer RULES IF Type = ‘X’ THEN Distance < 2000 “Distance for type X should not exceed 20 ENDIF ENDMODEL

  9. Specification of macrodata DATAMODEL MyAggregateData AGGREGATEBY AggDef = Quarter * NstrCode FILTERBY Filter1 = ‘Profit > 1000’ FIELDS Quarter: integer NSTRCode: integer Average_Distance :real Sum_Weights :real … ENDMODEL Quar NSTR Average_ Sum_ ter Distance Weights 1 1 1 2 … …

  10. Specification of macrodata DATAMODEL MyAggregateData AGGREGATEBY AggDef = Quarter * NstrCode AggDef2 = Quarter FILTERBY Filter1 = ‘Profit > 1000’ Filter2 = ‘Profit <= 1000’ FIELDS Quarter: integer NSTRCode: integer Average_Distance :real Sum_Weights :real … ENDMODEL Quarter Average_ Sum_ Quar NSTR Average_ Sum_ ter Distance Weights Distance Weights 1 1 1 1 2 2 … … …

  11. Specification of aggregate calculation: how to detect anomalies? Distribution properties of the micro data, e.g. its variance. 1. Processing properties 2. � % item non-response � % previously automatically imputed values in field X Plausibility functions: e.g. the relative change between 3. weighted t-1 and t data: ∑ ∑ − w V w V − − i , t 1 i , t 1 i , t i , t Δ = w V − − i , t 1 i , t 1

  12. Comparing aggregate values Quarter Average_ Average_ Delta CV_Distance Distance(t-1) Distance(t) 1 2000 3500 0.75 4.5 2 1500 1700 0.14 2.4 3 2300 2100 0.09 (2,3) 1.6 (1) 4

  13. Specification of an aggregate Aggregate MyAgg INPUT CY = MyData LY_Agg = MyAggregateData output outputagg = MyAggregatedDataCompared cells − − Difference := abs(LY_Agg.average_distance - D D t 1 t AVG(CY.Distance))/AVG(LY.Distance) ; D Med_Dist := Median( CY.Distance ); − t 1 Cellcompare Aggdef2: ratio := average_distance(1) / average_distance(2) > 2.0 “There should be more transport in the first quarter!"; EndAggregate Quarter Average_ Sum_ Distance Weights 1 1.91 … 2 0.89 … … … …

  14. Specification of a grid Grid Kerncel_Grid input = MyAggregateData rules color(average_distance, difference, 0, 30.0, Green, Red) EndGrid NstrCode * Quarter

  15. Specification of a plot Plot MicroPlot input = MyData type = scatter xcolumn = Distance YColumn = TransportedWeight ZColumn = TheWeight mincolorvalue = 1.0 maxcolorvalue = 500.0 color = (Red, Blue, Green) TooltipText=‘Weight=$THEWEIGHT$' endplot

  16. Specification of the proces Proces Agg_LY(Data_LY, DataMacro_LY) Agg_LY.Done -> MyAgg(Data, DataMacro_LY, DataMacro) MyAgg.done -> comparegrid1(av.outputdata) AT Demo Position Top Label 'DataCompared' comparegrid1.RecordsSelected -> MicroPlot(comparegrid1.SelectedRecords) AT Demo POSITION Tab Label 'SelectedRecords' comparegrid1.RecordsSelected -> Details2(comparegrid1.SelectedRecords) AT Demo POSITION Left Label 'MicroData' MicroPlot.RecordsSelected -> MyEdit(MicroPlot.SelectedRecords) MyEdit.RecordEdited -> MyAgg(Data, DataMacro_LY, DataMacro) Endproces

  17. Specification of the proces (2)

  18. Micro-editing

  19. MacroView: currently - Testing with real data - Road transport data (~demo) - Production statistics - Generating new ideas for the POC - Investigating the possible integration with Blaise - Data layer - Syntax - Making the POC more robust - Better checking of illegal syntax

  20. A demo…

  21. Questions ?

  22. Grid met aggregaat + aankleuring

  23. Plots met aankleuring/zoom/selectie

  24. Idee:

  25. Todo/vragen: Met echte data aan de slag bij V&V en Mesogaafmaken Vervolg: waar moet de definitieve versie gemaakt worden? Binnen een project of bij Blaise ?

  26. Select data source(s) � Select of data sources occurs through so called .boi file(s) � Presently only one data source can be selected � Version for 2001: multiple data sources – record set data – aggregated data: table meta (Cristal)

  27. Define groups � Grouping variable types are: – enumeration – integer/ float – classification � Group the data, e.g. – sex (male, female) – turnover (0..100, 101..200, 201..300) leads to 6 groups

  28. Define/select variables to observe � select variables, e.g. – TotalSalaries , NoPersons � Define derived variables, if needed, e.g. – AvgSalary := TotalSalaries / NoPersons Generate table grid � � Choose indicators to analyse: – Mean, median, MAD, min, max, variance, standard deviation, alpha-trimmed mean, cell-filling, number of records .

  29. Show aggregates in grid:

  30. Define/select variables to observe Alternative aggregate functions CV(a) = std(a) /mean(a) Std Mean(a) a CV(a) Group 1 100.2 3.4 0.034 Group 2 90.4 6.2 0.069 Define checks on data (later: using checks from Blaise) - abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage - Filling(NoPersons) < 0.3 (non-response) --> signalling flag Δ profit Mean profit97 profit98 Filling < 0.3 Group 1 234 250 2.3% Group 2 456 503 14.7%

  31. Macro level: Detection of suspicious cells Identify suspicious data in cells (outliers), which have a potential impact on survey estimates using: � Estimates of the distribution (mean,std,…) � Manual outlier detection using boolean expression variables � Automatic outlier detection using Kosinsky algorithm

  32. Outlier detection: Kosinsky � 1-D: use |x-median(x)|/MAD(x) distance to robustly detect outliers � N-D: No median defined, use Mahalanobis distance instead: � M 2 = Σ (y i - μ ) T C -1 (y i - μ ) ( (y i - μ )/ σ in 1D)

  33. Kosinski-algorithm � 1. Start with n 0 =0.1*n ‘good’ points � 2. Good points � μ � 3. Obtain all Mahalanobis distances � 4. Take the (1+ f )*n i points with the smallest distances if distance < cutoff � 5. n i+1 = (1+f)*n i � 6. Repeat until no more points added.

  34. Parameters for outlier detection: f and cutoff

  35. Grid after outlier detection

  36. Meso level: Detection of points in plots Identify suspicious data in XY plots (outliers), which have a potential impact on survey estimates using: � Manual outlier detection using boolean expression variables � Automatic outlier detection using Kosinsky

  37. Scatter plot with outliers marked blue:

  38. Define/select variables to observe Alternative aggregate functions CV(a) = std(a) /mean(a) Std Mean(a) a CV(a) Group 1 100.2 3.4 0.034 Group 2 90.4 6.2 0.069 Define checks on data (later: using checks from Blaise) - abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage - Filling(NoPersons) < 0.3 (non-response) --> signalling flag Δ profit Mean profit97 profit98 Filling < 0.3 Group 1 234 250 2.3% Group 2 456 503 14.7%

  39. Scatter plot with condition (Points not complying to the conditions are green)

  40. Micro level: record editing � Identify suspicious records either automatically or manually � Sort suspicious records according to some outlier index, e.g. mahalanobis distance � Edit records manually; worst cases first

  41. Micro editing � Directly: click on suspicious point in xy-plot and edit it – now: • copy OLEDB --> Blaise • call RunDEP(...) in BlTools.dll • copy Blaise --> OLEDB – later: • use DEC with existing IBlaiseDatabase reference

  42. MacroView and Slice: The building blocks of macroview Under the hood the macroview application consists of several � modules; these modules can and will be used as building blocks in SLICE.

  43. Slice: a decomposed macroview as an example

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend