Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens - PDF document

Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens Statistics Netherlands prototype

Problems with data-editing (read: challenges) � Less funds supplied � More electronic input � More output demanded

More efficient data editing Electronic questionnaires with checks / 1. feedback Automatic corrections / imputations 2. Meso analysis for 3. – manual edits only when having effect at the aggregate level – Verifying results step 1 � MacroView

mAcrodata mAcrodata (aggregated) Microdata mIcrodata (aggregated) aggregate Combined aggregate + plausibility functions

Microdata mIcrodata Selection of the microdata Visualize Edit Select subset

� MacroView POC - Text-based (for now) - Must be flexible - For prototyping - Final version must be usable for different statistics - Input from & application to: - Transport - Production

mAcrodata mAcrodata (aggregated) Microdata mIcrodata (aggregated) Combined aggregate + plausibility functions

Specification of microdata DATAMODEL MyData primary ID weight TheWeight FIELDS ID :integer InterviewId :integer TransportedWeight :real , ifmissing(0.0) TransportType :integer, ifmissing(0) Distance :real TheWeight :real Quarter :integer NstrCode :integer RULES IF Type = ‘X’ THEN Distance < 2000 “Distance for type X should not exceed 20 ENDIF ENDMODEL

Specification of macrodata DATAMODEL MyAggregateData AGGREGATEBY AggDef = Quarter * NstrCode FILTERBY Filter1 = ‘Profit > 1000’ FIELDS Quarter: integer NSTRCode: integer Average_Distance :real Sum_Weights :real … ENDMODEL Quar NSTR Average_ Sum_ ter Distance Weights 1 1 1 2 … …

Specification of macrodata DATAMODEL MyAggregateData AGGREGATEBY AggDef = Quarter * NstrCode AggDef2 = Quarter FILTERBY Filter1 = ‘Profit > 1000’ Filter2 = ‘Profit <= 1000’ FIELDS Quarter: integer NSTRCode: integer Average_Distance :real Sum_Weights :real … ENDMODEL Quarter Average_ Sum_ Quar NSTR Average_ Sum_ ter Distance Weights Distance Weights 1 1 1 1 2 2 … … …

Specification of aggregate calculation: how to detect anomalies? Distribution properties of the micro data, e.g. its variance. 1. Processing properties 2. � % item non-response � % previously automatically imputed values in field X Plausibility functions: e.g. the relative change between 3. weighted t-1 and t data: ∑ ∑ − w V w V − − i , t 1 i , t 1 i , t i , t Δ = w V − − i , t 1 i , t 1

Comparing aggregate values Quarter Average_ Average_ Delta CV_Distance Distance(t-1) Distance(t) 1 2000 3500 0.75 4.5 2 1500 1700 0.14 2.4 3 2300 2100 0.09 (2,3) 1.6 (1) 4

Specification of an aggregate Aggregate MyAgg INPUT CY = MyData LY_Agg = MyAggregateData output outputagg = MyAggregatedDataCompared cells − − Difference := abs(LY_Agg.average_distance - D D t 1 t AVG(CY.Distance))/AVG(LY.Distance) ; D Med_Dist := Median( CY.Distance ); − t 1 Cellcompare Aggdef2: ratio := average_distance(1) / average_distance(2) > 2.0 “There should be more transport in the first quarter!"; EndAggregate Quarter Average_ Sum_ Distance Weights 1 1.91 … 2 0.89 … … … …

Specification of a grid Grid Kerncel_Grid input = MyAggregateData rules color(average_distance, difference, 0, 30.0, Green, Red) EndGrid NstrCode * Quarter

Specification of a plot Plot MicroPlot input = MyData type = scatter xcolumn = Distance YColumn = TransportedWeight ZColumn = TheWeight mincolorvalue = 1.0 maxcolorvalue = 500.0 color = (Red, Blue, Green) TooltipText=‘Weight=$THEWEIGHT$' endplot

Specification of the proces Proces Agg_LY(Data_LY, DataMacro_LY) Agg_LY.Done -> MyAgg(Data, DataMacro_LY, DataMacro) MyAgg.done -> comparegrid1(av.outputdata) AT Demo Position Top Label 'DataCompared' comparegrid1.RecordsSelected -> MicroPlot(comparegrid1.SelectedRecords) AT Demo POSITION Tab Label 'SelectedRecords' comparegrid1.RecordsSelected -> Details2(comparegrid1.SelectedRecords) AT Demo POSITION Left Label 'MicroData' MicroPlot.RecordsSelected -> MyEdit(MicroPlot.SelectedRecords) MyEdit.RecordEdited -> MyAgg(Data, DataMacro_LY, DataMacro) Endproces

Specification of the proces (2)

Micro-editing

MacroView: currently - Testing with real data - Road transport data (~demo) - Production statistics - Generating new ideas for the POC - Investigating the possible integration with Blaise - Data layer - Syntax - Making the POC more robust - Better checking of illegal syntax

A demo…

Questions ?

Grid met aggregaat + aankleuring

Plots met aankleuring/zoom/selectie

Todo/vragen: Met echte data aan de slag bij V&V en Mesogaafmaken Vervolg: waar moet de definitieve versie gemaakt worden? Binnen een project of bij Blaise ?

Select data source(s) � Select of data sources occurs through so called .boi file(s) � Presently only one data source can be selected � Version for 2001: multiple data sources – record set data – aggregated data: table meta (Cristal)

Define groups � Grouping variable types are: – enumeration – integer/ float – classification � Group the data, e.g. – sex (male, female) – turnover (0..100, 101..200, 201..300) leads to 6 groups

Define/select variables to observe � select variables, e.g. – TotalSalaries , NoPersons � Define derived variables, if needed, e.g. – AvgSalary := TotalSalaries / NoPersons Generate table grid � � Choose indicators to analyse: – Mean, median, MAD, min, max, variance, standard deviation, alpha-trimmed mean, cell-filling, number of records .

Show aggregates in grid:

Define/select variables to observe Alternative aggregate functions CV(a) = std(a) /mean(a) Std Mean(a) a CV(a) Group 1 100.2 3.4 0.034 Group 2 90.4 6.2 0.069 Define checks on data (later: using checks from Blaise) - abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage - Filling(NoPersons) < 0.3 (non-response) --> signalling flag Δ profit Mean profit97 profit98 Filling < 0.3 Group 1 234 250 2.3% Group 2 456 503 14.7%

Macro level: Detection of suspicious cells Identify suspicious data in cells (outliers), which have a potential impact on survey estimates using: � Estimates of the distribution (mean,std,…) � Manual outlier detection using boolean expression variables � Automatic outlier detection using Kosinsky algorithm

Outlier detection: Kosinsky � 1-D: use |x-median(x)|/MAD(x) distance to robustly detect outliers � N-D: No median defined, use Mahalanobis distance instead: � M 2 = Σ (y i - μ ) T C -1 (y i - μ ) ( (y i - μ )/ σ in 1D)

Kosinski-algorithm � 1. Start with n 0 =0.1*n ‘good’ points � 2. Good points � μ � 3. Obtain all Mahalanobis distances � 4. Take the (1+ f )*n i points with the smallest distances if distance < cutoff � 5. n i+1 = (1+f)*n i � 6. Repeat until no more points added.

Parameters for outlier detection: f and cutoff

Grid after outlier detection

Meso level: Detection of points in plots Identify suspicious data in XY plots (outliers), which have a potential impact on survey estimates using: � Manual outlier detection using boolean expression variables � Automatic outlier detection using Kosinsky

Scatter plot with outliers marked blue:

Define/select variables to observe Alternative aggregate functions CV(a) = std(a) /mean(a) Std Mean(a) a CV(a) Group 1 100.2 3.4 0.034 Group 2 90.4 6.2 0.069 Define checks on data (later: using checks from Blaise) - abs(profit97- profit98)*2/(profit97+ profit98) < 0.1 --> percentage - Filling(NoPersons) < 0.3 (non-response) --> signalling flag Δ profit Mean profit97 profit98 Filling < 0.3 Group 1 234 250 2.3% Group 2 456 503 14.7%

Scatter plot with condition (Points not complying to the conditions are green)

Micro level: record editing � Identify suspicious records either automatically or manually � Sort suspicious records according to some outlier index, e.g. mahalanobis distance � Edit records manually; worst cases first

Micro editing � Directly: click on suspicious point in xy-plot and edit it – now: • copy OLEDB --> Blaise • call RunDEP(...) in BlTools.dll • copy Blaise --> OLEDB – later: • use DEC with existing IBlaiseDatabase reference

MacroView and Slice: The building blocks of macroview Under the hood the macroview application consists of several � modules; these modules can and will be used as building blocks in SLICE.

Slice: a decomposed macroview as an example

Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens - PDF document

Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens Statistics Netherlands prototype Problems with data-editing (read: challenges) Less funds supplied More electronic input More output demanded More efficient data

I n t e r n s L i g h t n i n g T a l k s Proxy editing PiTiVi Proxy editing

Model-checking in systems biology - From Micro to Macro 1 / 62 00001 - 00:00:01 Model-checking

Game Engine Pull-Back Micro Behaviour AI Agent Global State Macro Strategy Values Planner

John Butters John Butters John Butters John Butters Macro Analysis johnbutters.org Two Types

DECOUPLING OF WAGES FROM PRODUCTIVITY: MACRO FACTS AND MICRO MECHANISMS Cyrille SCHWELLNUS,

An Integrated micro-macro firms distress model based on payments network Micro Financial

Non Linear Editing Programmable Solutions for the Broadcast Industry Non Linear Editing

Developmental Editing What is developmental editing? Who does the developmental edit?

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

RGBN IMAGE EDITING SIBGRAPI 2009 THIAGO PEREIRA LUIZ VELHO IMPA OUTLINE RGBN LINEAR EDITING

August 7 2019 www.firstbanknigeria.com Outline Macro, T rending News..Global Macro, T

Supplementary Notes to Horizon II macro 1 Horizon II macro Intro CONTENTS Page

MACRO GROUP | PRESENTATION MACRO GROUP | SHAREHOLDERS Ctia Rebelo, Financial, HR & Quality

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Follow-the-leader models on networks From micro to macro Markus Stachl

MICRO STANDING OFFER PROGRAM (MICRO-SOP) JUNE 9 VANCOUVER MEETING AND JUNE 10 WEBINAR June 9

A Cloud Benchmark Suite Combining Micro and Application Benchmarks Joel Scheuner, Philipp Leitner

ESS Vision 2020 Why does the ESS need a strategic debate on its future today? "Information

A FUTURE FOR OUR FUTURE Tim othe P. Anglade , AF83 WHO ARE WE ? THE SOCIAL WEB: (#27) SMALL

A Micro-service Approach for Cloud-Native Network Services Sebastiano Miano, Fulvio Risso Dept.

microbol Kickoff Conference Working Group 2: Qualification Frameworks and ECTS Prof. Ann

Data-Centric Execution of Speculative Parallel Programs MA MARK JEFFREY, SUVINAY SUBRAMANIAN,

miRNA Discovery & Prediction Algorithms Sergei Lebedev October 13, 2012 What is miRNA?

V a l i d a t i o n o f a c o n t e x t a n a l y s i s me t h o d

Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens - PDF document

Macro-selection and micro-editing: a Wim Hacking, Roger Lemmens Statistics Netherlands prototype Problems with data-editing (read: challenges) Less funds supplied More electronic input More output demanded More efficient data

I n t e r n s L i g h t n i n g T a l k s Proxy editing PiTiVi Proxy editing

Model-checking in systems biology - From Micro to Macro 1 / 62 00001 - 00:00:01 Model-checking

Game Engine Pull-Back Micro Behaviour AI Agent Global State Macro Strategy Values Planner

John Butters John Butters John Butters John Butters Macro Analysis johnbutters.org Two Types

DECOUPLING OF WAGES FROM PRODUCTIVITY: MACRO FACTS AND MICRO MECHANISMS Cyrille SCHWELLNUS,

An Integrated micro-macro firms distress model based on payments network Micro Financial

Non Linear Editing Programmable Solutions for the Broadcast Industry Non Linear Editing

Developmental Editing What is developmental editing? Who does the developmental edit?

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

RGBN IMAGE EDITING SIBGRAPI 2009 THIAGO PEREIRA LUIZ VELHO IMPA OUTLINE RGBN LINEAR EDITING

August 7 2019 www.firstbanknigeria.com Outline Macro, T rending News..Global Macro, T

Supplementary Notes to Horizon II macro 1 Horizon II macro Intro CONTENTS Page

MACRO GROUP | PRESENTATION MACRO GROUP | SHAREHOLDERS Ctia Rebelo, Financial, HR &amp; Quality

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Follow-the-leader models on networks From micro to macro Markus Stachl

MICRO STANDING OFFER PROGRAM (MICRO-SOP) JUNE 9 VANCOUVER MEETING AND JUNE 10 WEBINAR June 9

A Cloud Benchmark Suite Combining Micro and Application Benchmarks Joel Scheuner, Philipp Leitner

ESS Vision 2020 Why does the ESS need a strategic debate on its future today? &quot;Information

A FUTURE FOR OUR FUTURE Tim othe P. Anglade , AF83 WHO ARE WE ? THE SOCIAL WEB: (#27) SMALL

A Micro-service Approach for Cloud-Native Network Services Sebastiano Miano, Fulvio Risso Dept.

microbol Kickoff Conference Working Group 2: Qualification Frameworks and ECTS Prof. Ann

Data-Centric Execution of Speculative Parallel Programs MA MARK JEFFREY, SUVINAY SUBRAMANIAN,

miRNA Discovery &amp; Prediction Algorithms Sergei Lebedev October 13, 2012 What is miRNA?

V a l i d a t i o n o f a c o n t e x t a n a l y s i s me t h o d

MACRO GROUP | PRESENTATION MACRO GROUP | SHAREHOLDERS Ctia Rebelo, Financial, HR & Quality

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

ESS Vision 2020 Why does the ESS need a strategic debate on its future today? "Information

miRNA Discovery & Prediction Algorithms Sergei Lebedev October 13, 2012 What is miRNA?