www.statistik.at Wir bewegen Informationen Official Statistics - - PowerPoint PPT Presentation

statistik at wir bewegen informationen official
SMART_READER_LITE
LIVE PREVIEW

www.statistik.at Wir bewegen Informationen Official Statistics - - PowerPoint PPT Presentation

From price collection to Josef Auer Ingolf Boettcher price data analytics 2017 Ottawa Group www.statistik.at Wir bewegen Informationen Official Statistics production: Where we come from The statistical model The universe The statistical


slide-1
SLIDE 1

www.statistik.at Wir bewegen Informationen

From price collection to price data analytics

Josef Auer Ingolf Boettcher 2017 Ottawa Group

slide-2
SLIDE 2

www.statistik.at Folie 2 | 09.05.2017

Official Statistics production: Where we come from

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

The universe (entire statistical population) The statistical data (sample) Amongst others: Quality control of data input =30% =70% Official Statistics The statistical model (to approximate the universe)

slide-3
SLIDE 3

www.statistik.at Folie 3 | 09.05.2017

Integration of large new data sources no need for statistical models? no need for theory?

The universe (entire statistical population) The statistical data („big data“) Amongst others: Quality control of data input =30% =70% Official Statistics The statistical model (if necessary….?!)

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-4
SLIDE 4

www.statistik.at Folie 4 | 09.05.2017

Integration of large new data sources no need for statistical models? no need for theory?

The universe (entire statistical population) The statistical data („big data“) Amongst others: Quality control of data input =30% =70% Official Statistics

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-5
SLIDE 5

www.statistik.at Folie 5 | 09.05.2017

Integration of large new data sources

Quality control of scanner data and the web-scraped data  new measurment methods necessary Is it relevant? Is it accurate? Is it complete?

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-6
SLIDE 6

www.statistik.at Folie 6 | 09.05.2017

Relevance of scanner data

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

Quality problem – Data Relevance Measurement Method

Transaction data may contain transactions that are out of scope.

  • e.g. expenditures for business

purposes (out of scope for consumer price indices) Information by data providers;

  • therwise unresolved
slide-7
SLIDE 7

www.statistik.at Folie 7 | 09.05.2017

Integration of large new data sources: Relevance

Is it relevant?

  • Large data-sources do no

replace basic methodological work and checks concerning:

  • Coverage bias
  • Measurement error
  • Self selection bias

The statistical data (e.g. supermarket data food and non-food article) Large data sources do not make

  • bsolete sound statistical models

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-8
SLIDE 8

www.statistik.at Folie 8 | 09.05.2017

Relevance of web-scraped data

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

Quality problem – Data Relevance Measurement Method

are products offered online really sold and by whom? Information by data providers;

  • therwise unresolved
slide-9
SLIDE 9

www.statistik.at Folie 9 | 09.05.2017

Accuracy of scanner data

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

Quality problem – Data Accuracy Measurement Method

Volume and variety of data sets are too large to identify and clean erroneous/ untrustworthy/ inconsistent data sets with conventional methods.

Extent in % of erroneous / inconsistent data is monitored and excluded

slide-10
SLIDE 10

www.statistik.at Folie 10 | 09.05.2017

Accuracy of web-scraped data

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

Quality problem – Data Accuracy Measurement Method

Website content may be IP-specific (a user who frequently checks a website or a web-scraper might lead to different price displays than first-time users) Comparison of automatically and manually collected data

slide-11
SLIDE 11

www.statistik.at Folie 11 | 09.05.2017

Completeness of scanner data

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

Quality problem – Data Completeness Measurement Method

Volume and variety of data sets are too large to identify missing values with conventional methods. (Scanner data: natural attrition of Unique identifiers is extremely high)

Number and level of target values are measured against historical values from previous deliveries

slide-12
SLIDE 12

www.statistik.at Folie 12 | 09.05.2017

Completeness of web-scraped data

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

Quality problem – Data Completeness Measurement Method

Websites change frequently Relevant variables and URLs might not be identified and scraped

Number and level of target values are measured against historical values from previous deliveries

slide-13
SLIDE 13

www.statistik.at Folie 13 | 09.05.2017

Implementation of large new data sources : accuracy/completeness

Is it accurate? The statistical data (estimate for Austrian retail market) (e.g. supermarket scanner data for food and non-food)

# Shop ID Art- Code

  • Art. retailer

classifcation Product Description Quantity sold Sales in EUR 1 212 1234 Soft drinks - cola Cola, BrandX, 333ML 123 €129 2 212 1214 Soft drinks – cola Cola, light, BrandY, L 255 €126 … … … … … … 60.000.00 1234 9965 Bakery products Brezel, brandZ, 500g 50 €126

60.000.000 data sets every month= 5.000 Articles X 4 Weeks X 1000 Shops X 3 Retailers Before (with manual price collection): 10.000 data sets = 100 Articles X 1 (monthly collection) X 20 Cities X 5 supermarkets

? ? ? ? ? ? ? ? ? ? ? ?

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-14
SLIDE 14

www.statistik.at Folie 14 | 09.05.2017

Implementation of large new data sources : accuracy/completeness

Is it accurate? The statistical data (e.g. supermarket data food and non-food article)

# Shop ID Art- Code

  • Art. retailer

classifcation Product Description Quantity sold Sales in EUR Accurate & complete? 1 212 1234 Soft drinks - cola Cola, BrandX, 333ML 123 €129 YES 2 212 1214 Soft drinks – cola Cola, light, BrandY, L 255 €126 NO

Missing value for „Volume in Liter“ Large new data sources require automation of data cleaning and quality assessment processes

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-15
SLIDE 15

www.statistik.at Folie 15 | 09.05.2017

Implementation of large new data sources : accuracy/completeness

1.Define measureable quality dimensions and elements of the data 2.Automate as many consistency and quality checks as possible Examples:

  • Extent in % of erroneous / inconsistent data is monitored and

excluded

  • average # of missing values per data set
  • unreasonable changes of summary statistics
  • Number and level of target values measured against historical values
  • % of month to month attrition rates in product groups
  • 3. Ability to adapt automated processes to

ever-changing data structures and sources

Analytical approach to quality control

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-16
SLIDE 16

www.statistik.at Folie 16 | 09.05.2017

Implementation of large new data sources : accuracy/completeness

  • 3. Adapt automated processes to changing

data structures and sources

IT CPI experts imputes deletes cleans interprets analyzes executes Develops/writes programs maintains updates integrates

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-17
SLIDE 17

www.statistik.at Folie 17 | 09.05.2017

Implementation of large new data sources : accuracy/completeness

  • 3. Adapt automated processes to changing

data structures and sources = Data science

IT CPI experts imputes deletes cleans interprets analyzes executes Develops/writes programs maintains updates integrates

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

„Data science“ (in price statistics)–>integrate, clean, analyze and process continuously changing (non-standardized) large price data sources and turn them into compliant price statistics

slide-18
SLIDE 18

www.statistik.at Folie 18 | 09.05.2017

Implementation of large new data sources :

  • 3. Adapt automated price index compilation processes to

changing data structures and sources = Data science

Examples Scanner data

  • retailer continuously update

data-base structures to own data-warehouse needs

  • high attrition rate of single

articles, shops, product classes

Web-scraping

  • frequently changing web-site

architecture and product presentation

  • high attrition rate of single articles

and categories

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-19
SLIDE 19

www.statistik.at Folie 19 | 09.05.2017

Price index compilation with scanner data new working steps

  • 1. Article

identification and matching

Automated matching Manual matching

  • 2. Plauibility

check /filter /imputation

Deletetion of implausible data sets Sampling /Imputation

  • 3. Index

compilation

Geomean of sampled price relatives Retailer Weighted aggregation indices

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-20
SLIDE 20

www.statistik.at Folie 20 | 09.05.2017

Price index compliation with scanner data new strata

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-21
SLIDE 21

www.statistik.at Folie 21 | 09.05.2017

Price index compliation with scanner data

1.KW 2.KW 3.KW 4.KW 5.KW Mi Do Fr Sa So Mo Di Mi Do Fr Sa So Mo Di Mi Do Fr Sa So Mo Di Mi Do Fr Sa So Mo Di Mi Do Fr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

SD Delivery 1.CW SD Delivery 2.CW SD Delivery 3.CW SD Delivery 4.CW

  • 1. Article Identification, matching and mapping
  • 2. Plausi etc.
  • 3. (1) HVPI Flash-Estimate + Plausi

6.KW 7.KW Sa So Mo Di Mi Do Fr Sa So Mo Di Mi Do Fr Sa So Mo 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

  • 3. (2) H/VPI Compilation+ Plausi

H/VPI Publication

From price collection to price data analytics - Josef Auer and Ingolf Boettcher – Statistics Austria

slide-22
SLIDE 22

www.statistik.at Wir bewegen Informationen

From price collection to price data analytics

Contact: Josef Auer josef.auer@statistik.gv.at Ingolf Boettcher ingolf.boettcher@statistik.gv.at t