Official Statistics in the New Data Ecosystem David J. Hand Imperial - PowerPoint PPT Presentation

Official Statistics in the New Data Ecosystem David J. Hand Imperial College, London March 2015 1

Pascal’s wager on the existence of God: “ You must wager. It is not optional. You are embarked .” 2

Pascal’s wager on the existence of God: “ You must wager. It is not optional. You are embarked .” Same is true for official statistics and the “new data ecosystem” 3

Pascal’s wager on the existence of God: “ You must wager. It is not optional. You are embarked .” Same is true for official statistics and the “new data ecosystem”. It’s out there and one either ignores it, believing it’s inconsequential, or one engages with it. Official statisticians have to bet one way or the other 4

 Ignore it and you risk becoming irrelevant  Engage and you are leaping aboard a treadmill which is getting faster and faster ‐ tomorrow’s IT will be different from today’s (think Twitter, YouTube, etc) ‐ “ The future you have tomorrow won't be the same future you had yesterday .” Chuck Palahniuk ‐ “ In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world that no longer exists .” Eric Hoffer 5

What’s new in the world of data? ‐ source of data : automatic acquisition of data ‐ speed of acquisition of data : “streaming data” ‐ size of data sets : “big data” ‐ diversity of data ‐ complexity of data 6

Source: Modern data capture technologies Automatic data collection: ‐ electronic measurements: point of sale credit card terminals, petrol pumps, contactless travel cards, phone records, emails, GPS, CCTV cameras, ... Internet of things Administrative data Social media data – data directly from the web “Properties” of automatic data collection: ‐ immediate ‐ complete ??? ‐ untouched by human hands ??? 7

Speed: Realtime data collection – and analysis This has several major implications, threatening the position of NSIs ? 1: timeliness The balance of timeliness against accuracy: Example: UK GDP  1 st estimate: 44% of the data available by 25 days,  2 nd estimate: 88% by 55 days,  3 rd estimate: 85 days 8

What about inflation rate?: Elaborate procedure to collect sample data Contrast with direct recording from transactions: “Google has created a new inflation measure – the Google Price Index – based on the cost of goods sold online which could prove more accurate and up ‐ to ‐ date than official statistics. Google's mountain of web shopping data could also be used by the online group for economic forecasting ahead of the publication of official statistics, the Financial Times reported. Google's chief economist Hal Varian said that he is working on ‘predicting the present’ by using real ‐ time search data to forecast official figures which often are published at least a month after the period they cover. ” http://www.theguardian.com/business/2010/oct/12/google-create-new-inflation-measure 9

2: new kind of analytic tools needed “Streaming data”: the data keep on coming, like water from a hose Permanently executing analytic tools, processing the data as it arrives ‐ anomalies ‐ changes ‐ summaries (trends, averages, variability, maxima, ...) Realtime → automatic analysis Contrast the more familiar: a fixed database 10

Note: perhaps we cannot store the data once it has been processed In that case we need to know what questions we will ask We cannot later ask arbitrary questions, but only those that can be answered from our summary statistics Summarising a stream Subsetting a stream: sampling, but requires different approaches from classical survey sampling Filtering a stream: accept only those cases which meet some criterion 11

Size: Large data sets Census: the largest set of data collected by NSIs until relatively recently ? But now administrative data , register ‐ based data ‐ some countries (e.g. Scandinavian) ahead of the field transaction data even administrative data sets can be tiny compared with those arising from automatic data collection (e.g. social media, Google searches, twitter messages,email transaction logs, phone logs, transport logs, ...) 12

Diversity of data Survey, census, administrative, transaction, experimental, ... Numerical tables, image, text, signal, networks, ... Different kinds of data have different properties e.g. survey data: answers to the questions you choose but slow and expensive to collect, response bias? e.g. transaction data: fine granularity, both spatial and temporal, immediate, but may not address the question you want, complete population coverage – in principle, but rarely in practice → an opportunity : Perhaps data of different kinds can be combined synergistically, to overcome the problems of each individual kind 13

Stitching different kinds of data together Linking Matching Merging Technical challenges have begun to be addressed in different fields e.g. medical combining information from scans with traditional numeric and text files Great potential benefit from cross ‐ disciplinary collaboration and awareness 14

“survey and census data is what they say : administrative and transaction data is what they do ” 15

“survey and census data is what they say : administrative and transaction data is what they do ” New forms of data are closer to social reality 16

Complexity of data Especially, networks and linked data ‐ Social networks ‐ Cybersecurity ‐ Fraud detection 17

Ping ‐ Pong Fraud Ring Model ( HBOS Data Mining Team, 2007 ) Within the Mortgage Industry, in addition to individual fraudulent applications, fraud can also occur in groups often referred to as Fraud Rings. These groups can consist of applicants, brokers and/or other professionals that get together to cheat the system. The purpose of the Ping ‐ Pong Fraud Ring Model is to assist the Mortgage Fraud team in identifying these Fraud Rings. This is a 3 step approach: 1. Ping ‐ Pong clustering: aggregating applications that have common names, addresses, telephone numbers, employers, etc 2. Ranking: inconsistency rules are used to rank the clusters in order of severity. 3. Linkage analysis: identifying links between customers, brokers, employers, and/or solicitors. 18

OTHER CRITICAL ISSUES Data quality “ most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis ” Lazer et al 2014 Caution : No data set is without potential quality issues Role and reputation of NSIs, as standard bearers of quality 19

Example : Mistaken idea that admin data/transaction data has no quality issues Quality Assurance of Administrative Data Administrative Data Quality Assurance Toolkit UK Statistics Authority, January 2015 20

Top Tips UKSA Administrative Data Quality Assurance Toolkit, 2015 These five tips summarise the main pointers for statistical producers to develop a good understanding of the quality issues of administrative data: Don’t trust the safeguards . Check if safeguards are functioning effectively Get involved : work and share with suppliers, such as through secondments and webinars, to develop a common understanding Raise a red flag : identify potential data quality concerns using input and output quality indicators and investigate anomalies See the big picture : identify what investigations and audits have been conducted and what they found Corroborate the evidence : confirm the levels and the trends shown by the stats derived from the admin data 21

Example: Even transaction data implies a complex sociological selection process e.g. reject inference in credit scoring 22

Example: Instrumental failure Wind speed in metres per second, against time 23

Example: Proportion of homeowner missing values vs time 24

The rate of change of technology Implications: ‐ for series based on a particular technology, what happens when it changes, or disappears altogether (Windows XP) ‐ surveys have been around forever, but Google Trends? 25

Statistical subtleties The danger of “the data speak for themselves” ‐ selection bias ‐ regression to the mean ‐ individual behaviour vs population behaviour Are wages improving or remaining constant? A1: calculate the median wage at time 1 and time 2 A2: calculate the change of individuals in the population 26

Google flu trends “ In February 2013, ... Nature reported that [Google Flu Trends] was predicting more than double the proportion of doctor visits for influenza ‐ like illness than the Centers for Disease Control and Prevention, which bases its estimates on surveillance reports from laboratories across the United States ... despite the fact that GFT was built to predict CDC reports .” Lazer et al 2014 27

Initial version: find best matches from 50 million search terms for 1152 data points Overfitting! → part flu detector, part winter detector Updated, 2009: still consistently overestimated flu prevalence (100 out of 108 consecutive weeks); autocorrelated errors, etc Some understanding of classical time series modelling could help 28

Correlation = 0.992 29

Official Statistics in the New Data Ecosystem David J. Hand Imperial - PowerPoint PPT Presentation

Official Statistics in the New Data Ecosystem David J. Hand Imperial College, London March 2015 1 Pascals wager on the existence of God: You must wager. It is not optional. You are embarked . 2 Pascals wager on the existence of God:

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL The OCS NEC Group

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning

5 Official 5 Official 5 Official 5 Official Run Zone Coverage Run Zone Coverage Run Zone

The use of non-official sources in official international economic and financial statistics

Prsentation gnrale Official service providers Official service providers Official service

From Official Statistics to Official Data Science Mark van der Loo, Statistics Netherlands CBS,

The ever changing landscape of official statistics Jelke Bethlehem Leiden University, the

2019 OFFICIAL VISITORS GUIDE 2019 OFFICIAL VISITORS GUIDE The guide serves as the official

2 Theory of Ecosystem Services Speaker Dr. Stephen Polasky 2011 ECOSYSTEM SERVICES SEMINAR

HUTAN HUTAN HARAPAN HUTAN HUTAN HARAPAN HARAPAN HARAPAN Ecosystem Restoration Ecosystem

5 Ecosystem Services in Practice: Market-Based Ecosystem Services - From Theory to Application

4 Policy and Management Tools for Ecosystem Services Speaker Pavan Sukhdev 2011 ECOSYSTEM

3 Valuation of Ecosystem Services Speaker Dr. James Boyd 2011 ECOSYSTEM SERVICES SEMINAR

The Province of Ontario requires that each municipality update its Official Plan subsequent to the

UN Global Working Group (GWG) on Big Data for Official Statistics The Global Working Group (GWG)

North Spokane Corridor Executive Advisory Committee Meeting November 13, 2018 Topics NSC

Cross-border Linkage of ATM Networks in East Asia 2008. 4 Jae Hyun Choi Director General

Mental Health Services Act Mental Health Services Act (MHSA) Purpose The MHSA is intended to

Chinas INDC Sebastian Wienges, Climate Policy Team April 18, 2016 Chinas Role in the

CBOs Use of Evidence in Analysis of Budget and Economic Policies November 3, 2011 Jeffrey R.

Activity and Program based Benchmarking Stakeholder Information Meeting OEB Staff Presentation

Welcome guests to Stronger linkages between urban regeneration and infrastructure PwC Auckland

Challenges in Evidence-Informed Decision-making to Achieve Universal Health Coverage (UHC) 7 th

Sambuz

Useful Links

Newsletter

Mail Us

Official Statistics in the New Data Ecosystem David J. Hand Imperial - PowerPoint PPT Presentation

Official Statistics in the New Data Ecosystem David J. Hand Imperial College, London March 2015 1 Pascals wager on the existence of God: You must wager. It is not optional. You are embarked . 2 Pascals wager on the existence of God:

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL OFFICIAL The OCS NEC Group

Quality Assurance in Official Statistics Directorate of Economics &amp; Statistics, Planning

5 Official 5 Official 5 Official 5 Official Run Zone Coverage Run Zone Coverage Run Zone

The use of non-official sources in official international economic and financial statistics

Prsentation gnrale Official service providers Official service providers Official service

From Official Statistics to Official Data Science Mark van der Loo, Statistics Netherlands CBS,

The ever changing landscape of official statistics Jelke Bethlehem Leiden University, the

2019 OFFICIAL VISITORS GUIDE 2019 OFFICIAL VISITORS GUIDE The guide serves as the official

2 Theory of Ecosystem Services Speaker Dr. Stephen Polasky 2011 ECOSYSTEM SERVICES SEMINAR

HUTAN HUTAN HARAPAN HUTAN HUTAN HARAPAN HARAPAN HARAPAN Ecosystem Restoration Ecosystem

5 Ecosystem Services in Practice: Market-Based Ecosystem Services - From Theory to Application

4 Policy and Management Tools for Ecosystem Services Speaker Pavan Sukhdev 2011 ECOSYSTEM

3 Valuation of Ecosystem Services Speaker Dr. James Boyd 2011 ECOSYSTEM SERVICES SEMINAR

The Province of Ontario requires that each municipality update its Official Plan subsequent to the

UN Global Working Group (GWG) on Big Data for Official Statistics The Global Working Group (GWG)

North Spokane Corridor Executive Advisory Committee Meeting November 13, 2018 Topics NSC

Cross-border Linkage of ATM Networks in East Asia 2008. 4 Jae Hyun Choi Director General

Mental Health Services Act Mental Health Services Act (MHSA) Purpose The MHSA is intended to

Chinas INDC Sebastian Wienges, Climate Policy Team April 18, 2016 Chinas Role in the

CBOs Use of Evidence in Analysis of Budget and Economic Policies November 3, 2011 Jeffrey R.

Activity and Program based Benchmarking Stakeholder Information Meeting OEB Staff Presentation

Welcome guests to Stronger linkages between urban regeneration and infrastructure PwC Auckland

Challenges in Evidence-Informed Decision-making to Achieve Universal Health Coverage (UHC) 7 th

Sambuz

Useful Links

Newsletter

Mail Us

Quality Assurance in Official Statistics Directorate of Economics & Statistics, Planning