3 Piet Daas and Mark van der Loo* Statistics Netherlands * With - - PDF document

3
SMART_READER_LITE
LIVE PREVIEW

3 Piet Daas and Mark van der Loo* Statistics Netherlands * With - - PDF document

Big Data (and Big Data (and official statistics) 3 Piet Daas and Mark van der Loo* Statistics Netherlands * With contributions of: Edwin de Jonge and Paul van den Hurk MSIS 2013, April 25, Paris Overview Whats Big Data? g


slide-1
SLIDE 1

1

Big Data (and Big Data (and

  • fficial statistics)

Piet Daas and Mark van der Loo*

3

Statistics Netherlands

MSIS 2013, April 25, Paris * With contributions of: Edwin de Jonge and Paul van den Hurk

Overview

  • What’s Big Data?

g

  • Definition and the 3 V’s
  • Can Big Data be used for official statistics?
  • Examples from Statistics Netherlands
  • Future challenges

Wh t h t h ?

MSIS 2013, April 25, Paris

  • What has to change?

1

slide-2
SLIDE 2

2

  • Data, data everywhere!

Data, data everywhere!

X

MSIS 2013, April 25, Paris

2

What is Big Data?

  • According to a group of experts

Bi d t d t th t b Big data are data sources that can be – generally– described as: “high volume, velocity and variety of data that demand cost-effective, innovative forms of processing for enhanced insight and decision making.”

MSIS 2013, April 25, Paris

  • According to a user

“Data so big that it becomes awkward to work with” 3

slide-3
SLIDE 3

3

The most 3 important characteristics of Big Data

Amount

MSIS 2013, April 25, Paris

Rapid availability Complexity

Unstructured data Text

4

3 Big Data case studies

Can Big Data be used for official statistics? Ca g a a be used o o c a s a s cs

Examples from Statistics Netherlands

  • 1. Traffic loop detection data (100 million records/day)
  • Traffic & transport statistics
  • 2. Mobile phone data (35 million records/day)
  • Day time population, tourism

MSIS 2013, April 25, Paris

y p p

  • 3. Dutch social media messages (1~2 million messages/day)
  • Topics and sentiment

5

slide-4
SLIDE 4

4

  • 1. Traffic loop detection data
  • Traffic ‘loops’

E i t (24/7) th b f i

  • Every minute (24/7) the number of passing

vehicles is counted by >10,000 road sensors & camera’s in the Netherlands

  • Total vehicles and in different length classes
  • Interesting source to produce traffic and

MSIS 2013, April 25, Paris

g p transport statistics (and more)

  • Huge amounts of data, about 100 million

records a day

Locations

6

Number of detected vehicles on a single day

MSIS 2013, April 25, Paris

Total = ~ 295 million

7 By all loops

slide-5
SLIDE 5

5

Traffic loop detection activity (only first 10 min.)

MSIS 2013, April 25, Paris

8

Correct for missing data

  • ‘Corrected’ data (for blocks of 5 min)

Before After

MSIS 2013, April 25, Paris Total = ~ 295 million Total = ~ 330 million (+ 12%)

9

slide-6
SLIDE 6

6

For different vehicle lengths

1 categorie 3 categoriën 5 categoriën

X

Small vehicles <= 5.6 m

Totaal Totaal Totaal <= 5.6m > 1.85 & <= 2.4m > 5.6 & <= 12.2m > 2.4 & <= 5.6m > 12.2m > 5.6 & <= 11.5m > 11.5 & <= 12.2m > 12.2m

X

MSIS 2013, April 25, Paris

Medium sized vehicles > 5.6 m & <= 12.2 m Large vehicles > 12.2 m 10

Small vehicles

MSIS 2013, April 25, Paris

~75% of total

11

slide-7
SLIDE 7

7

Small & medium vehicles

MSIS 2013, April 25, Paris

12

Small, medium & large vehicles

MSIS 2013, April 25, Paris

13

slide-8
SLIDE 8

8

  • 2. Mobile phone data
  • Nearly every person in the Netherlands has a

bil h mobile phone

  • On them and almost always switched on!
  • An increasing number of people has a smart phone
  • Ideal source of information to:
  • Use mobile phone data of mobile phone companies:

MSIS 2013, April 25, Paris

  • Travel behaviour (‘Day time’-population)
  • Tourism (new phones that register to network)
  • Crowd info (for example during events)

14

Travel behaviour of mobile phones

Mobility of very active active mobile phone users

  • during a 14-day period

g y p

  • data of a single mob. company

Based on:

  • Call- and text-activity

multiples times a day

  • Location based on phone masts

Clearly selective:

MSIS 2013, April 25, Paris

15 Clearly selective:

  • Includes major cities
  • But the North and South-east
  • f the country much less
slide-9
SLIDE 9

9

  • 3. Social media messages
  • Dutch are very active on social media platforms
  • Bijna altijd bij zich en staat vrijwel altijd aan
  • Steeds meer mensen hebben een smartphone!
  • Mogelijke informatiebron voor:
  • Welke onderwerpen zijn actueel:
  • Aantal berichten en sentiment hierover

MSIS 2013, April 25, Paris

  • Als meetinstrument te gebruiken voor:
  • .

Map by Eric Fischer (via Fast Company)

16

  • 3. Social media messages
  • Dutch are very active on social media platforms
  • Potential information source for:
  • 3a. Content:
  • Collected Dutch Twitter messages for study: ‘selection’ of 12 million
  • Topics discussed and sentiment over these topics (quickly

available!) and probably more?

  • Investigate it to obtain an answer on its potential use

MSIS 2013, April 25, Paris

  • 3b. Sentiment
  • Sentiment in Dutch social media messages: ‘all’ ~2 billion

17

slide-10
SLIDE 10

10

Social media: Dutch Twitter topics

(3%) (10%) (7%) (3%) (7%) (3%) MSIS 2013, April 25, Paris

18

(46%) (5%)

12 million messages

(3%)

Sentiment in Social media

  • Access to Coosto database

2 billi bli l il bl

  • ~ 2 billion publicly available messages
  • Twitter, Facebook, Hyves, Webfora, Blogs etc.
  • Sentiment of each message
  • Positive, negative or neutral
  • Interesting finding

MSIS 2013, April 25, Paris

  • Looked at so-called ‘Mood of the nation’ compared

to Consumer confidence of Statistics Netherlands 19

slide-11
SLIDE 11

11

Consumer confidence, survey data

Sentiment towards the economic climate

(pos – neg) as % of total MSIS 2013, April 25, Paris

20

~1000 respondents/month

(

Sentiment in social media messages

Sentiment towards the economic climate & Social media message sentiment

(pos – neg) as % of total MSIS 2013, April 25, Paris

Corr: 0.88

21

~25 million messages/month

(

slide-12
SLIDE 12

12

Challenges: Big Data and statistics

  • Legal
  • Is access routinely allowed (not only for research)?

y ( y )

  • Privacy
  • With more and more data, privacy demands increase
  • We have to be careful here!
  • Costs
  • In the Netherlands we don’t pay for admin data.
  • Should we pay for Big Data?
  • Manage

MSIS 2013, April 25, Paris

  • Who owns the data? Stability of delivery/source
  • Because of its volume, run queries in database of data

source holder

22

Challenges: Big Data and statistics (2)

  • Methodological
  • Big data sources register events, not units, and they are selective!
  • Methods & models specific for large dataset (fast and ‘robust’)
  • Methods & models specific for large dataset (fast and robust )
  • Try to ‘make big data small’ ASAP (noise reduction)
  • Technological
  • Learn from ‘computational statistical’ research areas
  • High Performance Computing needs, parallel processing
  • People

MSIS 2013, April 25, Paris

p

  • Need ‘data scientists’ (statistical minded people with programming

skills that are curious)

  • That are able to think outside the traditional sample survey based

paradigm!

23

slide-13
SLIDE 13

13

MSIS 2013, April 25, Paris

The future of Stat Neth?