I ts not the documents; its the DATA! Tom Johnson Managing - - PowerPoint PPT Presentation

i t s not the documents it s the data
SMART_READER_LITE
LIVE PREVIEW

I ts not the documents; its the DATA! Tom Johnson Managing - - PowerPoint PPT Presentation

I ts not the documents; its the DATA! Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m 1 I ts not the documents, its the DATA! Presentation at 2011 Open


slide-1
SLIDE 1

Tom Johnson Managing Director

  • Inst. for Analytic Journalism

Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m

1

I t’s not the documents; it’s the DATA!

slide-2
SLIDE 2

I t’s not the documents, it’s the DATA!

Presentation at “2011 Open Government Academy” March 26, 2011 Presented by the New Mexico Foundation for Open Government, New Mexico Press Association and New Mexico Broadcasters Association

This PowerPoint deck and Tipsheet posted at:

http:// j o h n s o n – f o g . n o t l o n g . c o m

2

Licensed under a Creative Commons Attribution‐NonCommercial‐NoDerivs 3.0 Unported License.

slide-3
SLIDE 3

I mportant point

3

Nothing is as important– and valuable– as a good theory!

slide-4
SLIDE 4

Theory of Journalistic Process

Data In Analysis Info Out

  • Data = that which, upon Analysis, yields Information.

“Data” has many forms.

  • Analysis = Examination of data and facts to uncover and

understand cause‐effect and contextual relationships and patterns, thus providing basis for problem solving and decision making.

  • I nformation = that which aids in making decisions

4

slide-5
SLIDE 5

I mportant point

The document is not the data.

5

slide-6
SLIDE 6

Bertillon system: Public Records DB

Early public records

  • Intricate data

collection

  • Potential for error in

data entry

  • Potential for error in

filing

  • No machine retrieval
  • r analysis
  • Even today, OCR

would be impossible

slide-7
SLIDE 7

Bertillon system: Public Records DB

By 1910…

  • Indexing system has improved
  • Typewriters instead of pen
  • Better haircuts

But still …

  • Null fields
  • Subject to data entry errors;

lost or misfiled cards/data

  • Limited large‐scale analysis

resources

slide-8
SLIDE 8

Bertillon system: Public Records DB

  • Early public records
  • Intricate data

collection

  • Data entry potential

for error

  • Filing potential for

error

  • No machine retrieval
  • r analysis
  • Even today, no OCR

By 1910…

  • Indexing system has improved
  • Typewriters instead of pen
  • Better haircuts

But still …

  • Null fields
  • Subject to data entry errors;

lost or misfiled cards/data

  • Limited large‐scale analysis

resources Early “hard drives,” data retrieval and data analysis of public records

slide-9
SLIDE 9

Bertillon system: Public Records DB

  • Early public records
  • Intricate data

collection

  • Data entry potential

for error

  • Filing potential for

error

  • No machine retrieval
  • r analysis
  • Even today, no OCR

By 1910…

  • Indexing system has improved
  • Typewriters instead of pen
  • Better haircuts

But still …

  • Null fields
  • Subject to data entry errors;

lost or misfiled cards/data

  • Limited large‐scale analysis

resources Early “hard drives,” data retrieval and data analysis of public records

  • A public record, but
  • ne of limited usage
  • A DOCUMENT, but no

efficient, productive, insightful way to FIND the data

  • A DOCUMENT, but no

efficient, productive, insightful way to EXTRACT the data

  • Sorta like a PDF
slide-10
SLIDE 10

Traditional Data I n Analysis Info Out

10

Data I n Analysis Info Out

  • Notes
  • Text
  • Numeric
  • Images
  • Maps
  • How? Who?
slide-11
SLIDE 11

Digital Age Data I n Analysis Info Out

11

  • Notes
  • Text
  • Numeric
  • I mages
  • Charts/ Graphs
  • Maps
  • Audio
  • Video
  • Atoms Bits
  • How? Who?
  • New data is

ubiquitous, shareable, scaleable.

  • Retrieval, copying

and storage costs trivial

  • Can be validated and

explored by individuals and applications

slide-12
SLIDE 12

Digital Age Data I n Analysis Info Out

12

  • Notes
  • Text
  • Numeric
  • Images
  • Charts/Graphs
  • Maps
  • Audio
  • Video
  • Atoms Bits
  • How? Who?
  • All data today requires

NEW tools for ANALYSIS and STORY‐ TELLING

  • Statutes are usually

adequate; the CULTURES are the challenge.

slide-13
SLIDE 13

I mportant point The document is not the data. Without analysis, the data are not the story.

13

slide-14
SLIDE 14

Four stories

  • Doig: Hurricane Andrew, Data (from documents)

= Pulitizer Prize & bldg. inspectors in jail

  • Craig Harris: “Arizona pension systems a soaring

burden”

  • Waite: water, developers, land use =

disappearing wet lands

  • UK: Investigate Your MPs Expenses

“We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of

  • them. Only 235,357 to go”

MP’s expense claims on Google spreadsheet

14

slide-15
SLIDE 15

Journalism and GI S

  • Steve Doig

[Miami Herald] 1992

15

Hurricane Andrew + damage reports + building inspection = jail terms

slide-16
SLIDE 16

Doig: Hurricane Andrew

16

slide-17
SLIDE 17

Four stories

  • Doig: Hurricane Andrew, Data (from documents)

= Pulitizer Prize & bldg. inspectors in jail

  • Craig Harris: “Arizona pension systems a soaring

burden”

17

slide-18
SLIDE 18

Analysis with real data

18

Search Sort DB info

slide-19
SLIDE 19

Four stories

  • Doig: Hurricane Andrew, Data (from documents)

= Pulitizer Prize & bldg. inspectors in jail

  • Craig Harris: “Arizona pension systems a soaring

burden”

  • Waite: water, developers, land use = “Vanishing

Wetlands”

19

slide-20
SLIDE 20

Vanishing Wetlands

20

slide-21
SLIDE 21

Four stories

  • Doig: Hurricane Andrew, Data (from documents)

= Pulitizer Prize & bldg. inspectors in jail

  • Craig Harris: “Arizona pension systems a soaring

burden”

  • Waite: water, developers, land use =

disappearing wet lands

  • UK: Investigate Your MPs Expenses

“We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of

  • them. Only 235,357 to go”

MP’s expense claims on Google spreadsheet

  • EFF Seeks Cooperating FOIA Reviewers

21

slide-22
SLIDE 22

UK MP’s expenses

22

Solid search tools

These are PDFs, POST‐search

slide-23
SLIDE 23

Major questions?

As participants in a liberal democracy…

  • How do we get the necessary

data?

  • And from where?
  • And in appropriate forms?

23

slide-24
SLIDE 24

Files, Transparency, Ease of Analysis

24

Easier Challenging

slide-25
SLIDE 25

Files, Transparency, Ease of Analysis

25

slide-26
SLIDE 26

Data I n: Objectives/ Requirements

  • Move data from “out there” to analytic site/tools
  • Looking for connections; patterns

26

slide-27
SLIDE 27

Data I n: Objectives/ Requirements

  • Seeking fine-grained data, NOT aggregations
  • Seek data in original form (i.e. NO PDFs)
  • Get data in lowest common denominator format:
  • Comma-delimited files in ASCII or Text
  • Who collected the data? Why? How?
  • Who proofed/edited the data? Why? How?
  • If from data base, first ask for “record layout” or

“code sheet” or “schema”

  • Definitions of variables or fields. Constant or ???

27

slide-28
SLIDE 28

Data I n: “Typical” problems with gov sites

Barriers data = barriers to analysis

  • NO site search capability; no site map
  • Failure to use open-standard HTML; using closed-

standard Adobe Flash/Shockwave environment.

  • Page formats/layouts not consistent;

too many drill-downs instead of search-driven generators

  • Jiggly roll-overs; too much effort spent on bling
  • Impossible to download or scrape data for analysis
  • Information available only in Adobe PDF files;

notoriously unfriendly to data analysis.

28

slide-29
SLIDE 29

Good NM sites

29

Search! Español Feedback!

slide-30
SLIDE 30

NM Legis. Bill Finder

30

Could be better: no way to find what bills were introduced by X legislator

Download bill in TWO formats

slide-31
SLIDE 31

Data I n: Challenges

  • New site in New Mexico: www.sunshineportalnm.com
  • “Beta,” but facade for taxpayers; a secondary tax bcs of minimal

utility; torture for journos

31

slide-32
SLIDE 32

Data I n: Challenges in SunshinePort

  • Comprehensive Annual Financial Reports
  • Possible to machine download, but laborious to format for

analysis

  • Investment Holdings reports are far worse
  • They are poor-quality static image files, not machine-

readable.

  • Tabular data roughly formatted; makes conversion for

analysis an arduous, if not impossible task.

32

slide-33
SLIDE 33

Bottom line on SunshinePortalNM.com “If the State of New Mexico takes the position that through this site it is discharging all of its disclosure

  • bligations with respect to these

particular records, open government is in trouble there.”

33

“This is not even a web page, it’s a Flash application, so there’s not going to be much sunlight escaping from this portal. “

slide-34
SLIDE 34

Bottom line on SunshinePortalNM.com “If the State of New Mexico takes the position that through this site it is discharging all of its disclosure

  • bligations with respect to these

particular records, open government is in trouble there.”

34

“This is not even a web page, it’s a Flash application, so there’s not going to be much sunlight escaping from this portal. “

“A perfect example of creating the appearance of transparency without actually being transparent.”

slide-35
SLIDE 35

Good data sites – Gov and NGO

  • Data.gov [A beta site] www.data.gov/
  • Metrics www.data.gov/metric
  • DataSF - http://datasf.org/

a clearinghouse of datasets available from the City & County of San Francisco

  • San Francisco Enterprise GI S Program -

http://gispub02.sfgov.org/data.asp

  • Maplight.com – an example of how citizens can use data

Nonprofit, nonpartisan research organization, provides citizens and journalists the transparency tools to shine a light on the influence of money

  • n politics.
  • Prize-winning gov’t agency web sites:

http://www.centerdigitalgov.com/survey/88/2010

35

slide-36
SLIDE 36

Common aspects?

  • All have up-front search capabilities
  • All are written in “data-accessible” code
  • All data can be downloaded with “relative”

ease

  • Some have various languages available
  • ALL are run by GOVERNMENT; no

commercial sites

36

slide-37
SLIDE 37

Challenge for Watchdogs?

Failure on the part of planners/bureaucrats to simply…

  • Give The People THEI R Data…
  • I n The Most Basic, Original,

Straightforward Form…

  • And Let Them Figure Out What

Should Be Done With I t!

  • The governor agrees

37

slide-38
SLIDE 38

Tomorrow?

38

Public Access to Original Data Impact

Why not?

slide-39
SLIDE 39

I t’s not the documents, it’s the DATA!

Tom Johnson Managing Director

  • Inst. for Analytic Journalism

Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m

39

Gracias a todos

slide-40
SLIDE 40

I t’s not the documents, it’s the DATA!

This PowerPoint deck and Tipsheet posted at:

http://johnson‐fog.notlong.com

40

Presentation at “2011 Open Government Academy” March 26, 2011 Presented by the New Mexico Foundation for Open Government, New Mexico Press Association and New Mexico Broadcasters Association