Social'Data'Science' David'Dreyer'Lassen' UCPH'ECON' - - PowerPoint PPT Presentation

social data science
SMART_READER_LITE
LIVE PREVIEW

Social'Data'Science' David'Dreyer'Lassen' UCPH'ECON' - - PowerPoint PPT Presentation

Social'Data'Science' David'Dreyer'Lassen' UCPH'ECON' September'24,'2015' In'God'we'trust,' all'others'must'bring'data' W.#Edwards#Dewing# Different'types'of'data' 2' Today:'' Empirical'design' data'generaLng'process' modes'of'collecLon'


slide-1
SLIDE 1

Social'Data'Science'

David'Dreyer'Lassen' UCPH'ECON' September'24,'2015'

slide-2
SLIDE 2

In'God'we'trust,' all'others'must'bring'data'

W.#Edwards#Dewing#

Different'types'of'data' 2'

slide-3
SLIDE 3

Today:'' Empirical'design' data'generaLng'process' modes'of'collecLon' strategic'data'provision'

David'Dreyer'Lassen' UCPH'ECON' September'24,'2015'

slide-4
SLIDE 4

roadmap'

  • Different'data'for'different'quesLons'
  • Theory'and'empirics,'forecasLng'and'

hypothesis'tesLng'

  • Effects'of'causes'vs.'Causes'of'effects'
  • Data'generaLng'process'
  • Modes'of'data'collecLon'–'pros'and'cons'
  • Strategic'data'management'and'data'

producLon'

Different'types'of'data' 4'

slide-5
SLIDE 5

Different'data'for'different'quesLons'

  • r'

Different'quesLons'for'different'data'

SomeLmes'possible'to'separate'data$collec)on$process$ from'underlying'data$genera)ng$process'–'and' someLmes'not' ' Fundamental'difference'between'what'people'do'and' what'they'say'they'do' ‘cheap'talk’'/'‘put'your'money'where'your'mouth'is’'/' honest/costly'signaling'

Different'types'of'data' 5'

slide-6
SLIDE 6

roadmap'

  • Different'data'for'different'quesLons'
  • Theory'and'empirics,'forecasLng'and'

hypothesis'tesLng'

  • Effects'of'causes'vs.'Causes'of'effects'
  • Data'generaLng'process'
  • Modes'of'data'collecLon'–'pros'and'cons'
  • Strategic'data'management'and'data'

producLon'

Different'types'of'data' 6'

slide-7
SLIDE 7

What'is'your'quesLon,'again?'

  • 1. Research'quesLon'

from'theory'

  • 2. Ideal'empirical'design'
  • 3. Feasible'empirical'

design'/'collecLon'

  • 4. Results'
  • 5. Adjustment'of'theory/

quesLon/design'

  • 6. New'results'
  • 7. …'
  • A. What'data'do'we'have'
  • B. What'quesLon'can'

they'answer'

  • C. Research'quesLon'
  • D. Results'

Different'types'of'data' 7'

slide-8
SLIDE 8

All'models'are'wrong'–'' but'some'are'useful'

Two'key'goals'

  • 1. ForecasLng:'individual'behavior,'policy'

consequences,'voLng,'Champions'League,'…' Data'science'/'machine'learning'(but'also' macroeconomics)'

  • 2. Hypothesis'tesLng,'derived'from'theory'

´TradiLonal’'social'science''

Different'types'of'data' 8'

George#Box#

slide-9
SLIDE 9
  • 1. ForecasLng'
  • Example:'Bank'wants'to'forecast'nonepayment'on'

loans'(P_d:'probability'of'default)'

  • Couldn’t'care'less'about'theory'
  • Rough'”Data'Science”:'try'to'predict'from'all'available'

data'

  • Suppose'we'find'that'birth'weight'predicts'default'

– Bank'is'happy,'beier'fit'(defer'ethics'etc)' – Policy:'does'invesLng'in'preenatal'care'reduce'defaults?'

  • In'pracLce:'set'of'predictors'taken'from'(some)'theory,'

even'if'casual'

  • ComplicaLons:'if'customers'know'that'P_d'depends'on'

birth'weight,'would/should'they'disclose'it?'What'if' loans'only'to'disclosers?'Would'they'tell'the'truth?'

Different'types'of'data' 9'

slide-10
SLIDE 10

2.'Hypothesis'tesLng'

  • Theory'(raLonal'choice,'sociology,'biology,'

common'sense,'…)'posits'effect'of'X'on'Y'

  • A. SelecLon/type'theory:'People'who'are'impaLent'

cannot'defer'immediate'pleasures'e>'smoke'and' drink'while'pregnant'e>'gives'birth'sooner.'If' impaLent'parents'e>'impaLent'children'(whether'by' nature'or'nurture),'we'have'an'explanaLon.'

  • B. Biological'theory:'low'birth'weight'affects'brain'

development'and'neurological'wiring'for'paLence.'

  • If'(A),'liile'role'for'policy;'also,'both'can'be'true'

at'same'Lme'

  • How'to'disLnguish:'exogenous'shock'to'

birthweight,'but'ethically'tricky'...' '

Different'types'of'data' 10'

slide-11
SLIDE 11

Goodhart’s'law'

  • Most'popular:'“When'a'measure'becomes'a'

target,'it'ceases'to'be'a'good'measure.”'

  • What'he'wrote:'“Any'observed'staLsLcal'

regularity'will'tend'to'collapse'once'pressure' is'placed'upon'it'for'control'purposes.”'

Different'types'of'data' 11'

slide-12
SLIDE 12

Case'of'Google'Flu'

  • Google'Flu:'web'searches'for'Flu'symptoms'

predicted'actual'flu'cases''

  • Byeproduct'of'Google’s'main'service'
  • But'from'2010,'not'so'well:'overesLmated'

actual'flu'cases,'partly'as'result'of'autosuggest' feature,'partly'because'model'was'overfiied' (we’ll'return'to'that)'

  • Best'predictor:'number'of'cases'past'week'

Different'types'of'data' 12'

slide-13
SLIDE 13

roadmap'

  • Different'data'for'different'quesLons'
  • Theory'and'empirics,'forecasLng'and'

hypothesis'tesLng'

  • Effects'of'causes'vs.'Causes'of'effects'
  • Data'generaLng'process'
  • Modes'of'data'collecLon'–'pros'and'cons'
  • Strategic'data'management'and'data'

producLon'

Different'types'of'data' 13'

slide-14
SLIDE 14

Effects'of'causes' vs.' Causes'of'effects' '

Different'quesLons'

  • Effects'of'causes:'intervenLon,'what'is'effect'
  • f'policy'X'on'outcome'Y'
  • Causes'of'effects:'Why'does'Z'occur?'

'

Different'types'of'data' 14'

slide-15
SLIDE 15

Effects'of'causes' (forward'causal'quesLons)'

  • Narrow'quesLons,'someLmes'(but'not'always)'

policy'intervenLons'

– Effect'of'tax'change'on'behavior' – Effect'of'regulaLon'on'risk'taking' – Effect'of'schooling'on'earnings' – Effect'of'smoking'on'lung'cancer'propensity' – Effect'of'public'health'on'schooling'in'Africa' – …'

  • Oren,'but'not'always,'amenable'to'treatments/'

randomizaLon/experimentaLon'

Different'types'of'data' 15'

slide-16
SLIDE 16

Causes'of'effects' (reverse'causal'inference)'

  • Much'harder,'but'oren'more'interesLng'

– Why'do'some'people'smoke?' – What'are'the'causes'of'democraLzaLon?' – Why'do'some'people'pursue'a'PhD'why'others' drop'out'arer'primary'school?' – Why'did'Greece'(almost)'go'bankrupt?'

  • Tensions'with'”effects'of'causes”'–'search'for'

causes'someLmes'derided'as'‘party'chaier’'

Different'types'of'data' 16'

slide-17
SLIDE 17

roadmap'

  • Different'data'for'different'quesLons'
  • Theory'and'empirics,'forecasLng'and'

hypothesis'tesLng'

  • Effects'of'causes'vs.'Causes'of'effects'
  • Data'generaLng'process'
  • Modes'of'data'collecLon'–'pros'and'cons'
  • Strategic'data'management'and'data'

producLon'

Different'types'of'data' 17'

slide-18
SLIDE 18

What'is'the'data$genera)ng$process?' ' ObservaLonal:'endogenous'decisions,'researcher' passive'collector'of'data' RandomizaLon:'treatmentecontrol' (Some)'exogeneity:'policy'intervenLons,'someLmes' with'comparisons,'researchers'someLmes'involved' ' Important:'more'data'does'not'give'beier'result/ more'precision'if'esLmator'is'biased'

Different'types'of'data' 18'

Data'generaLng'process'

slide-19
SLIDE 19

Randomized'experiments'

  • DisLnguish'

– Lab'experiments:'tradiLonally'computerebased'in' econ,'but'also'eye'tracking/brain'images'(fMRI)/ physiological' – Survey'experiments:'assign'survey'respondents'to' different'frames/treatments/primings,'e.g.'have' SocDems'and'Liberals'say'same'thing'and'look'at' support' – Field'experiments:'experimental'control'in'the'real' world,'e.g.'banks'charging'different'rates'to'learn' about'mobility'of'customers;'intervenLons'against' teacher'absenteeism'in'India;'…)'

Different'types'of'data' 19'

slide-20
SLIDE 20

Randomized'experiments'

  • DisLnguish'

– Natural'experiments' (weather'induced:'effects'of'poverty'on'violence,' randomizaLon'of'names'on'elecLon'ballots,'…)' – Quasieexperiments' (effects'of'change'in'policy;'effect'of'tax'reform'on' tax'planning;'effect'of'immigrant'allocaLon'on' crime)'

  • Throughout:'exogenous'(outside'of'the'

individual)'change'

Different'types'of'data' 20'

slide-21
SLIDE 21

Randomized'experiments'

  • Large,'important'current'debate'in'

(development)'economics'

  • CofE:'what'are'effects'of'penalLes'on'teachers’'

absence'in'Indian'village'schools'–' evidence'from'randomized'experiments'

  • Randomly$selected'teachers'get'harsh'penalty'

for'noeshows'e>'difference'in'absenteeism'causal$ effect'of'penalty'

  • (Broader'EofC'Q:'why'is'educaLon'sector'in'rural'

India'so'inefficient?)'

Different'types'of'data' 21'

slide-22
SLIDE 22

Randomized'experiments'

  • Strong'on'internal'validity:'from'

randomizaLon'any'effect'on'absenteeism'is' from'harsher'penalLes;'good'for'tesLng' theory'

  • Weak(er)'on'external'validity'–'would'effect'

be'similar'in'Africa?'Would'effect'from'lab' work'outside'lab?'Why,'why'not?'

  • (compare:'medicine'works'in'similar'ways'

across'locaLons)'

Different'types'of'data' 22'

slide-23
SLIDE 23

Randomized'experiments'

  • Challenges'

– Limits'to'what'can'be'studied'by'experimentaLon' ('ethics;'law;'feasibility)'' – Funding'(field'experiments'expensive,'survey'exp' less'so)' – Oren'par)cipa)on$constraint$–'voluntary' parLcipants’'gain'>='0'or'no'incenLve' – Subjects'leave'for'various'(systemaLc)'reasons' – Largeescale'randomizaLon'can'be'hard'in'field' experiments'

Different'types'of'data' 23'

slide-24
SLIDE 24

ObservaLonal'data'

  • Generated'without'experimental'or'

exogenous'intervenLon'

  • Typically'reveals'correlaLons'or'descripLve'

paierns'that'can'be'interesLng'in'themselves'

Different'types'of'data' 24'

slide-25
SLIDE 25

Example:'Inequality'

Different'types'of'data' 25'

Source:'Pikeiy'and'Saez,'Science'2014,'tax'return'data'

slide-26
SLIDE 26

ObservaLonal'data'

  • Generated'without'experimental'or'

exogenous'intervenLon'

  • Typically'reveals'correlaLons'or'descripLve'

paierns'that'can'be'interesLng'in'themselves'

– Are'in'themselves'silent'about'causality' – Theory'may'be'provide'structure'to'learn'about' causal'mechanism'under'strong'assumpLons' – May'conflate'correlaLon'and'causality'

Different'types'of'data' 26'

slide-27
SLIDE 27

ObservaLonal'data'

  • Exple:'Does'being'in'private'schools'affect'grades'

– Classic:'Catholic'schools'and'grades'in'US' – Collect'aiendance'and'grades'e>'run'regression'

  • But:'suppose'some'parents'are'more'focused'on'

schooling'than'others'

– Send'kids'to'private'school'more' – More'involved'in'school'+'homework'

  • What'do'higher'grades'measure?'

– Effect'of'private'school'OR'effect'of'involved'parents?'

Different'types'of'data' 27'

slide-28
SLIDE 28

ObservaLonal'data'

  • What'to'do?'

– Assign'kids/parents'randomly'to'private'schools?'

  • More'complicated'

– WaiLngelist'experiment'design:'people'who'sign' up'reveal'themselves'as'school'interested,' compare'grades'between'those'in'program'and'

  • n'waiLng'list'e>'much'narrower'design'

– Modeling'(US'case):'use'fact'that'Catholics'are' much'more'likely'to'choose'Catholic'schools'

Different'types'of'data' 28'

slide-29
SLIDE 29

roadmap'

  • Different'data'for'different'quesLons'
  • Theory'and'empirics,'forecasLng'and'

hypothesis'tesLng'

  • Effects'of'causes'vs.'Causes'of'effects'
  • Data'generaLng'process'
  • Modes'of'data'collecLon'–'pros'and'cons'
  • Strategic'data'management'and'data'

producLon'

Different'types'of'data' 29'

slide-30
SLIDE 30

Modes'of'data'collecLon'

  • (Ethnographic'/'parLcipant'observer)'
  • Survey'

– Interview'survey'(in'person),'phone'survey,'internet' survey,'…'

  • AdministraLve'data'

– Used'for'administraLve'purposes' – Some'countries:'census,'tax'return' – DK:'CPReregistry'based'

  • (Primary'collecLon:'texts,'counLng)'
  • “Big'data”:'in'social'sciences'typically'a'byeproduct'of'

digital'informaLon'

Different'types'of'data' 30'

slide-31
SLIDE 31

Modes'of'data'collecLon'

  • Note:'survey,'admin'data,'big'data'can'all'

have'randomized'/'exogenous'elements'or'be' purely'observaLonal'

  • Oren'in'Lab/field'experiments:'ask'about'

income,'educaLon'etc'–'but'may'be'biased'

  • SomeLmes:'combine'experimental'data'with'

admin'or'big'data'(but'rare)'

Different'types'of'data' 31'

slide-32
SLIDE 32

Ethnographic'

  • Pros'

– Aiempt'to'understand' situaLons'from' parLcipants’'perspecLve' – Very'detailed'

  • bservaLons'(e.g.'

dynamics'at'a'meeLng:' who'speaks'when,'who' listens,'who'nods'off'and' flirts'etc)'

  • Cons'

– Very'difficult'to' generalize'(if'even'the' goal)' – Typically'very'small'n,' not'for'stats'' – Hard'to'reproduce'/' replicate'

Different'types'of'data' 32'

slide-33
SLIDE 33

Surveys'

  • Pros'

– Can'be'cheap' – Elicit'info'on'aztudes,' beliefs,'expectaLons' – Necessary'when'no'other' means'exist' – Combine'with'openeended' info' – Easily'anonymized'(firms;' China)'

  • Cons'

– Can'be'expensive' – Nonerandom'samples,' someLmes'very'much'so' (paid'surveys)' – Cheap'talk' – Diverse'interpretaLons' (e.g.'1e10'scales,'Maasai' example)' – Very'different'quality:' interview'vs.'internet' – Not'full'researcher'control:' Interviewer'compleLons'

Different'types'of'data' 33'

slide-34
SLIDE 34

AdministraLve'data'

  • Pros'

– Oren'full'populaLon' – In'DK:'third'party' reported'e>'no'reporLng' bias,'no'survey'bias' – Very'detailed,'no'survey' faLgue' – Oren'very'precise,'since' used'for'admin'purposes'

  • Cons'

– No'sor'data'(aztudes,' expectaLons)' – Privacy'concerns' – Restricted'to'what'is' collected'for'admin' reasons,'both'type'and' frequency''

Different'types'of'data' 34'

slide-35
SLIDE 35

Big'data'

  • Pros'

– Oren'based'on'real' decisions'(as'admin' data),'but'more'detail,' e.g.'aucLons' – High'frequency'(e.g.' wifi),'high'granularity'e>'' almost'large'N' ethnographic'data' – Cheap/free'

  • Cons'

– No'established'protocol' for'collecLon' – Starteup'costs'' – Even'more'privacy' concerns' – Corporate'gatekeepers'' e>'bias'in'access''

Different'types'of'data' 35'

slide-36
SLIDE 36

Example:'Social'Fabric'

  • Largeescale'(N=1000)'big'data'project'
  • Handed'out'smart'phones'to'DTU'freshmen'
  • Collected'phone,'SMS/text,'GPS,'wifi,'

bluetooth'data'

  • e>'Where,'when,'with'whom'
  • e>'social'networks'

'

Different'types'of'data' 36'

slide-37
SLIDE 37

Example:'Social'Fabric'

Different'types'of'data' 37'

Phone'locaLons'0500h'Monday'morning'

slide-38
SLIDE 38

Example:'Social'Fabric'

Different'types'of'data' 38'

10'min'GPS ' ' ' 'wifi'

slide-39
SLIDE 39

Example:'CSS'

Different'types'of'data' 39'

Heatmap'of'people'with'mobile'devices'on'CSS'(anonymous)'

slide-40
SLIDE 40

Example:'why'phone'data'

  • Phones'as'sociometers$
  • Many/most'people'

carry'phone'with'them' all'the'Lme'

  • Would'be'IMPOSSIBLE'

to'have'people'report'in' detail'for'every'10'min' every'day'for'a'year'

  • For'this'project:'tailored'

sorware,'but'realized' that'many'apps'collect' detailed'wifiedata' without'telling$

Different'types'of'data' 40'

slide-41
SLIDE 41

roadmap'

  • Different'data'for'different'quesLons'
  • Theory'and'empirics,'forecasLng'and'

hypothesis'tesLng'

  • Effects'of'causes'vs.'Causes'of'effects'
  • Data'generaLng'process'
  • Modes'of'data'collecLon'–'pros'and'cons'
  • Strategic'data'management'and'data'

producLon'

Different'types'of'data' 41'

slide-42
SLIDE 42

Strategic'data'management'and' producLon'

  • People'/'firms'/'governments'do'not'always'

provide'truthful'and/or'complete'data'

  • Example:'No'penalty'for'lying'in'surveys'–'but'

no'reason'to'either'

  • PoliLcal'reasons'for'obscuring'or'invenLng'

data:'Greece'in'EU,'Chinese'economy'

  • Firms:'Proprietary'info,'compeLLon'reasons,'

fooling'customers'and'regulators'(VW)'

Different'types'of'data' 42'

slide-43
SLIDE 43

Social'desirability'bias'

  • Key'concern'in'surveys,'but'more'general'

problem:' What'if'people'answer'so'as'to'conform'with' general'noLons'of'what’s'desirable?'

– Examples:'Won’t'admit'to'not'voLng'or'having' sexually'transmiied'diseases,'exaggerates'income' – Important'for'asking/assessing'sensiLve'quesLons'

Different'types'of'data' 43'

slide-44
SLIDE 44

Social'desirability'bias'

  • Why?'
  • DisLnguish'

a) selfedecepLon' b) impression'management'

  • Example:'Scrape'data'from'daLng'websites'

and'link'(hypotheLcally)'to'income'data'

– Is'there'a'correlaLon'between'beauty'and' income?'(Yes,'but'not'from'such'data)' – Bias'could'be'both'(a)'and'(b)'

Different'types'of'data' 44'