Big Data Max Kemman University of Luxembourg October 11, 2015 - - PowerPoint PPT Presentation

big data
SMART_READER_LITE
LIVE PREVIEW

Big Data Max Kemman University of Luxembourg October 11, 2015 - - PowerPoint PPT Presentation

Big Data Max Kemman University of Luxembourg October 11, 2015 Doing Digital History: Introduction to Tools and Technology Recap from last time What were aspects of an archive? What are the three steps of digitisation? What is the difference


slide-1
SLIDE 1

Big Data

Max Kemman

University of Luxembourg October 11, 2015

Doing Digital History: Introduction to Tools and Technology

slide-2
SLIDE 2

Recap from last time

What were aspects of an archive? What are the three steps of digitisation? What is the difference between data & metadata? What meta/data do we have of letters?

slide-3
SLIDE 3

Today

Are digital libraries big data?

  • N=ALL
  • Messy data
  • From causality to correlation
  • Radical contextualisation
  • Next time
slide-4
SLIDE 4

Are digital libraries big data?

Last week we discussed digital libraries/archives Europeana contains about 53M digital objects Is this big data?

slide-5
SLIDE 5

What is "data" anyway?

Term has rhetorical function: "that which is given prior to argument" (Gitelman, 2014) Common description: "raw data" But creating data requires vast amount of work (as we saw last week) Interpretive work into creating data

slide-6
SLIDE 6

What is "big data"?

Metaphors used to describe big data give different interpretations (Awati & Shum, 2014)

Food: raw or cooked

  • Resource: oil, gold
  • Liquid: ocean, tsunami
slide-7
SLIDE 7

What is "big data" anyway?

'Classic' definition by V's:

Volume: size

  • Velocity: accumulation
  • Variety: heterogeneous
  • Another definition: too much data to handle
slide-8
SLIDE 8

Is this new?

Andrew Prescott (2015):

Domesday book

  • US Census 1890
slide-9
SLIDE 9

What is "big data" anyway?

What is the difference between "lots of data" and "big data"? (Lagoze, 2014)

"Large" is historical: computers change

  • Big data makes us rethink what science is
slide-10
SLIDE 10

Are digital libraries big data?

Or, does History have big data? From the definitions so far:

Size: not so much (compared to CERN)

  • Velocity: not so much
  • Variety: yes!
  • Too much data to handle: probably
  • Makes us rethink what science/scholarship is: maybe
  • Is our collection of Hillary Clinton emails 'big data'?

Some say History/Humanities do not have big data

slide-11
SLIDE 11

Why is big data interesting

BUT, why are we concerned with big data, but not with particle physics? (Wallach, 2014) What are the 2 reasons she gives?

Social: big data are about people

  • Granularity: individual people and their activities
  • Here maybe History/Humanities do have interest in big data
slide-12
SLIDE 12

Big data is a big topic

Another definition of big data (Mayer-Schönberger & Cukier, 2014) Let's discuss these features

N=ALL

  • Messy
  • From causality to correlation
slide-13
SLIDE 13

N=ALL

"N" refers to the number of observations done as part of the sample size Sample: a group that represents the entire population So N=ALL refers to measuring everything, rather than a representative smaller group

slide-14
SLIDE 14

All historical sources?

A difference between "a lot of data" and "all data" Remember Rosenzweig from week 1: The injunction of traditional historians to look at “everything” cannot survive in a digital era in which “everything” has survived

Rosenzweig (2003)

slide-15
SLIDE 15

Is size that interesting?

If big data is merely a quantitative difference, what's the interest? But, quantitive can lead to qualitative difference (Mayer-Schönberger, 2014)

slide-16
SLIDE 16

Quantitative to qualitative

slide-17
SLIDE 17

Longue durée

Rather than focusing on a very short timespan, see development over ages

(Manning, 2013)

slide-18
SLIDE 18

Messy data

Big data has Variety A heterogeneous dataset Too much data to manually check

Different data-types

  • Different variables
slide-19
SLIDE 19

Can we use messy data?

Mayer-Schönberger & Cukier: size makes up for messiness Exactness is from the age of spare information The noise can be smoothed out

slide-20
SLIDE 20

Crowdsourcing

One way of trying to get someone to look at the data Need to trust anonymous people

slide-21
SLIDE 21

Does big data reflect the world?

With N=ALL, big data = reality, right? But (big) data incorporates choices of what to measure Twitter/Facebook are biased reflections of the world

slide-22
SLIDE 22

Biases in language

Big data word-pairs (MIT Technology review)

Man - Woman

  • King - Queen
  • Brother - Sister
  • Computer programmer - Homemaker
  • Doctor - Midwife
  • Coward - Whore
  • etc
slide-23
SLIDE 23

How big data is 'unfair'

The average person is a fiction Hitchcock: it is the exceptions we are interested in!

slide-24
SLIDE 24

Looking at the exceptions

Wallach agrees: use the granularity of big data to study minorities & exceptions How do we discover the minorities & exceptions of interest? To repeat; cannot look at all cases individually Some statistical analysis is required

slide-25
SLIDE 25

From causality to correlation

Correlation: two variables show a statistical relation

Positive: when A increases, B increases

  • Negative: when A increases, B decreases
  • Causation: one variable explains the second

Example: when it rains, more people take umbrellas with them

slide-26
SLIDE 26

Correlation found

A nice example is Google Flu Trends:

Took flu data from national health center for number of years

  • Investigated which keyword searches occurred shortly before or during flu outbreaks
  • Use keyword searches to predict outbreak of flu
slide-27
SLIDE 27

Correlation and causation

Important to remember: correlation does not equal causation The keyword searches do not cause the flu! Sometimes you don't know which variable comes first Maybe a third variable explains the two measured ones

slide-28
SLIDE 28

Meaningful correlation

Does the correlation mean anything? Google Flu Trends later found not to produce accurate results Spurious correlations

slide-29
SLIDE 29

Spurious correlations

slide-30
SLIDE 30

Spurious correlations

http://www.tylervigen.com/spurious-correlations Find a correlation yourself: http://tylervigen.com/discover

slide-31
SLIDE 31

Meaningful correlation

We cannot only use the statistics, we need to interpret them But still we do not want to manually check all the possible correlations

slide-32
SLIDE 32

Machine learning

Wallach describes herself as machine learning researcher A simple introduction to machine learning (Geitgey, 2014) Rather than telling the computer what to do, it learns what to do

Supervised

  • Unsupervised
slide-33
SLIDE 33

Supervised learning

Provide enough answers to learn to give a new answer Computer figures out how to go from data to the answer

slide-34
SLIDE 34

Supervised learning

https://www.youtube.com/watch?v=SZ88F82KLX4 Or beat masters at chess or Go

slide-35
SLIDE 35

Unsupervised learning

No given answer Are there patterns? Outliers?

slide-36
SLIDE 36

Train without knowing the rules

What do pregnant women buy? How are sentences translated to different languages? (MIT Technology Review)

slide-37
SLIDE 37

Biased algorithms?

Issues of biased algorithms:

Diversity in job applications

  • School drop outs
  • Predictive profiling of criminality
  • "We have no idea how these predictions are made"

Often criticism of algorithm, but where does bias come from?

slide-38
SLIDE 38

Rethinking science/scholarship

How does this require a rethinking of scholarship? Ways of reasoning (Dixon, 2012)

Induction: from the specific to the general

  • Deduction: from the general to the specific
  • Abduction: patterns
slide-39
SLIDE 39

Patterns

Rens Bod: discovery of patterns with tools is Humanities 2.0 Hermeneutic interpretation of these patterns is Humanities 3.0 Fickers: context more interesting than the data

slide-40
SLIDE 40

Radical contextualisation

What is the context of each datapoint? Hitchcock - contextualize using the big data

slide-41
SLIDE 41

Context

If content is king, context is its crown Your search keywords make sense in your context

slide-42
SLIDE 42

Radical context

Remember from week 1: what does this tweet mean as part of 31M? Or actually: what does this tweet mean

  • utside of Twitter?
slide-43
SLIDE 43

Zooming

Hitchcock describes the macroscope quoting Katy Börner Macroscopes provide a "vision of the whole," helping us "synthesize" the related elements and detect patterns, trends, and outliers while granting access to myriad details. Rather than make things larger or smaller, macroscopes let us observe what is at once too great, slow,

  • r complex for the human eye and mind to notice and comprehend.
slide-44
SLIDE 44

Zooming in on people

If today we have a public dialogue that gives voice to the traditionally excluded and silenced – women, and minorities of ethnicity, belief and dis/ability – it is in no small part because we now have beautiful histories of small things. In other words, it has been the close and narrow reading of human experience that has done most to give voice to people excluded from ‘power’ by class, gender and race.

Hitchcock

slide-45
SLIDE 45

Close reading

Hitchcock argues for interchange of close and distant reading Distant reading? That's the next lecture

slide-46
SLIDE 46

For next time

18 October

Distant Reading

Aiden, E. L., & Michel, J.-B. (2013). The sound of silence. In Uncharted (pp. 69–83). Penguin.

  • Moretti, F. (2009). Style, Inc. Reflections on Seven Thousand Titles (British Novels, 1740–

1850). Critical Inquiry, 36(1), 134–158.