[PPT] - Big Data Max Kemman University of Luxembourg October 19, 2015 PowerPoint Presentation

SLIDE 1

Big Data

Max Kemman

University of Luxembourg October 19, 2015 Online slides optimised for Full-HD screens in full-screen mode Download PDF here

Doing Digital History: Introduction to Tools and Technology

SLIDE 2

Recap from last time

What is a digital library or archive? How are sources digitised? How can we search the digital archive? Can we research the digital library or archive as a whole?

SLIDE 3

Today

Are digital libraries big data?

N=ALL
Messy data
From causality to correlation
Radical contextualisation
Next time

SLIDE 4

Are digital libraries big data?

Last week we discussed digital libraries/archives Europeana contains about 32M digital objects Is this big data?

SLIDE 5

What is "data" anyway?

Term has rhetorical function: "that which is given prior to argument" (Gitelman, 2014) Common description: "raw data" But creating data requires vast amount of work (as we saw last week) Interpretive work into creating data

SLIDE 6

What is "big data"?

Metaphors used to describe big data give different interpretations (Awati & Shum, 2014)

Food: raw or cooked

Resource: oil, gold
Liquid: ocean, tsunami

SLIDE 7

What is "big data" anyway?

'Classic' definition by V's: Another definition: too much data to handle

Volume: size

Velocity: accumulation
Variety: heterogeneous

SLIDE 8

Is this new?

Andrew Prescott (2015):

Domesday book

US Census 1890

SLIDE 9

What is "big data" anyway?

What is the difference between "lots of data" and "big data"? (Lagoze, 2014)

"Large" is historical: computers change

Big data makes us rethink what science is

SLIDE 10

Are digital libraries big data?

Or, does History have big data? From the definitions so far: Some say History/Humanities do not have big data

Size: not so much (compared to CERN)

Velocity: not so much
Variety: yes!
Too much data to handle: probably
Makes us think what science is: maybe

SLIDE 11

Why is big data interesting

BUT, why are we concerned with big data, but not with particle physics? (Wallach, 2014) Two reasons: Here maybe History/Humanities do have interest in big data

Social: big data are about people

Granularity: individual people and their activities

SLIDE 12

Big data is a big topic

Another definition of big data (Mayer-Schönberger & Cukier, 2014) Let's discuss these features

N=ALL

Messy
From causality to correlation

SLIDE 13

N=ALL

"N" refers to the number of observations done as part of the sample size Sample: a group that represents the entire population So N=ALL refers to measuring everything, rather than a representative smaller group

SLIDE 14

All historical sources?

A difference between "a lot of data" and "all data" Remember Rosenzweig from week 1: The injunction of traditional historians to look at “everything” cannot survive in a digital era in which “everything” has survived

Rosenzweig (2003)

SLIDE 15

Is size that interesting?

If big data is merely a quantitative difference, what's the interest? But, quantitive can lead to qualitative difference (Mayer-Schönberger, 2014)

SLIDE 16

Longue durée

Rather than focusing on a very short timespan, see development over ages

SLIDE 17

Messy data

Big data has Variety A heterogeneous dataset Too much data to manually check

Different data-types

Different variables

SLIDE 18

Can we use messy data?

Mayer-Schönberger & Cukier: size makes up for messiness Exactness is from the age of spare information The noise can be smoothed out

SLIDE 19

Crowdsourcing

One way of trying to get someone to look at the data Need to trust anonymous people

SLIDE 20

Does big data reflect the world?

With N=ALL, big data = reality, right? But (big) data incorporates choices of what to measure Twitter/Facebook are biased reflections of the world

SLIDE 21

How big data is 'unfair'

The average person is a fiction Hitchcock: it is the exceptions we are interested in!

SLIDE 22

Looking at the exceptions

Wallach agrees: use the granularity of big data to study minorities & exceptions How do we discover the minorities & exceptions of interest? To repeat; cannot look at all cases individually Some statistical analysis is required

SLIDE 23

From causality to correlation

Correlation: two variables show a statistical relation Causation: one variable explains the second

Positive: when A increases, B increases

Negative: when A increases, B decreases
Example: when it rains, more people take umbrellas with them

SLIDE 24

Correlation found

A nice example is Google Flu Trends:

Took flu data from national health center for number of years

Investigated which keyword searches occurred shortly before or during flu outbreaks
Use keyword searches to predict outbreak of flu

SLIDE 25

Correlation and causation

Important to remember: correlation does not equal causation The keyword searches do not cause the flu! Sometimes you don't know which variable comes first Maybe a third variable explains the two measured ones

SLIDE 26

Meaningful correlation

Does the correlation mean anything? Google Flu Trends later found not to produce accurate results Spurious correlations

SLIDE 27

Spurious correlations

SLIDE 28

Spurious correlations

http://www.tylervigen.com/spurious-correlations Find a correlation yourself: http://tylervigen.com/discover

SLIDE 29

Meaningful correlation

We cannot only use the statistics, we need to interpret them But still we do not want to manually check all the possible correlations

SLIDE 30

Machine learning

Wallach describes herself as machine learning researcher A simple introduction to machine learning (Geitgey, 2014) Rather than telling the computer what to do, it learns what to do

Supervised

Unsupervised

SLIDE 31

Supervised learning

Provide enough answers to learn to give a new answer Computer figures out how to go from data to the answer

SLIDE 32

Supervised learning

Or beat masters at chess

SLIDE 33

Unsupervised learning

No given answer Are there patterns? Outliers?

SLIDE 34

Train without knowing the rules

What do pregnant women buy? How are sentences translated to different languages?

SLIDE 35

Patterns

Rens Bod: discovery of patterns with tools is Humanities 2.0 Hermeneutic interpretation of these patterns is Humanities 3.0 Fickers: context more interesting than the data

SLIDE 36

Radical contextualisation

What is the context of each datapoint? Hitchcock - contextualize using the big data

SLIDE 37

Context

If content is king, context is its crown Your search keywords make sense in your context

SLIDE 38

Radical context

Remember from week 1: what does this tweet mean as part of 31M? Or actually: what does this tweet mean

utside of Twitter?

SLIDE 39

Zooming

Hitchcock describes the macroscope quoting Katy Börner Macroscopes provide a "vision of the whole," helping us "synthesize" the related elements and detect patterns, trends, and outliers while granting access to myriad details. Rather than make things larger or smaller, macroscopes let us observe what is at once too great, slow,

r complex for the human eye and mind to notice and comprehend.

SLIDE 40

Zooming in on people

If today we have a public dialogue that gives voice to the traditionally excluded and silenced – women, and minorities of ethnicity, belief and dis/ability – it is in no small part because we now have beautiful histories of small things. In other words, it has been the close and narrow reading of human experience that has done most to give voice to people excluded from ‘power’ by class, gender and race.

Hitchcock

SLIDE 41

Close reading

Hitchcock argues for interchange of close and distant reading Distant reading? That's the next lecture

SLIDE 42

For next time

19 October (double lecture)

Distant Reading

SLIDE 43

Distant Reading

Max Kemman

University of Luxembourg October 19, 2015

Doing Digital History: Introduction to Tools and Technology

SLIDE 44

Recap from last time

What is big data? Do digital libraries and historians have big data? How can big data be analyzed?

SLIDE 45

Today

What is distant reading?

Reading the distance
Biases in the chart
Hands-on
Next time
Assignment

SLIDE 46

What is distant reading?

“distant reading”: understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data.

(Schulz, 2011)

SLIDE 47

Aggregating

Rather than analyzing a book page by page, analyze a corpus book by book Corpus (here): an aggregated set of sources

SLIDE 48

Viewing the aggregate (Moretti)

SLIDE 49

Viewing the aggregate (Aiden & Michel)

SLIDE 50

The charts

The charts aim to show how one variable relates to another Vertical: y-axis Horizontal: x-axis Y-axis is often frequency per X words X-axis is often time

SLIDE 51

X-Axis

Not always the case, e.g. Gendered Language in Teacher Reviews X-axis: frequency per million words Y-axis: discipline Colour: gender

SLIDE 52

Reading the distance

SLIDE 53

Reading the distance

SLIDE 54

Looking closer (Aiden & Michel)

SLIDE 55

Looking closer (Moretti)

SLIDE 56

Looking closer (Moretti)

SLIDE 57

Playing around with the view

SLIDE 58

Playing around with the view

SLIDE 59

Finding a correlation

SLIDE 60

Finding a correlation

Note: Moretti does not actually calculate the statistics

SLIDE 61

Biases in the charts

What makes this chart difficult to interpret?

SLIDE 62

OCR

1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000

(click on line/label for focus)

0.00000% 0.00020% 0.00040% 0.00060% 0.00080% 0.00100% 0.00120% 0.00140% fuck

SLIDE 63

OCR

1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000

(click on line/label for focus)

0.00000000% 0.00000100% 0.00000200% 0.00000300% 0.00000400% 0.00000500% 0.00000600% 0.00000700% 0.00000800% ftrolling

SLIDE 64

OCR

1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000

(click on line/label for focus)

0.000000% 0.000020% 0.000040% 0.000060% 0.000080% 0.000100% 0.000120% 0.000140% 0.000160% 0.000180% 0.000200% 0.000220% ftrolling strolling

SLIDE 65

The bumps

Important to note is smoothing Makes the chart easier to read, but abstracts away information ftrolling,strolling Google ngram

SLIDE 66

What is in the corpus?

SLIDE 67

What is in the corpus?

SLIDE 68

(Source)

What is in the corpus?

Recent research on Google Books: (Source)

Raw numbers do not reflect popularity

Lots of scientific literature
Language
Spurious correlations?
ne study that used Google Books to make broad claims about

the changing nature of childhood in the mid-20th century, a study that failed to acknowledge that parenting manuals emerged as a genre during that era.

SLIDE 69

Hands-on

Go to http://bookworm.culturomics.org/ and choose a bookworm or other ngram viewer Try the same keywords in different tools Try different keywords in the same tool

SLIDE 70

For next time

26 October (double lecture)

What? Investigating what a corpus is about

Reading: (see Moodle)

Braake, S. ter, & Fokkens, A. (2015). How to Make it in History. Working Towards a Methodology of Canon Research with Digital Methods. In Biographical Data in a Digital World 2015 (pp. 85–93).

SLIDE 71

Assignment

Work in pairs of two Use one of the tools discussed today to try and find something you find

interesting. Document your steps and choices and discuss why a finding is
f interest, and whether you can be certain of this finding.

Hand in the assignment in HTML, include your name and a decent profile photo

SLIDE 72

Assignment

Grading Do note: the finding itself is not the most important aspect of the assignment Email to max.kemman@uni.lu before the start of the next lecture

1pt for free

3pts for HTML
3pts for documentation of your process
3pts for critical reflection on your finding