Literary Text Mining and Stylometry DH Crash Course Andreas van - - PowerPoint PPT Presentation

literary text mining and stylometry
SMART_READER_LITE
LIVE PREVIEW

Literary Text Mining and Stylometry DH Crash Course Andreas van - - PowerPoint PPT Presentation

Literary Text Mining and Stylometry DH Crash Course Andreas van Cranenburgh Huygens ING Institute for Logic, Language and Computation Royal Netherlands Academy of Arts and Sciences University of Amsterdam March 23, 2014 Amsterdam, 2014


slide-1
SLIDE 1

Literary Text Mining and Stylometry

DH Crash Course Andreas van Cranenburgh

Huygens ING Institute for Logic, Language and Computation Royal Netherlands Academy of Arts and Sciences University of Amsterdam

March 23, 2014

Amsterdam, 2014

slide-2
SLIDE 2

Today’s menu

  • 1. “The Riddle of Literary Quality” project
  • 2. Machine Learning
  • 3. Your Mission
slide-3
SLIDE 3

The project

The Riddle of Literary Quality*

*http://literaryquality.huygens.knaw.nl

slide-4
SLIDE 4

Literary Quality: “low” versus “high” brow

Perceptions of literary quality due to:

◮ Social factors? ◮ Contextual factors? ◮ Individual factors?

slide-5
SLIDE 5

Literary Quality: “low” versus “high” brow

Perceptions of literary quality due to:

◮ Social factors? ◮ Contextual factors? ◮ Individual factors? ◮ Textual characteristics?

slide-6
SLIDE 6

Main research question

Survey: Two independent axes of quality:

  • 1. good vs. bad
  • 2. literary vs. non-literary
slide-7
SLIDE 7

Main research question

Survey: Two independent axes of quality:

  • 1. good vs. bad
  • 2. literary vs. non-literary

Texts: Two kinds of text features:

  • 1. low-level: directly extracted from text

(e.g., sentence length)

  • 2. high-level: analyze text with some model

(e.g., deep syntactic structures)

slide-8
SLIDE 8

Main research question

Survey: Two independent axes of quality:

  • 1. good vs. bad
  • 2. literary vs. non-literary

Texts: Two kinds of text features:

  • 1. low-level: directly extracted from text

(e.g., sentence length)

  • 2. high-level: analyze text with some model

(e.g., deep syntactic structures)

Question

Can we find correlations between quality judgments and text features?

slide-9
SLIDE 9

Corpus

◮ 401 modern Dutch novels ◮ Published 2007–2012 ◮ Selected by popularity

slide-10
SLIDE 10

Survey

◮ Large reader survey ◮ Subjects select books they read from the corpus,

and rate whether the book is good, literary

◮ about 14,000 readers completed the survey

slide-11
SLIDE 11

Today’s menu

  • 1. “The Riddle of Literary Quality” project
  • 2. Machine Learning
  • 3. Your Mission
slide-12
SLIDE 12

The Workflow

Definition

Text classification: Text ⇒ Features ⇒ Model ⇒ Predictions

slide-13
SLIDE 13

The Workflow

Definition

Text classification: Text ⇒ Features ⇒ Model ⇒ Predictions

◮ Goal: generalization

slide-14
SLIDE 14

Today’s menu

  • 1. “The Riddle of Literary Quality” project
  • 2. Machine Learning

Features Model Predictions Background

  • 3. Your Mission
slide-15
SLIDE 15

Feature vectors

Definition

Vector: a sequence of numbers

slide-16
SLIDE 16

Feature vectors

Definition

Vector: a sequence of numbers Each text will be represented by a vector of numbers. E.g.: Author Shall I compare thee ... Shakespeare 1 1 1 1 ... Me 9 ...

slide-17
SLIDE 17

The Vector Space Model

Definition

Space: place in which distances are defined

slide-18
SLIDE 18

The Vector Space Model

Definition

Space: place in which distances are defined

◮ texts are more or less distant (dissimilar) in this space ◮ each vector element is a dimension ◮ the vector specifies a co-ordinate

in the vector space.

slide-19
SLIDE 19

Bag-of-Words model

Definition

Bag-of-Words (BOW) model: use word counts as vectors E.g.: Author Shall I compare thee ... Shakespeare 1 1 1 1 ... Me 9 ...

slide-20
SLIDE 20

Function words vs. Content words: I

h a s u p

  • n

a n y

  • v

e r t h i n k c

  • m

e c a n a b

  • u

t m

  • r

e t h e n f r

  • m

t h e m b y c

  • u

l d w e r e t h e y a t s h e h i s a 0.0 0.1 0.2 0.3 0.4 0.5

Function words:

◮ Small words, highly frequent ◮ Unconsciously chosen ◮ Articles, pronouns, conjunctions

E.g.: the, I, and, of, in Content words:

◮ Low- to mid-frequency ◮ Chosen to match topic ◮ Nouns, verbs, adjectives

E.g.: walk, talk, ship, sun

slide-21
SLIDE 21

Function words vs. Content words: II

For text classification, Function words:

◮ Useful for authorship attribution,

gender detection

◮ Small set of words is sufficient ◮ Pennebaker (2011),

The Secret Life of Pronouns Content words:

◮ Good at detecting topics,

related work

◮ Large vocabulary required

http://secretlifeofpronouns.com/

slide-22
SLIDE 22

Model: making predictions

◮ Similar texts will have similar word counts ◮ Simplest model: for a new text,

find its nearest neighbor and use that to make a prediction

slide-23
SLIDE 23

Model: making predictions

◮ Similar texts will have similar word counts ◮ Simplest model: for a new text,

find its nearest neighbor and use that to make a prediction This works, but ...

◮ Not all words are equally important ◮ Not all texts are as representative

slide-24
SLIDE 24

Model: Support Vector Machines (SVM)

◮ Support Vectors are data points that maximally

separate the classes to be learned;

◮ After training, each feature receives a weight that

determines how much it will affect predictions

◮ The support vectors and weights define a line that

separates the classes.

slide-25
SLIDE 25

Predictions

◮ Authorship ◮ Topic ◮ Readability ◮ Prose genre (detective, thriller, sci-fi, &c.) ◮ &c.

slide-26
SLIDE 26

Two fundamental problems: I

Problems in Machine Learning:

Definition

The Curse of Dimensionality: Too many features. Not enough data to learn interactions of features.

◮ Limit number of features. ◮ SVM handles large number of features well.

slide-27
SLIDE 27

Two fundamental problems: II

Problems in Machine Learning:

Definition

Overfitting: The training data has been learned so ‘well’ that nothing else can be predicted. ⇒ undergeneralization

◮ Validate predictions on

separate data set (train vs. test set)

slide-28
SLIDE 28

Dimensionality Reduction

Issues with BOW model:

◮ Large vocabulary, high number of dimensions ◮ Would like to merge counts for similar words (e.g.,

color/colour, problem/issue)

slide-29
SLIDE 29

Dimensionality Reduction

Issues with BOW model:

◮ Large vocabulary, high number of dimensions ◮ Would like to merge counts for similar words (e.g.,

color/colour, problem/issue)

Definition

Latent Semantic Analysis is a form of dimensionality reduction that attempts to summarize word counts as topics/concepts.

slide-30
SLIDE 30

Limitations of Bag-of-Words models

Drawbacks:

◮ Word order information is lost ◮ Fixed granularity of individual words

slide-31
SLIDE 31

Limitations of Bag-of-Words models

Drawbacks:

◮ Word order information is lost ◮ Fixed granularity of individual words

Alternatives:

◮ More complex features; e.g., grammatical.

But: more complex features ...

◮ are more often wrong ◮ may have low counts,

statistics will be less reliable/powerful

slide-32
SLIDE 32

Limitations of Bag-of-Words models

Drawbacks:

◮ Word order information is lost ◮ Fixed granularity of individual words

Alternatives:

◮ More complex features; e.g., grammatical.

But: more complex features ...

◮ are more often wrong ◮ may have low counts,

statistics will be less reliable/powerful

◮ Incremental model; include context

But: difficult to model influence of preceding text.

slide-33
SLIDE 33

Aside: More advanced models

Topic Modeling Identify a number of topics (word distributions) Deep Learning automatically learn good representations

  • f data (features) using neural networks
slide-34
SLIDE 34

Today’s menu

  • 1. “The Riddle of Literary Quality” project
  • 2. Machine Learning
  • 3. Your Mission
slide-35
SLIDE 35

Today: Prose Genres

◮ Detective ◮ Thriller ◮ ... ◮ Literary fiction

Who, what defines genres?

◮ Publishers, critics ◮ Topics, style of texts

slide-36
SLIDE 36

The Data

◮ 300+ novels from Project Gutenberg; ◮ Mostly 19th century; ◮ From following categories (“genres”):

◮ Adventure ◮ Detective ◮ Fiction ◮ Sci-Fi ◮ Short ◮ Historical ◮ Poetry

Ashok et al. (EMNLP, 2013). Success with style.

slide-37
SLIDE 37

Your Mission

...should you choose to accept it:

  • 1. Install Python: http://continuum.io/downloads
  • 2. Download corpus & code:

http://tinyurl.com/n9aaoht

◮ Unzip, open folder ◮ Click on start-windows.bat or start-osx.commmand ◮ A browser opens, open the notebook

DH-crash-course-riddle.ipynb

  • 3. Tweak parameters until score is acceptable
  • 4. Interpret the results
slide-38
SLIDE 38

THE END