Computational semantics for the humanities Diarmuid O S eaghdha - - PowerPoint PPT Presentation

computational semantics for the humanities
SMART_READER_LITE
LIVE PREVIEW

Computational semantics for the humanities Diarmuid O S eaghdha - - PowerPoint PPT Presentation

Computational semantics for the humanities Diarmuid O S eaghdha Natural Language and Information Processing Group Computer Laboratory University of Cambridge do242@cam.ac.uk Translation and the Digital 25 April 2014 Introduction


slide-1
SLIDE 1

Computational semantics for the humanities

Diarmuid ´ O S´ eaghdha

Natural Language and Information Processing Group Computer Laboratory University of Cambridge do242@cam.ac.uk

Translation and the Digital 25 April 2014

slide-2
SLIDE 2

Introduction

◮ “Big Data” revolution:

◮ We have access to more textual data than any human could

ever read.

◮ We can perform some kinds of automated analysis over large

datasets.

◮ For humanities researchers:

◮ Data mining is a tool that facilitates asking questions about

language use.

◮ Data mining is not a question or an answer.

◮ Natural Language Processing (NLP) research gives us

computational methods for analysing and interpreting text.

slide-3
SLIDE 3

Corpus frequency

1900 1920 1940 1960 1980 2000 0.5 1 1.5 2 2.5 ·10−4 Proportional frequencies in Google Books corpus computer mouse

slide-4
SLIDE 4

Semantics: The distributional hypothesis

◮ Imagine that tezg¨

uino is a rare English word, and you saw the word used in the following sentences:

  • 1. A bottle of tezg¨

uino is on the table.

  • 2. Everyone likes tezg¨

uino.

  • 3. Tezg¨

uino makes you drunk.

  • 4. We make tezg¨

uino out of corn.

(Lin, 1998)

◮ Can you guess what tezg¨

uino means?

◮ What kind of things do you expect will be similar to tezg¨

uino?

◮ The Distributional Hypothesis: Two words are expected to

be semantically similar if they have similar patterns of co-occurrence in observed text.

slide-5
SLIDE 5

Co-occurrences and similarity

◮ We can produce a distributional “profile” of a word from a

corpus: farmer: part-time, sheep, peasant, tenant, wife, crop, . . . doctor: nurse, junior, prescribe, consult, patient, surgery,. . . hospital: psychiatric, memorial, discharge, admission, clinic, . . .

◮ We can compute similarity between words by comparing their

profiles.

slide-6
SLIDE 6

Semantic space visualisation

British National Corpus, top 5000 dependencies cat dog man woman kangaroo salad fish pizza doctor pet vet nurse food cinema surgery surgeon wine beer factory worker tool hammer shark apple hospital computer chicken

slide-7
SLIDE 7

Discovering semantic classes

BNC nouns, method related to Latent Dirichlet Allocation (topic modelling) Class 1 Class 2 Class 3 Class 4 attack test line university raid examination axis college assault check section school campaign testing circle polytechnic

  • peration

exam path institute incident scan track institution bombing assessment arrow library

  • ffensive

sample curve hospital

slide-8
SLIDE 8

Tracking meaning over time

◮ Ongoing project (with Meng Zhang) ◮ We know that language changes over time. ◮ Words change their meaning by adding and losing senses and

associations.

◮ Can we study this behaviour in a large corpus? ◮ Goal: “word biographies”. ◮ A historian of ideas might be interested in what a word meant

to people at different points in time.

slide-9
SLIDE 9

Tracking meaning over time

1900 1920 1940 1960 1980 2000 0.2 0.4 0.6 0.8 1 Meaning consistency in Google Books corpus computer mouse

slide-10
SLIDE 10

Conclusion

◮ We have methods for extracting meaning from document

collections:

◮ Comparing words and texts ◮ Clustering words/concepts ◮ Identifying themes in a corpus ◮ Identifying associations between words/concepts

◮ We need users in other fields to provide interesting questions. ◮ If you have ideas, say hi! Or send me an email at

do242@cam.ac.uk.