Text is fun: Statistical exploration of large corpora Siva Reddy - - PowerPoint PPT Presentation

text is fun statistical exploration of large corpora
SMART_READER_LITE
LIVE PREVIEW

Text is fun: Statistical exploration of large corpora Siva Reddy - - PowerPoint PPT Presentation

Text is fun: Statistical exploration of large corpora Siva Reddy Lexical Computing Ltd, UK http://sketchengine.co.uk IIIT-Hyderabad Advanced School on Natural Language Processing July 14 2012 Siva Reddy (Lexical Computing Ltd, UK) Text is


slide-1
SLIDE 1

Text is fun: Statistical exploration of large corpora

Siva Reddy

Lexical Computing Ltd, UK

http://sketchengine.co.uk

IIIT-Hyderabad Advanced School on Natural Language Processing July 14 2012

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 1 / 30

slide-2
SLIDE 2

Acknowledgments

Adam Kilgarriff Michael Rundell

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 2 / 30

slide-3
SLIDE 3

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

slide-4
SLIDE 4

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

slide-5
SLIDE 5

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

slide-6
SLIDE 6

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

slide-7
SLIDE 7

How are dictionaries built in pre-computer era?

James Murray and colleagues: Oxford English Dictionary

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 4 / 30

slide-8
SLIDE 8

How are dictionaries built in pre-computer era?

Storage of Evidences

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 5 / 30

slide-9
SLIDE 9

How are dictionaries built in pre-computer era?

Indexing

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 6 / 30

slide-10
SLIDE 10

Revolution: Internet Era

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 7 / 30

slide-11
SLIDE 11

Dictionary building: Requirements

Corpus (Text) Collection Wordlist Evidence collection: Words in action. Word Profiles

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30

slide-12
SLIDE 12

Dictionary building: Requirements

Corpus (Text) Collection Wordlist Evidence collection: Words in action. Word Profiles

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30

slide-13
SLIDE 13

Web as Corpus: Challenges

Crawling Text extraction Spamming Duplication Exercise 1: WebBootCaT Collect corpus from web on a topic of interest. (Baroni et al., 2006; Kilgarriff et al., 2010)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30

slide-14
SLIDE 14

Web as Corpus: Challenges

Crawling Text extraction Spamming Duplication Exercise 1: WebBootCaT Collect corpus from web on a topic of interest. (Baroni et al., 2006; Kilgarriff et al., 2010)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30

slide-15
SLIDE 15

Wordlist

Generalized dictionary Domain-specific dictionary Exercise 2: Keyword Extraction Collect keywords from the corpus you collected above.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30

slide-16
SLIDE 16

Wordlist

Generalized dictionary Domain-specific dictionary Exercise 2: Keyword Extraction Collect keywords from the corpus you collected above.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30

slide-17
SLIDE 17

Evidence collection

Words in action Google like searching isn’t enough Get all the word forms of test? Words which are at a distance of three from test? Corpus Query Language: regular expressions

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 11 / 30

slide-18
SLIDE 18

Regular expressions

Regular Expression Table:

http://bit.ly/KZT7Kj

Exercise 3: Write regular expressions for . . .

http://sketchengine.co.uk/exercises/regex/

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 12 / 30

slide-19
SLIDE 19

CQL: Corpus Query Language

query pattern matching set of tokens tokens have attributes (word, lemma, tag, lempos, lc)

[attribute="value"] for each token pattern

value is a regular expression Additional Pointers

http://bit.ly/LPRuju http://trac.sketchengine.co.uk/wiki/SkE/ CorpusQuerying

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 13 / 30

slide-20
SLIDE 20

Corpus Processing: Challenges

What are the noun forms of the word test? Will "test.*" work? Word Tokenization Morphological analysis Part-of-Speech Tagging CQL: [lemma="treat" & tag="N.*"]

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30

slide-21
SLIDE 21

Corpus Processing: Challenges

What are the noun forms of the word test? Will "test.*" work? Word Tokenization Morphological analysis Part-of-Speech Tagging CQL: [lemma="treat" & tag="N.*"]

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30

slide-22
SLIDE 22

Collocations (word associations)

When do you say a word A is important to word B? mouse: laser mouse: food Exercise 4: Collocations of the words girl and boy? Download data from http://sivareddy.in/textisfun.tgz Rank context words using mutual informationa:

P(x,y) P(x)P(y)

aRemoved log for simplicity

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30

slide-23
SLIDE 23

Collocations (word associations)

When do you say a word A is important to word B? mouse: laser mouse: food Exercise 4: Collocations of the words girl and boy? Download data from http://sivareddy.in/textisfun.tgz Rank context words using mutual informationa:

P(x,y) P(x)P(y)

aRemoved log for simplicity

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30

slide-24
SLIDE 24

Word Sketch - a profile describing collocations

Word Sketch of write-v http://bit.ly/KUCBFj The voice of the majority Sketch Grammar: describes the frequent constructions of words in language Exercise 5: Objects of eat-v? Write the Sketch Grammar capturing object relation?

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 16 / 30

slide-25
SLIDE 25

My near-dream for Indian languages?

Writing Sketch Grammar is not so time-taking. Exploit Sketch Grammar to build Syntactic Parser A parser for every language Cash the similarities between different languages

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 17 / 30

slide-26
SLIDE 26

When do you say two words are similar?

Distributional Hypothesis (Harris, 1954) The words that occur in similar contexts tend to have similar meaning e.g: laptop, computer Backbone for Vector Space Model of Semantics. Firth (Firth, 1957) You shall know a person from his friends - Chinese Proverb You shall know a word from its context - Firth’s Principle Bag of words hypothesis Two documents tend to be similar if they have similar distribution of similar words

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30

slide-27
SLIDE 27

When do you say two words are similar?

Distributional Hypothesis (Harris, 1954) The words that occur in similar contexts tend to have similar meaning e.g: laptop, computer Backbone for Vector Space Model of Semantics. Firth (Firth, 1957) You shall know a person from his friends - Chinese Proverb You shall know a word from its context - Firth’s Principle Bag of words hypothesis Two documents tend to be similar if they have similar distribution of similar words

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30

slide-28
SLIDE 28

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSM

Backbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space. Context of the entity as dimensions Existing methods represent knowledge in VSMs mainly in three types (Turney and Pantel, 2010)

term-document term-context pair-pattern

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30

slide-29
SLIDE 29

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSM

Backbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space. Context of the entity as dimensions Existing methods represent knowledge in VSMs mainly in three types (Turney and Pantel, 2010)

term-document term-context pair-pattern

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30

slide-30
SLIDE 30

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSM

Backbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space. Context of the entity as dimensions Existing methods represent knowledge in VSMs mainly in three types (Turney and Pantel, 2010)

term-document term-context pair-pattern

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30

slide-31
SLIDE 31

Term-Document: (Salton et al., 1975)

1

d1: Human machine interface for Lab ABC computer applications

1Image courtesy: (Landauer et al., 1998)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 20 / 30

slide-32
SLIDE 32

Term-Document: (Salton et al., 1975)

2

Document similarity can be found using Cosine similarity sim(D1,D2) =

D1.D2

D1D2

2Image courtesy: (Salton et al., 1975)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30

slide-33
SLIDE 33

Term-Document: (Salton et al., 1975)

2

Document similarity can be found using Cosine similarity sim(D1,D2) =

D1.D2

D1D2

2Image courtesy: (Salton et al., 1975)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30

slide-34
SLIDE 34

Term-Context: Word Space Model

Meaning of a word as a vector (Schütze, 1998) Meaning of a word is represented as a cooccurrence vector built from a corpus police-n photon-n speed-n car-n soul-n Traffic 142 293 347 1 Light 41 29 222 198 50 TrafficLight 5 13 48 Exercise 6: Compute similarity between girl, boy, dog Hint: Represent words as vectors using mutual information scores of context words, and compute Cosine similarity.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30

slide-35
SLIDE 35

Term-Context: Word Space Model

Meaning of a word as a vector (Schütze, 1998) Meaning of a word is represented as a cooccurrence vector built from a corpus police-n photon-n speed-n car-n soul-n Traffic 142 293 347 1 Light 41 29 222 198 50 TrafficLight 5 13 48 Exercise 6: Compute similarity between girl, boy, dog Hint: Represent words as vectors using mutual information scores of context words, and compute Cosine similarity.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30

slide-36
SLIDE 36

Word Senses

So far we represented a word with a single word sketch mouse vs mouse? Word Sense Disambiguation: collocations are the clue WordNet have been used extensively Can we guess the number of senses of a word?

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 23 / 30

slide-37
SLIDE 37

Word Sense Induction

Figure: Word Sense Induction in a Graph based setting

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 24 / 30

slide-38
SLIDE 38

Semantic Word Sketches

Semantic Frames Demo: http://corpdev.sketchengine.co.uk/run.cgi/

first_form?corpname=5dcaa5fe

Exercise 7: abstract entities which modify boy and girl Use word sense of context words as clue.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 25 / 30

slide-39
SLIDE 39

Beyond Words: Compositional Semantics

Given meanings of couch roast potato Can we interpret the meanings of couch potato roast potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30

slide-40
SLIDE 40

Beyond Words: Compositional Semantics

Given meanings of couch roast potato Can we interpret the meanings of couch potato roast potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30

slide-41
SLIDE 41

Couch Potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 27 / 30

slide-42
SLIDE 42

Roast Potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 28 / 30

slide-43
SLIDE 43

Bibliography I

Baroni, M., Kilgarriff, A., Pomikalek, J., and Rychly, P . (2006). Webbootcat: Instant domain-specific corpora to support human translators. In Proceedings of the 11th Annual Conference of the European Association for Machine Translation (EAMT), Norway. Firth, J. R. (1957). A Synopsis of Linguistic Theory, 1930-1955. Studies in Linguistic Analysis, pages 1–32. Harris, Z. S. (1954). Distributional structure. Word, 10:146–162. Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory for many languages. In Proceedings of the Seventh International Conference

  • n Language Resources and Evaluation (LREC’10), Valletta, Malta.

Landauer, T. K., Foltz, P . W., and Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25:259–284. Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18:613–620.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 29 / 30

slide-44
SLIDE 44

Bibliography II

Schütze, H. (1998). Automatic Word Sense Discrimination. Computational Linguistics, 24(1):97–123. Turney, P . D. and Pantel, P . (2010). From frequency to meaning: vector space models of semantics. J. Artif. Int. Res., 37:141–188.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 30 / 30