Text is fun: Statistical exploration of large corpora Siva Reddy - - PowerPoint PPT Presentation

▶

Dec 15, 2022 104 likes •551 views

Text is fun: Statistical exploration of large corpora Siva Reddy Lexical Computing Ltd, UK http://sketchengine.co.uk IIIT-Hyderabad Advanced School on Natural Language Processing July 14 2012 Siva Reddy (Lexical Computing Ltd, UK) Text is

SLIDE 1

Text is fun: Statistical exploration of large corpora

Siva Reddy

Lexical Computing Ltd, UK

http://sketchengine.co.uk

IIIT-Hyderabad Advanced School on Natural Language Processing July 14 2012

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 1 / 30

SLIDE 2

Acknowledgments

Adam Kilgarriff Michael Rundell

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 2 / 30

SLIDE 3

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

SLIDE 4

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

SLIDE 5

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

SLIDE 6

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 3 / 30

SLIDE 7

How are dictionaries built in pre-computer era?

James Murray and colleagues: Oxford English Dictionary

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 4 / 30

SLIDE 8

How are dictionaries built in pre-computer era?

Storage of Evidences

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 5 / 30

SLIDE 9

How are dictionaries built in pre-computer era?

Indexing

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 6 / 30

SLIDE 10

Revolution: Internet Era

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 7 / 30

SLIDE 11

Dictionary building: Requirements

Corpus (Text) Collection Wordlist Evidence collection: Words in action. Word Profiles

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30

SLIDE 12

Dictionary building: Requirements

Corpus (Text) Collection Wordlist Evidence collection: Words in action. Word Profiles

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 8 / 30

SLIDE 13

Web as Corpus: Challenges

Crawling Text extraction Spamming Duplication Exercise 1: WebBootCaT Collect corpus from web on a topic of interest. (Baroni et al., 2006; Kilgarriff et al., 2010)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30

SLIDE 14

Web as Corpus: Challenges

Crawling Text extraction Spamming Duplication Exercise 1: WebBootCaT Collect corpus from web on a topic of interest. (Baroni et al., 2006; Kilgarriff et al., 2010)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 9 / 30

SLIDE 15

Wordlist

Generalized dictionary Domain-specific dictionary Exercise 2: Keyword Extraction Collect keywords from the corpus you collected above.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30

SLIDE 16

Wordlist

Generalized dictionary Domain-specific dictionary Exercise 2: Keyword Extraction Collect keywords from the corpus you collected above.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 10 / 30

SLIDE 17

Evidence collection

Words in action Google like searching isn’t enough Get all the word forms of test? Words which are at a distance of three from test? Corpus Query Language: regular expressions

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 11 / 30

SLIDE 18

Regular expressions

Regular Expression Table:

http://bit.ly/KZT7Kj

Exercise 3: Write regular expressions for . . .

http://sketchengine.co.uk/exercises/regex/

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 12 / 30

SLIDE 19

CQL: Corpus Query Language

query pattern matching set of tokens tokens have attributes (word, lemma, tag, lempos, lc)

[attribute="value"] for each token pattern

value is a regular expression Additional Pointers

http://bit.ly/LPRuju http://trac.sketchengine.co.uk/wiki/SkE/ CorpusQuerying

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 13 / 30

SLIDE 20

Corpus Processing: Challenges

What are the noun forms of the word test? Will "test." work? Word Tokenization Morphological analysis Part-of-Speech Tagging CQL: [lemma="treat" & tag="N."]

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30

SLIDE 21

Corpus Processing: Challenges

What are the noun forms of the word test? Will "test." work? Word Tokenization Morphological analysis Part-of-Speech Tagging CQL: [lemma="treat" & tag="N."]

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 14 / 30

SLIDE 22

Collocations (word associations)

When do you say a word A is important to word B? mouse: laser mouse: food Exercise 4: Collocations of the words girl and boy? Download data from http://sivareddy.in/textisfun.tgz Rank context words using mutual informationa:

P(x,y) P(x)P(y)

aRemoved log for simplicity

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30

SLIDE 23

Collocations (word associations)

When do you say a word A is important to word B? mouse: laser mouse: food Exercise 4: Collocations of the words girl and boy? Download data from http://sivareddy.in/textisfun.tgz Rank context words using mutual informationa:

P(x,y) P(x)P(y)

aRemoved log for simplicity

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 15 / 30

SLIDE 24

Word Sketch - a profile describing collocations

Word Sketch of write-v http://bit.ly/KUCBFj The voice of the majority Sketch Grammar: describes the frequent constructions of words in language Exercise 5: Objects of eat-v? Write the Sketch Grammar capturing object relation?

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 16 / 30

SLIDE 25

My near-dream for Indian languages?

Writing Sketch Grammar is not so time-taking. Exploit Sketch Grammar to build Syntactic Parser A parser for every language Cash the similarities between different languages

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 17 / 30

SLIDE 26

When do you say two words are similar?

Distributional Hypothesis (Harris, 1954) The words that occur in similar contexts tend to have similar meaning e.g: laptop, computer Backbone for Vector Space Model of Semantics. Firth (Firth, 1957) You shall know a person from his friends - Chinese Proverb You shall know a word from its context - Firth’s Principle Bag of words hypothesis Two documents tend to be similar if they have similar distribution of similar words

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30

SLIDE 27

When do you say two words are similar?

Distributional Hypothesis (Harris, 1954) The words that occur in similar contexts tend to have similar meaning e.g: laptop, computer Backbone for Vector Space Model of Semantics. Firth (Firth, 1957) You shall know a person from his friends - Chinese Proverb You shall know a word from its context - Firth’s Principle Bag of words hypothesis Two documents tend to be similar if they have similar distribution of similar words

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 30

SLIDE 28

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSM

Backbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space. Context of the entity as dimensions Existing methods represent knowledge in VSMs mainly in three types (Turney and Pantel, 2010)

term-document term-context pair-pattern

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30

SLIDE 29

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSM

Backbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space. Context of the entity as dimensions Existing methods represent knowledge in VSMs mainly in three types (Turney and Pantel, 2010)

term-document term-context pair-pattern

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30

SLIDE 30

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSM

Backbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space. Context of the entity as dimensions Existing methods represent knowledge in VSMs mainly in three types (Turney and Pantel, 2010)

term-document term-context pair-pattern

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 30

SLIDE 31

Term-Document: (Salton et al., 1975)

1 d1: Human machine interface for Lab ABC computer applications

1Image courtesy: (Landauer et al., 1998)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 20 / 30

SLIDE 32

Term-Document: (Salton et al., 1975)

2

Document similarity can be found using Cosine similarity sim(D1,D2) =

D1.D2

D1D2

2Image courtesy: (Salton et al., 1975)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30

SLIDE 33

Term-Document: (Salton et al., 1975)

2

Document similarity can be found using Cosine similarity sim(D1,D2) =

D1.D2

D1D2

2Image courtesy: (Salton et al., 1975)

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 21 / 30

SLIDE 34

Term-Context: Word Space Model

Meaning of a word as a vector (Schütze, 1998) Meaning of a word is represented as a cooccurrence vector built from a corpus police-n photon-n speed-n car-n soul-n Traffic 142 293 347 1 Light 41 29 222 198 50 TrafficLight 5 13 48 Exercise 6: Compute similarity between girl, boy, dog Hint: Represent words as vectors using mutual information scores of context words, and compute Cosine similarity.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30

SLIDE 35

Term-Context: Word Space Model

Meaning of a word as a vector (Schütze, 1998) Meaning of a word is represented as a cooccurrence vector built from a corpus police-n photon-n speed-n car-n soul-n Traffic 142 293 347 1 Light 41 29 222 198 50 TrafficLight 5 13 48 Exercise 6: Compute similarity between girl, boy, dog Hint: Represent words as vectors using mutual information scores of context words, and compute Cosine similarity.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 22 / 30

SLIDE 36

Word Senses

So far we represented a word with a single word sketch mouse vs mouse? Word Sense Disambiguation: collocations are the clue WordNet have been used extensively Can we guess the number of senses of a word?

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 23 / 30

SLIDE 37

Word Sense Induction

Figure: Word Sense Induction in a Graph based setting

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 24 / 30

SLIDE 38

Semantic Word Sketches

Semantic Frames Demo: http://corpdev.sketchengine.co.uk/run.cgi/

first_form?corpname=5dcaa5fe

Exercise 7: abstract entities which modify boy and girl Use word sense of context words as clue.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 25 / 30

SLIDE 39

Beyond Words: Compositional Semantics

Given meanings of couch roast potato Can we interpret the meanings of couch potato roast potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30

SLIDE 40

Beyond Words: Compositional Semantics

Given meanings of couch roast potato Can we interpret the meanings of couch potato roast potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 30

SLIDE 41

Couch Potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 27 / 30

SLIDE 42

Roast Potato

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 28 / 30

SLIDE 43

Bibliography I

Baroni, M., Kilgarriff, A., Pomikalek, J., and Rychly, P . (2006). Webbootcat: Instant domain-specific corpora to support human translators. In Proceedings of the 11th Annual Conference of the European Association for Machine Translation (EAMT), Norway. Firth, J. R. (1957). A Synopsis of Linguistic Theory, 1930-1955. Studies in Linguistic Analysis, pages 1–32. Harris, Z. S. (1954). Distributional structure. Word, 10:146–162. Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory for many languages. In Proceedings of the Seventh International Conference

n Language Resources and Evaluation (LREC’10), Valletta, Malta.

Landauer, T. K., Foltz, P . W., and Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25:259–284. Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18:613–620.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 29 / 30

SLIDE 44

Bibliography II

Schütze, H. (1998). Automatic Word Sense Discrimination. Computational Linguistics, 24(1):97–123. Turney, P . D. and Pantel, P . (2010). From frequency to meaning: vector space models of semantics. J. Artif. Int. Res., 37:141–188.

Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 30 / 30

Text is fun: Statistical exploration of large corpora

Siva Reddy

Lexical Computing Ltd, UK

http://sketchengine.co.uk

IIIT-Hyderabad Advanced School on Natural Language Processing July 14 2012

Acknowledgments

Adam Kilgarriff Michael Rundell

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

What is “meaning”?

Semantics: Study of meaning in language. Lexical semantics: Study of meaning of words.

How are dictionaries built in pre-computer era?

James Murray and colleagues: Oxford English Dictionary

How are dictionaries built in pre-computer era?

Storage of Evidences

How are dictionaries built in pre-computer era?

Indexing

Revolution: Internet Era

Dictionary building: Requirements

Corpus (Text) Collection Wordlist Evidence collection: Words in action. Word Profiles

Dictionary building: Requirements

Corpus (Text) Collection Wordlist Evidence collection: Words in action. Word Profiles

Web as Corpus: Challenges

Crawling Text extraction Spamming Duplication Exercise 1: WebBootCaT Collect corpus from web on a topic of interest. (Baroni et al., 2006; Kilgarriff et al., 2010)

Web as Corpus: Challenges

Crawling Text extraction Spamming Duplication Exercise 1: WebBootCaT Collect corpus from web on a topic of interest. (Baroni et al., 2006; Kilgarriff et al., 2010)

Wordlist

Generalized dictionary Domain-specific dictionary Exercise 2: Keyword Extraction Collect keywords from the corpus you collected above.

Wordlist

Generalized dictionary Domain-specific dictionary Exercise 2: Keyword Extraction Collect keywords from the corpus you collected above.

Evidence collection

Words in action Google like searching isn’t enough Get all the word forms of test? Words which are at a distance of three from test? Corpus Query Language: regular expressions

Regular expressions

Regular Expression Table:

http://bit.ly/KZT7Kj

Exercise 3: Write regular expressions for . . .

http://sketchengine.co.uk/exercises/regex/

CQL: Corpus Query Language

query pattern matching set of tokens tokens have attributes (word, lemma, tag, lempos, lc)

[attribute="value"] for each token pattern

value is a regular expression Additional Pointers

http://bit.ly/LPRuju http://trac.sketchengine.co.uk/wiki/SkE/ CorpusQuerying

Corpus Processing: Challenges

What are the noun forms of the word test? Will "test.*" work? Word Tokenization Morphological analysis Part-of-Speech Tagging CQL: [lemma="treat" & tag="N.*"]

Corpus Processing: Challenges

What are the noun forms of the word test? Will "test.*" work? Word Tokenization Morphological analysis Part-of-Speech Tagging CQL: [lemma="treat" & tag="N.*"]

Collocations (word associations)

When do you say a word A is important to word B? mouse: laser mouse: food Exercise 4: Collocations of the words girl and boy? Download data from http://sivareddy.in/textisfun.tgz Rank context words using mutual informationa:

P(x,y) P(x)P(y)

Collocations (word associations)

When do you say a word A is important to word B? mouse: laser mouse: food Exercise 4: Collocations of the words girl and boy? Download data from http://sivareddy.in/textisfun.tgz Rank context words using mutual informationa:

P(x,y) P(x)P(y)

Word Sketch - a profile describing collocations

Word Sketch of write-v http://bit.ly/KUCBFj The voice of the majority Sketch Grammar: describes the frequent constructions of words in language Exercise 5: Objects of eat-v? Write the Sketch Grammar capturing object relation?

My near-dream for Indian languages?

Writing Sketch Grammar is not so time-taking. Exploit Sketch Grammar to build Syntactic Parser A parser for every language Cash the similarities between different languages

When do you say two words are similar?

When do you say two words are similar?

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSM

Backbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space. Context of the entity as dimensions Existing methods represent knowledge in VSMs mainly in three types (Turney and Pantel, 2010)

term-document term-context pair-pattern

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSM

Backbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space. Context of the entity as dimensions Existing methods represent knowledge in VSMs mainly in three types (Turney and Pantel, 2010)

term-document term-context pair-pattern

Vector Space Models (VSMs) of Semantics

Interpret semantics using VSM

Backbone: Distributional Hypothesis

Text entity (we are interested in) as a Vector (point) in dimensional space. Context of the entity as dimensions Existing methods represent knowledge in VSMs mainly in three types (Turney and Pantel, 2010)

term-document term-context pair-pattern

Term-Document: (Salton et al., 1975)

1

d1: Human machine interface for Lab ABC computer applications

What are the noun forms of the word test? Will "test." work? Word Tokenization Morphological analysis Part-of-Speech Tagging CQL: [lemma="treat" & tag="N."]

What are the noun forms of the word test? Will "test." work? Word Tokenization Morphological analysis Part-of-Speech Tagging CQL: [lemma="treat" & tag="N."]