Statistical Analysis of Corpus Data with R Distributional properties - - PowerPoint PPT Presentation

statistical analysis of corpus data with r
SMART_READER_LITE
LIVE PREVIEW

Statistical Analysis of Corpus Data with R Distributional properties - - PowerPoint PPT Presentation

Statistical Analysis of Corpus Data with R Distributional properties of Italian NN compounds: An Exploration with R Designed by Marco Baroni 1 and Stefan Evert 2 1 Center for Mind/Brain Sciences (CIMeC) University of Trento 2 Institute of


slide-1
SLIDE 1

Statistical Analysis of Corpus Data with R

Distributional properties of Italian NN compounds: An Exploration with R Designed by Marco Baroni1 and Stefan Evert2

1Center for Mind/Brain Sciences (CIMeC)

University of Trento

2Institute of Cognitive Science (IKW)

University of Onsabrück

slide-2
SLIDE 2

Outline

Introduction Data Clustering k-means Dimenstionality reduction with PCA

slide-3
SLIDE 3

NN Compounds

◮ Part of work carried out by Marco Baroni with Emiliano

Guevara (U Bologna) and Vito Pirrelli (CNR/ILC, Pisa)

◮ Three-way classification inspired by theoretical (Bisetto

and Scalise, 2005) and psychological work (e.g., Costello and Keane, 2001)

◮ Relational (computer center, angolo bambini) ◮ Attributive (swordfish, esperimento pilota) ◮ Coordinative (singer-songwriter, bar pasticceria)

slide-4
SLIDE 4

Relational compounds

◮ Express relation between two entities ◮ Heads are typically information containers, organizations,

places, aggregators, pointers, etc.

◮ M “grounds” generic meaning of, or fills slot of H ◮ E.g., stanza server (“server room”), fondo pensioni

(“pension fund”), centro città (“city center”)

slide-5
SLIDE 5

Attributive compounds

◮ Interpretation of M is reduced to a “salient” property of its

full semantic content, and this property is attributed to H:

◮ presidente fantoccio (“puppet president”), progetto pilota

(“pilot project”)

slide-6
SLIDE 6

Coordinative compounds

◮ Head and modifier denote similar/compatible entities,

compound has coordinative reading

◮ HM is both H and M ◮ viaggio spedizione (“expedition travel”), cantante attore

(“singer actor”)

◮ Ignored here

slide-7
SLIDE 7

Ongoing exploration

◮ Data-set of frequent compounds: 24 ATT / 100 REL ◮ All ATT and REL compounds with freq ≥ 1, 000 in itWaC (2

billion token Italian Web-based corpus)

◮ Will the distinction between ATT and REL emerge from

combination of distributional cues (also extracted from itWaC)?

slide-8
SLIDE 8

Ongoing exploration

◮ Data-set of frequent compounds: 24 ATT / 100 REL ◮ All ATT and REL compounds with freq ≥ 1, 000 in itWaC (2

billion token Italian Web-based corpus)

◮ Will the distinction between ATT and REL emerge from

combination of distributional cues (also extracted from itWaC)?

◮ Cues:

◮ Semantic similarity between head and modifier ◮ Explicit syntactic link ◮ Relational properties of head and modifier ◮ “Specialization” of head and modifier

slide-9
SLIDE 9

Outline

Introduction Data Clustering k-means Dimenstionality reduction with PCA

slide-10
SLIDE 10

The data

H Compound head (Italian compounds are left-headed!) M Modifier TYPE attributive or relational COS Cosine similarity between H and M DELLL Log-likelihood ratio score for comparison between

  • bserved frequency of H del M (“H of the M”) and

expected frequency under independence HDELPROP Proportion of times H occurs in context H del NOUN over total occurrences of H DELMPROP Proportion of times M occurs in context NOUN DEL M over total occurrences of M HNPROP Proportion of times H occurs in context H NOUN

  • ver total occurrences of H

NMPROP Proportion of times M occurs in context NOUN M

  • ver total occurrences of M
slide-11
SLIDE 11

Cue statistics

◮ Read the file comp.stats.txt into a data-frame named

d and “attach” the data-frame

☞ load file with read.delim() function as recommended ☞ use option encoding="UTF-8" on Windows

◮ Compute basic statistics ◮ Look at the distribution of each cue among compounds of

type attributive (at) vs. relational (re)

◮ Find out for which cues the distinction between attributive

and relational is significant (using a t-test or Mann-Whitney ranks test)

◮ Also, which cues are correlated? (use cor() on the

subset of the data-frame that contains the cues)

slide-12
SLIDE 12

Outline

Introduction Data Clustering k-means Dimenstionality reduction with PCA

slide-13
SLIDE 13

Outline

Introduction Data Clustering k-means Dimenstionality reduction with PCA

slide-14
SLIDE 14

Clustering

◮ k-means: one of the simplest and most widely used hard

flat clustering algorithms

◮ For more sophisticated options, see the cluster and e1071

packages

slide-15
SLIDE 15

k-means

◮ The basic algorithm

  • 1. Start from k random points as cluster centers
  • 2. Assign points in data-set to cluster of closest center
  • 3. Re-compute centers (means) from points in each cluster
  • 4. Iterate cluster assignment and center update steps until

configuration converges

◮ Given random nature of initialization, it pays off to repeat

procedure multiple times (or to start from “reasonable” initialization)

slide-16
SLIDE 16

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-17
SLIDE 17

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-18
SLIDE 18

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-19
SLIDE 19

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-20
SLIDE 20

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-21
SLIDE 21

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-22
SLIDE 22

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-23
SLIDE 23

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-24
SLIDE 24

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-25
SLIDE 25

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-26
SLIDE 26

Illustration of the k-means algorithm

See help(iris) for more information about the data set used

  • −2

−1 1 2 −2 −1 1 2 petal width (z−score) petal length (z−score)

slide-27
SLIDE 27

k-means, first try

# cues are in columns 4 to 9

> km <- kmeans(d[,4:9], 2, nstart=10) > km

# problem: extreme DELLL values dominate the clustering # (relevant small cluster might be cluster 2 in your solution)

> DELLL[km$cluster==1] > head(sort(DELLL, decreasing=TRUE))

slide-28
SLIDE 28

Scaling and trying again

> scaled <- scale(d[,4:9]) > summary(d[4:9])

# distribution of original data

> summary(scaled)

# after scaling

> km <- kmeans(scaled, 2, nstart=10) > km > table(km$cluster, d$TYPE) # confusion matrix

slide-29
SLIDE 29

Outline

Introduction Data Clustering k-means Dimenstionality reduction with PCA

slide-30
SLIDE 30

Dimensionality reduction

◮ To find “latent” variables ◮ To reduce random noise ◮ For easier visualization

slide-31
SLIDE 31

Principal component analysis (PCA)

◮ Find a set of orthogonal dimensions such that the first

dimension “accounts” for the most variance in the original data-set, the second dimension accounts for as much as possible of the remaining variance, etc.

◮ The top k dimensions (principal components) are the best

sub-set of k dimensions to approximate the spread in the

  • riginal data-set

◮ Principal components represent correlations of original

variables ➪ might reveal interesting underlying patterns

slide-32
SLIDE 32

Preserving variance: examples

−2 −1 1 2 −2 −1 1 2 dimension 1 dimension 2

  • variance = 1.26
slide-33
SLIDE 33

Preserving variance: examples

−2 −1 1 2 −2 −1 1 2 dimension 1 dimension 2

slide-34
SLIDE 34

Preserving variance: examples

−2 −1 1 2 −2 −1 1 2 dimension 1 dimension 2

  • variance = 0.36
slide-35
SLIDE 35

Preserving variance: examples

−2 −1 1 2 −2 −1 1 2 dimension 1 dimension 2

slide-36
SLIDE 36

Preserving variance: examples

−2 −1 1 2 −2 −1 1 2 dimension 1 dimension 2

  • variance = 0.72
slide-37
SLIDE 37

Preserving variance: examples

−2 −1 1 2 −2 −1 1 2 dimension 1 dimension 2

slide-38
SLIDE 38

Preserving variance: examples

−2 −1 1 2 −2 −1 1 2 dimension 1 dimension 2

  • variance = 0.9
slide-39
SLIDE 39

Adding an orthogonal dimension

−2 −1 1 2 −2 −1 1 2 dimension 1 dimension 2

slide-40
SLIDE 40

PCA in R

> temp <- subset(d, select=c(HNPROP, NMPROP, DELLL, HDELPROP, DELMPROP, COS)) > pr <- prcomp(temp, scale=TRUE) > pr > plot(pr) > biplot(pr) > biplot(pr, xlabs=TYPE, xlim=c(-.25,.25), ylim=c(-.25,.25))

slide-41
SLIDE 41

More refined plotting

> plot(pr$x[,1:2], type="n", xlim=c(min(pr$x[,1]),4), ylim=c(min(pr$x[,2]),4))

# only sets up plot region

> points(subset(pr$x, TYPE=="re"), col="blue", pch=19, lwd=2) # blue points for type ‘‘re’’ > points(subset(pr$x, TYPE=="at"), col="red", pch=19, lwd=2)

# red points for type ‘‘at’’

> legend("topright", inset=.05, fill=c("red","blue"), cex=1.5, legend=c("ATT","REL"))

# legend explains colors

slide-42
SLIDE 42

Adding the cues

> text(pr$rotation[1,1]*4, pr$rotation[1,2]*4, label="H N", cex=1.7) > text(pr$rotation[2,1]*4, pr$rotation[2,2]*4, label="N M", cex=1.7) > text(pr$rotation[3,1]*4, pr$rotation[3,2]*4, label="H DEL M", cex=1.7) > text(pr$rotation[4,1]*4, pr$rotation[4,2]*4, label="H DEL", cex=1.7) > text(pr$rotation[5,1]*4, pr$rotation[5,2]*4, label="DEL M", cex=1.7) > text(pr$rotation[6,1]*4, pr$rotation[6,2]*4, label="COS", cex=1.7)

slide-43
SLIDE 43

Trying k-means again

> km <- kmeans(pr$x[,1:4], 2, nstart=10) > table(km$cluster, d$TYPE)

# what happens with more/fewer dimensions?

> plot(pr$x[,1:2], type="n", xlim=c(min(pr$x[,1]),4), ylim=c(min(pr$x[,2]),4)) > text(pr$x[,1], pr$x[,2], col=km$cluster, labels=TYPE)

# now refine this plot as on previous slides