Textual and Lexical Statistics Carme 2011 Mnica Bcue Bertaut - - PowerPoint PPT Presentation

textual and lexical statistics
SMART_READER_LITE
LIVE PREVIEW

Textual and Lexical Statistics Carme 2011 Mnica Bcue Bertaut - - PowerPoint PPT Presentation

Textual and Lexical Statistics Carme 2011 Mnica Bcue Bertaut Universitat Politcnica de Catalunya Outline Outline 1. Introduction 2. Illustrative example 3. Statistical analysis. Methods and results 3.1 Text encoding and data


slide-1
SLIDE 1

Carme 2011

Mónica Bécue Bertaut Universitat Politècnica de Catalunya

Textual and Lexical Statistics

slide-2
SLIDE 2

Outline

1. Introduction 2. Illustrative example 3. Statistical analysis. Methods and results 3.1 Text encoding and data structures 3.2 Glossaries 3.3 Lexical statistics 3.4 Textual statistics Correspondence analysis Constrained clustering Other applications, other methods, in brief: canonical correspondence analysis multiple factor analysis for contingency table 4. Conclusive remarks

Mónica Bécue-Bertaut Lexical and textual statistics

Outline

2/39

slide-3
SLIDE 3

Introduction

Mónica Bécue-Bertaut Lexical and textual statistics

In the late 50th, several favoured archiving the classics on electronic devices (Trésor général des langues et parlers français in France, Oxford Concordances, in UK.) and developing methods to deal with these now available huge collections of texts. Lexical statistics (P. Guiraud, Ch. Muller., G.K. Zipf, G. U. Yule,…) Muller Ch. (1977). Principes et méthodes de statistique lexicale. Hachette. The 35 years of that event next year… Correspondence analysis (CA), proposed by Benzécri (1961-1964)─ burst in on these first works to offer a very new and invigorating approach. CA and clustering methods constitute the core of textual statistics whose applications are extremely wide-ranging. Benzécri (1981). Pratique de l’analyse des données. Tome 3. Linguistique et lexicologie. The 30 years of that event this year.

  • 1. Introduction. Some dates about statistical analysis of texts

3/39

slide-4
SLIDE 4

The classics The Bible Political speech Newspapers Bibliography on a given theme for technological watch and scientific policy Free-text answers to open-ended questions in surveys Non-directive interviews Claiming letters Information retrieval Automatic search in textual bases such as legal bases Films or TV series scripts. What is a good script? Organisation of textual bases

  • 1. Introduction. Types of corpus (collection of texts/documents)

Closing prosecution speech in a trial. Is the speech a good rethorical text? Free comments in hall test sessions related to wine/perfume/cheese…tasting

Mónica Bécue-Bertaut Lexical and textual statistics

Introduction

4/39

slide-5
SLIDE 5

Challenge: turning texts into (textual) data Starting point: counting up the word frequencies Lexical statistics start from the sequence of words They mainly look for the lexical structure of the corpus (Muller, 1977; Labbé, 1990) Textual statistics start from the documentswords matrix. They favour the distribution of the words into the documents by applying correspondence analysis and clustering methods (Benzécri, 1981; Lebart & Salem, 1994; Murtagh, 2005)

Introduction

  • 1. Introduction. Challenge and approaches

Mónica Bécue-Bertaut Lexical and textual statistics

Introduction

5/39

slide-6
SLIDE 6

Speech pronounced at the end of a trial for murder at Barcelona Audience Court (in Spanish). It has been segmented into 59 discursive spaces by the researchers, taking as breakpoints the tone ruptures and silences of the prosecutor. It takes one hour 15mn. Beginning of the speech:

  • 2. Illustrative example: uncover the discursive strategy in a

prosecution closing speech

  • ---esp0001

con la venia por este ministerio fiscal se ha formulado un escrito de conclusiones elevado a la definitiva por varios hechos un delito relativo a la prostitución un delito de asesinato así como un delito de estafa en concurso ideal con el anterior con respecto al primero de ellos es decir al delito relativo a la prostitución no nos queda ningún género de dudas (…) (approximate translation: with the leave of the Court from this public prosecutor’s department a final written statement drawing conclusions has been made due to several incriminating facts: prostitution related offence, murder offence as well as premeditated fraud offence. Concerning the first

  • ffence, we have no doubts....)

Illustrative example

Mónica Bécue-Bertaut Lexical and textual statistics 6/39

slide-7
SLIDE 7

To demonstrate his thesis, convince the audience (judges/lawyers; in this case, no jury) while observing the law By adopting a convenient discursive strategy: selection of words, arguments order, rhythm of the speech

Objectives For the prosecutor

Uncover, through statistics tools, how the prosecutor uses the information at his disposal to structure his speech and

  • argumentation. Is the speech good?

For the analyst

Mónica Bécue-Bertaut Lexical and textual statistics

Illustrative example

7/39

slide-8
SLIDE 8

Murder of a prostitute (MJAM) by her pimp to cash a substantial life insurance No evidence Only one witness (MF) had heard the pimp talking to an accomplice about the murder. However, she withdrew her statement to the police when facing the judge. She was most probably afraid… The prosecutor has to rely on persuasive clues/circumstantial evidence to show the implausibility of the defence thesis It is the first important case for the prosecutor while the defendant has the best criminal counsel of Barcelona. The colleagues of the prosecutor consider this case as a “lost case” …

Some words about the context

Mónica Bécue-Bertaut Lexical and textual statistics

Illustrative example

8/39

slide-9
SLIDE 9

Oral speech that follows a classical scheme 1. Listing the offences and indicating the legal framework 2. Describing the facts, the data and their connections 3. Qualifying these facts in a legal way, as the conclusive part of the speech This speech is largely improvised: the prosecutor has to pronounce it right at the end of the trial, taking into account what has occurred during all the events and statements

Some words about the context

Illustrative example

Mónica Bécue-Bertaut Lexical and textual statistics 9/39

slide-10
SLIDE 10
  • 3. Statistical analysis: methods and results

1.

Texts encoding and data structures. Defining textual units and segmenting the corpus into documents

2.

Glossaries

3.

Lexical statistics Repartition of the vocabulary Growth of the vocabulary

4.

Textual statistics. Correspondence analysis Constrained clustering Characteristic lexical features Other applications, other methods

10

Statistical analysis

Mónica Bécue-Bertaut Lexical and textual statistics 10/39

slide-11
SLIDE 11

Words: │ con │ la │ venia │ por │ este │ ministerio │ fiscal │ se │ ha │ formulado │ Lemmas: │ ConPr │ leArt │ veniaSubf │ porPr │ esteDet │ ministerioSubm │ fiscalAdj │ sePro │ formularVb│ Repeated segments: │ ministerio fiscal │ este ministerio fiscal │

  • 3. 1 Texts encoding and data structures

Textual units

In the example, 2 nested segmentations : 59 discursive spaces (roughly, paragraphs) Less fine and more regular segmentation into 20 “Blocks” of about 500

  • ccurrences (whose limits correspond to discursive spaces limits)

Segmentation of the corpus into documents

Text encoding In this case, tool and full words are conserved.

Mónica Bécue-Bertaut Lexical and textual statistics 11/39

slide-12
SLIDE 12

Data structures

W1 W8 W1 W3 W8 W6 W7 W8 W5 W9 W7 W8 W5 W6

doc w1 w2 w3 w4 w5 w6 w7 w8 w9 1 2 1 1 2 1 1 1 2 3 1 1 1 1 1

Data structures

doc1 doc2 doc3

Corpus encoded into a sequence of labelled occurrences Corpus encoded into a documents×words table: frequency table

Short example

corpus composed of 3 documents; N=14 occurrences V=9 distinct words doc w1 w2 w3 w4 w5 w6 w7 w8 w9 date topic author

1 2 1 1 y1 a 1 2 1 1 1 2 y2 a 2 3 1 1 1 1 1 y3 b 1

Corpus and metadata encoded into a multiple table

Mónica Bécue-Bertaut Lexical and textual statistics 12/39

slide-13
SLIDE 13

3.2 Glossary of words

N=10400 occurrences V=1799 distinct words. 302 words repeated 5 times and over (8031 occurrences are kept) mean length of the discursive spaces: 176.3 occurrences length range of the discursive spaces : from 54 to 463 occurrences

Vocabulary

Mónica Bécue-Bertaut Lexical and textual statistics 13/39

slide-14
SLIDE 14

Glossary.Frequent words

Life insurance: seguro (insurance, 51) , persona (person, 40), seguro de vida (life insurance, 25), relación (relationship, 26), millones (millions, 13), beneficiarios (beneficiaries, 11). Actors: MJAM (victim, 38), FPM (33) and JCM (22) (defendants), hijos (children, 21), SRT (wife of the defendant, 17), MF (witness for the prosecution, 15). Statement

  • f

the witness for the prosecution : declaración(es) (statement(s), 43), policía (police, 27), delito (offence, 21), caso (case, 17), defensa (defence, 17), manifestaciones (manifestations, 16). Facts, data, clues, because there is no evidence, the prosecutor mentions hecho(s) (fact/s), 50), otro(s) dato(s) (other data, 32), indicios (clues, 11), “un cúmulo de índices” (accumulation of clues/circumstantial evidences), prueba (evidence, 15). Words indicating conviction: realmente (really, 44), consta (it is established 31), perfectamente (perfectly, 18), es evidente (it is evident 13) tenemos (we have, 14), sabemos (we know, 11). Vocabulary

Mónica Bécue-Bertaut Lexical and textual statistics 14/39

slide-15
SLIDE 15

3.3 Lexical statistics

Lexical statistics

W1 W8 W1 W3

….

W9 W7 W8 W5 W6

Block1 Block 20

Lexical statistics: corpus/text as a sequence of occurrences

Mónica Bécue-Bertaut Lexical and textual statistics 15/39

slide-16
SLIDE 16

Distribution of the vocabulary

perfectamente tenido puede declaraciones puesto hechos manifiesta manifestó decir ejemplo decimos mantenía sabemos exactamente totalmente fundamental tengo siempre tema plenamente evidentemente testigos lógico contar tenemos testimonio

Stable words: words with a uniform distribution in the corpus: Vocabulary distribution index (Hubert & Labbé, 1990) Distribution of the Vocabulary These words, not always important from the frequency point of view, give a “tone” to the speech, impose an idea or a general impression through their regular appearance In our case, many stable words passes

  • n

a conviction message The prosecutor has no doubt and wishes to transmit his conviction

Mónica Bécue-Bertaut Lexical and textual statistics 16/39

slide-17
SLIDE 17

Growth and specialization of the vocabulary

Labbé and Hubert’s model (1990) Vocabulary growth N V General vocabulary Vocabulary growth:

  • bserved

Vocabulary growth: model

  • a. Urn with

general vocabulary

  • b. Urns with specialized

vocabularies Model: The speaker randomly draws occurrences either from the specialized urns with probability P or from the general vocabulary urn with probability Q=(1-P). Results from estimating the model: an estimation of P is obtained. The expected and

  • bserved vocabulary size, at different stages N’ of the corpus, can be compared.

Mónica Bécue-Bertaut Lexical and textual statistics 17/39

slide-18
SLIDE 18

Growth and specialization of the vocabulary

In the case of the closing speech, p=0.19 that indicates a high specialization in the case of a corpus composed of only one speech about only one topic Vocabulary growth

Mónica Bécue-Bertaut Lexical and textual statistics

1 20 19 18 17 14 15 16 4 8 3 7 6 5 13 9 10 12 11 2

  • bserved-expected vocabulary

18/39

slide-19
SLIDE 19

3.4 Textual statistics

Frequency threshold: 5 302 different words 8031 occurrences are conserved

Corpus encoded into a frequency table (lexical table) . At first, the spaces×words table Textual statistics

spaces w1 … …. w302 1 2 … … … … … 59 … … 1

Corpus encoded into a documents×words table

Mónica Bécue-Bertaut Lexical and textual statistics 19/39

slide-20
SLIDE 20

Correspondence analysis

Correspondence analysis of the lexical table.

0.75 1.50 2.25

  • 0.5

0.5 1.0 1.5 2.0

Factor 1 l1=0.1650 - 4.82 % Factor 2 l2=0.1562 - 4.57 %

1esp 2esp 3esp 4esp 5esp 6esp 7esp 8esp 9esp 10esp 11esp 12esp 13esp 14esp 15esp 16esp 17esp 18esp 19esp 20esp 21esp 22esp 23esp 24esp 25esp 26esp 27esp 28esp 29esp 30esp 31esp 32esp 33esp34esp 35esp 36esp 37esp 38esp 39esp 40esp 41esp 42esp 43esp 44esp 45esp 46esp 47esp 48esp 49esp 50esp 51esp 52esp 53esp 54esp 55esp 56esp 57esp 58esp 59esp

Poor conclusions Local phenomenon are favoured Disapointment

Mónica Bécue-Bertaut Lexical and textual statistics 20/39

Words on the first CA plane

1.5 3.0 4.5

  • 1

1 2 3 Factor 1 Factor 2 FPMJCM JN MF MJAM SRT a absoluto acreditado acto actos actuaciones acuerdo acusado acusados además ahí al algo allí aquella aquellas aquellos aquí artículo asesinato así ayer año años beneficiarios bien bueno cada cadáver capital cargo casa caso chica ciento cierto cinco cincuenta coche comisaría como compañía con confidencial conocía consta contar contra creo cual cualquier cuando cuanto cuatro curso dato datos de decimos decir declaraciones declaración defensa del delictivo delito demás desaparición desde después devolver dice dicho digo dijo distintas distinto documentación donde dos ejemplo el ella ellos empezó en entendemos entonces entre era eran es esa ese eso esta estaba estaban estado estamos estas este esto estos está evidente evidentemente exactamente existe existencia fallecimiento familia favor fiscal frente fue fundamental ha hablando había habían hace hacer han hasta hay haya hayan hecho hechos hemos hijos hizo hubiera iba igual incluso indicios investigación jubilación juez la lapso las le les ley libertad lo los lugar lógico manifestaciones manifestó manifiesta mantenía marzo mayor mañana me medio medios meses millones ministerio mismo momento muerte mujer muy más nada nadie ninguna ningún no noche normalnos noventa nuevo

  • bra
  • casiones
  • chenta
  • tras
  • tro
  • ído

para parece parte partir perdón perfectamente pero persona personas plenamente podía policía por porque posibilidad primero problema propias propio prostitución protección prueba puede pues puesto póliza que queda qué realmente registro relación relativo respecto resto resulta retracta sabemos sabía sauna se seguro seguros según sentencia ser sería sexual señor si sido siempre sin sino sobre solamente son sospechas sostén su sumario sus suscribió sé sí sólo tal también tampoco tan tanto tema tenemos tengo tenido tenía testigo testigos testimonio tiempo tiene tienen tipo toda todo todos totalmente trabajo tratamiento trato tres tribunal un una unas uno unos va valorar vamos varias ver vez vida y ya yo él

slide-21
SLIDE 21

Blocks as supplementary columns on the first CA plane Correspondence analysis A structure appears: from Bl1 to Bl17 a pattern close to the typical horseshoe is revealed with irregularities that are meaningful The horseshoe indicates an evolution

  • f

the vocabulary along time, evolution disrupted at block 18.

Mónica Bécue-Bertaut Lexical and textual statistics

  • 0.75

0.75 1.50 2.25

  • 0.75

0.75 1.50

Factor 1 - 4.82 % Factor 2 - 4.57 %

Bl 1 Bl 2 Bl 3 Bl 4 Bl 5 Bl 6 Bl 7 Bl 8 Bl 9 Bl 10 Bl 11 Bl 12 Bl 13 Bl 14 Bl 15 Bl 16 Bl 17 Bl 18 Bl 19 Bl 20

21/39

slide-22
SLIDE 22

Blocks w1 … …. w302 1 2 … … … … … 20 … … 1

Correspondence analysis The row-spaces belonging to a same block are collapsed into a unique row. We look for more interpretable axes, regardless of loosing nuances We group the spaces into the twenty spaces

Mónica Bécue-Bertaut Lexical and textual statistics

To better study the evolution of vocabulary along the 20 blocks, we perform Correspondence analysis on the aggregated lexical table

22/39

slide-23
SLIDE 23
  • 0.50
  • 0.25

0.25 0.50

  • 0.25

0.25 0.50

Factor 1 l1=0.133 - 10.30 % Factor 2 - 1 l2=0.127 9.88 % Bl 1 Bl 2 Bl 3 Bl 4 Bl 5 Bl 6 Bl 7 Bl 8 Bl 9 Bl 10 Bl 11 Bl 12 Bl 13 Bl 14 Bl 15 Bl 16 Bl 17 Bl 18 Bl 19 Bl 20

Correspondence analysis CA applied to the Blocks×Words table: aggregated analysis

To study the vocabulary evolution, it is convenient to perform CA on the table that collapses the spaces belonging to a same block

Mónica Bécue-Bertaut Lexical and textual statistics 23/39

The rethorical trajectory

  • f the corpus is clearly

visualized

slide-24
SLIDE 24

24

  • 1.50
  • 0.75

0.75

  • 0.75

0.75 1.50

Factor 1 - 10.30 % Factor 2 - 9.88 % FPM MF MJAM SRT acreditado acto actos acuerdo acusado s aquellas artículo ayer años beneficiarios cadáver capital caso ciento cincuenta comisaría cuando cuatro de declaraciones declaración defensa delito desaparición después devolver dice digo ellos en es estas este está fiscal frente hablando había han hijos hizo investigación jubilación juez la le libertad lo los mayor mañana meses millones ministerio noche

  • tras
  • ído

parte policía por prostitución protección pues queda qué registro relativo resulta sabía seguro sexual señor si sumario suscribió sí sólo también tenido testigo tipo todos trabajo trato un vida

Correspondence analysis

Mónica Bécue-Bertaut Lexical and textual statistics 24/39

  • 1.50
  • 0.75

0.75

  • 0.75

0.75 1.50

Factor 1 Factor 2

a la policía a los hijos

como consta en

consta en las actuaciones de la policía de los acusados de los hijos delito relativo a la prostitución el hecho de que el momento en que en el sumario en las actuaciones en ningún momento en varias ocasiones era una persona es evidente que estamos hablando de la existencia del las declaraciones de las manifestaciones de los hijos de nos dice que nos lo dice

  • tro dato más

se nos dice seguro de vida una persona que

Repeated segments considered as supplementary

slide-25
SLIDE 25

hijos hizo investigación jubilación juez la le libertad lo los mayor mañana meses millones ministerio noche

  • tras
  • ído

parte policía por prostitución protección pues queda qué registro relativo resulta sabía seguro sexual señor si sumario suscribió sí sólo también tenido testigo tipo todos trabajo trato un vida

Correspondence analysis

“Pasamos a las declaraciones de MF”. Había oído” Go to the statements of

  • MF. She had heard

Listing the offences Circumstantial evidence and reasonable inferences that arise from that evidence can constitute sufficient proof

  • f

the elements of a crime The “strange” insurance . Why the beneficiaries are the children

  • f

the murderer? Implausibility of the causes of the witness retraction Mónica Bécue-Bertaut Lexical and textual statistics Offence related to prostitution clearly established 25/39 Implausibility

  • f

a couple relationship between the victim and the defendant. Data, facts, etc. linked to this thesis

Factor 1 - 10.30 % Factor 2 - 9.88 % FPM MF MJAM SRT acreditado acto actos acuerdo acusad

  • s

aquellas artículo ayer años beneficiarios cadáver capital caso ciento cincuenta comisaría cuando cuatro de declaraciones declaración defensa delito desaparición después devolver dice digo ellos en es estas este está fiscal frente hablando había han

She was a person

slide-26
SLIDE 26

Correspondence analysis validation

Mónica Bécue-Bertaut Lexical and textual statistics

Bootstrap: confidence ellipse Stability of the blocks configuration through partial bootstrap

26/39

slide-27
SLIDE 27

Correspondence analysis validation

Mónica Bécue-Bertaut Lexical and textual statistics

Some interesting points:

  • acusado and acusados play a

different role

  • indicios (clues) and prueba

(evidence) can be considered as synonyms, just what the prosecutor has looked for

Stability of the words configuration

27/39

slide-28
SLIDE 28

Constrained clustering

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 20 19 18 17 14 15 16 4 8 3 7 6 5 13 9 10 12 11 2

  • bserved-expected vocabulary

Constrained clustering

From the coordinates

  • n the six first axes

Mónica Bécue-Bertaut Lexical and textual statistics 28/39

slide-29
SLIDE 29
  • 0.50
  • 0.25

0.25 0.50

  • 0.25

0.25 0.50

Factor 1 - 10.30 % Factor 2 - 9.88 % Bl 1 Bl 2 Bl 3 Bl 4 Bl 5 Bl 6 Bl 7 Bl 8 Bl 9 Bl 10 Bl 11 Bl 12 Bl 13 Bl 14 Bl 15 Bl 16 Bl 17 Bl 18 Bl 19 Bl 20

Constrained clustering

From cutting the tree, 6 clusters are obtained; two composed by

  • nly
  • ne

block Clustering takes the text sequence into account.

The clusters on the factorial map

Mónica Bécue-Bertaut Lexical and textual statistics 29/39

slide-30
SLIDE 30
  • 0.50
  • 0.25

0.25 0.50

  • 0.25

0.25 0.50

Factor 1 - 10.30 % Factor 2 - 9.88 % Bl 2 Bl 3 Bl 4 Bl 5 Bl 6 Bl 7 Bl 8 Bl 9 Bl 10 Bl 11 Bl 12 Bl 13 Bl 14 Bl 15 Bl 16 Bl 17 Bl 18 Bl 19 Bl 20

Lexical characterization of the clusters 1.

delito , prostitución, actos , relativo, investigación, tercería

.

O =62; D=44

2.

Prueba, tribunal O=69; D=47

3. SRT, vida, hijos jubilación, capital beneficiarios, millones seguro O=54.6; D=40 4.policía, declaración, MF, conocí, si, declaraciones. jcm. comisaría, oído , testigo, perfectamente, sabía, retracta, protección O=49.2; D= 41 5. persona O=82; D=42

Mónica Bécue-Bertaut Lexical and textual statistics

  • 6. cuatro, desde,

meses, sin, resulta

. O=56.5; D=46

Characteristic words Originality index O Diversity level D

30/39

slide-31
SLIDE 31
  • 0.50
  • 0.25

0.25 0.50

  • 0.25

0.25 0.50

Bl 1 Bl 2 Bl 3 Bl 4 Bl 5 Bl 6 Bl 7 Bl 8 Bl 9 Bl 10 Bl 11 Bl 12 Bl 13 Bl 14 Bl 15 Bl 16 Bl 17 Bl 18 Bl 19 Bl 20

CA provides the shape, the rythm of the closing speech, summary

Mónica Bécue-Bertaut Lexical and textual statistics

segmented in homogeneous intervals by means of constrained clustering, taking into account more axes All together, that allows for accessing the meaning of the texts

  • 1.50
  • 0.75

0.75

  • 0.75

0.75 1.50

FPM MF MJAM SRT acreditado acto actos acuerdo acusa dos aquellas artículo ayer años beneficiarios cadáver capital caso ciento cincuenta comisaría cuando cuatro de declaraciones declaración defensa delito desaparición después devolver dice digo ellos en es estas este está fiscal frente hablando había han hijos hizo investigación jubilación juez la le libertad lo los mayor mañana meses millones ministerio noche

  • tras
  • ído

parte policía por prostitución protección pues queda qué registro relativo resulta sabía seguro sexual señor si sumario suscribió sí sólo también tenido testigo tipo todos trabajo trato un vida

  • 0.50
  • 0.25

0.25 0.50

  • 0.25

0.25 0.50

Bl 1 Bl 2 Bl 3 Bl 4 Bl 5 Bl 6 Bl 7 Bl 8 Inning Bl 9 Bl 10 Bl 11 Bl 12 Bl 13 Bl 14 Bl 15 Bl 16 Bl 17 Bl 18 Bl 19 Bl 20

It underlines the interrelationships between words, between documents (blocks) and between words and documents Leading to uncover the meaning of the corpus

If you want to know the end of the trial: The defendants have been condemned for murder charges by the Audience Court

  • f Barcelona and then by the Supreme Court for appeal

31/39

slide-32
SLIDE 32

Other applications, other methods

Mónica Bécue-Bertaut Lexical and textual statistics

A brief look to a very important application:

  • pen-ended questions in surveys

a large collection of (very) short texts enriched each one by a large amount of data about the respondent-author

32/39

Other applications, other methods

slide-33
SLIDE 33

Starting point: To introduce the characteristics of the respondent as active,

that corresponds to the fact that any text meaning can only be disclosed when the speaker is identified through his/her characteristics. In that, we follow Borges (Ficciones, 1944): a same text will be read and understood in a very different way depending on the characteristics of the author. In “Pierre Menard, French author of Don Quijote”, he wrote: “Comparing the Don Quijote by Menard from this by Cervantes is highly

  • revealing. For example, this latter wrote:

... la verdad, cuya madre es la historia, émula del tiempo, depósito de las acciones, testigo de lo pasado, ejemplo y aviso de lo presente, advertencia de lo por venir. Menard, on the contrary , wrote:... la verdad, cuya madre es la historia, émula del tiempo, depósito de las acciones, testigo de lo pasado, ejemplo y aviso de lo presente, advertencia de lo por venir. (…) it is clear the contrast of styles (...)” The meaning of a text depends on metadata!

Mónica Bécue-Bertaut Lexical and textual statistics 33/39

slide-34
SLIDE 34

Other applications, other methods

respondent w1 w3 w4 w8 w202 var1 … var20

1 1 1 y1 a 1 … … … … … … y2 a 2 220 1 1 y3 b 1

Survey data: Example: Survey to Spanish judges “What is a good judge?” weak structure of the lexical table. CA axes mainly reflect local phenomenon

Mónica Bécue-Bertaut Lexical and textual statistics 34/39

Other applications, other methods

slide-35
SLIDE 35

Other applications, other methods Canonical correspondence analysis (CCA) Convenient when some answers to closed questions can be considered as explicative of the open-endend questions. Multiple factor analysis for a mixture of frequency data and categorical/ quantitative data (MFA +MFACT) Convenient when some answers to closed questions and open-ended answers can be considered, all together, as a battery of questions looking for circumventing a same topic

Mónica Bécue-Bertaut Lexical and textual statistics

Proposals: to resort to

35/39

Other applications, other methods

slide-36
SLIDE 36

CCA enlarges the usual practice of aggregated lexical tables. the variability dimensions put to the fore are related to differences in the answers to the closed questions

Survey to judges Some results with CCA: 3 types of objects:

  • Individuals
  • Words
  • Variables/ Categories

Other applications, other methods

Mónica Bécue-Bertaut Lexical and textual statistics 36/39

slide-37
SLIDE 37

37

  • 1

1 2 3

  • 1.50
  • 0.75

0.75 1.50

axe 1 axe 2

Je lis peu Je lis assez Je lis beaucoup Je lis avec de grandes difficultés je lis avec facilité aburrido aburrimiento aprender aprendes aprendo aventura aventuras diversion divertida divertido divertirme divierto enseña entrar entretenido entretenimiento fantasia imaginacion importante interesante mundo rollo saber aprende aventuras diviertes imaginación importante importantes sino NOTE GLOBALE INSUFFISANTE NOTE GLOBALE EXCELLENTE PÈRE SANS ÉTUDES CLASSE ÉLEVÉE PÈRE ÉT. UNIVERSITAIRES Je lis avec quelques difficultés

l l

Other applications, other methods

aburrimiento rollo entretenimiento bastante Poco aburrido entretenido interesante

All the other words

l

1=0.51

3.3% l

2=0.49

3.1%

Example: Survey to 895 children studying fifth grade who had to complete a close questionnaire and the two following assertions:

  • 1. For me, to read means…;
  • 2. I believe that reading is important

because…

  • Weak structure of the

lexical table.

  • CA axes mainly reflect

local phenomenon Mónica Bécue-Bertaut Lexical and textual statistics

  • 2
2 4 6
  • 3.0
  • 1.5
1. 5 3.

Factor 1 Factor 2

Axis 1 l1=1.4; 2% Axis 2 l2=1.2; 1.7%

CA MFA+MFACT:

Some results with MFA+MFACT: 3 types of objects:

  • Individuals
  • Words
  • Variables/ Categories

37/39

slide-38
SLIDE 38

Lexical statistics provide global indicators (specialisation of the vocabulary, diversity, originality) Textual statistics, through correspondence analysis, allows for accessing to the rethorical shape of the corpus Constrained clustering locates the disruptions and provides a segmentation

  • f the corpus in homogeneous intervals

Automatic selection of characteristic words uncovers the meaning of the shape and disruptions Applications in a great variety

  • f

fields. Increasing importance in bibliography and technological watch: evolution, trends,

  • ppositions,

disruptions Correspondence analysis should (and will) be more widely used to extract information from large textual data bases Simultaneous analysis of textual+Metadata opens interesting perspectives

  • 4. Conclusive remarks.

Conclusive remarks

Mónica Bécue-Bertaut Lexical and textual statistics 38/39

slide-39
SLIDE 39

Benzécri (1981). Pratique de l’analyse des données. Tome 3. Linguistique et lexicologie. Dunod Lebart, L., Salem, A. (1998). Exploring textual data. Kluwer. Labbé D. (1990). Le vocabulaire de Mitterrand. Presses de la FNSP Muller Ch. (1977). Principes et méthodes de statistique lexicale. Hachette. Murtagh F. (2005). Correspondence analysis with R and Java. Chap. 5. Content Analysis of Text. Chapman & Hall.

  • JADT. Proceeding

MFA+MFACT Bécue-Bertaut M., Pagès J. (2008) Multiple Factor Analysis and Clustering

  • f

a Mixture

  • f

Quantitative, Categorical and Frequency Data. Computational Statistics and Data Analysis, 52, 3255 – 3268 Software DTM Data and Text Mining (by Lebart, free), SPAD (comercial) CCA R-function by Cadoret & Kostov, MFACT and constrained clustering – R function by Kostov

Some references

References

Mónica Bécue-Bertaut Lexical and textual statistics 39/39