Clustering in Swedish The Impact of some Properties of the Swedish - - PowerPoint PPT Presentation

clustering in swedish
SMART_READER_LITE
LIVE PREVIEW

Clustering in Swedish The Impact of some Properties of the Swedish - - PowerPoint PPT Presentation

Introduction Clustering Evaluation Clustering in Swedish Clustering in Swedish The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method Magnus Rosell 2006-03-15 Magnus Rosell Clustering in


slide-1
SLIDE 1

Introduction Clustering Evaluation Clustering in Swedish

Clustering in Swedish

The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method Magnus Rosell 2006-03-15

Magnus Rosell Clustering in Swedish

slide-2
SLIDE 2

Introduction Clustering Evaluation Clustering in Swedish Content The Swedish Twin Registry Clustering Motivation

Content

Introduction Clustering Evaluation Clustering in Swedish

Magnus Rosell Clustering in Swedish

slide-3
SLIDE 3

Introduction Clustering Evaluation Clustering in Swedish Content The Swedish Twin Registry Clustering Motivation

The Swedish Twin Registry

KI (The Swedish Medical University) Largest twin registry in the world, about 140 000 twins. Smoking is not harmful. The impact of heritage and environment. A lot of questionnnaires (open and closed questions).

Magnus Rosell Clustering in Swedish

slide-4
SLIDE 4

Introduction Clustering Evaluation Clustering in Swedish Content The Swedish Twin Registry Clustering Motivation

One free text question about occupation

About 42 000 twins have answered. Two hierarchical classification systems: System L1 L2 L3 L4 L5 AMSYK 11 28 114 361 969 YK80 12 59 288 (number of categories per level) Manual categorization of the 42 000 texts according to both systems took one summer.

Magnus Rosell Clustering in Swedish

slide-5
SLIDE 5

Introduction Clustering Evaluation Clustering in Swedish Content The Swedish Twin Registry Clustering Motivation

Clustering

Clustering – to partition a set of objects into clusters (groups or parts) so that the objects within clusters are more similar to each

  • ther than object from other clusters.

We are interested in similarity with respect to the contents of the documents (our objects). Clustering vs. Categorization

Magnus Rosell Clustering in Swedish

slide-6
SLIDE 6

Introduction Clustering Evaluation Clustering in Swedish Content The Swedish Twin Registry Clustering Motivation

Motivation

Postprocessing of search results (http://vivisimo.com, http://www.iboogie.com) Tool for exploration. Questionnaires. Cheaper and faster than manual clustering/categorization. Easy to obtain several different clusterings of the same set.

Magnus Rosell Clustering in Swedish

slide-7
SLIDE 7

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

Clustering

Many clustering algorithms use representation of the objects (often a vector of values) similarity measure

Magnus Rosell Clustering in Swedish

slide-8
SLIDE 8

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

Document Clustering

For documents – Information Retrieval: The content of a text is represented by the words in it. No regard to word order. Very common words that do not have any “meaning” are removed (stoplist). Texts are considered similar if they share many words.

Magnus Rosell Clustering in Swedish

slide-9
SLIDE 9

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

Document Clustering (cont.)

The term-by-document matrix: d1 . . . dj . . . t1 . . . ti . . .       w1,1 . . . w1,j . . . . . . ... . . . wi,1 . . . wi,j . . . . . . . . . ...       wi,j – based on the frequency of the word in the document and its frequency in the entire document collection (tf*idf-weighting for instance). Similarity between documents – cosine measure for instance.

Magnus Rosell Clustering in Swedish

slide-10
SLIDE 10

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

Algorithms

Two Types of Clustering Algorithms Partitioning algorithms, flat partition Hierarchical algorithms, hierarchy of clusters

Magnus Rosell Clustering in Swedish

slide-11
SLIDE 11

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

A Partitioning Algorithm: K-Means

1 Initial partition, for example: pick k documents at random as

first cluster centroids.

2 Put each document in the most similar cluster. 3 Calculate new cluster centroids. 4 Repeat 2 and 3 until some condition is fulfilled. Magnus Rosell Clustering in Swedish

slide-12
SLIDE 12

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

A Clustering Example

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6

Magnus Rosell Clustering in Swedish

slide-13
SLIDE 13

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

Hierarchical Clustering

Agglomerative algorithm:

1 Make one cluster for each document. 2 Join the most similar pair into one cluster. 3 Repeat 2 until some condition is fulfilled.

Examples: single link, complete link, group average link, Ward’s method.

Magnus Rosell Clustering in Swedish

slide-14
SLIDE 14

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

Hierarchical Clustering (cont.)

Divisive algorithm:

1 Put all documents in one cluster. 2 Split the worst cluster. 3 Repeat 2 until some condition is fulfilled.

Example: Bisecting K-Means splits the biggest cluster into two using K-Means.

Magnus Rosell Clustering in Swedish

slide-15
SLIDE 15

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

Comparison of Algorithms

K-Means decide the number of clusters in advance different results depending on initial partition “global” Agglomerative may stop at “optimal” number of clusters same result every time “local”

Magnus Rosell Clustering in Swedish

slide-16
SLIDE 16

Introduction Clustering Evaluation Clustering in Swedish Clustering Document Clustering Algorithms A Document Clustering Example

A Document Clustering Example

Klustringsresultat

Antal artiklar Ord Ekonomi Nöje Sport Sverige Världen Totalt Kluster1 procent, index, börs, ök, ränt 167 4 1 37 23 232 Kluster2 film, aftonbl, skriv, tv, sver 18 421 22 176 40 677 Kluster3 spel, match, svensk, vann, klubb 19 452 10 14 495 Kluster4 reut, pressmeddel, bolag, stockholm, akti 312 8 6 36 10 372 Kluster5 polis, död, skad, person, tt 3 48 19 241 413 724 Totalt tt, svensk, procent, skriv, stockholm, reut, spel, tid, dag, akti 500 500 500 500 500 2500

Magnus Rosell Clustering in Swedish

slide-17
SLIDE 17

Introduction Clustering Evaluation Clustering in Swedish Evaluation Internal Quality Measures External Quality Measures Comparing Comparisons

Evaluation

What is a good clustering? Internal qualities measures depend on the representation. External quality measures are based on a known categorization.

Magnus Rosell Clustering in Swedish

slide-18
SLIDE 18

Introduction Clustering Evaluation Clustering in Swedish Evaluation Internal Quality Measures External Quality Measures Comparing Comparisons

Internal Quality Measures

Cluster self similarity (the average similarity of the documents in a cluster). Not good when evaluating the represenation. Also uses the assumption that our representation is valid.

Magnus Rosell Clustering in Swedish

slide-19
SLIDE 19

Introduction Clustering Evaluation Clustering in Swedish Evaluation Internal Quality Measures External Quality Measures Comparing Comparisons

External Quality Measures

Compare the clustering to another partition (a categorization for instance). Precision, Recall Entropy, Mutual Information

Magnus Rosell Clustering in Swedish

slide-20
SLIDE 20

Introduction Clustering Evaluation Clustering in Swedish Evaluation Internal Quality Measures External Quality Measures Comparing Comparisons

External Quality Measures (cont.)

External measures – comparisons: Clustering → categorization Max and min, but what is good?

Magnus Rosell Clustering in Swedish

slide-21
SLIDE 21

Introduction Clustering Evaluation Clustering in Swedish Evaluation Internal Quality Measures External Quality Measures Comparing Comparisons

Comparing Comparisons

Clustering YK80 AMSYK

YK80 → AMSYK: 3.00 Clustering → AMSYK: 2.29 (77 %) AMSYK → YK80: 3.02 Clustering → YK80: 2.17 (72 %) Average values over all levels.

Magnus Rosell Clustering in Swedish

slide-22
SLIDE 22

Introduction Clustering Evaluation Clustering in Swedish Text Sets Stemming Compound Splitting Phrases

Text Sets

KTH News Corpus

DN and Aftonbladet Five categories each

Medical papers from L¨ akartidningen

MeSH-terms Four autmatically extracted categorizations.

Magnus Rosell Clustering in Swedish

slide-23
SLIDE 23

Introduction Clustering Evaluation Clustering in Swedish Text Sets Stemming Compound Splitting Phrases

Stemming

Morphology: cykel, cyklarnas, cykling Stemming – remove suffixes. Stemming improved results on newspaper articles with 4 %. Reduces the size of the representation.

Magnus Rosell Clustering in Swedish

slide-24
SLIDE 24

Introduction Clustering Evaluation Clustering in Swedish Text Sets Stemming Compound Splitting Phrases

Compound Splitting

Swedish compounds: textklustring (clustering of texts) The Spell checking program Stava. Stop some “compounds” from being split: godk¨ and, Svante, Lindstr¨

  • m, etc

Improved results with 10 %. Combined with Stemming: 13 %. (Newspaper articles) Keep only the components!

Magnus Rosell Clustering in Swedish

slide-25
SLIDE 25

Introduction Clustering Evaluation Clustering in Swedish Text Sets Stemming Compound Splitting Phrases

Phrases

The grammar checking program Granska – Noun Phrases. Phrase representations Phrase-match OR phrase-overlap-match. Use split compounds as phrases OR not. Split compounds within phrases OR not. 23 = 8 phrase representations. Tried on its own and in combination with ordinary word representation.

Magnus Rosell Clustering in Swedish

slide-26
SLIDE 26

Introduction Clustering Evaluation Clustering in Swedish Text Sets Stemming Compound Splitting Phrases

Phrases (cont.)

Tested on newspaper articles and medical journal papers: Phrases < phrases + words < words. Phrases in medical papers > phrases in newspaper articles.

Magnus Rosell Clustering in Swedish

slide-27
SLIDE 27

Introduction Clustering Evaluation Clustering in Swedish Text Sets Stemming Compound Splitting Phrases

Demonstration

Demonstration of clustering results in html-browser.

Magnus Rosell Clustering in Swedish

slide-28
SLIDE 28

Introduction Clustering Evaluation Clustering in Swedish Text Sets Stemming Compound Splitting Phrases

Some example clusters

Some example clusters from a clustering to 59 cluster, compared with level 3 of AMSYK (114 categories):

Most Important Words Docs Purity 1 auxiliary, care, home, service, old age 983 0.82 2 preschool, child, learn, educationalist, play 810 0.63 3 chauffeur, lorry, drive, taxi, driven 1125 0.80 4 country, agriculture, farm, animal, berry 901 0.47 5 building, carpenter, build, house, concrete 1083 0.65 Description Docs Fraction 1 Personal care and related workers 4916 0.11 2 Pre-primary education teaching associate professionals 661 0.02 3 Motor vehicle drivers 1128 0.03 4 Crop and animal producers 1040 0.02 5 Building and construction workers 1283 0.03

Magnus Rosell Clustering in Swedish