Clustering in Swedish The Impact of some Properties of the Swedish - PowerPoint PPT Presentation

Introduction Clustering Evaluation Clustering in Swedish Clustering in Swedish The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method Magnus Rosell 2006-03-15 Magnus Rosell Clustering in Swedish

Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation Content Introduction Clustering Evaluation Clustering in Swedish Magnus Rosell Clustering in Swedish

Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation The Swedish Twin Registry KI (The Swedish Medical University) Largest twin registry in the world, about 140 000 twins. Smoking is not harmful. The impact of heritage and environment. A lot of questionnnaires (open and closed questions). Magnus Rosell Clustering in Swedish

Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation One free text question about occupation About 42 000 twins have answered. Two hierarchical classification systems: System L1 L2 L3 L4 L5 AMSYK 11 28 114 361 969 YK80 12 59 288 (number of categories per level) Manual categorization of the 42 000 texts according to both systems took one summer. Magnus Rosell Clustering in Swedish

Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation Clustering Clustering – to partition a set of objects into clusters (groups or parts) so that the objects within clusters are more similar to each other than object from other clusters. We are interested in similarity with respect to the contents of the documents (our objects). Clustering vs. Categorization Magnus Rosell Clustering in Swedish

Introduction Content Clustering The Swedish Twin Registry Evaluation Clustering Clustering in Swedish Motivation Motivation Postprocessing of search results (http://vivisimo.com, http://www.iboogie.com) Tool for exploration. Questionnaires. Cheaper and faster than manual clustering/categorization. Easy to obtain several different clusterings of the same set. Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Clustering Many clustering algorithms use representation of the objects (often a vector of values) similarity measure Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Document Clustering For documents – Information Retrieval: The content of a text is represented by the words in it. No regard to word order. Very common words that do not have any “meaning” are removed (stoplist). Texts are considered similar if they share many words. Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Document Clustering (cont.) The term-by-document matrix: d 1 d j . . . . . .   t 1 w 1 , 1 w 1 , j . . . . . . . . . ... . . .   . . .     t i w i , 1 w i , j . . . . . .   . . .  ...  . . . . . . w i , j – based on the frequency of the word in the document and its frequency in the entire document collection (tf*idf-weighting for instance). Similarity between documents – cosine measure for instance. Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Algorithms Two Types of Clustering Algorithms Partitioning algorithms , flat partition Hierarchical algorithms , hierarchy of clusters Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example A Partitioning Algorithm: K-Means 1 Initial partition, for example: pick k documents at random as first cluster centroids. 2 Put each document in the most similar cluster. 3 Calculate new cluster centroids. 4 Repeat 2 and 3 until some condition is fulfilled. Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example A Clustering Example 6 4 2 0 −2 −4 −6 −6 −4 −2 0 2 4 6 Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Hierarchical Clustering Agglomerative algorithm: 1 Make one cluster for each document. 2 Join the most similar pair into one cluster. 3 Repeat 2 until some condition is fulfilled. Examples: single link, complete link, group average link, Ward’s method. Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Hierarchical Clustering (cont.) Divisive algorithm: 1 Put all documents in one cluster. 2 Split the worst cluster. 3 Repeat 2 until some condition is fulfilled. Example: Bisecting K-Means splits the biggest cluster into two using K-Means. Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example Comparison of Algorithms K-Means decide the number of clusters in advance different results depending on initial partition “global” Agglomerative may stop at “optimal” number of clusters same result every time “local” Magnus Rosell Clustering in Swedish

Introduction Clustering Clustering Document Clustering Evaluation Algorithms Clustering in Swedish A Document Clustering Example A Document Clustering Example Klustringsresultat Antal artiklar Ord Ekonomi Nöje Sport Sverige Världen Totalt procent, index, börs, ök, Kluster1 167 4 1 37 23 232 ränt film, aftonbl, skriv, tv, Kluster2 18 421 22 176 40 677 sver spel, match, svensk, Kluster3 0 19 452 10 14 495 vann, klubb reut, pressmeddel, bolag, Kluster4 312 8 6 36 10 372 stockholm, akti polis, död, skad, person, Kluster5 3 48 19 241 413 724 tt tt, svensk, procent, skriv, Totalt stockholm, reut, spel, tid, 500 500 500 500 500 2500 dag, akti Magnus Rosell Clustering in Swedish

Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons Evaluation What is a good clustering? Internal qualities measures depend on the representation. External quality measures are based on a known categorization. Magnus Rosell Clustering in Swedish

Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons Internal Quality Measures Cluster self similarity (the average similarity of the documents in a cluster). Not good when evaluating the represenation. Also uses the assumption that our representation is valid. Magnus Rosell Clustering in Swedish

Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons External Quality Measures Compare the clustering to another partition (a categorization for instance). Precision, Recall Entropy, Mutual Information Magnus Rosell Clustering in Swedish

Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons External Quality Measures (cont.) External measures – comparisons: Clustering → categorization Max and min, but what is good? Magnus Rosell Clustering in Swedish

Introduction Evaluation Clustering Internal Quality Measures Evaluation External Quality Measures Clustering in Swedish Comparing Comparisons Comparing Comparisons Clustering AMSYK YK80 YK80 → AMSYK: 3.00 Clustering → AMSYK: 2.29 (77 %) AMSYK → YK80: 3.02 Clustering → YK80: 2.17 (72 %) Average values over all levels. Magnus Rosell Clustering in Swedish

Introduction Text Sets Clustering Stemming Evaluation Compound Splitting Clustering in Swedish Phrases Text Sets KTH News Corpus DN and Aftonbladet Five categories each Medical papers from L¨ akartidningen MeSH-terms Four autmatically extracted categorizations. Magnus Rosell Clustering in Swedish

Introduction Text Sets Clustering Stemming Evaluation Compound Splitting Clustering in Swedish Phrases Stemming Morphology: cykel, cyklarnas, cykling Stemming – remove suffixes. Stemming improved results on newspaper articles with 4 %. Reduces the size of the representation. Magnus Rosell Clustering in Swedish

Introduction Text Sets Clustering Stemming Evaluation Compound Splitting Clustering in Swedish Phrases Compound Splitting Swedish compounds: textklustring (clustering of texts) The Spell checking program Stava . Stop some “compounds” from being split: godk¨ and, Svante, Lindstr¨ om, etc Improved results with 10 %. Combined with Stemming: 13 %. (Newspaper articles) Keep only the components! Magnus Rosell Clustering in Swedish

Clustering in Swedish The Impact of some Properties of the Swedish - PowerPoint PPT Presentation

Introduction Clustering Evaluation Clustering in Swedish Clustering in Swedish The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method Magnus Rosell 2006-03-15 Magnus Rosell Clustering in

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Welcome to The Swedish Club The Swedish Club 2018 1 It all started in 1872 The Swedish Club

The strengthened case for Swedish law and Swedish arbitration in times of uncertainty Swedish

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

earnings call Jeff Woodbury Vice President, Investor Relations and Secretary February 2, 2018

Climate Services Climate Services Climate Services Climate Services Innovation and

Climate Adaptation and development pathway Ari Muhammad Green Jobs, Foundation Training for ILO

The WMO Integrated Global Observing System (WIGOS), current status and planned regional

STATISTICS FOR ECOLOGISTS USING R AND EXCEL: DATA COLLECTION, EXPLORATION, ANALYSIS AND

E MBEDDED INTELLIGENCE : TRENDS & CHALLENGES April 16 th , 2019 Embedded & Cyber Physical

THERMAL POWER PLANTS 28 th October, 2015 ATLAS COPCO INDIA Established in the year 1960.

Terramin General Update Matt Daniel Environment and Community S uperintendent Angas Zinc

Sambuz

Useful Links

Newsletter

Mail Us

Clustering in Swedish The Impact of some Properties of the Swedish - PowerPoint PPT Presentation

Introduction Clustering Evaluation Clustering in Swedish Clustering in Swedish The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method Magnus Rosell 2006-03-15 Magnus Rosell Clustering in

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Welcome to The Swedish Club The Swedish Club 2018 1 It all started in 1872 The Swedish Club

The strengthened case for Swedish law and Swedish arbitration in times of uncertainty Swedish

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

earnings call Jeff Woodbury Vice President, Investor Relations and Secretary February 2, 2018

Climate Services Climate Services Climate Services Climate Services Innovation and

Climate Adaptation and development pathway Ari Muhammad Green Jobs, Foundation Training for ILO

The WMO Integrated Global Observing System (WIGOS), current status and planned regional

STATISTICS FOR ECOLOGISTS USING R AND EXCEL: DATA COLLECTION, EXPLORATION, ANALYSIS AND

E MBEDDED INTELLIGENCE : TRENDS &amp; CHALLENGES April 16 th , 2019 Embedded &amp; Cyber Physical

THERMAL POWER PLANTS 28 th October, 2015 ATLAS COPCO INDIA Established in the year 1960.

Terramin General Update Matt Daniel Environment and Community S uperintendent Angas Zinc

Sambuz

Useful Links

Newsletter

Mail Us

E MBEDDED INTELLIGENCE : TRENDS & CHALLENGES April 16 th , 2019 Embedded & Cyber Physical