Generalized similarity measures for text data. Hubert Wagner (IST - PowerPoint PPT Presentation

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner GETCO 2015, Aalborg April 9, 2015

Plan ◮ Shape of data. ◮ Text as a point-cloud. ◮ Log-transform and similarity measure. ◮ Bregman divergence and topology.

Shape of data.

Main tools. Rips and Cech simplicial complexes: ◮ Capture the shape of the union of balls. ◮ Combinatorial representation. Persistence captures geometric-topological information of the data: ◮ Key property: stability!

Interpretation of filtration values. For a simplex S = v 0 , . . . , v k , f ( S ) = t means that at filtration threshold t , objects v 0 , . . . , v k are considered close .

Text as a point-cloud.

Basic concepts Corpus: ◮ (Large) collection of text documents. Term-vector: ◮ Weighted vector of key-words or terms . ◮ Summarizes the topic of a single document. ◮ Higher weight means higher importance .

Concept: Vector Space Model ◮ Vector Space Model maps a corpus K to R d . ◮ Each distinct term in K becomes a direction, so d can be high (10s thousands). ◮ Each document is represented by its term-vector . Cat <(Cat,0.5), (Donkey,0.5)> y e k n T o D <(Cat,0), (Dog,0.2), (Donkey,0.9)> Dog G

Concept: Similarity measures ◮ Cosine similarity compares two documents. ◮ Distance (dissimilarity) d ( a , b ) := 1 − sim ( a , b ). ◮ This d is not a metric . Cat <(Cat,0.5), (Donkey,0.5)> y e k n T o D <(Cat,0), (Dog,0.2), (Donkey,0.9)> Dog G

Geometry-topological tools.

Interpreting Rips A simplex is added immediately after its boundary: ◮ d ( a , b ) – the dissimilarity. ◮ For triangle d ( a , b , c ) = max ( d ( a , b ) , d ( a , c ) , d ( b , c )). ◮ Is this the filtering function we want?

Generalized similarity Goal: ◮ Extend similarity from pairs to larger subsets of documents . ◮ Its persistence should be stable. ◮ As a bonus, the resulting complex will be smaller. A A A [(A,0.5), (G,0), (T,0.5)] T T T [(A,0), (G,0.2), (T,0.9)] G G G

Simple example. For simplicity, let us work with binary term-vectors (or sets of terms). ◮ sim J ( X 1 , dots , X d ) = card ∩ i X i card ∪ i X i . ◮ Generalizes the Jaccard index . cat dog donkey 1 1 0 0 1 1 1 0 1

New direction. Flawed generalized cosine measure: n k � � R cos ( p 0 , p 1 , . . . , p k ) = p i j . (1) j =1 i =0 Another option: the length of the geometric mean: 1 � k 2  k +1  2 � n � � R gm ( p 0 , p 1 , . . . , p k ) = p i (2) .   j j =1 i =0

Log-transform We study the N-dimensional log-transform and related distances.

Log-transform

Log-transform in 3D

Log-distance

Log-distance: formula Let x , y ∈ R n − 1 , s = ( x , F 1 ( x )) and t = ( y , F 1 ( y )). Then the log-distance from x to y is D ( x , y ) = � n j =1 ( t j − s j ) e 2 t j .

Log-distance: conjugate y x x * y *

Log-distance: conjugate in 3D

Log Ball

Log Cech complex � Cech r ( X ) = { ξ ⊆ X | B r ( x ) � = ∅} . (3) x ∈ ξ

Generalized measure. For each simplex ξ ∈ ∆( X ), there is a smallest radius for which ξ belongs to the ˇ Cech complex: r C ( ξ ) = min { r | ξ ∈ Cech r ( X ) } . (4) We call r C : ∆( X ) → R the ˇ Cech radius function of X . In the original coordinate space, we get the desired similarity measure: R C ( ξ ) = e − r C ( ξ ) / √ n (5)

Bregman divergences

Bregman divergences Bregman distance from x to y : D F ( x , y ) = F ( x ) − [ F ( y ) + �∇ F ( y ) , x − y � ] ; (6)

Bregman divergences F can be any strictly convex function! ◮ It covers the Sq. Eucl. distance, squared Mahalanobis distance, Kullback-Leibler divergence, Itakura-Saito distance. ◮ Extensive use in machine learning. ◮ Links to statistics via [regular] exponential family (of distributions).

Further connections ◮ Bregman-based Voronoi [Nielsen at el]. ◮ Information Geometry. ◮ Collapsibility Cech → Delunay [Bauer, Edelsbrunner]. ◮ Persistence stability for geometric complexes [Chazal, de Silva, Oudot]

Summary ◮ New, stable and relevant distance (dissimilarity measure) for texts. ◮ It serves as an interpretation of text data. ◮ Link between TDA and Bregman divergences.

Thank you! Research partially supported by the TOPOSYS project

Generalized similarity measures for text data. Hubert Wagner (IST - PowerPoint PPT Presentation

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner GETCO 2015, Aalborg April 9, 2015 Plan Shape of data. Text as a point-cloud. Log-transform and similarity measure.

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

(Dis-)Similarity Measures for Description Logics Representation Claudia dAmato Computer

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Background Background Text Complexity Text Complexity Text Complexity Sowmya V.B., Sowmya

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Topological measures of similarity Erin Wolf Chambers Saint Louis University

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

A Similarity Measure for the ALN Description Logic Nicola Fanizzi, Claudia dAmato Dipartimento

A graphical view of distance between rankings: the Point and Area measures Giorgio Maria Di Nunzio

Notes about correlation (for Asgn 2) Sharon Goldwater Sharon Goldwater Correlation Overview of

Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and

Passage Based Retrieval (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Passage Based

Measuring distance/ similarity of data objects Multiple data types Records of users

Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace Unsupervised learning