Generalized similarity measures for text data.
Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner
GETCO 2015, Aalborg
April 9, 2015
Generalized similarity measures for text data. Hubert Wagner (IST - - PowerPoint PPT Presentation
Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner GETCO 2015, Aalborg April 9, 2015 Plan Shape of data. Text as a point-cloud. Log-transform and similarity measure.
GETCO 2015, Aalborg
April 9, 2015
◮ Shape of data. ◮ Text as a point-cloud. ◮ Log-transform and similarity measure. ◮ Bregman divergence and topology.
◮ Capture the shape of the union of balls. ◮ Combinatorial representation.
◮ Key property: stability!
◮ (Large) collection of text documents.
◮ Weighted vector of key-words or terms. ◮ Summarizes the topic of a single document. ◮ Higher weight means higher importance.
◮ Vector Space Model maps a corpus K to Rd. ◮ Each distinct term in K becomes a direction, so
◮ Each document is represented by its term-vector.
Cat D
k e y Dog T G
<(Cat,0), (Dog,0.2), (Donkey,0.9)> <(Cat,0.5), (Donkey,0.5)>
◮ Cosine similarity compares two documents. ◮ Distance (dissimilarity) d(a, b) := 1 − sim(a, b). ◮ This d is not a metric.
Cat D
k e y Dog T G
<(Cat,0), (Dog,0.2), (Donkey,0.9)> <(Cat,0.5), (Donkey,0.5)>
◮ d(a, b) – the dissimilarity. ◮ For triangle d(a, b, c) =
◮ Is this the filtering function we want?
◮ Extend similarity from pairs to larger subsets of
◮ Its persistence should be stable. ◮ As a bonus, the resulting complex will be
A T G A T G A T G
[(A,0), (G,0.2), (T,0.9)] [(A,0.5), (G,0), (T,0.5)]
◮ simJ(X1, dots, Xd) = card ∩iXi
card ∪iXi .
◮ Generalizes the Jaccard index.
n
k
j.
n
j
k+1
1 2
j=1(tj − sj)e2tj.
x x* y* y
◮ It covers the Sq. Eucl. distance, squared
◮ Extensive use in machine learning. ◮ Links to statistics via [regular] exponential
◮ Bregman-based Voronoi [Nielsen at el]. ◮ Information Geometry. ◮ Collapsibility Cech→Delunay [Bauer,
◮ Persistence stability for geometric complexes
◮ New, stable and relevant distance (dissimilarity
◮ It serves as an interpretation of text data. ◮ Link between TDA and Bregman divergences.
Research partially supported by the TOPOSYS project