Generalized similarity measures for text data. Hubert Wagner (IST - - PowerPoint PPT Presentation

generalized similarity measures for text data
SMART_READER_LITE
LIVE PREVIEW

Generalized similarity measures for text data. Hubert Wagner (IST - - PowerPoint PPT Presentation

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner GETCO 2015, Aalborg April 9, 2015 Plan Shape of data. Text as a point-cloud. Log-transform and similarity measure.


slide-1
SLIDE 1

Generalized similarity measures for text data.

Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner

GETCO 2015, Aalborg

April 9, 2015

slide-2
SLIDE 2

Plan

◮ Shape of data. ◮ Text as a point-cloud. ◮ Log-transform and similarity measure. ◮ Bregman divergence and topology.

slide-3
SLIDE 3

Shape of data.

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

Main tools.

Rips and Cech simplicial complexes:

◮ Capture the shape of the union of balls. ◮ Combinatorial representation.

Persistence captures geometric-topological information of the data:

◮ Key property: stability!

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44

Interpretation of filtration values.

For a simplex S = v0, . . . , vk, f (S) = t means that at filtration threshold t, objects v0, . . . , vk are considered close.

slide-45
SLIDE 45

Text as a point-cloud.

slide-46
SLIDE 46

Basic concepts

Corpus:

◮ (Large) collection of text documents.

Term-vector:

◮ Weighted vector of key-words or terms. ◮ Summarizes the topic of a single document. ◮ Higher weight means higher importance.

slide-47
SLIDE 47

Concept: Vector Space Model

◮ Vector Space Model maps a corpus K to Rd. ◮ Each distinct term in K becomes a direction, so

d can be high (10s thousands).

◮ Each document is represented by its term-vector.

Cat D

  • n

k e y Dog T G

<(Cat,0), (Dog,0.2), (Donkey,0.9)> <(Cat,0.5), (Donkey,0.5)>

slide-48
SLIDE 48

Concept: Similarity measures

◮ Cosine similarity compares two documents. ◮ Distance (dissimilarity) d(a, b) := 1 − sim(a, b). ◮ This d is not a metric.

Cat D

  • n

k e y Dog T G

<(Cat,0), (Dog,0.2), (Donkey,0.9)> <(Cat,0.5), (Donkey,0.5)>

slide-49
SLIDE 49

Geometry-topological tools.

slide-50
SLIDE 50

Interpreting Rips

A simplex is added immediately after its boundary:

◮ d(a, b) – the dissimilarity. ◮ For triangle d(a, b, c) =

max(d(a, b), d(a, c), d(b, c)).

◮ Is this the filtering function we want?

slide-51
SLIDE 51

Generalized similarity

Goal:

◮ Extend similarity from pairs to larger subsets of

documents.

◮ Its persistence should be stable. ◮ As a bonus, the resulting complex will be

smaller.

A T G A T G A T G

[(A,0), (G,0.2), (T,0.9)] [(A,0.5), (G,0), (T,0.5)]

slide-52
SLIDE 52

Simple example.

For simplicity, let us work with binary term-vectors (or sets of terms).

◮ simJ(X1, dots, Xd) = card ∩iXi

card ∪iXi .

◮ Generalizes the Jaccard index.

cat dog donkey 1 1 1 1 1 1

slide-53
SLIDE 53

New direction.

Flawed generalized cosine measure: Rcos(p0, p1, . . . , pk) =

n

  • j=1

k

  • i=0

pi

j.

(1) Another option: the length of the geometric mean: Rgm(p0, p1, . . . , pk) =  

n

  • j=1

k

  • i=0

pi

j

  • 2

k+1

1 2

. (2)

slide-54
SLIDE 54

Log-transform

We study the N-dimensional log-transform and related distances.

slide-55
SLIDE 55

Log-transform

slide-56
SLIDE 56

Log-transform in 3D

slide-57
SLIDE 57

Log-distance

slide-58
SLIDE 58

Log-distance: formula

Let x, y ∈ Rn−1, s = (x, F1(x)) and t = (y, F1(y)). Then the log-distance from x to y is D(x, y) = n

j=1(tj − sj)e2tj.

slide-59
SLIDE 59

Log-distance: conjugate

x x* y* y

slide-60
SLIDE 60

Log-distance: conjugate in 3D

slide-61
SLIDE 61

Log Ball

slide-62
SLIDE 62

Log Cech complex

Cechr(X) = {ξ ⊆ X |

  • x∈ξ

Br(x) = ∅}. (3)

slide-63
SLIDE 63

Generalized measure.

For each simplex ξ ∈ ∆(X), there is a smallest radius for which ξ belongs to the ˇ Cech complex: rC(ξ) = min{r | ξ ∈ Cechr(X)}. (4) We call rC: ∆(X) → R the ˇ Cech radius function of X. In the original coordinate space, we get the desired similarity measure: RC(ξ) = e−rC(ξ)/√n (5)

slide-64
SLIDE 64

Bregman divergences

slide-65
SLIDE 65

Bregman divergences

Bregman distance from x to y: DF(x, y) = F(x) − [F(y) + ∇F(y), x − y] ; (6)

slide-66
SLIDE 66

Bregman divergences

F can be any strictly convex function!

◮ It covers the Sq. Eucl. distance, squared

Mahalanobis distance, Kullback-Leibler divergence, Itakura-Saito distance.

◮ Extensive use in machine learning. ◮ Links to statistics via [regular] exponential

family (of distributions).

slide-67
SLIDE 67

Further connections

◮ Bregman-based Voronoi [Nielsen at el]. ◮ Information Geometry. ◮ Collapsibility Cech→Delunay [Bauer,

Edelsbrunner].

◮ Persistence stability for geometric complexes

[Chazal, de Silva, Oudot]

slide-68
SLIDE 68

Summary

◮ New, stable and relevant distance (dissimilarity

measure) for texts.

◮ It serves as an interpretation of text data. ◮ Link between TDA and Bregman divergences.

slide-69
SLIDE 69

Thank you!

Research partially supported by the TOPOSYS project