generalized similarity measures for text data
play

Generalized similarity measures for text data. Hubert Wagner (IST - PowerPoint PPT Presentation

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner GETCO 2015, Aalborg April 9, 2015 Plan Shape of data. Text as a point-cloud. Log-transform and similarity measure.


  1. Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with Herbert Edelsbrunner GETCO 2015, Aalborg April 9, 2015

  2. Plan ◮ Shape of data. ◮ Text as a point-cloud. ◮ Log-transform and similarity measure. ◮ Bregman divergence and topology.

  3. Shape of data.

  4. Main tools. Rips and Cech simplicial complexes: ◮ Capture the shape of the union of balls. ◮ Combinatorial representation. Persistence captures geometric-topological information of the data: ◮ Key property: stability!

  5. Interpretation of filtration values. For a simplex S = v 0 , . . . , v k , f ( S ) = t means that at filtration threshold t , objects v 0 , . . . , v k are considered close .

  6. Text as a point-cloud.

  7. Basic concepts Corpus: ◮ (Large) collection of text documents. Term-vector: ◮ Weighted vector of key-words or terms . ◮ Summarizes the topic of a single document. ◮ Higher weight means higher importance .

  8. Concept: Vector Space Model ◮ Vector Space Model maps a corpus K to R d . ◮ Each distinct term in K becomes a direction, so d can be high (10s thousands). ◮ Each document is represented by its term-vector . Cat <(Cat,0.5), (Donkey,0.5)> y e k n T o D <(Cat,0), (Dog,0.2), (Donkey,0.9)> Dog G

  9. Concept: Similarity measures ◮ Cosine similarity compares two documents. ◮ Distance (dissimilarity) d ( a , b ) := 1 − sim ( a , b ). ◮ This d is not a metric . Cat <(Cat,0.5), (Donkey,0.5)> y e k n T o D <(Cat,0), (Dog,0.2), (Donkey,0.9)> Dog G

  10. Geometry-topological tools.

  11. Interpreting Rips A simplex is added immediately after its boundary: ◮ d ( a , b ) – the dissimilarity. ◮ For triangle d ( a , b , c ) = max ( d ( a , b ) , d ( a , c ) , d ( b , c )). ◮ Is this the filtering function we want?

  12. Generalized similarity Goal: ◮ Extend similarity from pairs to larger subsets of documents . ◮ Its persistence should be stable. ◮ As a bonus, the resulting complex will be smaller. A A A [(A,0.5), (G,0), (T,0.5)] T T T [(A,0), (G,0.2), (T,0.9)] G G G

  13. Simple example. For simplicity, let us work with binary term-vectors (or sets of terms). ◮ sim J ( X 1 , dots , X d ) = card ∩ i X i card ∪ i X i . ◮ Generalizes the Jaccard index . cat dog donkey 1 1 0 0 1 1 1 0 1

  14. New direction. Flawed generalized cosine measure: n k � � R cos ( p 0 , p 1 , . . . , p k ) = p i j . (1) j =1 i =0 Another option: the length of the geometric mean: 1 � k 2  k +1  2 � n � � R gm ( p 0 , p 1 , . . . , p k ) = p i (2) .   j j =1 i =0

  15. Log-transform We study the N-dimensional log-transform and related distances.

  16. Log-transform

  17. Log-transform in 3D

  18. Log-distance

  19. Log-distance: formula Let x , y ∈ R n − 1 , s = ( x , F 1 ( x )) and t = ( y , F 1 ( y )). Then the log-distance from x to y is D ( x , y ) = � n j =1 ( t j − s j ) e 2 t j .

  20. Log-distance: conjugate y x x * y *

  21. Log-distance: conjugate in 3D

  22. Log Ball

  23. Log Cech complex � Cech r ( X ) = { ξ ⊆ X | B r ( x ) � = ∅} . (3) x ∈ ξ

  24. Generalized measure. For each simplex ξ ∈ ∆( X ), there is a smallest radius for which ξ belongs to the ˇ Cech complex: r C ( ξ ) = min { r | ξ ∈ Cech r ( X ) } . (4) We call r C : ∆( X ) → R the ˇ Cech radius function of X . In the original coordinate space, we get the desired similarity measure: R C ( ξ ) = e − r C ( ξ ) / √ n (5)

  25. Bregman divergences

  26. Bregman divergences Bregman distance from x to y : D F ( x , y ) = F ( x ) − [ F ( y ) + �∇ F ( y ) , x − y � ] ; (6)

  27. Bregman divergences F can be any strictly convex function! ◮ It covers the Sq. Eucl. distance, squared Mahalanobis distance, Kullback-Leibler divergence, Itakura-Saito distance. ◮ Extensive use in machine learning. ◮ Links to statistics via [regular] exponential family (of distributions).

  28. Further connections ◮ Bregman-based Voronoi [Nielsen at el]. ◮ Information Geometry. ◮ Collapsibility Cech → Delunay [Bauer, Edelsbrunner]. ◮ Persistence stability for geometric complexes [Chazal, de Silva, Oudot]

  29. Summary ◮ New, stable and relevant distance (dissimilarity measure) for texts. ◮ It serves as an interpretation of text data. ◮ Link between TDA and Bregman divergences.

  30. Thank you! Research partially supported by the TOPOSYS project

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend