Persistent Homology in Text Mining ACAT Meeting, Bremen Hubert - - PowerPoint PPT Presentation

persistent homology in text mining
SMART_READER_LITE
LIVE PREVIEW

Persistent Homology in Text Mining ACAT Meeting, Bremen Hubert - - PowerPoint PPT Presentation

Persistent Homology in Text Mining ACAT Meeting, Bremen Hubert Wagner (Jagiellonian University) Joint work with Pawel Dlotko (UPenn) and Marian Mrozek (Jagiellonian University) July 18, 2013 Hubert Wagner (Jagiellonian University) Persistent


slide-1
SLIDE 1

Persistent Homology in Text Mining

ACAT Meeting, Bremen Hubert Wagner (Jagiellonian University)

Joint work with Pawel Dlotko (UPenn) and Marian Mrozek (Jagiellonian University)

July 18, 2013

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 1 / 25

slide-2
SLIDE 2

Big picture.

Topology can be useful in analyzing text data. We ’sold’ this idea to Google. The input is local: documents and their similarities. Persistence gives some geometrical-topological global information, describing the entire corpus (set of documents).

Cat Donkey Dog

each dimension = word Dog and Cat and .... whatever bla bla... Dog and Cat and .... whatever bla bla Dog and Cat and .... whatever bla blaDog and Cat and .... whatever bla blaDog and Cat and .... whatever bla bla

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 2 / 25

slide-3
SLIDE 3

Plan

Text data and its representation. Concepts from text mining, similarity measure. Extended similarity measure. Practical usage.

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 3 / 25

slide-4
SLIDE 4

Practical example of text mining application

Google Alerts (www.google.com/alerts) ’Monitor the Web for interesting new content’. You specify the query (topic, keywords). It ’googles’ the given topic every day for you. Email notification when something new appears. Problem: lots of spam: most results people got pointed to very similar webpages/documents.

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 4 / 25

slide-5
SLIDE 5

Interesting properties of text data.

Zipf law, intuitively: relative frequency of the k-th most popular word is roughly 1/k. For a reasonable corpus:

k = 1 gives 6% (in English: ’the’) k = 2 gives 3% (’of’) k = 3 gives 2% (’to’)

It works for all natural languages...

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 5 / 25

slide-6
SLIDE 6

Representation of text data.

Each point represents a single document, ideally its position summarizes the content/topic.

Cat Donkey Dog

each dimension = word Dog and Cat and .... whatever bla bla... Dog and Cat and .... whatever bla bla Dog and Cat and .... whatever bla blaDog and Cat and .... whatever bla blaDog and Cat and .... whatever bla bla

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 6 / 25

slide-7
SLIDE 7

Representation of text data.

It’s natural to think about similarity between text documents. ’Balls’ wrt. similarity describe a context (for some radius).

Cat Donkey Dog

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 7 / 25

slide-8
SLIDE 8

Shape of such data.

With persistence we can view it at different scales, namely: similarity thresholds. Each simplex means that its vertices/documents are similar (at this similarity scale/threshold).

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 8 / 25

slide-9
SLIDE 9

Existing tools.

There are text-mining techniques to represent the data (Vector-Space-Model). A standard similarity measure which works well in practice (cosine similarity). Of course we can just build a Rips complex... Is this the right method?

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 9 / 25

slide-10
SLIDE 10

Concept: Term-vectors

Used to extract characteristic words (or terms) from a document. Each term is weighted according to its relative ’importance’.

Words which appear often in a document are weighted higher. But this is offset by their global frequency. (’tf-idf’)

Usually at most 50 non-zero coefficients. So, each document is described by vector of pairs: (termi, weighti), only non-zero weight matters.

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 10 / 25

slide-11
SLIDE 11

Concept: Vector Space Model

Vector Space Model maps a corpus to Rd. Each document is represented by its term-vector. Each unique term becomes an orthogonal direction, so the (embedding) dimension d can be very high. Term-vectors give the coordinates of documents in this space. It was a huge breakthrough in the 80s for information retrieval, text mining etc. Still used!

Cat D

  • n

k e y Dog T G

<(Cat,0), (Dog,0.2), (Donkey,0.9)> <(Cat,0.5), (Donkey,0.5)>

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 11 / 25

slide-12
SLIDE 12

Concept: Vector Space Model and similarity

We use the cosine similarity to ’compare’ two documents. ’Dissimilarity’: dsim(a, b) := 1 − sim(a, b). Dissimilarity is not a metric (no triangle inequality).

Cat D

  • n

k e y Dog T G

<(Cat,0), (Dog,0.2), (Donkey,0.9)> <(Cat,0.5), (Donkey,0.5)>

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 12 / 25

slide-13
SLIDE 13

Another interesting property.

Even 0-homology is interesting. The neighborhood graph has a special structure. It’s a scale-free graph. Random graph model: preferential attachment (Barab´ asi-Albert). Hyperlinks, social networks, citation networks also follow this.

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 13 / 25

slide-14
SLIDE 14

New concept: Extended Vector Space Model and multidimensional similarity

We extend the notion of similarity from pairs to larger subsets

  • f documents (up to size, say, 5).

This way we capture ’higher-dimensional’ relationships in the input. The resulting simplicial complex is a Cech complex.

A T G A T G A T G

[(A,0), (G,0.2), (T,0.9)] [(A,0.5), (G,0), (T,0.5)]

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 14 / 25

slide-15
SLIDE 15

Concept: Extended Vector Space Model and multidimensional similarity

Sim(X1..Xd) =

  • i

d

j=1 Xij

d

i=1 Xid . For d = 2 it’s the cosine.

Extends similarity from pairs to larger subsets of documents. The value of each d-simplex gets the similarity among the q + 1-subset of documents it contains. cat dog donkey 0.8 0.5 0.8 0.4 0.3 0.7

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 15 / 25

slide-16
SLIDE 16

Concept: Extended Vector Space Model and multidimensional similarity

Sim(X1..Xd) =

  • i

d

j=1 Xij

d

i=1 Xid .

Intuition: for binary weights, Sim is the size of set-theoretical intersection (up to normalization). For d = 2 it’s (almost) the Jaccard measure. cat dog donkey 1 1 1 1 1 1

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 16 / 25

slide-17
SLIDE 17

Our experimental setting.

We use documents from the English Wikipedia. Input: point cloud P ⊂ Rd Build the Cech complex, filtered by dissimilarity. Remember: each simplex gets filtration value = dissimilarity

  • f its documents.

Compute and analyze persistence diagrams.

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 17 / 25

slide-18
SLIDE 18

Persistence of these data.

We increase our dissimilarity threshold... ... allowing less and less related documents to be considered ’similar’. We check how holes are created and how they merge (the younger one ’dies’) during this process... We change the dissimilarity threshold from 0 to 1:

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 18 / 25

slide-19
SLIDE 19

Persistence of these data.

We increase our dissimilarity threshold... ... allowing less and less related documents to be considered ’similar’. We check how holes are created and how they merge (the younger one ’dies’) during this process... We change the dissimilarity threshold from 0 to 1:

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 18 / 25

slide-20
SLIDE 20

Persistence of these data.

We increase our dissimilarity threshold... ... allowing less and less related documents to be considered ’similar’. We check how holes are created and how they merge (the younger one ’dies’) during this process... We change the dissimilarity threshold from 0 to 1:

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 18 / 25

slide-21
SLIDE 21

Persistence of these data.

We increase our dissimilarity threshold... ... allowing less and less related documents to be considered ’similar’. We check how holes are created and how they merge (the younger one ’dies’) during this process... We change the dissimilarity threshold from 0 to 1:

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 18 / 25

slide-22
SLIDE 22

Persistence of these data.

We increase our dissimilarity threshold... ... allowing less and less related documents to be considered ’similar’. We check how holes are created and how they merge (the younger one ’dies’) during this process... We change the dissimilarity threshold from 0 to 1:

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 18 / 25

slide-23
SLIDE 23

Persistence of these data.

We increase our dissimilarity threshold... ... allowing less and less related documents to be considered ’similar’. We check how holes are created and how they merge (the younger one ’dies’) during this process... We change the dissimilarity threshold from 0 to 1:

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 18 / 25

slide-24
SLIDE 24

Big picture again.

We are interested in ’topology’ of textual data in this representation. More precisely: in the global structure of similarities among documents. We can capture high-dimensional relationships (extended similarity). Overall it gives (some) global information of the entire corpus.

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 19 / 25

slide-25
SLIDE 25

Applications

Dimensionality estimation. Interactive text data exploration, attention routing, missing data. Inter-language comparison of corpora, stability. Simplification (overview) of data.

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 20 / 25

slide-26
SLIDE 26

Persistence in dim 1.

We see some phase transition around dissimilarity value = 0.8.

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

death birth persistent diagram in dimension 1 Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 21 / 25

slide-27
SLIDE 27

Computational results: ”Can you do this for 1014 points?”

In practice the number of simplices is at least 109. Standard method to compute persistence: reduce the ordered boundary matrix. Efficiency is a problem for such datasets (quadratic scaling, worst-case is cubic). We want a general tool which to do experiments and research with...

5 10 15 20 10000 20000 30000 40000 50000 60000

time [s] number of cells in the complex reduction time for dim = 4

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 22 / 25

slide-28
SLIDE 28

Practical efficiency

For text-data using Rips complexes and standard computational methods we experienced quadratic running time (in the size of the complex) and the complexes were large. We now see linear running times and significantly smaller complexes, by: Switching to persistent cohomology (duality due to Vin de Silva et al.). Cech complex is significantly smaller. No need for preprocessing: all pairs have non-zero persistence. Using the new efficient PHAT library (Ulrich Bauer, Michael Kerber, Jan Reininghaus). Additionally new types of simplicial complexes look promising (Graph-induced, Zig-zag zoos...)

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 23 / 25

slide-29
SLIDE 29

Conclusion

New setup, capturing higher-dimensional relationships. We can construct a Cech complex, which is smaller than Rips. With new tools, we can handle reasonably-sized text data. We hope to use it to answer questions for some real-world data.

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 24 / 25

slide-30
SLIDE 30

Thanks for Ulrich Bauer, Herbert Edelsbrunner, Jan Reininghaus for helpful comments. Some of these is published: HW, P.Dlotko, M.Mrozek, ”Computational Topology for Text Mining”, CTIC 2012. Also: you can check out the persistence library: code.google.com/p/phat/ Thank you! (Research supported by UE Programme: ”Geometry and Topology in Physical Models” and Google Research Award programme.).

Hubert Wagner (Jagiellonian University) Persistent Homology in Text Mining 25 / 25