Finding Structure in Texts with Topological Data Analysis Calli - - PowerPoint PPT Presentation

finding structure in texts with topological data analysis
SMART_READER_LITE
LIVE PREVIEW

Finding Structure in Texts with Topological Data Analysis Calli - - PowerPoint PPT Presentation

Finding Structure in Texts with Topological Data Analysis Calli Clay and Ella Graham St. Catherine University February 1, 2020 Calli Clay and Ella Graham (St. Kates) Finding Structure in Texts with TDA February 1, 2020 1 / 17


slide-1
SLIDE 1

Finding Structure in Texts with Topological Data Analysis

Calli Clay and Ella Graham

  • St. Catherine University

February 1, 2020

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 1 / 17

slide-2
SLIDE 2

Introduction

Recently, analyzing data has become more complex because data sets are larger in size and higher in dimension To address this complexity, we looked at determining the shape of a data set using an approach called topological data analysis

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 2 / 17

slide-3
SLIDE 3

The Shape of a Data Set

A Three Dimensional Data Set

−10 −5 5 10 15 −4 −2 0 2 4 6 8 10 12 −6 −4 −2 2 4 6

Variable One Variable Two Variable Three

Yet Another Three Dimensional Data Set

−6 −4 −2 2 4 6 −15−10 −5 0 5 10 15 −6 −4 −2 0 2 4 6

Variable One Variable Two Variable Three

Figure: Visualizing data sets (Dr. Pelatt)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 3 / 17

slide-4
SLIDE 4

Research Goals

Determine the efficiency of topological data analysis as a text analytics tool Analyze poetry forms including the villanelle and sestina Analyze music genres including rock music and pop music

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 4 / 17

slide-5
SLIDE 5

Background

Topology is the study of shapes

Figure: Transforming a coffee cup into a donut (Hood)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 5 / 17

slide-6
SLIDE 6

Background

Persistent homology is a common TDA method

A technique for approximating the topological features of a space in different dimensions Has not been widely used for analyzing texts

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 6 / 17

slide-7
SLIDE 7

Simplicial Complexes

Geometric representations of the shape of a data set

Simplices are the building blocks for simplicial complexes Figure: Simplices (Huang)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 7 / 17

slide-8
SLIDE 8

Simplicial Complexes

We can think of point clouds as being sampled from topological space

Simplices are used to turn point clouds into simplicial complexes Accomplished with a Vietoris-Rips complex Figure: Illustration of building a simplicial complex from a point cloud (Huang)

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 8 / 17

slide-9
SLIDE 9

Persistent Homology

We use persistent homology to analyze the space that is represented by simplicial complexes We calculate homology groups in each dimension

Dimension 0 represents components Dimension 1 represents holes or loops Dimension 2 and higher represent voids

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 9 / 17

slide-10
SLIDE 10

Barcodes

Visual representation of the persistent homology of a given text

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 10 / 17

slide-11
SLIDE 11

Barcode Example with Poetry

Do not go gentle into that good night by Dylan Thomas

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 11 / 17

slide-12
SLIDE 12

Bottleneck Distance

Once each text file is visually represented by a barcode, we can compare their barcodes to find the bottleneck distance

Measures distance between the persistent homologies of two text files

W∞(X, Y ) = inf

η:X→Y sup x∈X

||x − η(x)||∞ Wasserstein distance is another approach

Figure: Barcode 1 in Dimension 0 Figure: Barcode 2 in Dimension 0

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 12 / 17

slide-13
SLIDE 13

Process

Using the programming software RStudio, we: Clean each text file Represent each line of text with a word count vector

The resulting vector space forms a word count matrix

Calculate a distance matrix composed of the pairwise distances between each point in the word count matrix Use RStudio packages to calculate the persistent homology, create barcodes, and find pairwise bottleneck distances between barcodes

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 13 / 17

slide-14
SLIDE 14

Word Count Vectors with Song Lyrics

raindrops (an angel cried) by Ariana Grande “When Raindrops fell down from the sky the day you left me, an angel cried

  • h, she cried, an angel cried

she cried”

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 14 / 17

slide-15
SLIDE 15

Issues and Questions

Stop Words: Do they change word count vectors significantly?

Address with standard tf-idf technique (Wagner)

Defining Distance: Euclidean or Angular? Algorithms: SIF or SIFTS? 1 2 3 4 1 2 3 4

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 15 / 17

slide-16
SLIDE 16

Results

Analyzing poetry using persistent homology is more interesting than analyzing song lyrics Upon further investigation, we may be able to accurately conclude that TDA is effective for the analysis of poetry

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 16 / 17

slide-17
SLIDE 17

References

  • H. Edelsbrunner and J. Harer, Computational topology: an introduction.

American Mathematical Soc., 2010.

  • X. Zhu, “Persistent homology: An introduction and a new text

representation for natural language processing,” in Twenty-Third International Joint Conference on Artificial Intelligence, 2013.

  • H. Wagner, P. D

lotko, and M. Mrozek, “Computational topology in text mining,” in CT, pp. 68–78, Springer, 2012. H.-L. Huang, X.-L. Wang, P. P. Rohde, Y.-H. Luo, Y.-W. Zhao, C. Liu, L. Li, N.-L. Liu, C.-Y. Lu, and J.-W. Pan, “Demonstration of topological data analysis on a quantum processor,” Optica, vol. 5, no. 2, pp. 193–198, 2018.

  • S. Gholizadeh, A. Seyeditabari, and W. Zadrozny, “Topological signature of

19th century novelists: Persistent homology in text mining,” Big Data and Cognitive Computing, vol. 2, no. 4, p. 33, 2018.

  • M. Hood, “When is a coffee mug a donut? topology explains it,” 2016.

Ripser, https://live.ripser.org/.

Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 17 / 17