finding structure in texts with topological data analysis
play

Finding Structure in Texts with Topological Data Analysis Calli - PowerPoint PPT Presentation

Finding Structure in Texts with Topological Data Analysis Calli Clay and Ella Graham St. Catherine University February 1, 2020 Calli Clay and Ella Graham (St. Kates) Finding Structure in Texts with TDA February 1, 2020 1 / 17


  1. Finding Structure in Texts with Topological Data Analysis Calli Clay and Ella Graham St. Catherine University February 1, 2020 Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 1 / 17

  2. Introduction Recently, analyzing data has become more complex because data sets are larger in size and higher in dimension To address this complexity, we looked at determining the shape of a data set using an approach called topological data analysis Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 2 / 17

  3. The Shape of a Data Set A Three Dimensional Data Set Yet Another Three Dimensional Data Set −4 −2 0 2 4 6 8 10 12 −15−10 −5 0 5 10 15 Variable Three Variable Three Variable Two Variable Two 6 4 6 4 2 2 −6 −4 −2 0 0 −2 −4 −6 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 Variable One Variable One Figure: Visualizing data sets (Dr. Pelatt) Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 3 / 17

  4. Research Goals Determine the efficiency of topological data analysis as a text analytics tool Analyze poetry forms including the villanelle and sestina Analyze music genres including rock music and pop music Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 4 / 17

  5. Background Topology is the study of shapes Figure: Transforming a coffee cup into a donut (Hood) Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 5 / 17

  6. Background Persistent homology is a common TDA method A technique for approximating the topological features of a space in different dimensions Has not been widely used for analyzing texts Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 6 / 17

  7. Simplicial Complexes Geometric representations of the shape of a data set Simplices are the building blocks for simplicial complexes Figure: Simplices (Huang) Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 7 / 17

  8. Simplicial Complexes We can think of point clouds as being sampled from topological space Simplices are used to turn point clouds into simplicial complexes Accomplished with a Vietoris-Rips complex Figure: Illustration of building a simplicial complex from a point cloud (Huang) Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 8 / 17

  9. Persistent Homology We use persistent homology to analyze the space that is represented by simplicial complexes We calculate homology groups in each dimension Dimension 0 represents components Dimension 1 represents holes or loops Dimension 2 and higher represent voids Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 9 / 17

  10. Barcodes Visual representation of the persistent homology of a given text Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 10 / 17

  11. Barcode Example with Poetry Do not go gentle into that good night by Dylan Thomas Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 11 / 17

  12. Bottleneck Distance Once each text file is visually represented by a barcode, we can compare their barcodes to find the bottleneck distance Measures distance between the persistent homologies of two text files W ∞ ( X , Y ) = η : X → Y sup inf || x − η ( x ) || ∞ x ∈ X Wasserstein distance is another approach Figure: Barcode 2 in Dimension 0 Figure: Barcode 1 in Dimension 0 Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 12 / 17

  13. Process Using the programming software RStudio, we: Clean each text file Represent each line of text with a word count vector The resulting vector space forms a word count matrix Calculate a distance matrix composed of the pairwise distances between each point in the word count matrix Use RStudio packages to calculate the persistent homology, create barcodes, and find pairwise bottleneck distances between barcodes Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 13 / 17

  14. Word Count Vectors with Song Lyrics raindrops (an angel cried) by Ariana Grande “When Raindrops fell down from the sky the day you left me, an angel cried oh, she cried, an angel cried she cried” Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 14 / 17

  15. Issues and Questions Stop Words: Do they change word count vectors significantly? Address with standard tf-idf technique (Wagner) Defining Distance: Euclidean or Angular? Algorithms: SIF or SIFTS? 1 2 1 2 4 3 4 3 Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 15 / 17

  16. Results Analyzing poetry using persistent homology is more interesting than analyzing song lyrics Upon further investigation, we may be able to accurately conclude that TDA is effective for the analysis of poetry Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 16 / 17

  17. References H. Edelsbrunner and J. Harer, Computational topology: an introduction . American Mathematical Soc., 2010. X. Zhu, “Persistent homology: An introduction and a new text representation for natural language processing,” in Twenty-Third International Joint Conference on Artificial Intelligence , 2013. H. Wagner, P. D� lotko, and M. Mrozek, “Computational topology in text mining,” in CT , pp. 68–78, Springer, 2012. H.-L. Huang, X.-L. Wang, P. P. Rohde, Y.-H. Luo, Y.-W. Zhao, C. Liu, L. Li, N.-L. Liu, C.-Y. Lu, and J.-W. Pan, “Demonstration of topological data analysis on a quantum processor,” Optica , vol. 5, no. 2, pp. 193–198, 2018. S. Gholizadeh, A. Seyeditabari, and W. Zadrozny, “Topological signature of 19th century novelists: Persistent homology in text mining,” Big Data and Cognitive Computing , vol. 2, no. 4, p. 33, 2018. M. Hood, “When is a coffee mug a donut? topology explains it,” 2016. Ripser, https://live.ripser.org/ . Calli Clay and Ella Graham (St. Kate’s) Finding Structure in Texts with TDA February 1, 2020 17 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend