Content visualization of scientific corpora using an extensible - PowerPoint PPT Presentation

. Content visualization of scientific corpora using an extensible relational database implementation . Eleftherios Stamatogiannakis, Ioannis Foufoulas, Theodoros Giannakopoulos, Harry Dimitropoulos, Natalia Manola and Yannis Ioannidis National and Kapodistrian University of Athens Management of Data, Information, and Knowledge Group 1

. Introduction . Goal: content-based visualization of scientific documents Scientific documents: rich and diverse Applied on three datasets Application on publications that share a common funding scheme (e.g. EU FP7-ICT): funding mining submodule Implemented in madIS: data analysis via extended relational db OpenAIRE+ EU project “2nd-Generation Open Access Infrastructure for Research in Europe” - 283595 infrastructure of publication - data repositories implements EC’s open access policies connects publications to research data and funding automatic content clustering and classification 2

. Content-based classification (1) . Document d representation: . . tokenization 1 . . stop word removal 2 . . stemming 3 . . term frequency df d ( t ) calculation 4 repeat for each class c estimate P ( t ) and P ( t | c ) build for each class c : a dictionary D c an array of respective weights W c 3

. Content-based classification (2) . P ( t | c ) add term t if P ( t ) > T P ( t | c ) W c ( t ) = P ( t ) classification based on sum of logs actually equivalent to the Naive Bayes classifier 4

Visualization of text corpora, Class representation - dimensionality reduction (1) . . Goal: class representation in 2D space Classes of similar content are close A dimensionality reduction task (classes instead of samples) Previous step: class represented by a set of terms and weights Compute similarity matrix: ∑ N 1 i =1 W c 2 ( k : D c 1 ( i ) = D c 2 ( k )) S ( c 1 , c 2 ) = ∑ N 2 i =1 W c 2 ( i ) ∑ N 2 i =1 W c 1 ( k : D c 2 ( i ) = D c 1 ( k )) + (1) ∑ N 1 i =1 W c 1 ( i ) classes c 1 and c 2 , dictionaries D c 1 and D c 2 , weights W c 1 and W c 2 , number of terms per dictionary N 1 and N 2 5

Visualization of text corpora, Class representation - dimensionality reduction (2) . . S : class distributions in the R M space ( M : #classes) Reduce dimension to 2 using discretized class representations Step 1: reduce the feature space via clustering ( k clusters): use rows of S as samples compute distance between rows i of S and clusters j : d ( i , j ) each class now represented by its distances from clusters k feature space Step 2: use SOM to map k -dimensional space to 2D avoid numerical - computational issues in SOM training 6

Visualization of text corpora, Content visualization of unlabelled corpora . . “Batch” classification + visualization for a corpus For every document i , i = 1 , . . . , N d : apply the classifier soft-outputs: P i ( c ) for each class c = 1 , . . . , M Collection of documents is represented for each class c , using: the 2-D class estimated coordinates the accumulated estimated content class probability ∑ N d i =1 P i ( c ) [ X c , Y c , ] N d where, X c , and Y c are the estimated 2-D class coordinates Adopt a balloon representation 7

. Fund Mining Module . Goal: detect particular funding schemes Started with EU FP7-funded Recently extended to Wellcome Trust projects Currently: handle arbitrary number of funding Funding information important for: funding statistics - visual analytics. Here funding information is used to specify the types of documents being visualized Module either used on individual docs or in batch mode Preprocessing (stopwords removal, tokenization, etc) Find matches against known lists of project grant agreement numbers - acronyms Use contextual information to filter out false matches 8

. Datasets . arXiv (arxiv.org) coverings 7 categories 2 level hierarchy (2nd: 130 classes 2nd level - 7 general categories) only use 2nd level labels arXiv.org API used to retrieve the docs (450K abstracts used) BASE (www.base-search.net). open access archive of scientific docs operated by Bielefeld University Library DDC categorization (Dewey Decimal Classification) 35K annotated documents in the English language 2 DDC levels (i.e. 100 classes) WoS (Web of Science) http://thomsonreuters.com/web-of-science 180 class labels (non-hierarchical) 18K labelled abstracts 9

. Implementation Issues (1) . Training and testing (both classification and visualization modules): implemented in Python external libraries, e.g. NumPy, NLTK,... But: need a release version for the testing case: implementation in the context of a data processing workflow, easily transferred to a distributed environment Achievable? Yes: adopted scheme only involves text segmentation, dictionary terms retrieval and a simple weights computation 10

. Implementation Issues (2) . Implemented in madIS (https://code.google.com/p/madis/) Data analysis via an extended relational database Built on top of the SQLite database with Python extensions Feels like Hadoop SQL without the overhead but also without the distributed processing capabilities madSQL, an SQL-based: extended with UDFs (User Defined Functions) Eliminates the effort using UDFs (UDFs are first class citizens in the query language itself) UDF categories: row: analogous to the Map operator of Map/Reduce systems aggregate: capture arbitrary aggregation functionality beyond the one predefined in SQL (SUM(), AVG(), ...).Analogous to the Reduce operator virtual table: (table functions in Postgresql and Oracle) used to create virtual tables 11

. Implementation Issues (3) . UDF functionality + traditional relational DB facilities (UDFs closely tied to the relational DB engine): eliminates the communication cost between the two execution layers (functional/relational) Naive bayes classification example: use a UDF to split documents into words use the relational facilities to calculate word frequencies use aggregate UDFs to compute sum of logs ALL done in one madSQL query, completely within madIS every process (classification, visualization) is implemented in terms of an (extended) SQLite query: create temp table if not exists resultsT able as select ontop(5, p, title,class,matches,p) from (select title,class,jgroup(term,p) as matches,sum(p) as p from (select * from (select title,textwindow((summary),0,0,2) from abstractT able),arxiv where middle = term or regexpr(’( \ S+)( \ s)( \ S+)’,middle,’1’) = term) group by title,class) group by title ; 12

. Results - Classification Evaluation on the arXiv dataset . (different probability ratio thresholds) 13

. Results - Classification Module Execution Times . T able: Average execution times per abstract. Higher dictionary thresholds (less dictionary terms) obviously lead to faster classifications. Times are in msecs # abstracts T = 2 T = 10 T = 20 T = 30 T = 50 10 63 27 21 20 17 15000 57 23 15 13 10 14

. Results - Visualization Example: FP7 - ICT calls - ARXIV . arXiv astro-ph.GA Astroph. - Galaxy Astrophysics astro-ph.SR Astroph. - Solar and Stellar cs.DB CS - Databases cs.DL CS - Digital Li- braries cs.IT CS - Information Theory cs.LG CS - Machine Learning cs.PF CS - Performance cond- Condensed Matter mat.other - Other hep-lat High Energy Physics - Lattice hep-th High Energy Physics - Theory physics.geo- Physics - Geo- ph physics physics.optics Physics - Optics physics.space- Physics - Space ph Physics quant-ph Quantum Physics q-bio.CB Quantitative Biol- ogy - Cell Behavior stat.ML Statistics - Machine Learning 15

. Results - Visualization Example: FP7 - ICT calls - WOS . WOS BU Astronomy - Astro- physics GC Geochemistry & Geophysics GU Ecology IQ Engineering, Elec- trical & Electronic EX Computer Science, Theory & Methods HL Health Care Sci- ences LE Geosciences, Mul- tidisc. PU Mechanics RU Neurosciences UB Physics, Applied UK Physics, Con- densed Matter UI Physics, Multidisc. UP Physics, Particles & Fields YE T ellecom. 16

. OpenAIRE mining service BETA . http://hatter.madgik.di.uoa.gr:8080/openaireplus/classifier returns a jason-like result (for several taxonomies, not only arXiv) 17

. Ongoing work . Detailed Evaluation (visualization functionality) Visualization enhancement (e.g, a tag cloud for each estimated class probability) Semi-supervised techniques (e.g., probabilistic topic modeling). 18

Content visualization of scientific corpora using an extensible - PowerPoint PPT Presentation

. Content visualization of scientific corpora using an extensible relational database implementation . Eleftherios Stamatogiannakis, Ioannis Foufoulas, Theodoros Giannakopoulos, Harry Dimitropoulos, Natalia Manola and Yannis Ioannidis

Security Visualization Tim Vidas & Hanan Hibshi UPS 2011 1 Visualization Visualization can

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

Data Visualization Brait ispuu Types of Visualization Mathematical Visualization y =

Visualization Visualization Understand what ConvNets learn 2 Visualization The development of

Scientific Visualization : From Data to Insight Vijay Natarajan Indian Institute of Science

Scientific Visualization Algorithms Graphics & Visualization: Principles & Algorithms

East Slavic parallel corpora: diachronic and diatopic variaton in Belarusian, Ukrainian, and

Data and Analysis Note 8 Introduction to Corpora Alex Simpson Note 8 Introduction to corpora

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data & Analysis,

Roadmap On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar

Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger

Towards Continuous Qvality Control for Spoken Language Corpora Anne Ferger and Hanna Hedeland

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

Visualization CS 299 Introduction to Data Science Overview 1. What Is Visualization? 2.

Visualization Systems 11-1 Ronald Peikert SciVis 2008 - Visualization Systems Modular

Quantum-inspired Classification Process Giuseppe Sergioli & Alophis group (Applied Logics,

Supermassive BH Accretion and Feedback in SPH Simulation Amit Kashi University of Nevada, Las

Beam Transfer Devices: Septa & Kickers M.J. Barnes CERN TE/ABT Acknowledgements: J.

Generalized tensor methods and entanglement measurements for models with long-range interactions

Windows NT Security Windows 95, 3.1, 3.11 are basically DOS and they have no security

SARS Quarterly Stakeholder Meeting Feed Back SARS Quarterly Stakeholder Meeting Service

Fourth Quarter 2015 Earnings Conference Call Presentation February 3, 2016 2/3/2016 Forward

WiNSeC Dr. Patrick White Assoc. Director WiNSeC Office: 201-216-5028 October 29, 2003 pw

Content visualization of scientific corpora using an extensible - PowerPoint PPT Presentation

. Content visualization of scientific corpora using an extensible relational database implementation . Eleftherios Stamatogiannakis, Ioannis Foufoulas, Theodoros Giannakopoulos, Harry Dimitropoulos, Natalia Manola and Yannis Ioannidis

Security Visualization Tim Vidas &amp; Hanan Hibshi UPS 2011 1 Visualization Visualization can

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

Data Visualization Brait ispuu Types of Visualization Mathematical Visualization y =

Visualization Visualization Understand what ConvNets learn 2 Visualization The development of

Scientific Visualization : From Data to Insight Vijay Natarajan Indian Institute of Science

Scientific Visualization Algorithms Graphics &amp; Visualization: Principles &amp; Algorithms

East Slavic parallel corpora: diachronic and diatopic variaton in Belarusian, Ukrainian, and

Data and Analysis Note 8 Introduction to Corpora Alex Simpson Note 8 Introduction to corpora

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data &amp; Analysis,

Roadmap On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar

Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger

Towards Continuous Qvality Control for Spoken Language Corpora Anne Ferger and Hanna Hedeland

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

Visualization CS 299 Introduction to Data Science Overview 1. What Is Visualization? 2.

Visualization Systems 11-1 Ronald Peikert SciVis 2008 - Visualization Systems Modular

Quantum-inspired Classification Process Giuseppe Sergioli &amp; Alophis group (Applied Logics,

Supermassive BH Accretion and Feedback in SPH Simulation Amit Kashi University of Nevada, Las

Beam Transfer Devices: Septa &amp; Kickers M.J. Barnes CERN TE/ABT Acknowledgements: J.

Generalized tensor methods and entanglement measurements for models with long-range interactions

Windows NT Security Windows 95, 3.1, 3.11 are basically DOS and they have no security

SARS Quarterly Stakeholder Meeting Feed Back SARS Quarterly Stakeholder Meeting Service

Fourth Quarter 2015 Earnings Conference Call Presentation February 3, 2016 2/3/2016 Forward

WiNSeC Dr. Patrick White Assoc. Director WiNSeC Office: 201-216-5028 October 29, 2003 pw

Security Visualization Tim Vidas & Hanan Hibshi UPS 2011 1 Visualization Visualization can

Scientific Visualization Algorithms Graphics & Visualization: Principles & Algorithms

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data & Analysis,

Quantum-inspired Classification Process Giuseppe Sergioli & Alophis group (Applied Logics,

Beam Transfer Devices: Septa & Kickers M.J. Barnes CERN TE/ABT Acknowledgements: J.