Terminology-based Text Embedding for Computing Document Similarities - PowerPoint PPT Presentation

Terminology-based Text Embedding for Computing Document Similarities on Technical Content Hamid Mirisaee, Eric Gaussier, Cedric Lagnier, Agnes Guerraz July 2019 � 1

Outline • Introduction • Related work • Proposed method • Experiments • Conclusion � 2

Businesses need to work with startups • Concours I-LAB. —> they need “good” info • Concours de l’innovation. crawl the web (startups), process • Partnership with LIG. them, provide structured info � 3

How does it look like? � 4

Introduction � 5

Introduction • Finding relevant documents: on everyday basis. • The principle is (almost) always the same: 1. Define the space and the similarity measure. 2. Take everything to that space. 3. Find the closest documents in the space. • Query ~ document: • closest papers to “Building Representative Composite Items” on arXiv? How to define the space? How to represent documents? � 6

Related work (motivation?) • Classic: tf-idf + cosine. • KNN with tf-idf —> text classification [2014]. • Text representation: tf-idf, LSI and multiword —> text classification [2011]. • tf-idf is nice & helps in many tasks such as topic modelling, but… • Let’s be more contextual: • Go to word level (word2vec). • Represent words such that they carry semantical features (based on co-occurrence). • vec(King) — vec(Man) + vec(Queen) ~ vec(Woman) • most_simlar(car) = [cars, vehicle, automobile] • Many variations: doc2vec, sent2vec, combine with tf-idf, … � 7

Motivation Given a document, find similar documents in a “selective” fashion. Do you seriously read the introduction and/or related work section of papers all the time? Focus on the important parts of the document. � 8

Proposed method: the big picture! 1. Extract keywords and/or keyphrases (composite keywords) of the document. 2. Score the sentences of the document based on the (composite) keywords they contain. 3. Pick a way to embed the sentences. 4. Embed the document as weighted average of the embeddings of its sentences. � 9

Extracting (composite) keywords: use graphs! Graph = Nodes + Edges • Many problems can be formulated and/or interpreted via graph structure. • In NLP: • Nodes —> entities (words, sentences, paragraphs, etc). • Edges —> relation with them (semantic, co-occurrence, etc). • [Rousseau et al.] graph-of-words: • Nodes —> terms of the documents. • Edges —> if two terms co-occur in a fixed-size window. � 10

Graph-of-words The proposed method can be used to find similar documents, particularly when the technical content is concerned for finding relevant documents. proposed method similar documents technical content relevant documents • proposed method similar —> {proposed, method}, {proposed, similar}, {method, similar} • method similar documents —> {method, similar}, {method, document}, {similar, document} • similar documents technical —> … • documents technical content —> … • technical content relevant —> … • content relevant documents —> … (un)weighted (un)directed � 11

K-core • Core with max K —> main core • Idea : it’s important to be central, but your neighbours are also important! • [Rousseau & Vazirgiannis]: • Main core —> keywords & keyphrases • Better than HITS and PageRank. • No hyperparameter. � 12 image from http://frncsrss.github.io/papers/rousseau-dissertation.pdf

⃗ ⃗ ⃗ ⃗ ⃗ TDE: Terminology-based Document Embedding (informally) Extracting keyphrases artificial pancreas Preprocessing (of all cores) Graph-of-words insulin pump clinical test 42.6 34 Score them based d = s 1 + s 2 = 0.55 × s 1 + 0.45 s 2 (42.6 + 34) × (42.6 + 34) × on their core & their edge weight Do the math! Score the sentences based on their We develop artificial pancreas which acts like an insulin pump. keyphrases artificial pancreas (23.2) (score = 23.2 + 19.4 = 42.6) insulin pump (19.4) Via a clinical test, we evaluated our insulin pump. clinical test (14.6) (score = 19.4 + 14.6 = 34) � 13

TDE: Terminology-based Document Embedding (formally) ALL CORES ONLY KEYPHRASES ONLY 2-WORD � 14

Experiments • Baselines: • doc2vec : directly embed a document. • TWA : tf-idf weighted average of words of the document. • TDE: how to represent a sentence? • sent2vec: learn a model to embed sentences. TDE s 2 v • (tf-)idf weighted average of it’s words. TDE iw � 15

Experiments: dataset • Crawling websites of 68K startups (3.4M pages). • Filter non-English, take pages with texts —> 43K startups with 2.8M pages. • Document = combination of some pages of the startup. X X � 16

Experiments: training, evaluation & results • 100 documents , four domains : {medical, agriculture, energy, biology}, scores {1, …, 5}. • Metric: Normalized Discounted Cumulative Gain, NDCG@1 , NDCG@5 . p DCG p rel i ∑ DCG p = NDCG p = log 2 ( i + 1) IDCG p i =1 • Train the doc2vec on the dataset. • Train the sent2vec on all the sentences (omitting the stopwords) of the dataset. � 17

Experiments: more details HERE The proposed method can be used to find similar documents, particularly when the technical content is concerned for sentences —> only English, no stopwords. finding relevant documents. IN PRACTICE • proposed method similar • method similar documents We use TDE iw • similar documents technical Multilingual parallel w2vec models (EN, FR, DE, ES). • documents technical content • technical content relevant Only nouns & adj. • content relevant documents Check if the keyphrase is actually a keyphrase. language specific RE checking (NN-NN, NN-ADJ, …) imprimente 3D, 3D printer, satellite de communication, … � 18

Experiments: examples SKOPAI (top-3 sentences) SKOPAI (top-3 similars) We build on a longstanding experience in corporate, We have a strong experience of the innovation startup, not-for-profit and public service organizations. ecosystem and how it works: research and We stand for collaboration that allows businesses to technology transfer in academia, R&D in tech or thrive. This is why we focus on enabling alliances that industry corporations, venture capital or government foster innovation and redesign business models. innovation policy. Nous construisons une plate-forme de référence pour We are Building the only all-in-one innovation & start- la technologie, fournissant en temps réel une up ecosystem platform for the cloud power future. connaissance complète sur toute startup dans le monde entier. Startup assessment depends on the quality and startup assessment depends on the quality and context of context of the person performing it – for example the person performing it – for example chief innovation chief innovation officer, product managers, R&D officer, product managers, R&D engineers, investors, engineers, investors, buyers, legals, etc. buyers, legals, etc. � 19

Conclusions • Use graph-based methods to extract similar documents. • The general framework is valid for any sort of sentence embedding. • Focus on the technical content (via keyphrases). • Outperform the state-of-the-art in terms of NDCG. • Could be also used to rank sentences. � 20

Thank you! We are hiring… https://www.skopai.com/join-us/ Full Stack Engineer Data Scientist � 21

Terminology-based Text Embedding for Computing Document Similarities - PowerPoint PPT Presentation

Terminology-based Text Embedding for Computing Document Similarities on Technical Content Hamid Mirisaee, Eric Gaussier, Cedric Lagnier, Agnes Guerraz July 2019 1 Outline Introduction Related work Proposed method

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

SACM Terminology https://sacmwg.github.io/draft-ietf-sacm-terminology/

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

SmallTech,BigIssues HowItWorks AdvantagesofRFID

hello hello Electric light, circadian disruption and cancer risk Richard Stevens UConn Health

Neural State Classification for Hybrid Systems Nicola Paoletti Royal Holloway, University of

What Is Wrong With Peter A. Schneider Current Regulation of Potential conflicts of interest to

University of Applied Sciences Upper Austria 2 3 4 y x G(Expr): Expr Term | Term + Expr

Outline General introduction Basic

GOALS AND SUCCESS GOALS AND SUCCESS MEASURES FOR AI- MEASURES FOR AI- ENABLED SYSTEMS ENABLED

Biomedicine Enrico Grisan enrico.grisan@dei.unipd.it From biology to models (Artificial) Neural