Discovering Similar Passages Within Large Text Documents - PowerPoint PPT Presentation

Discovering ¡Similar ¡Passages ¡Within ¡Large ¡ Text ¡Documents ¡ Demetrios ¡Glinos ¡ glinos@eecs.ucf.edu ¡ ¡ 1

The ¡Problem ¡Domain ¡ § The ¡task ¡is ¡to ¡find ¡one ¡or ¡more ¡passages ¡in ¡one ¡document ¡that ¡are ¡the ¡ same ¡or ¡closely ¡similar ¡to ¡passages ¡in ¡another ¡document. ¡ § There ¡can ¡be ¡more ¡than ¡one ¡matching ¡set ¡of ¡passages ¡in ¡a ¡given ¡ document ¡pair. ¡ § Corresponding ¡passages ¡may ¡not ¡be ¡in ¡the ¡same ¡order ¡in ¡each ¡document. ¡ ¡ § Corresponding ¡passages ¡need ¡not ¡be ¡idenCcal, ¡only ¡similar: ¡ § AddiCons ¡or ¡deleCons ¡of ¡words ¡and ¡phrases ¡ § Use ¡of ¡synonyms ¡ § Alternate ¡grammaCcal ¡construcCons ¡ § Each ¡passage ¡pair, ¡however, ¡presents ¡a ¡text ¡alignment ¡problem. ¡ 2

ApplicaCon ¡Areas ¡ § Document ¡deduplicaCon ¡ § Example: ¡ ¡Recognizing ¡that ¡two ¡documents ¡represent ¡the ¡same ¡content ¡when ¡ building ¡a ¡database ¡of ¡medical ¡journal ¡arCcles ¡or ¡abstracts ¡retrieved ¡from ¡ different ¡online ¡sources. ¡ § Textual ¡Entailment ¡DeterminaCon ¡ § Example: ¡ ¡Recognizing ¡that ¡two ¡sentences ¡mean ¡the ¡same ¡thing ¡despite ¡ different ¡grammaCcal ¡construcCons ¡and ¡that ¡can ¡spoof ¡deep ¡parsers. ¡ § Plagiarism ¡DetecCon ¡ § Example: ¡ ¡Recognizing ¡that ¡one ¡document ¡contains ¡substanCal ¡passages ¡that ¡ have ¡been ¡copied, ¡perhaps ¡modified, ¡from ¡another. ¡ 3

A ¡Simple ¡Example ¡of ¡Cut-‑and-‑Paste ¡ § Here, ¡the ¡task ¡is ¡simply ¡to ¡ find ¡the ¡coresponding ¡passage(s), ¡if ¡any. ¡ ¡ 4

How ¡Difficult ¡Can ¡This ¡Be? ¡ § Consider ¡two ¡5,000-‑word ¡documents ¡that ¡contain ¡a ¡common ¡passage ¡(i.e., ¡ no ¡differences), ¡but ¡we ¡don’t ¡know ¡anything ¡about ¡it, ¡not ¡even ¡its ¡length. ¡ § An ¡exhausCve ¡search ¡must ¡test: ¡ § Every ¡valid ¡length ¡from ¡1 ¡to ¡5,000 ¡ § Every ¡shingle ¡of ¡each ¡length ¡in ¡each ¡document ¡ § Average ¡number ¡of ¡shingles ¡is ¡2,500 ¡ § Result ¡is ¡approx. ¡(5000)(2500)(2500) ¡= ¡over ¡30 ¡billion ¡passage ¡ comparisons. ¡ § This ¡is ¡O(n 3 ) ¡complexity. ¡ ¡If ¡differences ¡are ¡allowed, ¡search ¡is ¡O(n 4 ). ¡ 5

Our ¡Approach ¡ § Take ¡advantage ¡of ¡the ¡fact ¡that, ¡despite ¡differences, ¡similar ¡passages ¡tend ¡ to ¡have ¡aligned ¡concepts. ¡ § We ¡borrow ¡the ¡ Smith-‑Waterman ¡ dynamic ¡programming ¡algorithm ¡from ¡ the ¡bioinformaCcs ¡community. ¡ § We ¡extend ¡it ¡for ¡large ¡document ¡text ¡similarity ¡applicaCons ¡by ¡specifying: ¡ § Recursive ¡descent ¡ – ¡to ¡support ¡discovery ¡of ¡mulCple ¡passage ¡pairs ¡ § Matrix ¡splicing ¡ – ¡for ¡handling ¡large ¡documents ¡ § Chaining ¡– ¡for ¡connecCng ¡passage ¡components ¡ § Relaxed ¡similarity ¡measure ¡– ¡for ¡idenCfying ¡token ¡matches ¡ 6

A ¡simple ¡(but ¡actual) ¡example ¡ This ¡essay ¡discusses ¡Hamlet's ¡famous ¡ This ¡ar7cle ¡discusses ¡the ¡famous ¡Hamlet ¡ soliloquy ¡in ¡rela7on ¡to ¡the ¡major ¡themes ¡of ¡ monologue ¡of ¡the ¡main ¡themes ¡of ¡the ¡game. ¡ the ¡play. ¡ ¡ ¡ ¡ (ROOT ¡ (ROOT ¡ ¡ ¡(S ¡ ¡ ¡(S ¡ ¡ ¡ ¡ ¡(NP ¡(DT ¡This) ¡(NN ¡essay)) ¡ ¡ ¡ ¡ ¡(NP ¡(DT ¡This) ¡(NN ¡arCcle)) ¡ ¡ ¡ ¡ ¡(VP ¡(VBZ ¡discusses) ¡ ¡ ¡ ¡ ¡(VP ¡(VBZ ¡discusses) ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡(DT ¡the) ¡(JJ ¡famous) ¡(NNP ¡Hamlet) ¡(NN ¡ ¡ ¡ ¡ ¡ ¡ ¡ monologue)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡(NNP ¡Hamlet) ¡(POS ¡'s)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(PP ¡(IN ¡of) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(JJ ¡famous) ¡(NN ¡soliloquy)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(PP ¡(IN ¡in) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡(DT ¡the) ¡(JJ ¡main) ¡(NNS ¡themes)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡(NN ¡relaCon)))) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(PP ¡(IN ¡of) ¡ ¡ ¡ ¡ ¡ ¡ ¡(PP ¡(TO ¡to) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡(DT ¡the) ¡(NN ¡game))))))) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡ ¡ ¡ ¡ ¡(. ¡.))) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡(DT ¡the) ¡(JJ ¡major) ¡(NNS ¡themes)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(PP ¡(IN ¡of) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(NP ¡(DT ¡the) ¡(NN ¡play)))))) ¡ ¡ ¡ ¡ ¡(. ¡.))) ¡ 7

Concept ¡Alignment ¡ ¡ ¡ This ¡essay ¡discusses ¡Hamlet ¡‘s ¡ This ¡arCcle ¡discusses ¡the ¡famous ¡ famous ¡soliloquy ¡in ¡relaCon ¡to ¡the ¡ Hamlet ¡monologue ¡of ¡the ¡main ¡ major ¡themes ¡of ¡the ¡play. ¡ themes ¡of ¡the ¡game. ¡ ¡ ¡ This ¡essay ¡ This ¡arCcle ¡ discuss ¡ discusses ¡ Hamlet’s ¡famous ¡soliloquy ¡ the ¡famous ¡Hamlet ¡monologue ¡ in ¡relaCon ¡to ¡ of ¡ the ¡major ¡themes ¡ the ¡main ¡themes ¡ of ¡the ¡play ¡ of ¡the ¡game ¡ ¡ ¡ ¡ ¡ ¡ 8

The ¡Smith-‑Waterman ¡Algorithm ¡ § Uses ¡dynamic ¡programming ¡to ¡build ¡a ¡match ¡matrix ¡for ¡the ¡two ¡input ¡ documents ¡ § Finds ¡the ¡maximal ¡length ¡alignment ¡ § The ¡algorithm: ¡ 9

The ¡Match ¡Matrix ¡ 10

Recursive ¡Descent ¡ § Apply ¡algorithm ¡recursively ¡to ¡unused ¡regions ¡of ¡document ¡space ¡ 11

Matrix ¡Splicing ¡ § Slice ¡to ¡fit ¡segment ¡ within ¡available ¡memory ¡ § Column ¡to ¡lei ¡of ¡slice ¡ preserves ¡state, ¡allowing ¡ chains ¡to ¡cross ¡ boundaries ¡ 12

Chaining ¡ § Bridge ¡gaps ¡along ¡diagonals ¡if ¡conCnue ¡on ¡both ¡sides ¡ § Limit ¡2 ¡gaps ¡bridged ¡per ¡chain ¡ 13

Relaxed ¡Similarity ¡Measure ¡ § Different ¡authors ¡and ¡speakers ¡oien ¡use ¡different ¡arCcles ¡and ¡ preposiCons ¡when ¡expressing ¡the ¡same ¡concept. ¡ § When ¡tesCng ¡for ¡matches ¡while ¡building ¡up ¡the ¡match ¡matrix: ¡ § Equate ¡determiners: ¡ ¡ a, ¡an, ¡the ¡ § Also ¡equate ¡common ¡preposiCons: ¡ ¡ ¡ ¡ of, ¡in, ¡to, ¡for, ¡with, ¡on, ¡at, ¡from, ¡by, ¡about, ¡as, ¡into, ¡like, ¡through, ¡ ¡a@er, ¡over, ¡between, ¡out, ¡against, ¡during, ¡without, ¡before, ¡under, ¡ ¡around, ¡among ¡ 14

Test ¡Data ¡ § Although ¡not ¡a ¡perfect ¡match ¡for ¡this ¡algorithm, ¡we ¡chose ¡the ¡2013 ¡PAN ¡ text ¡alignment ¡test ¡corpus, ¡comprising ¡ § 5,185 ¡document ¡pairs ¡from ¡3,169 ¡source ¡and ¡1,826 ¡suspect ¡ documents ¡ § 1,000 ¡pairs ¡each ¡involving ¡ no ¡plagiarism , ¡ no ¡obfusca4on , ¡ random ¡ obfusca4on , ¡and ¡ cyclic ¡transla4on ¡ plagiarism ¡ § 1,185 ¡pairs ¡involving ¡ summary ¡plagiarism ¡ § Source ¡documents: ¡ § min/mean/max: ¡ ¡104 ¡/ ¡914 ¡/ ¡12,277 ¡words ¡ § Suspect ¡documents: ¡ § min/mean/max: ¡ ¡131 ¡/ ¡2,930 ¡/ ¡20,297 ¡words ¡ 15

Aggregate ¡Performance ¡ § Precision ¡uniformly ¡high ¡ § Recall ¡for ¡summary ¡near ¡nil ¡ ¡ § Understandable, ¡since ¡summaries ¡inherently ¡do ¡not ¡preserve ¡order ¡of ¡ concepts ¡ 16

DetecCon ¡Counts ¡ § Low ¡false ¡alarm ¡rate ¡overall ¡ § Manual ¡examinaCon ¡of ¡a ¡number ¡of ¡summary ¡cases ¡detected ¡indicate ¡that ¡ the ¡summaries ¡that ¡were ¡detected ¡were ¡largely ¡cut-‑and-‑paste ¡excerpts ¡ (for ¡which ¡concepts ¡are ¡aligned) ¡ 17

Discovering Similar Passages Within Large Text Documents - PowerPoint PPT Presentation

Discovering Similar Passages Within Large Text Documents Demetrios Glinos glinos@eecs.ucf.edu 1 The Problem Domain The task is to find one or more

Passages worth the dig Matt 4.1-11 PASSAGES WORTH THE DIG MATTHEW 4. THE DEVIL DIDNT MAKE

COMPREHENSION OF UNSEEN PASSAGES UNSEEN PASSAGES Teacher : Prof. Indu Bora Subject :

You are the light of the world. A city that is set on a hill cannot be hidden. Matthew 5:14

Passages worth the dig: Passages worth the dig: Picking a Pastor/Leader How would YOU

Passages worth the dig Passages worth the dig Matt 7.1-5 Judge not, that ye not be judged

From 500 passages to 50,000 books: Crea3ng and using

Introduction to Historical Texts Over 350, 000 late 15 th to long 19 th century

Nectar of Instruction (NOI) From shraddha to prema In Eleven Verses Texts 1-3 Text 8 Texts

Similarity is crucial to cognition General (often implicit) hypothesis: similar stimulus in

The 3 rd Covenant Re-Discovering the Word of God within the words of the Bible Re-Discovering The

Discovering Gods Word (Part-2) Discovering Gods Word (Part-2) Hermeneutics = The science

worth the dig Passages worth the dig Can you type? Can you type without looking? Often ,I

Finding Similar Items:Nearest Neighbor Search Barna Saha March 29, 2018 Finding Similar Items

Trigonometric functions Step one: similar triangles Two similar triangles have the same set of

Wh What is in here? This product includes two versions of the same passages. By the end of

and utterances (speech) go together to make texts and interactions and how those texts and

Jeffrey D. Ullman You can download a free copy of Mining of Massive Datasets , by Jure

Data Mining Learning from Large Data Sets Lecture 2

Web Characteristics CE-324: Modern Information Retrieval Sharif University of Technology M.

Data Leak Detection As a Service Xiaokui Shu and Danfeng (Daphne) Yao Department of Computer

State Board of Land Commissioners September 19, 2017 Boise, Idaho Increase pace and scale of

Development of High Data Readout Rate Pixel Module and Detector Hybridization at Fermilab

Benjamin Markines Ciro Cattuto Filippo Menczer ISI Foundation Beneficiaries Spammer

Lift Ladder Silver B Probl blem em: 50-100 lb shingle packs 500,000 accidents in