Text Representation - PowerPoint PPT Presentation

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea

Text Representation  Document Preprocessing  Vector Space Model for Document Storage  Measure of Similarity 2

Document preprocessing(1/3)  Tokenization • Filtering away tags • Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. • Token represented by a suitable integer, tid , typically 32 bits • Optional: stemming/conflation of words • Result: document (did) transformed into a sequence of integers ( tid, pos ) 3

Document preprocessing(2/3)  Stopwords • Function words and connectives • Appear in large number of documents and little use in pinpointing documents • Issues  Queries containing only stopwords ruled out  Polysemous words that are stopwords in one sense but not in others – E.g.; can as a verb vs. can as a noun 4

Document preprocessing(3/3)  Stemming • Remove inflections that convey parts of speech, tense and number • E.g.: university and universal both stem to universe. • Techniques  morphological analysis (e.g., Porter's algorithm)  dictionary lookup (e.g., WordNet ). • Stemming may increase the number of documents in the response of a query but at the price of precision  It is not a good idea to stem Abbreviations, and names coined in the technical and commercial sectors  E.g.: Stemming “ides” to “IDE”, the hard disk standard, “SOCKS” firewall protocol to “sock” worn on the foot, may be bad ! 5

The vector space model (1/4)  Documents represented as vectors in a multi-dimensional Euclidean space • Each axis = a term (token)  Coordinate of document d in direction of term t determined by: • Term frequency TF(d,t)  number of times term t occurs in document d, scaled in a variety of ways to normalize document length • Inverse document frequency IDF(t)  to scale down the coordinates of terms that occur in many documents 6

The vector space model (2/4)  Term frequency n(d, t) n(d, t) = TF(d, t) = • . ∑ TF(d, t) τ ) n(d, τ max (n(d, )) . τ τ  Cornell SMART system uses a smoothed version = if ( , ) 0 n d t = ( , ) 0 TF d t = + + otherwise ( , ) 1 log( 1 ( , )) TF d t n d t 7

The vector space model (3/4)  Inverse document frequency • Given  D is the document collection and is the set of D t documents containing t • Formulae D  mostly dampened functions of | | D t  SMART + 1 | | D = ( ) log( ) IDF t | | D t 8

Vector space model (4/4)  Coordinate of document d in axis t d t = • . ( , ) ( ) TF d t IDF t  • Transformed to in the TFIDF-space d  Query q • Interpreted as a document  q • Transformed to in the same TFIDF-space as d 9

Measures of Similarity (1/2)  Distance measure • Magnitude of the vector difference   −  . | | d q • Document vectors must be normalized to unit ( or ) length L L 1 2  Else shorter documents dominate (since queries are short)  Cosine similarity   • cosine of the angle between and d q  Shorter documents are penalized 10

Measures of Similarity (2/2) • Jaccard coefficient of similarity between document and d d 1 2 • T(d) = set of tokens in document d ∩ | ( ) ( ) | T d T d • . = 1 2 ' ( , ) r d d ∪ 1 2 | ( ) ( ) | T d T d 1 2 • Symmetric, reflexive • Forgives any number of occurrences and any permutations of the terms. 11

Text Representation - PowerPoint PPT Presentation

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea Text Representation Document Preprocessing Vector Space Model for Document Storage Measure of Similarity 2 Document preprocessing(1/3)

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

1 Introduction The Text Mining Process Text representation Learning Conclusion Introduction

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Title of an article [16 pt] Introduction [14 pt] Text. Text. Text. Text. Text. Text. Text. Text.

quanteda Quantitative Analysis of Textual Data Stefan Mller (www.muellerstefan.net)

Vector Space Model Lecture 2: Sept 13, 2013 CS886 2: Natural Language Understanding University

Text Alignment Module in CoReMo 2.1 Plagiarism Detector Diego A. RodrguezTorrejn 1,2 Jos

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer & PMC Member uschindler@apache.org

CSE 158 Lecture 9 Web Mining and Recommender Systems T ext Mining Administrivia Midterms

Recent Developments in Digital Services Taxes: The UK Debate John Vella Faculty of Law &

StemmingandSearch StrategiesforEast EuropeanLanguage

What Makes Human Languages Interesting? Connecting minds: how one persons thoughts reach

Text Representation - PowerPoint PPT Presentation

Text Representation http://www.cse.iitb.ac.in/~soumen/mining-the-web/ Ahmed Rafea Text Representation Document Preprocessing Vector Space Model for Document Storage Measure of Similarity 2 Document preprocessing(1/3)

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

1 Introduction The Text Mining Process Text representation Learning Conclusion Introduction

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Title of an article [16 pt] Introduction [14 pt] Text. Text. Text. Text. Text. Text. Text. Text.

quanteda Quantitative Analysis of Textual Data Stefan Mller (www.muellerstefan.net)

Vector Space Model Lecture 2: Sept 13, 2013 CS886 2: Natural Language Understanding University

Text Alignment Module in CoReMo 2.1 Plagiarism Detector Diego A. RodrguezTorrejn 1,2 Jos

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer &amp; PMC Member uschindler@apache.org

CSE 158 Lecture 9 Web Mining and Recommender Systems T ext Mining Administrivia Midterms

Recent Developments in Digital Services Taxes: The UK Debate John Vella Faculty of Law &amp;

StemmingandSearch StrategiesforEast EuropeanLanguage

What Makes Human Languages Interesting? Connecting minds: how one persons thoughts reach

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer & PMC Member uschindler@apache.org

Recent Developments in Digital Services Taxes: The UK Debate John Vella Faculty of Law &