TextProcessing CISC489/689010,Lecture#3 Monday,Feb.16 - PDF document

3/17/09 Text Processing  CISC489/689‐010, Lecture #3  Monday, Feb. 16  Ben CartereFe  Indexing  • An  index  is a list of things (keys) with pointers  to other things (items).  – Keywords    catalog numbers (   shelves).  – Concepts    page numbers.  – Terms    documents.  • Need for indexes:    – Ease of use.  – Speed.  – Scalability.  1

3/17/09 Manual vs. AutomaVc Indexing  • Manual:  – An “expert” assigns keys to each item.  – Example:  card catalog.  • AutomaVc:  – Keys automaVcally idenVfied and assigned.  – Example:  Google.  • AutomaVc as good as manual for most  purposes.  Text Processing  • First step in automaVc indexing.  • ConverVng documents into  index terms. • Terms are not just words.  – Not all words are of equal value in a search.  – SomeVmes not clear where words begin and end.  • Especially when not space‐separated, e.g. Chinese,  Korean.  – Matching the exact words typed by the user  doesn’t work very well in terms of effecVveness.  2

3/17/09 Text Processing Steps  • For each document:  – Parse it to locate the parts that are important.  – Segment and tokenize the text in the important  parts to get  words .  – Remove  stop words .  – Stem  words to common roots.  • Advanced processing may included phrases,  enVty tagging, link‐graph features, and more.  Parsing  • Some parts of a document are more important  than others.  • Document parser recognizes structure using  markup such as HTML tags.  – Headers, anchor text, bolded text are likely to be  important.  – JavaScript, style informaVon, navigaVon links less  likely to be important.  – Metadata can also be important.    3

3/17/09 Example Wikipedia Page  Wikipedia Markup  <title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics| topical]] environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping| Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’. … 4

3/17/09 Wikipedia HTML  Document Parsing  • HTML pages organize into trees.  <TITLE>  Tropical fish  <HEAD>  <META>  Nodes contain blocks of text. <HTML>  <H1>  Tropical fish  <B>  Tropical fish  <BODY>  <A>  fish  <P>  <A>  tropical  include found in environments  around the world  5

3/17/09 End Result of Parsing  • Blocks of text from important parts of page.  – Tropical fish include fish found in tropical  environments around the world, including both  freshwater and salt water species.  Fishkeepers  oien use the term “tropical fish” to refer only  those requiring fresh water, with saltwater tropical  fish referred to as “marine fish”.  • Next step:  segmenVng and tokenizing.  Tokenizing  • Forming words from sequence of characters in  blocks of text.  • Surprisingly complex in English, can be harder  in other languages.  • Early IR systems:  – Any sequence of alphanumeric characters of  length 3 or more.  – Terminated by a space or other special character.  – Upper‐case changed to lower‐case.  6

3/17/09 Tokenizing  • Example:  – “Bigcorp's 2007 bi‐annual report showed profits  rose 10%.” becomes  – “bigcorp 2007 annual report showed profits rose”  • Too simple for search applicaVons or even  large‐scale experiments  • Why? Too much informaVon lost  – Small decisions in tokenizing can have major  impact on effecVveness of some queries  Tokenizing Problems  • Small words can be important in some queries,  usually in combinaVons  •  xp, ma, pm, ben e king, el paso, master p, gm, j lo, world  war II  • Both hyphenated and non‐hyphenated forms of  many words are common   – SomeVmes hyphen is not needed   • e‐bay, wal‐mart, acVve‐x, cd‐rom, t‐shirts   – At other Vmes, hyphens should be considered either  as part of the word or a word separator  • winston‐salem, mazda rx‐7, e‐cards, pre‐diabetes, t‐mobile,  spanish‐speaking  7

3/17/09 Tokenizing Problems  • Special characters are an important part of tags,  URLs, code in documents  • Capitalized words can have different meaning  from lower case words  – Bush,  Apple  • Apostrophes can be a part of a word, a part of a  possessive, or just a mistake  – rosie o'donnell, can't, don't, 80's, 1890's, men's straw  hats, master's degree, england's ten largest ciVes,  shriner's  Tokenizing Problems  • Numbers can be important, including decimals   – nokia 3250, top 10 courses, united 93, quickVme  6.5 pro, 92.3 the beat, 288358   • Periods can occur in numbers, abbreviaVons,  URLs, ends of sentences, and other situaVons  – I.B.M., Ph.D., cis.udel.edu  • Note: tokenizing steps for queries must be  idenVcal to steps for documents  8

3/17/09 Tokenizing Process  • Assume we have used the parser to find blocks of  important text.  • A word may be any sequence of alphanumeric  characters terminated by a space or special  character.  – everything converted to lower case.  – everything indexed.  • Defer complex decisions to other components  – example: 92.3 → 92 3 but search finds documents  with 92 and 3 adjacent  – incorporate some rules to reduce dependence on  query transformaVon components  End Result of TokenizaVon  • List of words in blocks of text.  – tropical fish include fish found in tropical  environments around the world including both  freshwater and salt water species fishkeepers  oien use the term tropical fish to refer only those  requiring fresh water with saltwater tropical fish  referred to as marine fish  • Next step:  stopping.  • But first:  text staVsVcs.  9

3/17/09 Text StaVsVcs  • Huge variety of words used in text but  • Many staVsVcal characterisVcs of word  occurrences are predictable  – e.g., distribuVon of word counts  • Retrieval models and ranking algorithms  depend heavily on staVsVcal properVes of  words  – e.g., important words occur oien in documents  but are not high frequency in collecVon  Zipf’s Law  • DistribuVon of word frequencies is very  skewed – a few words occur very oien, many words hardly ever  occur  – e.g., two most common words (“the”, “of”) make up  about 10% of all word occurrences in text documents  • Zipf’s “law”:  – observaVon that rank ( r ) of a word Vmes its frequency  ( f ) is approximately a constant ( k) • assuming words are ranked in order of decreasing frequency  – i.e.,   r . f ≈  k or   r.P r   ≈   c , where  P r  is probability of word  occurrence and  c   ≈ 0.1 for English 10

3/17/09 Zipf’s Law  Wikipedia StaVsVcs   (wiki000 subset)  Total documents  5,001  Total word occurrences  22,545,922  Vocabulary size  348,436  Words occurring > 1000 Vmes  2,751  Words occurring once  163,404  Word  Freq  r  Pr (%)  r.Pr  poliVcian  5096  510  0.023  0.116  contractor  100  14,852  4.4∙10 ‐4  0.066  kickboxer  10  56,125  4.4∙10 ‐5  0.025  comdedian  1  185,035  4.4∙10 ‐6  0.008  11

3/17/09 Top 50 Words from wiki000 Subset  Zipf’s Law for wiki000 Subset  Probability Rank 12

TextProcessing CISC489/689010,Lecture#3 Monday,Feb.16 - PDF document

3/17/09 TextProcessing CISC489/689010,Lecture#3 Monday,Feb.16 BenCartereFe Indexing An index isalistofthings(keys)withpointers tootherthings(items).

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

CS1100: Computer Science and Its Applications Text Processing Processing Text Excel can be

Text processing Format Text File IASP 321 IASP 221 Dr. John Yoon Text Processing Commands

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Geoapplications development http://rgeo.wikience.org Higher School of Economics, Moscow,

- A Finite Element Software Teresa Beck, Simon Gawlok and HiFlow team HiFlow-Finite Element

Direct Assimilation of Radar Reflectivity Data using a Convective scale EnKF System Jingyao Luo 1

Water, Climate, and Local Governance: Experience from the Pacific Islands Rashed Chowdhury, PhD

Energy Quality, Net Energy, and the Coming Energy Transition Cutler J. Cleveland Department of

Semester projects The Plan Principles of Complex Systems Suggestions for Projects Course 300,

Benefits & Impacts Policy 779-page Ebook Download at http://bioenfapesp.org Bioenergy

An Overview of the barriers for sustainable development of GEOTHERMAL ENERGY potential in

TextProcessing CISC489/689010,Lecture#3 Monday,Feb.16 - PDF document

3/17/09 TextProcessing CISC489/689010,Lecture#3 Monday,Feb.16 BenCartereFe Indexing An index isalistofthings(keys)withpointers tootherthings(items).

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

CS1100: Computer Science and Its Applications Text Processing Processing Text Excel can be

Text processing Format Text File IASP 321 IASP 221 Dr. John Yoon Text Processing Commands

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Geoapplications development http://rgeo.wikience.org Higher School of Economics, Moscow,

- A Finite Element Software Teresa Beck, Simon Gawlok and HiFlow team HiFlow-Finite Element

Direct Assimilation of Radar Reflectivity Data using a Convective scale EnKF System Jingyao Luo 1

Water, Climate, and Local Governance: Experience from the Pacific Islands Rashed Chowdhury, PhD

Energy Quality, Net Energy, and the Coming Energy Transition Cutler J. Cleveland Department of

Semester projects The Plan Principles of Complex Systems Suggestions for Projects Course 300,

Benefits &amp; Impacts Policy 779-page Ebook Download at http://bioenfapesp.org Bioenergy

An Overview of the barriers for sustainable development of GEOTHERMAL ENERGY potential in

Benefits & Impacts Policy 779-page Ebook Download at http://bioenfapesp.org Bioenergy