StructuralTextFeatures CISC489/689010,Lecture#13 Monday,April6 th - PDF document

4/7/09  Structural Text Features  CISC489/689‐010, Lecture #13  Monday, April 6 th   Ben CartereGe  Structural Features  • So far we have mainly focused on “vanilla”  features of terms in documents  – Term frequency, document frequency  – “Bag of words” models  • Some documents have  structure  that we could  leverage for improved retrieval  – Natural language has structure as well  • We can derive features from this structure,  especially from the placement of terms within  structure or placement of terms with respect to  each other  1 

4/7/09  Example:  HTML  • “HyperText Markup Language”  • Provides document structure using tags enclosing  text  – <Ytle>:  enclosed text displayed at top of browser  – <body>:  enclosed text displayed in browser  – <h1>:  enclosed text displayed in large font  – <b>:  enclosed text displayed in bold  – <a>:  enclosed text can be clicked to go to another  page  • The text enclosed in fields is o]en unstructured  or structured with more HTML  Example:  HTML  2 

4/7/09  Example:  HTML  • HTML pages organize into trees.  <TITLE>  Tropical fish  <HEAD>  Nodes contain blocks of text.  <META>  <HTML>  <H1>  Tropical fish  <B>  Tropical fish  <BODY>  <A>  fish  <P>  <A>  tropical  include found in environments  around the world  Example:  Email  • Header fields provide some structure  3 

4/7/09  Structure in Natural Language  • One example:  parse trees  (from hGp://www.lancs.ac.uk/fss/courses/ling/corpus/Corpus2/2PARSE.HTM)  Hyper‐Structure  • The documents themselves may occur within  some structure  – The web:  documents link to each other, creaYng a  graph structure  – Email:  threaded conversaYons  – Sentences form paragraphs, paragraphs form  secYons, secYons form chapters, chapters form  books, …  • This structure may provide useful features  4 

4/7/09  Using Structural Features in Retrieval  • Steps:  – Derive features – document processing  – Index features – using inverted lists  – Retrieval using features – retrieval models, scoring  funcYons, query languages  Specific Features  • Phrases:  – Sequences of words in order  – Users want to query phrases, e.g. “tropical fish”  • Fields and tags:  – Markup enclosing parts of documents  – We want to emphasize some parts, de‐emphasize others.  E.g.  Ytles important, sidebars not  • Web hyper‐structure:  – Links between pages  – We want pages that are frequently linked using the same text to  score higher for queries that contain that text  • What are the features, how do we derive them, how do we  store them, and how do we model them in retrieval?  5 

4/7/09  Deriving and Indexing Features  • DerivaYon consideraYons:  – ComputaYonal Yme and space requirements  – Errors in processing  – Use in queries  • Indexing consideraYons:  – Fast query processing  – Flexibility (index once with all info for calculaYng  anything you can imagine vs. re‐index every Yme you  come up with a new idea)  – Storage  Phrases  • Many queries are 2‐3 word phrases  • Phrases are  – More precise than single words  • e.g., documents containing “black sea” vs. two words  “black” and “sea”  – Less ambiguous  • e.g., “big apple” vs. “apple”  • Can be difficult for ranking  • e.g., Given query “fishing supplies”, how do we score  documents with  – exact phrase many Ymes, exact phrase just once, individual words  in same sentence, same paragraph, whole document, variaYons  on words?  6 

4/7/09  Phrases  • Text processing issue – how are phrases  recognized?  • Three possible approaches:  – IdenYfy syntacYc phrases using a  part‐of‐speech   (POS) tagger  – Use word  n‐grams – Store word posiYons in indexes and use  proximity   operators  in queries  POS Tagging  • POS taggers use staYsYcal models of text to  predict syntacYc tags of words  – Example tags:   • NN (singular noun), NNS (plural noun), VB (verb), VBD  (verb, past tense), VBN (verb, past parYciple), IN  (preposiYon), JJ (adjecYve), CC (conjuncYon, e.g., “and”,  “or”), PRP (pronoun), and MD (modal auxiliary, e.g.,  “can”, “will”).  • Phrases can then be defined as simple noun  groups, for example  7 

4/7/09  Pos Tagging Example  Example Noun Phrases  8 

4/7/09  Noun Phrase Inverted Lists  Q = “united states”:  retrieve inverted list for phrase “united states” and process  Q = united states:  retrieve inverted lists for terms “united”, “states” and process  Word N‐Grams  • POS tagging too slow for large collecYons  • Simpler definiYon – phrase is any sequence of  n   words – known as  n‐grams – bigram : 2 word sequence,  trigram : 3 word sequence,  unigram : single words  – N‐grams also used at character level for applicaYons  such as OCR  • N‐grams typically formed from  overlapping   sequences of words  – i.e. move n‐word “window” one word at a Yme in  document  9 

4/7/09  Word Bigrams  Tropical fish  fish include  include fish  fish found  found in  in tropical  tropical environments  environments around  around the  the world  …  Bigram Inverted Lists  Though many unusual phrases are included, term staYsYcs  help ensure that they do not hurt retrieval  10 

4/7/09  N‐Grams  • Frequent n‐grams are more likely to be  meaningful phrases  • N‐grams form a Zipf distribuYon  – BeGer fit than words alone  • Could index all n‐grams up to specified length  – Much faster than POS tagging  – Uses a lot of storage  • e.g., document containing 1,000 words would contain  3,990 instances of word n‐grams of length 2  ≤ n ≤ 5   Google N‐Grams  • Web search engines index n‐grams  • Google sample:  • Most frequent trigram in English is “all rights  reserved”  – In Chinese, “limited liability corporaYon”  11 

4/7/09  Use Term PosiYons  • Rather than store phrases in index directly,  store term posiYons and locate phrases at  query Yme  • Match phrases or words within a window  – e.g., " tropical fish ", or “find tropical within 5  words of fish”  Phrase Method Tradeoffs  • POS tagging:  – Very long index Yme, possible errors, medium storage  requirement, not very flexible  – Fast phrase‐query processing  • N‐Grams:  – High storage requirement  – More flexible, fast phrase‐query processing  • Term posiYons:  – Medium‐low storage requirement, very flexible  – Possibly slower query processing due to needing to  calculate collecYon staYsYcs  12 

4/7/09  Parsing  • Basic parsing:  idenYfy which parts of  documents to index, which to ignore  • Full parsing:  idenYfy and label parts of  documents, maintain structure, decide which  parts are relaYvely more important  HTML Parsing  • An HTML parser produces a DOM tree  <TITLE>  Tropical fish  <HEAD>  <META>  <HTML>  <H1>  Tropical fish  <B>  Tropical fish  <BODY>  <A>  fish  <P>  <A>  tropical  include found in environments  around the world  • We want to store basic term informaYon (v,  idf) as well as informaYon about the nodes the  term appers in  13 

StructuralTextFeatures CISC489/689010,Lecture#13 Monday,April6 th - PDF document

4/7/09 StructuralTextFeatures CISC489/689010,Lecture#13 Monday,April6 th BenCartereGe StructuralFeatures Sofarwehavemainlyfocusedonvanilla

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

AT ATI TEAS READING REVIEW PART 4 UNDERSTANDING TEXT FEATURES and REFERENCE SOURCES Text

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Latvian Text-to-Speech Synthesizer Mrcis Pinnis Ilze Auzia Marcis.Pinnis@lumii.lv

Dat Data- a-Dri Drive ven Spe n Speech ech Synt nthe hesis Konstantin Tretjakov kt@ut.ee

Hanady Ahmed Allan Ramsay Arabic Department, CAS

Letter-to-Phoneme Conversion for a German Text-to-Speech System Vera Demberg Institut fr

Entity Representation and Retrieval Laura Dietz University of New Hampshire Alexander Kotov Wayne

bounding-box April 9, 2019 1 Boxes in Object Detection In [1]: % matplotlib inline import d2l

in 6 Slides Servicing GANT Services Reimer Karlsen-Masur, DFN-CERT GN3plus Symposium Services

A Simple Example 35/123 Pustejovsky - Brandeis Computational Event Models The Final SDRS