Stemming VSM, session 10 CS6200: Information Retrieval Slides by: - PowerPoint PPT Presentation

Dec 29, 2023 •306 likes •407 views

Stemming VSM, session 10 CS6200: Information Retrieval Slides by: Jesse Anderton Stemming A stemming algorithm converts words to k i t a b a book their root forms (stems) in order to focus k i t a b i my book on their underlying

Stemming VSM, session 10 CS6200: Information Retrieval Slides by: Jesse Anderton
Stemming A stemming algorithm converts words to k i t a b a book their root forms (“stems”) in order to focus k i t a b i my book on their underlying meaning. al k i t a b the book ‣ It works well in English or in Arabic, k i t a b uki your book (f) with many nouns and verbs deriving from a common root k i t a b uka your book (m) k i t a b uhu his book ‣ It works worse in German or Turkish, which compose very long words with k a t a b a to write complex meanings. ma kt a b a library, bookstore It’s common to add the root to the query’s ma kt a b office term vector (while leaving the unstemmed Arabic words that stem to ktb form present).
çekoslovakyalila ş tiramadiklarimizdanmi ş siniz “(it is speculated that) you had been one of those whom we could not convert to a Czechoslovakian.” –Common example of a Turkish word demonstrating agglutinative languages.
Stemming Algorithms Two major families of stemming algorithms exist: ‣ Dictionary-based stemmers use lists of related words. ‣ Algorithmic stemmers use some algorithm to derive related words. A simple algorithmic stemmer for English may remove the suffix -s: ‣ cats → cat, lakes → lake, plays → play ‣ But many false negatives: supplies → supplie ‣ And some false positives: ups → up Producing high quality rules is very challenging.
Porter Stemmer The Porter Stemmer was developed in the 70’s, and consists of a large series of rules to repeatedly apply until only the stem is left. It is fairly effective, though makes many categorical errors. Its complexity makes it hard to modify, though the porter2 stemmer fixes some of its problems. It outputs stems, not recognizable Porter Stemmer, step 1 of 5 words.
Krovetz Stemmer The Krovetz Stemmer is a hybrid of dictionary and algorithmic methods. It first checks the dictionary. If not found, it tries to remove suffixes and then checks the dictionary again. It produces recognizable words, unlike the Porter stemmer. Its effectiveness is comparable to the Porter stemmer. It has a lower false positive rate, but somewhat higher false negative.
Stemmer Comparison
Stem Classes A given stemming algorithm creates stem classes of words which are stemmed to the same root. These classes are generally too large and varied in meaning to use for query expansion, but they can be narrowed down using term co-occurrence statistics. The assumption is that those terms which tend to appear in the same Stem classes, before and after term co- document are more likely to be related occurrence thinning is applied (or interchangeable).
Wrapping Up Adding words from query terms’ stem class is an effective way to improve document matching. Many stemming algorithms exist; the Porter and Krovetz are commonly used, but there are many other popular stemmers (e.g. the Snowball stemmer, with variants for many languages). Next, we’ll discuss term co-occurrence statistics, which can be used to fix stem classes and identify other related words to add to the query vector.

Recommend

Sales Tax Affiliate Nexus Stemming From Sales Tax Affiliate Nexus Stemming From Online Business

Presenting a live 110 minute webinar with interactive Q&A Sales Tax Affiliate Nexus Stemming From Sales Tax Affiliate Nexus Stemming From Online Business Presence Meeting the Challenges of Tough State Laws and Court Rulings THURSDAY,

921 views • 79 slides

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar

Putting Suffix-Tree-Stemming to Work Benno Stein Martin Potthast Bauhaus University Weimar Paderborn University Introduction Stemming Approaches Evaluation GFKL 06 Mar. 8th, 2006 Stein/Potthast Index terms Text with markups

472 views • 43 slides

STEMMING THE SUPERBUG TIDE JUST A FEW DOLLARS MORE Ulrik Vestergaard Knudsen Deputy

STEMMING THE SUPERBUG TIDE JUST A FEW DOLLARS MORE Ulrik Vestergaard Knudsen Deputy Secretary-General OECD Resistance to 2 nd and 3 rd line Treatments Will Grow the Most 2 nd and 3 rd line antibiotics are AMR by line of antibiotic in

562 views • 8 slides

By, SravyaYeleswarapu 10 Nataniel Montoya 11 Law stemming fromVirginia Supreme Court

By, SravyaYeleswarapu 10 Nataniel Montoya 11 Law stemming fromVirginia Supreme Court decision Went into effectAugust 23rd Flat Hat Front Page Article, Sept. 1: In residential areas, noise is limited to 65 decibels between 7

308 views • 5 slides

Stemming tide of Invasive Plants in Hawaii: Status of the Voluntary Codes of Conduct

Stemming tide of Invasive Plants in Hawaii: Status of the Voluntary Codes of Conduct Presented by: Christy Martin, Public Information Officer Coordinating Group on Alien Pest Species www.cgaps.org (808) 722-0995 Hawaii's First Arrivals:

572 views • 19 slides

Stemming the Tide Legislation to Address the School-to- Prison Pipeline Jessica Feierman Points

4/15/2013 Stemming the Tide Legislation to Address the School-to- Prison Pipeline Jessica Feierman Points of Intervention Positive School Practices Effective Court Responses Facilitating Reentry 1 4/15/2013 Creating a Positive

336 views • 10 slides

LEARNING OBJECTIVES In this lesson, you will learn: Recent audit findings stemming from

LEARNING OBJECTIVES In this lesson, you will learn: Recent audit findings stemming from non-compliance with the Uniform Guidance Relevant sections of the Uniform Guidance that apply to the cited deficiencies Remedies specified by the

593 views • 20 slides

Information Retrieval Lecture 3 Recap: lecture 2 Stemming, tokenization etc. Faster

Information Retrieval Lecture 3 Recap: lecture 2 Stemming, tokenization etc. Faster postings merges Phrase queries This lecture Index compression Space for postings Space for the dictionary Will only look at space for

689 views • 46 slides

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth

DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland Outline Motivation Task Description Bengali Stemming Approach Hindi Stemming

907 views • 14 slides

CHALLENGES AND RESPONSES IN THE WORKPLACE IN RELATION TO COVID-19 Demerara Distillers Ltd Group

CHALLENGES AND RESPONSES IN THE WORKPLACE IN RELATION TO COVID-19 Demerara Distillers Ltd Group Of Companies, Guyana Presentation Outline - Introduction - Challenges Faced - DDLs Response in Stemming the Transmission and Safeguarding

690 views • 16 slides

The International Logistics Community Overview Members of the international forwarding community

The International Logistics Community Overview Members of the international forwarding community play a key role in ensuring the security of the global supply chain, stemming the flow of illegal exports, and helping to prevent Weapons of Mass

354 views • 9 slides

Text-based (image) retrieval Henning Mller HES SO//Valais Sierre, Switzerland Business

Business Information Systems Text-based (image) retrieval Henning Mller HES SO//Valais Sierre, Switzerland Business Information Systems Overview Difference of words and features Weightings instead of distance measures Stemming

696 views • 23 slides

Briefing on the Migration to RDAP The purpose of this paper is to help inform the discussion

16 February 2016 Briefing on the Migration to RDAP The purpose of this paper is to help inform the discussion stemming from the implementation of the Registry Data Access Protocol (RDAP), a standardized replacement for the existing WHOIS

108 views • 7 slides

Aircraft encounters with weather balloons: risks and mitigations Bob Lunnon Royal Meteorological

Aircraft encounters with weather balloons: risks and mitigations Bob Lunnon Royal Meteorological Society Background There have been a number of incidents stemming from aircraft encounters with balloons, where the pitot systems on the

323 views • 28 slides

EASM 2014 economic impact of sport events stemming from the primary visitors consumption in a

COMPOSITION OF EVENT VISITORS: AN INVESTIGATION OF FOUR NON-MEGA SPORT EVENTS Submitting author: Mr Grzegorz Kwiatkowski University of Southern Denmark, Department of Environmental and Business Economics Esbjerg, 6700 Denmark All authors:

175 views • 3 slides

Climate Change & Catastrophic Losses Dominic Clarke, Partner 416-593-3968

Climate Change & Catastrophic Losses Dominic Clarke, Partner 416-593-3968 dclarke@blaney.com Fort McMurray Fires Insured losses stemming from the fires have been estimated at approximately $4 billion 45,000 claims had been tendered

630 views • 21 slides

Su SuRF: : PRACTICAL RANGE FILTERING WITH FA FAST ST SU SUCCINCT TRIES Huanchen Zhang Hu

Su SuRF: : PRACTICAL RANGE FILTERING WITH FA FAST ST SU SUCCINCT TRIES Huanchen Zhang Hu Hy Hyeontaek Lim, Viktor r Leis, David G. Anders rsen Michael Kaminsky, Kimberl rly Keeton, Andre rew Pa Pavlo Fi Filters answer approximate

525 views • 50 slides

Disordered systems and random graphs 3 Amin Coja-Oghlan Goethe University based on joint work

Disordered systems and random graphs 3 Amin Coja-Oghlan Goethe University based on joint work with Dimitris Achlioptas, Oliver Gebhard, Max Hahn-Klimroth, Joon Lee, Philipp Loick, Noela Mller, Manuel Penschuck, Guangyan Zhou The problem

581 views • 21 slides

DISTROY: Detec-ng IC Trojans with Compressive Measurements

DISTROY: Detec-ng IC Trojans with Compressive Measurements Youngjune Gwon, H. T. Kung, and Dario Vlah Harvard University August 9, 2011 Understanding

415 views • 17 slides

An Examination of Bloom Filters and their Applications Jacob Honoroff March 16, 2006 Outline

An Examination of Bloom Filters and their Applications Jacob Honoroff March 16, 2006 Outline Bloom Filter Overview Traditional Applications Hierarchical Bloom Filters Paper Less Traditional Applications & Extensions 1

1.45k views • 113 slides

in the Range-Based Constraint Manager dm Balogh adam.balogh@ericsson.com Euro LLVM 2019

Multiplication and Division in the Range-Based Constraint Manager dm Balogh adam.balogh@ericsson.com Euro LLVM 2019 Brussels, Belgium Ericsson 2019-04-08 Ericsson Internal | 2018-02-21 Range-Based Constraint Manager Default in

187 views • 18 slides

M onkey: O ptimal N avigable Key -Value Store Niv Dayan, Manos Athanassoulis, Stratos Idreos

M onkey: O ptimal N avigable Key -Value Store Niv Dayan, Manos Athanassoulis, Stratos Idreos storage is cheaper inserts & updates price workload per GB time storage is cheaper inserts & updates price workload per GB time need

1.38k views • 123 slides

PUBLIC POLICY TOWARD ABUSE OF FIRM DOMINANCE Outline Public policy: false positives and

PUBLIC POLICY TOWARD ABUSE OF FIRM DOMINANCE Outline Public policy: false positives and false negatives Cases Outline Public policy: false positives and false negatives Cases Basic trade-off: Type I and Type II errors Errors

434 views • 21 slides

A True Positives Theorem for a Static Race Detector Nikos Gorogiannis Peter OHearn Ilya

A True Positives Theorem for a Static Race Detector Nikos Gorogiannis Peter OHearn Ilya Sergey Key Messages Unsound (and incomplete) static analyses can be principled , satisfying meaningful theorems that help to understand their

688 views • 44 slides