SLIDE 1
Text Processing
Information Retrieval Inf 141 / CS 121
Text Processing Information Retrieval Inf 141 / CS 121 - - PowerPoint PPT Presentation
Text Processing Information Retrieval Inf 141 / CS 121 Tokenization Break the input into words Character stream -> token stream Called a tokenizer / lexer / scanner Compiler front-end Lexer hooks up to parser
Information Retrieval Inf 141 / CS 121
character stream into tokens
specific grammars
Start space a-z space a-z End \n \n
languages
systems
suffix
dictionary form of the word, called a lemma
caresses -> caress
ponies -> poni
caress -> caress
cats -> cat
morphological analysis