Lemmatizer for Indian Languages IndiLem@FIRE-MET-2014 : An - - PowerPoint PPT Presentation

▶

Aug 01, 2023 223 likes •946 views

IndiLem@FIRE- Setup and Indian Statistical Institute, Kolkata Abhisek Chakrabarty Lemmatizer for Indian Languages IndiLem@FIRE-MET-2014 : An Unsupervised . . . . . . Results Experimental MET-2014 : Algorithm Lemmatization tion

SLIDE 1

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for Indian Languages

Abhisek Chakrabarty

Indian Statistical Institute, Kolkata

December 6, 2014

SLIDE 2

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Task of a lemmatizer and its need. The proposed lemmatization Approach. Results and error analysis.

SLIDE 3

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Task of a lemmatizer and its need. The proposed lemmatization Approach. Results and error analysis.

SLIDE 4

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Task of a lemmatizer and its need. The proposed lemmatization Approach. Results and error analysis.

SLIDE 5

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionary form of a word in context, which is known as the lemma of that word in that context. A lemma of a word in a context is required to retrieve the meaning of that word in that context. For example, ‘I retrieved that document.’ Here the lemma of retrieved is retrieve. If we cannot map retrieved to retrieve, the meaning of retrieved is not accessible.

SLIDE 6

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionary form of a word in context, which is known as the lemma of that word in that context. A lemma of a word in a context is required to retrieve the meaning of that word in that context. For example, ‘I retrieved that document.’ Here the lemma of retrieved is retrieve. If we cannot map retrieved to retrieve, the meaning of retrieved is not accessible.

SLIDE 7

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionary form of a word in context, which is known as the lemma of that word in that context. A lemma of a word in a context is required to retrieve the meaning of that word in that context. For example, ‘I retrieved that document.’ Here the lemma of retrieved is retrieve. If we cannot map retrieved to retrieve, the meaning of retrieved is not accessible.

SLIDE 8

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionary form of a word in context, which is known as the lemma of that word in that context. A lemma of a word in a context is required to retrieve the meaning of that word in that context. For example, ‘I retrieved that document.’ Here the lemma of retrieved is retrieve. If we cannot map retrieved to retrieve, the meaning of retrieved is not accessible.

SLIDE 9

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

What is lemmatization and why is it needed?

Lemmatization is a process that returns the base or dictionary form of a word in context, which is known as the lemma of that word in that context. A lemma of a word in a context is required to retrieve the meaning of that word in that context. For example, ‘I retrieved that document.’ Here the lemma of retrieved is retrieve. If we cannot map retrieved to retrieve, the meaning of retrieved is not accessible.

SLIDE 10

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) are morphologically very rich and suffixing in nature. Knowledge resources (dictionary, WordNet) for those languages usually store root words with their morphological and semantic descriptions. Very often we face several inflected word forms in raw texts like stories, newspapers, poems etc. To obtain the meaning and morphological properties of them, we have to determine the appropriate root words which is the task of a lemmatizer. So lemmatization is necessary for building up many NLP tools ( WSD systems, translation systems etc.) for Indian languages.

SLIDE 11

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) are morphologically very rich and suffixing in nature. Knowledge resources (dictionary, WordNet) for those languages usually store root words with their morphological and semantic descriptions. Very often we face several inflected word forms in raw texts like stories, newspapers, poems etc. To obtain the meaning and morphological properties of them, we have to determine the appropriate root words which is the task of a lemmatizer. So lemmatization is necessary for building up many NLP tools ( WSD systems, translation systems etc.) for Indian languages.

SLIDE 12

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) are morphologically very rich and suffixing in nature. Knowledge resources (dictionary, WordNet) for those languages usually store root words with their morphological and semantic descriptions. Very often we face several inflected word forms in raw texts like stories, newspapers, poems etc. To obtain the meaning and morphological properties of them, we have to determine the appropriate root words which is the task of a lemmatizer. So lemmatization is necessary for building up many NLP tools ( WSD systems, translation systems etc.) for Indian languages.

SLIDE 13

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) are morphologically very rich and suffixing in nature. Knowledge resources (dictionary, WordNet) for those languages usually store root words with their morphological and semantic descriptions. Very often we face several inflected word forms in raw texts like stories, newspapers, poems etc. To obtain the meaning and morphological properties of them, we have to determine the appropriate root words which is the task of a lemmatizer. So lemmatization is necessary for building up many NLP tools ( WSD systems, translation systems etc.) for Indian languages.

SLIDE 14

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Need of lemmatization for Indian languages

Major Indian languages (Hindi, Bengali, Gujrati etc.) are morphologically very rich and suffixing in nature. Knowledge resources (dictionary, WordNet) for those languages usually store root words with their morphological and semantic descriptions. Very often we face several inflected word forms in raw texts like stories, newspapers, poems etc. To obtain the meaning and morphological properties of them, we have to determine the appropriate root words which is the task of a lemmatizer. So lemmatization is necessary for building up many NLP tools ( WSD systems, translation systems etc.) for Indian languages.

SLIDE 15

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Different from stemmer

A stemmer operates on a single word without knowledge of the context. Usually a stemmer returns the common portion of the variant word forms and the stem may be an invalid word. But on varying contexts, the lemma of a particular word may be different and the lemma must be a valid word of the language. For all of the words retrieved, retrieval, retrieving, a stemmer may return retriev as the stem as it is the common portion of all the inflected forms But a lemmatizer should return retieve.

SLIDE 16

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Different from stemmer

A stemmer operates on a single word without knowledge of the context. Usually a stemmer returns the common portion of the variant word forms and the stem may be an invalid word. But on varying contexts, the lemma of a particular word may be different and the lemma must be a valid word of the language. For all of the words retrieved, retrieval, retrieving, a stemmer may return retriev as the stem as it is the common portion of all the inflected forms But a lemmatizer should return retieve.

SLIDE 17

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Different from stemmer

A stemmer operates on a single word without knowledge of the context. Usually a stemmer returns the common portion of the variant word forms and the stem may be an invalid word. But on varying contexts, the lemma of a particular word may be different and the lemma must be a valid word of the language. For all of the words retrieved, retrieval, retrieving, a stemmer may return retriev as the stem as it is the common portion of all the inflected forms But a lemmatizer should return retieve.

SLIDE 18

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Different from stemmer

A stemmer operates on a single word without knowledge of the context. Usually a stemmer returns the common portion of the variant word forms and the stem may be an invalid word. But on varying contexts, the lemma of a particular word may be different and the lemma must be a valid word of the language. For all of the words retrieved, retrieval, retrieving, a stemmer may return retriev as the stem as it is the common portion of all the inflected forms But a lemmatizer should return retieve.

SLIDE 19

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNet for collecting the root words of a language. Atfirst, the root words are stored in a trie structure. Each node in the trie corresponds to an unicode character of the language. The nodes that end with the final character of a root word are marked as ”final” nodes.

SLIDE 20

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNet for collecting the root words of a language. Atfirst, the root words are stored in a trie structure. Each node in the trie corresponds to an unicode character of the language. The nodes that end with the final character of a root word are marked as ”final” nodes.

SLIDE 21

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNet for collecting the root words of a language. Atfirst, the root words are stored in a trie structure. Each node in the trie corresponds to an unicode character of the language. The nodes that end with the final character of a root word are marked as ”final” nodes.

SLIDE 22

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNet for collecting the root words of a language. Atfirst, the root words are stored in a trie structure. Each node in the trie corresponds to an unicode character of the language. The nodes that end with the final character of a root word are marked as ”final” nodes.

SLIDE 23

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNet for collecting the root words of a language. Atfirst, the root words are stored in a trie structure. Each node in the trie corresponds to an unicode character of the language. The nodes that end with the final character of a root word are marked as ”final” nodes.

SLIDE 24

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Proposed Lemmatization Algorithm

Our lemmatization algorithm requires a dictionary or WordNet for collecting the root words of a language. Atfirst, the root words are stored in a trie structure. Each node in the trie corresponds to an unicode character of the language. The nodes that end with the final character of a root word are marked as ”final” nodes.

SLIDE 25

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’ + ‘◌ং’/‘ng’ + ‘শ’/‘sh’ ‘অংশ‌ূ’/‘angshu’ = ‘অংশ’/‘angsh’ + ‘◌ ু ’/‘u’ ‘অংশ‌ূক’/‘angshuk’ = ‘অংশ‌ূ’/‘angshu’ + ‘ক’/‘k’ ‘অংশ‌ূধর’/‘angshudhar’ = ‘অংশ‌ূ’/‘angshu’ + ‘ধ’/‘dha’ + ‘র’/‘r’ ‘অংশগত’/‘angshgata’ = ‘অংশ’/‘angsh’ + ‘গ’/‘ga’ + ‘ত/‘ta’

SLIDE 26

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’ + ‘◌ং’/‘ng’ + ‘শ’/‘sh’ ‘অংশ‌ূ’/‘angshu’ = ‘অংশ’/‘angsh’ + ‘◌ ু ’/‘u’ ‘অংশ‌ূক’/‘angshuk’ = ‘অংশ‌ূ’/‘angshu’ + ‘ক’/‘k’ ‘অংশ‌ূধর’/‘angshudhar’ = ‘অংশ‌ূ’/‘angshu’ + ‘ধ’/‘dha’ + ‘র’/‘r’ ‘অংশগত’/‘angshgata’ = ‘অংশ’/‘angsh’ + ‘গ’/‘ga’ + ‘ত/‘ta’

SLIDE 27

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’ + ‘◌ং’/‘ng’ + ‘শ’/‘sh’ ‘অংশ‌ূ’/‘angshu’ = ‘অংশ’/‘angsh’ + ‘◌ ু ’/‘u’ ‘অংশ‌ূক’/‘angshuk’ = ‘অংশ‌ূ’/‘angshu’ + ‘ক’/‘k’ ‘অংশ‌ূধর’/‘angshudhar’ = ‘অংশ‌ূ’/‘angshu’ + ‘ধ’/‘dha’ + ‘র’/‘r’ ‘অংশগত’/‘angshgata’ = ‘অংশ’/‘angsh’ + ‘গ’/‘ga’ + ‘ত/‘ta’

SLIDE 28

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’ + ‘◌ং’/‘ng’ + ‘শ’/‘sh’ ‘অংশ‌ূ’/‘angshu’ = ‘অংশ’/‘angsh’ + ‘◌ ু ’/‘u’ ‘অংশ‌ূক’/‘angshuk’ = ‘অংশ‌ূ’/‘angshu’ + ‘ক’/‘k’ ‘অংশ‌ূধর’/‘angshudhar’ = ‘অংশ‌ূ’/‘angshu’ + ‘ধ’/‘dha’ + ‘র’/‘r’ ‘অংশগত’/‘angshgata’ = ‘অংশ’/‘angsh’ + ‘গ’/‘ga’ + ‘ত/‘ta’

SLIDE 29

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

‘অংশ’/‘angsh’ = ‘অ’/‘a’ + ‘◌ং’/‘ng’ + ‘শ’/‘sh’ ‘অংশ‌ূ’/‘angshu’ = ‘অংশ’/‘angsh’ + ‘◌ ু ’/‘u’ ‘অংশ‌ূক’/‘angshuk’ = ‘অংশ‌ূ’/‘angshu’ + ‘ক’/‘k’ ‘অংশ‌ূধর’/‘angshudhar’ = ‘অংশ‌ূ’/‘angshu’ + ‘ধ’/‘dha’ + ‘র’/‘r’ ‘অংশগত’/‘angshgata’ = ‘অংশ’/‘angsh’ + ‘গ’/‘ga’ + ‘ত/‘ta’

SLIDE 30

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie

To find the lemma of a surface word, the trie is navigated starting from the initial node. Navigation ends when either the word is completely found in the trie or after some portion of the word there is no path present in the trie to navigate. While navigating, some situations may occur, depending on which we are taking decision to determine the lemma.

SLIDE 31

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie

To find the lemma of a surface word, the trie is navigated starting from the initial node. Navigation ends when either the word is completely found in the trie or after some portion of the word there is no path present in the trie to navigate. While navigating, some situations may occur, depending on which we are taking decision to determine the lemma.

SLIDE 32

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie

To find the lemma of a surface word, the trie is navigated starting from the initial node. Navigation ends when either the word is completely found in the trie or after some portion of the word there is no path present in the trie to navigate. While navigating, some situations may occur, depending on which we are taking decision to determine the lemma.

SLIDE 33

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie

To find the lemma of a surface word, the trie is navigated starting from the initial node. Navigation ends when either the word is completely found in the trie or after some portion of the word there is no path present in the trie to navigate. While navigating, some situations may occur, depending on which we are taking decision to determine the lemma.

SLIDE 34

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

If the surface word is itself a root word, then we will reach to a final node. If the surface word is not a root word, then the trie is navigated upto that node where the surface word completely ends or there is no path to navigate. We call this node as the end node.

SLIDE 35

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

If the surface word is itself a root word, then we will reach to a final node. If the surface word is not a root word, then the trie is navigated upto that node where the surface word completely ends or there is no path to navigate. We call this node as the end node.

SLIDE 36

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

If the surface word is itself a root word, then we will reach to a final node. If the surface word is not a root word, then the trie is navigated upto that node where the surface word completely ends or there is no path to navigate. We call this node as the end node.

SLIDE 37

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

If the surface word is itself a root word, then we will reach to a final node. If the surface word is not a root word, then the trie is navigated upto that node where the surface word completely ends or there is no path to navigate. We call this node as the end node.

SLIDE 38

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

Now two different cases may occur here. In the path from initial node to the end node, if one or more than one final nodes are found, then pick that final node which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma.

SLIDE 39

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

Now two different cases may occur here. In the path from initial node to the end node, if one or more than one final nodes are found, then pick that final node which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma.

SLIDE 40

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

Now two different cases may occur here.

1. In the path from initial node to the end node, if one or more

than one final nodes are found, then pick that final node which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma.

SLIDE 41

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

Now two different cases may occur here.

1. In the path from initial node to the end node, if one or more

than one final nodes are found, then pick that final node which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma.

SLIDE 42

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

Consider two inflected words ‘অংেশর’/‘angsher’ and ‘অংশীদােরর’/‘angshidaarer’. ‘অংেশর’/‘angsher’ comes from ‘অংশ’/‘angsh’. ‘অংশীদােরর’/‘angshidaarer’ comes from ‘অংশীদার’/‘angshidaar’.

SLIDE 43

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

Consider two inflected words ‘অংেশর’/‘angsher’ and ‘অংশীদােরর’/‘angshidaarer’. ‘অংেশর’/‘angsher’ comes from ‘অংশ’/‘angsh’. ‘অংশীদােরর’/‘angshidaarer’ comes from ‘অংশীদার’/‘angshidaar’.

SLIDE 44

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

Consider two inflected words ‘অংেশর’/‘angsher’ and ‘অংশীদােরর’/‘angshidaarer’. ‘অংেশর’/‘angsher’ comes from ‘অংশ’/‘angsh’. ‘অংশীদােরর’/‘angshidaarer’ comes from ‘অংশীদার’/‘angshidaar’.

SLIDE 45

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

Consider two inflected words ‘অংেশর’/‘angsher’ and ‘অংশীদােরর’/‘angshidaarer’. ‘অংেশর’/‘angsher’ comes from ‘অংশ’/‘angsh’. ‘অংশীদােরর’/‘angshidaarer’ comes from ‘অংশীদার’/‘angshidaar’.

SLIDE 46

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

Consider two inflected words ‘অংেশর’/‘angsher’ and ‘অংশীদােরর’/‘angshidaarer’. ‘অংেশর’/‘angsher’ comes from ‘অংশ’/‘angsh’. ‘অংশীদােরর’/‘angshidaarer’ comes from ‘অংশীদার’/‘angshidaar’.

SLIDE 47

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

If no root word is found in the path from the intial node to the end node, then find the final node in the trie which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma. If more than one final nodes are found at the closest distance then pick all of them. Now, generate the root word(s) which is/are represented by the path from initial node to those picked final node(s).

SLIDE 48

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

2. If no root word is found in the path from the intial node to

the end node, then find the final node in the trie which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma. If more than one final nodes are found at the closest distance then pick all of them. Now, generate the root word(s) which is/are represented by the path from initial node to those picked final node(s).

SLIDE 49

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

2. If no root word is found in the path from the intial node to

the end node, then find the final node in the trie which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma. If more than one final nodes are found at the closest distance then pick all of them. Now, generate the root word(s) which is/are represented by the path from initial node to those picked final node(s).

SLIDE 50

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

2. If no root word is found in the path from the intial node to

the end node, then find the final node in the trie which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma. If more than one final nodes are found at the closest distance then pick all of them. Now, generate the root word(s) which is/are represented by the path from initial node to those picked final node(s).

SLIDE 51

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

2. If no root word is found in the path from the intial node to

the end node, then find the final node in the trie which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma. If more than one final nodes are found at the closest distance then pick all of them. Now, generate the root word(s) which is/are represented by the path from initial node to those picked final node(s).

SLIDE 52

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the root word(s) which has/have maximum overlapping prefix length with the surface word. By the phrase ‘overlapping prefix length’ between two words, we mean the length of the longest common prefix between them. Even at this stage if more than one roots are selected, then select any one of them arbitrarily as the lemma. As it is very rare to have more than one root words in this stage and if more than one root exist, then all are viable candidates.

SLIDE 53

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the root word(s) which has/have maximum overlapping prefix length with the surface word. By the phrase ‘overlapping prefix length’ between two words, we mean the length of the longest common prefix between them. Even at this stage if more than one roots are selected, then select any one of them arbitrarily as the lemma. As it is very rare to have more than one root words in this stage and if more than one root exist, then all are viable candidates.

SLIDE 54

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the root word(s) which has/have maximum overlapping prefix length with the surface word. By the phrase ‘overlapping prefix length’ between two words, we mean the length of the longest common prefix between them. Even at this stage if more than one roots are selected, then select any one of them arbitrarily as the lemma. As it is very rare to have more than one root words in this stage and if more than one root exist, then all are viable candidates.

SLIDE 55

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the root word(s) which has/have maximum overlapping prefix length with the surface word. By the phrase ‘overlapping prefix length’ between two words, we mean the length of the longest common prefix between them. Even at this stage if more than one roots are selected, then select any one of them arbitrarily as the lemma. As it is very rare to have more than one root words in this stage and if more than one root exist, then all are viable candidates.

SLIDE 56

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Searching strategy in the trie (continued)

Finally among the generated root word(s), pick the root word(s) which has/have maximum overlapping prefix length with the surface word. By the phrase ‘overlapping prefix length’ between two words, we mean the length of the longest common prefix between them. Even at this stage if more than one roots are selected, then select any one of them arbitrarily as the lemma. As it is very rare to have more than one root words in this stage and if more than one root exist, then all are viable candidates.

SLIDE 57

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

consider the dictionary root words ‘শ‌ূনা’/‘shuna’, ‘শ‌ূনািন’/‘shunani’ and ‘শ‌ূনােনা’/‘shunano’. Now took an inflected word ‘শ‌ূেন’/‘shune’ which actually comes from ‘শ‌ূনা’/‘shuna’.

SLIDE 58

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

consider the dictionary root words ‘শ‌ূনা’/‘shuna’, ‘শ‌ূনািন’/‘shunani’ and ‘শ‌ূনােনা’/‘shunano’. Now took an inflected word ‘শ‌ূেন’/‘shune’ which actually comes from ‘শ‌ূনা’/‘shuna’.

SLIDE 59

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

consider the dictionary root words ‘শ‌ূনা’/‘shuna’, ‘শ‌ূনািন’/‘shunani’ and ‘শ‌ূনােনা’/‘shunano’. Now took an inflected word ‘শ‌ূেন’/‘shune’ which actually comes from ‘শ‌ূনা’/‘shuna’.

SLIDE 60

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Examples

consider the dictionary root words ‘শ‌ূনা’/‘shuna’, ‘শ‌ূনািন’/‘shunani’ and ‘শ‌ূনােনা’/‘shunano’. Now took an inflected word ‘শ‌ূেন’/‘shune’ which actually comes from ‘শ‌ূনা’/‘shuna’.

SLIDE 61

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Results

Based on the evaluation of the Morpheme Extraction Task - FIRE 2014, the results obtained on Bengali data using our lemmatization system are given in the following Table.

TOTAL. Precision
TOTAL. Recall
TOTAL. F1-measure:

56.19% 65.08% 60.31%

SLIDE 62

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Results

Based on the evaluation of the Morpheme Extraction Task - FIRE 2014, the results obtained on Bengali data using our lemmatization system are given in the following Table.

TOTAL. Precision
TOTAL. Recall
TOTAL. F1-measure:

56.19% 65.08% 60.31%

SLIDE 63

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Results

Based on the evaluation of the Morpheme Extraction Task - FIRE 2014, the results obtained on Bengali data using our lemmatization system are given in the following Table.

TOTAL. Precision
TOTAL. Recall
TOTAL. F1-measure:

56.19% 65.08% 60.31%

SLIDE 64

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Source of Lemmatization Errors

Compound words and out-of-vocabulary words are not considered in our algorithm. Root words are taken from dictionary but if the coverage

f the dictionary used is not good, then that will cause

errors. However, as there is no such good language independent lemmatizer for Indian languages, we hope our effort is a positive contribution.

SLIDE 65

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Source of Lemmatization Errors

Compound words and out-of-vocabulary words are not considered in our algorithm. Root words are taken from dictionary but if the coverage

f the dictionary used is not good, then that will cause

errors. However, as there is no such good language independent lemmatizer for Indian languages, we hope our effort is a positive contribution.

SLIDE 66

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Source of Lemmatization Errors

Compound words and out-of-vocabulary words are not considered in our algorithm. Root words are taken from dictionary but if the coverage

f the dictionary used is not good, then that will cause

errors. However, as there is no such good language independent lemmatizer for Indian languages, we hope our effort is a positive contribution.

SLIDE 67

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Questions ??

SLIDE 68

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

Questions ??

SLIDE 69

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

References

Pushpak Bhattacharyya, Ankit Bahuguna, Lavita Talukdar and Bornali Phukan. Facilitating Multi-Lingual Sense Annotation: Human Mediated Lemmatizer. Global WordNet Conference. 2014. Ljiljana Dolamic and Jacques Savoy. Comparative Study of Indexing and Search Strategies for the Hindi, Marathi and Bengali Languages. ACM Transactions on Asian Language Information Processing (TALIP), 9.3:11. Debasis Ganguly, Johannes Leveling and Gareth J. F. Jones. DCU@ FIRE 2012:Rule-Based Stemmers for Bengali and Hindi. Working Notes for the FIRE 2012 Workshop. Aki Loponen, Jiaul H. Paik and Kalervo Järvelin.UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task. Multilingual Information Access in South Asian Languages. Springer Berlin Heidelberg, 258-268. Sandipan Sarkar and Sivaji Bandyopadhyay. Morpheme Extraction Task Using Mulaadhaar – A Rule-Based Stemmer for Bengali.JU@FIRE MET 2012. Working Notes for the FIRE 2012 Workshop.

SLIDE 70

IndiLem@FIRE- MET-2014 : An Unsupervised Lemmatizer for Indian Languages Abhisek Chakrabarty Basics of Lemmatiza- tion Lemmatization Algorithm Experimental Setup and Results

. . . . . .

IndiLem@FIRE-MET-2014 : An Unsupervised Lemmatizer for Indian Languages

Abhisek Chakrabarty

Indian Statistical Institute, Kolkata

December 6, 2014

Contents

Task of a lemmatizer and its need. The proposed lemmatization Approach. Results and error analysis.

Contents

Task of a lemmatizer and its need. The proposed lemmatization Approach. Results and error analysis.

Contents

Task of a lemmatizer and its need. The proposed lemmatization Approach. Results and error analysis.

What is lemmatization and why is it needed?

What is lemmatization and why is it needed?

What is lemmatization and why is it needed?

What is lemmatization and why is it needed?

What is lemmatization and why is it needed?

Need of lemmatization for Indian languages

Need of lemmatization for Indian languages

Need of lemmatization for Indian languages

Need of lemmatization for Indian languages

Need of lemmatization for Indian languages

Different from stemmer

Different from stemmer

Different from stemmer

Different from stemmer

Proposed Lemmatization Algorithm

Proposed Lemmatization Algorithm

Proposed Lemmatization Algorithm

Proposed Lemmatization Algorithm

Proposed Lemmatization Algorithm

Proposed Lemmatization Algorithm

Examples

Examples

Examples

Examples

Examples

Searching strategy in the trie

Searching strategy in the trie

Searching strategy in the trie

Searching strategy in the trie

Searching strategy in the trie (continued)

If the surface word is itself a root word, then we will reach to a final node. If the surface word is not a root word, then the trie is navigated upto that node where the surface word completely ends or there is no path to navigate. We call this node as the end node.

Searching strategy in the trie (continued)

If the surface word is itself a root word, then we will reach to a final node. If the surface word is not a root word, then the trie is navigated upto that node where the surface word completely ends or there is no path to navigate. We call this node as the end node.

Searching strategy in the trie (continued)

If the surface word is itself a root word, then we will reach to a final node. If the surface word is not a root word, then the trie is navigated upto that node where the surface word completely ends or there is no path to navigate. We call this node as the end node.

Searching strategy in the trie (continued)

If the surface word is itself a root word, then we will reach to a final node. If the surface word is not a root word, then the trie is navigated upto that node where the surface word completely ends or there is no path to navigate. We call this node as the end node.

Searching strategy in the trie (continued)

Now two different cases may occur here. In the path from initial node to the end node, if one or more than one final nodes are found, then pick that final node which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma.

Searching strategy in the trie (continued)

Now two different cases may occur here. In the path from initial node to the end node, if one or more than one final nodes are found, then pick that final node which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma.

Searching strategy in the trie (continued)

Now two different cases may occur here.

than one final nodes are found, then pick that final node which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma.

Searching strategy in the trie (continued)

Now two different cases may occur here.

than one final nodes are found, then pick that final node which is closest to the end node. The word represented by the path from initial node to the picked final node is considered as the lemma.

Examples

Examples

Examples

Examples

Examples

Searching strategy in the trie (continued)

Searching strategy in the trie (continued)

Searching strategy in the trie (continued)

Searching strategy in the trie (continued)

Searching strategy in the trie (continued)

Searching strategy in the trie (continued)

Searching strategy in the trie (continued)

Searching strategy in the trie (continued)

Searching strategy in the trie (continued)

Searching strategy in the trie (continued)

Examples

consider the dictionary root words ‘শ‌ূনা’/‘shuna’, ‘শ‌ূনািন’/‘shunani’ and ‘শ‌ূনােনা’/‘shunano’. Now took an inflected word ‘শ‌ূেন’/‘shune’ which actually comes from ‘শ‌ূনা’/‘shuna’.

Examples

consider the dictionary root words ‘শ‌ূনা’/‘shuna’, ‘শ‌ূনািন’/‘shunani’ and ‘শ‌ূনােনা’/‘shunano’. Now took an inflected word ‘শ‌ূেন’/‘shune’ which actually comes from ‘শ‌ূনা’/‘shuna’.

Examples

consider the dictionary root words ‘শ‌ূনা’/‘shuna’, ‘শ‌ূনািন’/‘shunani’ and ‘শ‌ূনােনা’/‘shunano’. Now took an inflected word ‘শ‌ূেন’/‘shune’ which actually comes from ‘শ‌ূনা’/‘shuna’.

Examples

consider the dictionary root words ‘শ‌ূনা’/‘shuna’, ‘শ‌ূনািন’/‘shunani’ and ‘শ‌ূনােনা’/‘shunano’. Now took an inflected word ‘শ‌ূেন’/‘shune’ which actually comes from ‘শ‌ূনা’/‘shuna’.