Natural Language Processing
Lecture 2: Words and Morphology
Natural Language Processing Lecture 2: Words and Morphology - - PowerPoint PPT Presentation
Natural Language Processing Lecture 2: Words and Morphology Linguistic Morphology The shape of Words to Come What? Linguistics? What? Linguistics? One common complaint we receive in this course goes One common complaint we receive in
Lecture 2: Words and Morphology
The shape of Words to Come
are simpler, morphologically speaking, than most languages.
meaning
automaton → automata
he, him, his, …
speech)
depending on the base it attaches to
word appear to be different words
essential) in NLP tasks
Level hugging panicked foxes Lexical form hug +V +Prog panic +V +Past fox +N +Pl fox +V +Sg Morphemic form (intermediate form) hug^ing# panic^ed# fox^s# Orthographic form (surface form) hugging panicked foxes
morphemic form as intermediate representation)
the morphemic form as intermediate representation)
but also in speech applications, surface syntactic parsing, etc.
the FSA or FSM
cannot do analysis or generation
Accept them!
s ∈ Σ* ... ...
state to state, consuming symbols from the input tape, until we reach the final state
string? Is there a path through the network that…
I am no longer accepting the things I cannot change.
1. Table 2. Trie 3. Finite-state transducer
s : t
... ...
q0
q2 q3 q4
uygarlaştıramadıklarımızdanmışsınızcasına “(behaving) as if you are among those whom we were not able to civilize” uygar “civilized” +laş “become” +tır “cause to” +ama “not able” +dık past participle +lar plural +ımız first person plural possessive (“our”) +dan second person plural (“y’all”) +mış past +sınız ablative case (“from/among”) +casına finite verb → adverb (“as if”)
automatically—they must be inserted.
sections, each implemented with a separate FST:
representation) and surface representation
q0
q2 q3 q4 q5
✏ ✏ ✏ ✏ ✏ ✏
<latexit sha1_base64="vEcK7GYg/no1JoutfzXD/S+qVGI=">AFZXiclVRfa9RAE97PW1Pq62KLz642BNaOI8kd0VRKsW+FjB/oEkHpvNXLp0s5tmN5Uz5Av5aXytX8Cv4WYT4dIexQ4Ehpnf/Cbz290JU0alsu2rpeXOSvfe/dW13oOH648eb2w+OZYizwgcEcFEdhpiCYxyOFJUMThNM8BJyOAkPD+o8ieXkEkq+Fc1SyFIcMzplBKsdGiy2TnwQ4gpLxQ9/5FSovIMSo+LCFCku2NOYM8lyUBwFGc0Cnp+lfN0RsGAcqoZgHavrB3UNG/mNj98sM8pEo5O8jLaHym9sQUXdiBAToLgG4L6NRAdwFw1AK6NXB0DTjAhECqKI+rknGrZFSXjBdw72pgCEx8n+feNcAUqzPvzcdmXIhiQB4TIkU4FJe62KhWYD4r0fZOD7WthofAI8Rgqhq015QWfR9SZng/f+Nx+VRrc5DiPj/3PQ0uh5y08YMW5hkKVR7U4M/0LzPFAaUVujuHcZhZfmxFsEo7sQxDcmMad8+yQ15zxNVNPoe6BLWs+lN9nYsoe2MXTcRpny2rscLK5tOZHguQJcEUYltJz7FQFBc4UJUwz+rmEFJNzHIOnXY4TkEFhnuJXutIhKYi0x9XyETnKwqcSDlLQo1M9JWV13NVcFHOy9X0XVBQnuYKOKkbTXOGlEDV7tAbIQOi2Ew7mGT67RNEznCGidIbptWlkqeovIYDTOczQqcK6G74kEqJK32j36cpRbPuS7VTefYHTrj4fiLu7X/qZFx1XphvbK2Lcd6a+1bn61D68ginZ+dX52rzu+VP9317rPu8xq6vNTUPLVa1n35F/QAqb0=</latexit>allow you to compile rewrite rules into FSTs
alternations, you would be more likely to write rules in a notation similar to the rule on the preceding slide.
composed with other FSTs
code to make the process tractable.
parse generate
x[T ∘ S]z iff x[T]y and y[S]z; effectively equivalent to feeding an input to T, collecting the output from T, feeding this output to S and collecting the output from S.
T · S such that x1x2[T · S]y1y2 and x1[T]y1 and x2[S]y2.
ϵ[T*]ϵ and if w[T*]y and x[T]z then wx[T*]yz]; x[T*]y only holds if one of these two conditions holds.
x[T ∪ S]y iff x[T]y or x[S]y.
x[T ∩ S]y iff x[T]y and x[S]y. FSTs are not closed under intersection.
an exam
them using FST operations
B,” stop yourself and instead say “Compose FST A with FST B” or simply “A ∘ B”
Input: a word Output: the word’s stem (approximately) Examples from the Porter stemmer:
no noah nob nobility nobis noble nobleman noblemen nobleness nobler nobles noblesse noblest nobly nobody noces nod nodded nodding noddle noddles noddy nods no noah nob nobil nobi nobl nobleman noblemen nobl nobler nobl nobless noblest nobli nobodi noce nod nod nod noddl noddl noddi nod
Input: raw text Output: sequence of tokens normalized for easier processing.