SLIDE 5 The Smart Reply Flow: Preprocess Email
5
- Language detection - The language of the message is
identified and non-English messages are discarded.
- Tokenization - Subject and message body are broken into
words and punctuation marks.
- Sentence segmentation - Sentences boundaries are
identified in the message body.
- Normalization - Infrequent words and entities like personal
names, URLs, email addresses, phone numbers etc. are replaced by special tokens.
- Quotation removal - Quoted original messages and
forwarded messages are removed.
- Salutation/close removal - Salutations like Hi John and
closes such as Best regards, Mary are removed.