Speech Act Based Classification of Email Messages in Croatian - PowerPoint PPT Presentation

University of Zagreb Faculty of Electrical Engineering and Computing Text Analysis and Knowledge Engineering Lab Speech Act Based Classification of Email Messages in Croatian Language Tin Franovi´ c, Jan Šnajder {tin.franovic,jan.snajder}@fer.hr Eighth Language Technologies Conference (LTC IS-2012) Ljubljana, October 8th, 2012 October 8th, 2012 UNIZG FER TakeLab

Background & motivation Increase in popularity of email as means of communication Recent surveys – up to 2 hours a day spent on emails Automated email classification can reduce the amount of time users spend reading and sorting emails Speech acts (Searle, 1965) Speech acts are illocutionary acts that attempt to convey meaning from the speaker (or writer) to the listener (or reader) Speech acts are effective way of summarizing the intended purpose of the message UNIZG FER TakeLab October 8th, 2012 2/18 |

Goal & methodology Our goal Develop and evaluate speech act classification of email messageg in Croatian language using supervised machine learning Task framed as a multilabel text classification problem Thorough evaluation using six machine learning algorithms Evaluated using message-level, paragraph-level, and sentence-level features UNIZG FER TakeLab October 8th, 2012 3/18 |

Coming up next. . . 1 Message classification Dataset Message preprocessing Training classifiers Evaluation 2 Conclusion and future work 3 UNIZG FER TakeLab October 8th, 2012 4/18 |

Dataset annotation Several publicly available email datasets, however none in Croatian We compiled a dataset using 1337 messages from five sources Annotated using 13 different speech acts [Searle, 1965] Assertives (A MEND , P REDICT , C ONCLUDE ); Directives (R EQUEST , R EMIND , S UGGEST ); Expressives (A POLOGIZE , G REET , T HANK ); Commisives (C OMMIT , R EFUSE , W ARN ); Declarations (D ELIVER ). UNIZG FER TakeLab October 8th, 2012 5/18 |

Dataset annotation Two annotators, 15% of dataset double-annotated Speech act Speech act κ κ A MEND R EFUSE 0 . 714 0 . 000 A POLOGIZE R EMIND 0 . 856 0 . 747 C OMMIT R EQUEST 0 . 851 0 . 589 C ONCLUDE S UGGEST 0 . 005 0 . 544 D ELIVER T HANK 0 . 792 0 . 949 G REET W ARN 0 . 779 0 . 174 P REDICT 0 . 267 UNIZG FER TakeLab October 8th, 2012 6/18 |

Dataset annotation Infrequent and low-IAA speech acts removed: A POLOGIZE , C ONCLUDE , G REET , P REDICT , R EFUSE , T HANK , W ARN Speech acts used: D ELIVER , A MEND , C OMMIT , R EMIND , S UGGEST , R EQUEST UNIZG FER TakeLab October 8th, 2012 7/18 |

Message preprocessing Reduce the dimensionality and morphological variation Stemming Suffix of each word after last vowel removed Number of terms reduced from 15,100 to 11,856 Stop-word removal Filtered out words with little semantic information List of 2,024 Croatian stop-words UNIZG FER TakeLab October 8th, 2012 8/18 |

Message preprocessing (2) Separate training set created for each speech act using annotated data Text segments extracted at corresponding discourse levels Sentence and paragraph levels – segments that enclose start and end point of annotation Message level – complete message Negative examples sampled from the set of segments not annotated with the corresponding speech act UNIZG FER TakeLab October 8th, 2012 9/18 |

Training classifiers Rapid Miner implementation Six different models : SVMs (Support Vector Machines), naive Bayes (NB), k-NN ( k -Nearest Neighbors), Decision Stump (DS), AdaBoost (with Decision Stump as the weaker learner), and RDR (Ripple Down Rule) Three term weighting schemes : TF (Term Frequency) and TF-IDF (Term Frequency – Inverted Document Frequency) - all models except RDR Binary weights - only RDR Separate classifier trained for every speech act, term weighting scheme, and discourse level (198 models) Re-trained using stop-word removal UNIZG FER TakeLab October 8th, 2012 10/18 |

Training classifiers (2) Parameter optimization Grid-search 10-fold cross-validation for every parameter combination Optimal parameter chosen based on averaged F1 score Optimal model re-trained using whole training set and tested on held-out set 70% for training/validation, 30% held-out test set UNIZG FER TakeLab October 8th, 2012 11/18 |

Classifier performance F1 performance for best feature/discourse level combinations: NB k-NN SVM DS AB RDR D ELIVER 69.70 83.72 88.16 85.71 87.50 88.51 A MEND 71.43 77.97 72.29 74.63 77.27 79.31 C OMMIT 62.45 67.44 78.61 79.37 81.97 83.75 R EMIND 60.87 63.64 75.00 76.92 76.92 94.74 S UGGEST 67.06 70.27 76.27 75.12 71.50 76.84 R EQUEST 69.69 75.44 70.57 75.23 74.46 78.76 UNIZG FER TakeLab October 8th, 2012 12/18 |

Discourse level F1 performance for best classifier/feature combinations: Message Paragraph Sentence D ELIVER 86.59 83.64 88.51 A MEND 77.27 72.38 79.31 C OMMIT 81.97 78.93 83.75 R EMIND 76.92 69.57 94.74 S UGGEST 71.88 69.74 76.84 R EQUEST 70.09 72.19 78.76 94.74 83.64 78.93 Overall UNIZG FER TakeLab October 8th, 2012 13/18 |

Feature types F1 performance for best classifier/discourse level combinations: With stop-words Without stop-words Binary TF TF-IDF Binary TF TF-IDF D ELIVER 87.50 88.00 88.16 87.96 88.51 88.51 A MEND 70.07 77.19 77.27 75.86 79.31 77.19 C OMMIT 79.37 81.63 78.82 79.76 83.75 81.97 R EMIND 76.92 76.92 75.00 77.78 77.78 94.74 S UGGEST 71.50 76.27 68.40 73.08 76.84 73.68 R EQUEST 61.90 78.10 74.46 77.53 78.76 78.08 UNIZG FER TakeLab October 8th, 2012 14/18 |

Overall performance F1 performance with optimal feature sets for each classifier, averaged over speech acts: Message Paragraph Sentence NB 69.70 72.38 79.31 k-NN 72.73 75.44 83.72 SVM 83.87 81.55 88.16 DS 78.65 79.37 85.71 AB 83.54 87.50 94.74 RDR 86.59 83.64 88.51 UNIZG FER TakeLab October 8th, 2012 15/18 |

Conclusion Addressed multilabel speech act classification for Croatian Thorough evaluation using six machine learning algorithms and three feature types Discourse level and feature type do not influence significantly classification performance Certain speech acts more accurately classified on particular levels Obtained F1 scores notably higher than reported in previous work [Cohen, 2004; Carvalho, 2006] UNIZG FER TakeLab October 8th, 2012 16/18 |

Future work Future work Explore relationship between discourse level and speech acts Employ information extraction methods to augment speech acts Impact of speech acts on importance-based classification UNIZG FER TakeLab October 8th, 2012 17/18 |

Thank you for your attention Let’s keep in touch. . . www.takelab.hr info@takelab.hr UNIZG FER TakeLab October 8th, 2012 18/18 |

Speech Act Based Classification of Email Messages in Croatian - PowerPoint PPT Presentation

University of Zagreb Faculty of Electrical Engineering and Computing Text Analysis and Knowledge Engineering Lab Speech Act Based Classification of Email Messages in Croatian Language Tin Franovi c, Jan najder

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Toward Toward Univeral Network-based Univeral Network-based Speech Translation Speech

Human Speech Hermansky Spring 2020 EN.520.680 Speech and Auditory Processing by Humans and

Writing Effective Messages Business Letters Memos Email Presentations Reports and So on

Speech sound disorder by Sajjal (2018) Definition A speech sound disorder (SSD) is a speech

Speech of Greta Thunberg at the UN Climate Change COP24 Conference in Katowice Content -Greta

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Socket Programming Michele Van Dyne MUS 204B mvandyne@mtech.edu

through Shared Sacrifice : Oregon Work Share Program Week of May 11, 2020 We are Oregon Tech

Estimates and Projections SDC affiliates meeting May 2009 Program on Applied Demographics Web:

Identifying Features of Android Apps from Execution Traces Qi Xin, Farnaz Behrang, Mattia

Nilton Bila, Eyal de Lara University of Toronto Matti Hiltunen, Kaustubh Joshi, H. Andres

CS CS 683 683 - Se Securi rity and Pri rivacy Sp Spri ring 2018 2018 Instr Ins truc

Get Logical with Datalog Stuart Halloway Datomic Team, Clojure/core, Relevance 1

B2B Email Messaging: How to maximize the quality of your leads with carefully crafted email