RICT at the NTCIR-14 QALab- PoliInfo Task Jiawei Yong, Shintaro - PowerPoint PPT Presentation

RICT at the NTCIR-14 QALab- PoliInfo Task Jiawei Yong, Shintaro Kawamura, Katsumi Kanasaki, Shoichi Naitoh, and Kiyohiko Shinomiya Ricoh Company, Ltd.

Index Segmentation subtask ➢ Overall thought for segmentation ➢ Cue-phrase-based idea ◎ Semi-supervised segmentation ➢ Results and conclusion Classification subtask ➢ Research challenges ➢ Research methods ➢ Results and conclusion 2

Segmentation subtask 3

Segmentation subtask in 2 steps input: Date, Speaker, Summary contiguous minutes segments that segments correspond to the input Date: 初めに、 xxx Speaker: xxxxxxxxxxxxxxxx xxxxxxxxxxx xxx 見解を求めます。 xxxxxxxxxxx xxxxxxx 次に、 xxx 次に、 xxx xxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxx xxx 見解を求めます。 xxx 見解を求めます。 xxx xxxxxxxxxxx 最後に、 xxx 最後に、 xxx xxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx xxx 質問を終わります。 xxx 質問を終わります。 xxxxxxxxxxx search segmentation 4

Data sets for the segmentation subtask data sets provided by the task organizer • training data: used as development data • test data search segmentation contiguous segments minutes segments that correspond to the input annotated by ourselves training data: 4804 • utterances, 995 segments development data: 3438 • utterances, 683 segments 5

Cue-phrase-based idea (segmentation step) ▪ Hints for topical segmentation □ Lexical cohesion TextTiling was tried in the dry run not reliable □ Cue phrases used in the formal run effective for speech in the assembly 6

Models for segmentation step (formal run) Submitted 5 Runs Rule-based Model (string pattern matching) … Run 1 ▪ Supervised Model ▪ BoW ⇒ SVM … Run 2 – pre-trained word2vec ⇒ LSTM … Run 5 – *word embeddings ⇒ HAN (unsubmitted) – Semi- supervised Model (Original method)… Run 3 ▪ No segmentation Model (each utterance is a segment) … Run 4 ▪ 7

Semi-supervised model ( Segmentation step ) ▪ Segment boundaries are learned through bootstrapping. 84905 utterances speaker boundary ・ 10 words classifier ・ at the head the last line and the tail of a segment compressed logistic boundary with LSI regression the first line of a segment classifier ・ estimated ・ iteration BoW segment boundary 8

Search step segments contiguous segments output optimal one selected ▪ maximize σ 𝑗=1 𝑙 𝑗𝑒𝑔(𝑢 𝑗 ) − 𝜇𝑙log(𝑜) Penalty for the length Coverage of weighted words ( 𝑜 utterances) 𝑢 𝑗 𝑗 = 1, … , 𝑙 in the summary Hyperparameter 𝜇 is tuned by the development data. ( 0.4 for questions and 0.7 for answers) 9

Evaluation results The performance of the methods when applied to the test data set (mean values of 5 runs) Question Answer Segmentation method Recall Precision F1 Recall Precision F1 rule-based 0.851 0.913 0.881 0.949 0.903 0.925 SVM 0.819 0.851 0.834 0.913 0.939 0.925 LSTM 0.916 0.690 0.780 0.909 0.925 0.914 HAN 0.871 0.874 0.873 0.949 0.921 0.934 semi-supervised 0.836 0.760 0.796 0.907 0.814 0.858 no segmentation 0.828 0.715 0.767 0.680 0.839 0.751 ▪ The rule-based segmentation was the best during the formal run (Top 1 in F1). The method using a hierarchical attention network (unsubmitted one) also shows good performance. 10

Conclusions on segmentation subtask ▪ Assembly speeches can be effectively segmented by cue phrases. ▪ A rule-based segmentation and a neural network segmentation combined with a simple search model give good results. They can be baselines for more advanced methods that take syntactic or semantic features into account. ▪ A semi-supervised segmentation that does not require training data is also feasible. 11

Classification subtask 12

Research challenges in classification ◆ Training Data 02 01 ・ Quality ・ Quantity The quantity of labelled utterances for The kappa statistics among annotators are each topic are insufficient. quite low to the same sentence labelling. Challenge1: Low Kappa Statistic Challenge2: Underfitting 03 The volume of different labels in different ・ Imbalance topics are in a great disparity. Challenge3: Imbalanced Learning 13

Research methods in classification 01 Challenge1: Low Kappa Statistic Fact Checkability Subtask News Detection Support for Fact Check （ NLP2018 ） ① Unanimous training data (4710) input LSTM ① F1 score:0.91 input ② Majority training data (10291) × LSTM ② F1 score:0.81 Suspicious News Detection Using Micro Blog Text (2018) 14

Research methods in classification 02 Challenge2: Underfitting Topic1- Stance Classification Subtask Training data Topic2- Cross-topic 6684 Utterances Classifier Topic N- Topic 12- Training Utterances data Topic1- Topic1-Classifer 1000+ Training data Integrated model Topic2- Topic2-Classifer 1000+ Training data ・・・・・・ Topic 12- Topic 12-Classifer 1000+ Training data The variation of The variation of Loss rate accuracy rate 15

Research methods in classification 03 Isolation Forest Challenge3: Imbalanced Learning Relevance & Stance Classification Subtask Relevance （ ”1” ）： irrelevance （ ”0” ）＝ 9390 ： 901 ≒ 10 ： 1 The F1 score of Minority Class One class SVM outlier detection We regard Majority class as normal data , minority class as outlier value . 16

Evaluation results The performance of the methods when applied to the test data set for classification Top Values of RICT Runs for each criteria Classification Subtasks 1- 1- 0- 0- Accuracy 1-F1 0-F1 Recall Precision Recall Precision 0.406 0.857 0.923 (rank 2) 1. Relevance 0.99 0.865 0.524 0.332 (rank 7) (rank 7) Imbalanced Learn 0.564 0.811 0.729 2. Fact-checkability 0.693 0.476 0.899 0.738 (rank 3) (rank 3) (rank 3) Low kappa Low kappa 0.40 0.295 0.63 (rank 3) underfitting 0.889 0.808 2- 2- 3. Stance 2-F1 0.962 0.827 (rank 2) (rank 1) Recall Precision underfitting 0.290 0.194 0.579 (rank 4) underfitting 17

Conclusions on classification subtask We have showed the assembly utterances can be classified ◼ by supervised learning methods with a high accuracy. ▪ The selection of training data acts an important role for supervised learning method. We shall select out the training data in consideration of quality quantity and balance. ① Low Kappa Statistic Challenge ② Underfitting Challenge ③ Imbalanced Learn Challenge Integrated model Isolation Forest Unanimous training data 18

RICT at the NTCIR-14 QALab- PoliInfo Task Jiawei Yong, Shintaro - PowerPoint PPT Presentation

RICT at the NTCIR-14 QALab- PoliInfo Task Jiawei Yong, Shintaro Kawamura, Katsumi Kanasaki, Shoichi Naitoh, and Kiyohiko Shinomiya Ricoh Company, Ltd. Index Segmentation subtask Overall thought for segmentation Cue-phrase-based idea

CRE IGHT ON SCHOOL DIST RICT RE IMAGINAT ION Ag e nda T o pic s Pr ogr amming with T

Smart Border Coalition MAR MARGARIT ARITA NUNE NUNEZ AN AND MAR MARCEL ELO PEIN PEINADO, CA

Presen esentation tion of of the the Annual Annual Schoo School Di District rict Budg Budget fo

Summer mmer 201 014 4 in in Distri rict 1 150 50 Sum ummer Opportunit unitie ies f s

En Envi vironmental Cultural Resource A&E C Contract Outreach Distri rict 6 June 1

Co Co- Op Op En Ente tertai tainment nment Di Dist stri rict ct The purpose of the

Pee eekskill kskill City ty School hool Di Dist strict rict Board of Education Presentation

Go Goleta Water D r Distri rict Cost of Service and Rate Design Study Admin in C Commit

Bud udget Adop Adoption 2020 2020-21 21 Upl pland U Uni nified S Schoo ool Distri rict

To St andardize or Not t o St andardize? Choices made when a Dist rict revises t heir St andards

San n Ys Ysidro dro Sch0ol ch0ol Di District rict Presc school l & Child ld

WE L COME T O H SAANICH T HE DIST RICT OF NORT OCP MARINE POL ICY & GUIDE L INE

CLA LARK RK COU OUNT NTY Y DI DISTRICT RICT COU OURT THE HERAP RAPEUTIC EUTIC SPE

S yed Aqeel Aj iz Addl. Dist rict & S ession Judge, S habqadar It is an agency which

use monitoring in in Wapichan territory ry, Guyana South Rupununi Dis istri rict Council

Gal Galena ena Un Unit S School D ool Distri rict # #120 120 - - - - - - - - - - - - - -

Core Linguistic Resources for the Worlds Languages Christopher Cieri, Mike Maxwell, Stepanie

LArIAT Beamline and Auxiliary Detectors Michael Backfish (Fermilab) Jason St. John (University

Spectral Surface Reconstruction Nils Erik Flick January 13, 2009 Surface Reconstruction

Neutrons for industry and engineering Magnus Hrnqvist Colliander Department of Physics

Using Python for shell scripts Peter Hill Using Python for shell scripts | January 2018 | 1/29

DM System Design and LDM-148 KIAN-TAT LIM, DM SOFTWARE ARCHITECT What Are We Building? Title

Introducing DSpace 3.x Hardy Po)nger , University of Missouri

reporting service into DSpace using Altmetric.com Jim Halliday & Stacy Konkiel, Indiana

RICT at the NTCIR-14 QALab- PoliInfo Task Jiawei Yong, Shintaro - PowerPoint PPT Presentation

RICT at the NTCIR-14 QALab- PoliInfo Task Jiawei Yong, Shintaro Kawamura, Katsumi Kanasaki, Shoichi Naitoh, and Kiyohiko Shinomiya Ricoh Company, Ltd. Index Segmentation subtask Overall thought for segmentation Cue-phrase-based idea

CRE IGHT ON SCHOOL DIST RICT RE IMAGINAT ION Ag e nda T o pic s Pr ogr amming with T

Smart Border Coalition MAR MARGARIT ARITA NUNE NUNEZ AN AND MAR MARCEL ELO PEIN PEINADO, CA

Presen esentation tion of of the the Annual Annual Schoo School Di District rict Budg Budget fo

Summer mmer 201 014 4 in in Distri rict 1 150 50 Sum ummer Opportunit unitie ies f s

En Envi vironmental Cultural Resource A&amp;E C Contract Outreach Distri rict 6 June 1

Co Co- Op Op En Ente tertai tainment nment Di Dist stri rict ct The purpose of the

Pee eekskill kskill City ty School hool Di Dist strict rict Board of Education Presentation

Go Goleta Water D r Distri rict Cost of Service and Rate Design Study Admin in C Commit

Bud udget Adop Adoption 2020 2020-21 21 Upl pland U Uni nified S Schoo ool Distri rict

To St andardize or Not t o St andardize? Choices made when a Dist rict revises t heir St andards

San n Ys Ysidro dro Sch0ol ch0ol Di District rict Presc school l &amp; Child ld

WE L COME T O H SAANICH T HE DIST RICT OF NORT OCP MARINE POL ICY &amp; GUIDE L INE

CLA LARK RK COU OUNT NTY Y DI DISTRICT RICT COU OURT THE HERAP RAPEUTIC EUTIC SPE

S yed Aqeel Aj iz Addl. Dist rict &amp; S ession Judge, S habqadar It is an agency which

use monitoring in in Wapichan territory ry, Guyana South Rupununi Dis istri rict Council

Gal Galena ena Un Unit S School D ool Distri rict # #120 120 - - - - - - - - - - - - - -

Core Linguistic Resources for the Worlds Languages Christopher Cieri, Mike Maxwell, Stepanie

LArIAT Beamline and Auxiliary Detectors Michael Backfish (Fermilab) Jason St. John (University

Spectral Surface Reconstruction Nils Erik Flick January 13, 2009 Surface Reconstruction

Neutrons for industry and engineering Magnus Hrnqvist Colliander Department of Physics

Using Python for shell scripts Peter Hill Using Python for shell scripts | January 2018 | 1/29

DM System Design and LDM-148 KIAN-TAT LIM, DM SOFTWARE ARCHITECT What Are We Building? Title

Introducing DSpace 3.x Hardy Po)nger , University of Missouri

reporting service into DSpace using Altmetric.com Jim Halliday &amp; Stacy Konkiel, Indiana

En Envi vironmental Cultural Resource A&E C Contract Outreach Distri rict 6 June 1

San n Ys Ysidro dro Sch0ol ch0ol Di District rict Presc school l & Child ld

WE L COME T O H SAANICH T HE DIST RICT OF NORT OCP MARINE POL ICY & GUIDE L INE

S yed Aqeel Aj iz Addl. Dist rict & S ession Judge, S habqadar It is an agency which

reporting service into DSpace using Altmetric.com Jim Halliday & Stacy Konkiel, Indiana