Data-driven Methods for SMS- based FAQ Retrieval FIRE 2011 - - PowerPoint PPT Presentation

data driven methods for sms based faq retrieval
SMART_READER_LITE
LIVE PREVIEW

Data-driven Methods for SMS- based FAQ Retrieval FIRE 2011 - - PowerPoint PPT Presentation

Data-driven Methods for SMS- based FAQ Retrieval FIRE 2011 Sanmitra Bhattacharya Hung Tran Padmini Srinivasan Computer Science INTRODUCTION Why SMS-based FAQ Retrieval? Exponential growth in telecom market India among the top


slide-1
SLIDE 1

Data-driven Methods for SMS- based FAQ Retrieval

Sanmitra Bhattacharya Hung Tran Padmini Srinivasan Computer Science

FIRE 2011

slide-2
SLIDE 2

INTRODUCTION

  • Why SMS-based FAQ Retrieval?
  • Exponential growth in telecom market
  • India among the top contributors
  • Widespread use of text messages
  • Personal communication
  • Advertisement
  • Enquiry
slide-3
SLIDE 3

FIRE 2011 SMS-BASED FAQ RETRIEVAL

SMS-based FAQ retrieval

  • Corpus: FAQs in agriculture,

career, general knowledge, etc.

  • Queries: SMS text messages
  • Task: Find FAQ entries that

answer/match SMS queries answer/match SMS queries

slide-4
SLIDE 4

FIRE 2011 SMS-BASED FAQ RETRIEVAL

TREC QA track (1999-2007)

  • Corpus: Newswire, AQUAINT, and Blogs
  • Question Series − FACT & LIST
  • Question Topic: “House of Chanel”
  • FACT: In what year was the company

founded? SMS-based FAQ retrieval

  • Corpus: FAQs in agriculture,

career, general knowledge, etc.

  • Queries: SMS text messages
  • Task: Find FAQ entries that

answer/match SMS queries

  • LIST: What museums have displayed

Chanel clothing?

  • Task: Define a target by answering

questions answer/match SMS queries

slide-5
SLIDE 5

CHALLENGES

  • Sample SMS query
  • “wht is career counclng”
  • Non-standard abbreviations (what -> wht, wt, vat,

etc.)

  • Misspellings
  • Misspellings
  • Omission of words
  • Inappropriate Transliterations
  • Grammatical Errors
  • Match this SMS query to “What is career

counseling?”

slide-6
SLIDE 6

SUB-TASKS

  • Mono-lingual FAQ Retrieval
  • Cross-lingual FAQ Retrieval

ENGLISH Query ENGLISH FAQs ENGLISH HINDI FAQs

  • Multi-lingual FAQ Retrieval

Query FAQs ENGLISH Query ENGLISH, HINDI, MALAYALAM FAQs

slide-7
SLIDE 7

DATA

  • FAQs:

<FAQ> <FAQID>ENG_CAREER_1</FAQID> <DOMAIN>CAREER</DOMAIN> <QUESTION>What is career counseling?</QUESTION> <ANSWER> Career counseling is a process ... </ANSWER> </FAQ>

  • SMS queries:
  • SMS queries:

<SMS> <SMS_QUERY_ID>ENG_405</SMS_QUERY_ID> <SMS_TEXT>wht is career counclng</SMS_TEXT> <MATCHES> <ENGLISH>ENG_CAREER_1</ENGLISH> <MALAYALAM>NONE</MALAYALAM> <HINDI>NONE</HINDI> </MATCHES> </SMS>

slide-8
SLIDE 8

DATASET

  • FAQ Corpus
  • ENGLISH: 7251
  • HINDI: 1994
  • MALAYALAM: 681
  • SMS Queries
  • SMS Queries

Sub- task Training Testing Englis h Hindi Malayala m English Hindi Malayala m Mono

1071 230 140 3405 324 50

Cross

472

  • 3405
  • Multi

460 230 80 3405 324 50

slide-9
SLIDE 9

FLOWCHART OF METHODS

slide-10
SLIDE 10

BASIC STEPS

  • Indexing
  • INDRI IR system
  • 2 types − UTF-8 and Translated
  • Translation mechanism for Hindi
  • Google Translate
  • Microsoft Bing Translator
  • Microsoft Bing Translator
  • Sample Output
  • Hindi FAQ: धान भण्

डारण करते समय क् या-क् या सावधािनयां बरतनी ह?

  • Google Translate Output: When grain storage - what are the

precautions?

  • Microsoft Bing Translator Output: What-if, when Paddy cold

storage savdhaniyan be?

slide-11
SLIDE 11

BASIC STEPS

  • Translation mechanism for Malayalam
  • No standard API
  • Crowdsourcing − oDesk
  • 681 FAQs + 50 SMS queries
  • # of translators: 2
  • Time: 2 days
  • Time: 2 days
  • Cost: 40 USD
  • Example:
  • Malayalam: ?
  • English Translation 1: Which is the longest river in the world?
  • English Translation 2: Which is world’s longest river?
slide-12
SLIDE 12

BASIC STEPS

  • Straight Borda Count
  • Used for merging several results
  • Consensus-based voting of retrieval results
  • ALL RETRIEVAL METHODS USE INDRI’S BELIEF-OPERATOR

#combine #combine

slide-13
SLIDE 13

MONO-LINGUAL RETRIEVAL (ENGLISH)

slide-14
SLIDE 14
  • ENGLISH
  • Google Spelling Suggestions
  • Input:

<SMS> <SMS_QUERY_ID>ENG_405</SMS_QUERY_ID> <SMS_TEXT>wht is career counclng</SMS_QUERY> ... </SMS>

MONO-LINGUAL RETRIEVAL (ENGLISH)

</SMS>

  • Output:

<SMS> <SMS_QUERY_ID>ENG_405</SMS_QUERY_ID> <SMS_TEXT>what is career counselling</SMS_QUERY> ... </SMS>

  • No standard API
slide-15
SLIDE 15

MONO-LINGUAL RETRIEVAL (ENGLISH)

  • ENGLISH (cont.)
  • Term Expansion
  • 1-4 character words
  • Commonly used abbreviations: ‘c’

for ‘see’

  • Manually created lookup table
  • 766 abbreviations and expansions
  • 766 abbreviations and expansions
  • Aspell spell-checker
  • Problem with common acronyms

and proper nouns (Ghaziabad -> Gasbag)

  • Term Frequency
  • ≤ 6 least frequent terms/SMS

query

slide-16
SLIDE 16

MONO-LINGUAL RETRIEVAL (HINDI)

  • HINDI
  • UTF-8 retrieval
  • English-translated retrieval −

Similar to English

  • Straight Borda Count
slide-17
SLIDE 17

MONO-LINGUAL RETRIEVAL (MALAYALAM)

  • MALAYALAM
  • UTF-8 retrieval
  • English-translated retrieval −
  • Desk
  • Straight Borda Count
slide-18
SLIDE 18

CROSS-LINGUAL RETRIEVAL

slide-19
SLIDE 19
  • Same methods as in English mono-lingual retrieval
  • ONLY index is different

CROSS-LINGUAL RETRIEVAL

ENGLISH Query HINDI FAQs

  • Hindi FAQs translated into English
slide-20
SLIDE 20

MULTI-LINGUAL RETRIEVAL

slide-21
SLIDE 21

MULTI-LINGUAL RETRIEVAL

  • ENGLISH SMS
  • Run 1: Google Spelling Suggestions

English SMS English FAQ Hindi TR FAQ Malayalam TR FAQ

  • Run 1: Google Spelling Suggestions

+ Term Expansion

  • Run 2: Google Spelling Suggestions

+ Term Expansion + Spell check

  • Run 3: Google Spelling Suggestions

+ Term Expansion + Term Frequency

slide-22
SLIDE 22

MULTI-LINGUAL RETRIEVAL

  • HINDI SMS
  • Run 1: Translated SMS queries +

Google Spelling Suggestions (all

Hindi SMS (TR + native) English FAQ Hindi FAQ (TR + UTF-8) Malayalam TR FAQ

Google Spelling Suggestions (all English indexes)

  • Run 2:
  • Hindi: UTF-8 query on UTF-8

index

  • English & Malayalam

(translated): English query on English index

slide-23
SLIDE 23

MULTI-LINGUAL RETRIEVAL

  • MALAYALAM SMS
  • Run 1: oDesk Translated SMS

queries (all English indexes)

Malayalam SMS (TR + native) English FAQ Hindi TR FAQ Malayalam FAQ (TR + UTF-8)

queries (all English indexes)

  • Run 2:
  • Malayalam: UTF-8 query on

UTF-8 index

  • English & Hindi (translated):

English query on English index

slide-24
SLIDE 24

RESULTS

  • Mono-lingual FAQ Retrieval
  • Mean Reciprocal Rank (MRR)

English Hindi Malayalam Run 1 0.736 0.746 0.838 Run 2 0.687 0.860 0.893

English Run 1: Google Spelling Suggestion + Term Expansion

  • English: Aspell spell-checker doesn’t work well

Run 3 0.711 0.819 0.881

slide-25
SLIDE 25

RESULTS

  • Mono-lingual FAQ Retrieval
  • Mean Reciprocal Rank (MRR)

English Hindi Malayalam Run 1 0.736 0.746 0.838 Run 2 0.687 0.860 0.893

Hindi Run 2: Translated + Google Spelling Suggestion

  • Hindi: Translated queries and corpus work well

Run 3 0.711 0.819 0.881

slide-26
SLIDE 26

RESULTS

  • Mono-lingual FAQ Retrieval
  • Mean Reciprocal Rank (MRR)

English Hindi Malayalam Run 1 0.736 0.746 0.838 Run 2 0.687 0.860 0.893

Malayalam Run 2:

  • Desk Translated
  • Malayalam: Translated queries and corpus work well

Run 3 0.711 0.819 0.881

slide-27
SLIDE 27

RESULTS (CONT.)

  • Cross-lingual FAQ Retrieval

MRR Run 1 0.108 Run 2 0.135 Run 3 0.104

  • Probable errors in relevance judgments
  • SMS Query: (ID: ENG_SMS_QUERY_I31) WHAT IS WIRELESS ISP?

Relevance Judgment: HINDI_TELECOMMUNICATION_88 -> Q. दुिनया का थान दूरसंचार क े े म या है? (English-translated: Q. What is the world's place in the telecom sector?) Run 1 Retrieval: HINDI_TELECOMMUNICATION_87 -> Q. वायरलेस आईएसपी या है? (English-translated: Q. What is a Wireless ISP?)

slide-28
SLIDE 28

RESULTS (CONT.)

  • Multi-lingual FAQ Retrieval

English Hindi Malayalam Run 1 0.711 0.727 0.889 Run 2 0.683 0.839 0.829 Run 3 0.661

  • English: Aspell spell-checker doesn’t work well
  • Hindi: Involving the native UTF-8 retrieval gives better score
  • Malayalam: Translated queries and corpus work well
slide-29
SLIDE 29

RESULTS (CONT.)

  • Comparison of best results from mono- and multi-lingual tasks

English Hindi Malayalam Mono 0.736 0.860 0.893 Multi 0.711 0.839 0.889

slide-30
SLIDE 30

RESULTS (CONT.)

  • Comparison of best results from mono- and multi-lingual tasks
  • Identical errors in relevance judgment as in cross-lingual

retrieval

English Hindi Malayalam Mono 0.736 0.860 0.893 Multi 0.711 0.839 0.889

retrieval

  • But minimal effect

% in Relevant English 704 (100%) Hindi 37 (5.2%) Malayala m 84 (11.9%) # of SMS queries Total 3405 Relevant 704 (20.6%) Non- relevant 2701 (79.4%)

slide-31
SLIDE 31

CONCLUSIONS

  • Google spelling suggestions and term expansion improve

retrieval performance

  • For Hindi and Malayalam, translation to English helps
  • Use of crowdsourcing for Malayalam-English translation is

effective

  • Multi-lingual translation is more challenging than mono-lingual
  • Multi-lingual translation is more challenging than mono-lingual
  • Lesser noise (abbreviations, misspellings, etc.) in Hindi and

Malayalam SMS

  • Future work: Explore other techniques for handling non-

standard abbreviations

slide-32
SLIDE 32

ACKNOWLEDGMENT

  • Text Mining and Retrieval Group, Computer Science,

University of Iowa

  • FIRE 2011 Organizers
slide-33
SLIDE 33

Thank You!