Data-driven Methods for SMS- based FAQ Retrieval FIRE 2011 - - PowerPoint PPT Presentation
Data-driven Methods for SMS- based FAQ Retrieval FIRE 2011 - - PowerPoint PPT Presentation
Data-driven Methods for SMS- based FAQ Retrieval FIRE 2011 Sanmitra Bhattacharya Hung Tran Padmini Srinivasan Computer Science INTRODUCTION Why SMS-based FAQ Retrieval? Exponential growth in telecom market India among the top
INTRODUCTION
- Why SMS-based FAQ Retrieval?
- Exponential growth in telecom market
- India among the top contributors
- Widespread use of text messages
- Personal communication
- Advertisement
- Enquiry
FIRE 2011 SMS-BASED FAQ RETRIEVAL
SMS-based FAQ retrieval
- Corpus: FAQs in agriculture,
career, general knowledge, etc.
- Queries: SMS text messages
- Task: Find FAQ entries that
answer/match SMS queries answer/match SMS queries
FIRE 2011 SMS-BASED FAQ RETRIEVAL
TREC QA track (1999-2007)
- Corpus: Newswire, AQUAINT, and Blogs
- Question Series − FACT & LIST
- Question Topic: “House of Chanel”
- FACT: In what year was the company
founded? SMS-based FAQ retrieval
- Corpus: FAQs in agriculture,
career, general knowledge, etc.
- Queries: SMS text messages
- Task: Find FAQ entries that
answer/match SMS queries
- LIST: What museums have displayed
Chanel clothing?
- Task: Define a target by answering
questions answer/match SMS queries
CHALLENGES
- Sample SMS query
- “wht is career counclng”
- Non-standard abbreviations (what -> wht, wt, vat,
etc.)
- Misspellings
- Misspellings
- Omission of words
- Inappropriate Transliterations
- Grammatical Errors
- Match this SMS query to “What is career
counseling?”
SUB-TASKS
- Mono-lingual FAQ Retrieval
- Cross-lingual FAQ Retrieval
ENGLISH Query ENGLISH FAQs ENGLISH HINDI FAQs
- Multi-lingual FAQ Retrieval
Query FAQs ENGLISH Query ENGLISH, HINDI, MALAYALAM FAQs
DATA
- FAQs:
<FAQ> <FAQID>ENG_CAREER_1</FAQID> <DOMAIN>CAREER</DOMAIN> <QUESTION>What is career counseling?</QUESTION> <ANSWER> Career counseling is a process ... </ANSWER> </FAQ>
- SMS queries:
- SMS queries:
<SMS> <SMS_QUERY_ID>ENG_405</SMS_QUERY_ID> <SMS_TEXT>wht is career counclng</SMS_TEXT> <MATCHES> <ENGLISH>ENG_CAREER_1</ENGLISH> <MALAYALAM>NONE</MALAYALAM> <HINDI>NONE</HINDI> </MATCHES> </SMS>
DATASET
- FAQ Corpus
- ENGLISH: 7251
- HINDI: 1994
- MALAYALAM: 681
- SMS Queries
- SMS Queries
Sub- task Training Testing Englis h Hindi Malayala m English Hindi Malayala m Mono
1071 230 140 3405 324 50
Cross
472
- 3405
- Multi
460 230 80 3405 324 50
FLOWCHART OF METHODS
BASIC STEPS
- Indexing
- INDRI IR system
- 2 types − UTF-8 and Translated
- Translation mechanism for Hindi
- Google Translate
- Microsoft Bing Translator
- Microsoft Bing Translator
- Sample Output
- Hindi FAQ: धान भण्
डारण करते समय क् या-क् या सावधािनयां बरतनी ह?
- Google Translate Output: When grain storage - what are the
precautions?
- Microsoft Bing Translator Output: What-if, when Paddy cold
storage savdhaniyan be?
BASIC STEPS
- Translation mechanism for Malayalam
- No standard API
- Crowdsourcing − oDesk
- 681 FAQs + 50 SMS queries
- # of translators: 2
- Time: 2 days
- Time: 2 days
- Cost: 40 USD
- Example:
- Malayalam: ?
- English Translation 1: Which is the longest river in the world?
- English Translation 2: Which is world’s longest river?
BASIC STEPS
- Straight Borda Count
- Used for merging several results
- Consensus-based voting of retrieval results
- ALL RETRIEVAL METHODS USE INDRI’S BELIEF-OPERATOR
#combine #combine
MONO-LINGUAL RETRIEVAL (ENGLISH)
- ENGLISH
- Google Spelling Suggestions
- Input:
<SMS> <SMS_QUERY_ID>ENG_405</SMS_QUERY_ID> <SMS_TEXT>wht is career counclng</SMS_QUERY> ... </SMS>
MONO-LINGUAL RETRIEVAL (ENGLISH)
</SMS>
- Output:
<SMS> <SMS_QUERY_ID>ENG_405</SMS_QUERY_ID> <SMS_TEXT>what is career counselling</SMS_QUERY> ... </SMS>
- No standard API
MONO-LINGUAL RETRIEVAL (ENGLISH)
- ENGLISH (cont.)
- Term Expansion
- 1-4 character words
- Commonly used abbreviations: ‘c’
for ‘see’
- Manually created lookup table
- 766 abbreviations and expansions
- 766 abbreviations and expansions
- Aspell spell-checker
- Problem with common acronyms
and proper nouns (Ghaziabad -> Gasbag)
- Term Frequency
- ≤ 6 least frequent terms/SMS
query
MONO-LINGUAL RETRIEVAL (HINDI)
- HINDI
- UTF-8 retrieval
- English-translated retrieval −
Similar to English
- Straight Borda Count
MONO-LINGUAL RETRIEVAL (MALAYALAM)
- MALAYALAM
- UTF-8 retrieval
- English-translated retrieval −
- Desk
- Straight Borda Count
CROSS-LINGUAL RETRIEVAL
- Same methods as in English mono-lingual retrieval
- ONLY index is different
CROSS-LINGUAL RETRIEVAL
ENGLISH Query HINDI FAQs
- Hindi FAQs translated into English
MULTI-LINGUAL RETRIEVAL
MULTI-LINGUAL RETRIEVAL
- ENGLISH SMS
- Run 1: Google Spelling Suggestions
English SMS English FAQ Hindi TR FAQ Malayalam TR FAQ
- Run 1: Google Spelling Suggestions
+ Term Expansion
- Run 2: Google Spelling Suggestions
+ Term Expansion + Spell check
- Run 3: Google Spelling Suggestions
+ Term Expansion + Term Frequency
MULTI-LINGUAL RETRIEVAL
- HINDI SMS
- Run 1: Translated SMS queries +
Google Spelling Suggestions (all
Hindi SMS (TR + native) English FAQ Hindi FAQ (TR + UTF-8) Malayalam TR FAQ
Google Spelling Suggestions (all English indexes)
- Run 2:
- Hindi: UTF-8 query on UTF-8
index
- English & Malayalam
(translated): English query on English index
MULTI-LINGUAL RETRIEVAL
- MALAYALAM SMS
- Run 1: oDesk Translated SMS
queries (all English indexes)
Malayalam SMS (TR + native) English FAQ Hindi TR FAQ Malayalam FAQ (TR + UTF-8)
queries (all English indexes)
- Run 2:
- Malayalam: UTF-8 query on
UTF-8 index
- English & Hindi (translated):
English query on English index
RESULTS
- Mono-lingual FAQ Retrieval
- Mean Reciprocal Rank (MRR)
English Hindi Malayalam Run 1 0.736 0.746 0.838 Run 2 0.687 0.860 0.893
English Run 1: Google Spelling Suggestion + Term Expansion
- English: Aspell spell-checker doesn’t work well
Run 3 0.711 0.819 0.881
RESULTS
- Mono-lingual FAQ Retrieval
- Mean Reciprocal Rank (MRR)
English Hindi Malayalam Run 1 0.736 0.746 0.838 Run 2 0.687 0.860 0.893
Hindi Run 2: Translated + Google Spelling Suggestion
- Hindi: Translated queries and corpus work well
Run 3 0.711 0.819 0.881
RESULTS
- Mono-lingual FAQ Retrieval
- Mean Reciprocal Rank (MRR)
English Hindi Malayalam Run 1 0.736 0.746 0.838 Run 2 0.687 0.860 0.893
Malayalam Run 2:
- Desk Translated
- Malayalam: Translated queries and corpus work well
Run 3 0.711 0.819 0.881
RESULTS (CONT.)
- Cross-lingual FAQ Retrieval
MRR Run 1 0.108 Run 2 0.135 Run 3 0.104
- Probable errors in relevance judgments
- SMS Query: (ID: ENG_SMS_QUERY_I31) WHAT IS WIRELESS ISP?
Relevance Judgment: HINDI_TELECOMMUNICATION_88 -> Q. दुिनया का थान दूरसंचार क े े म या है? (English-translated: Q. What is the world's place in the telecom sector?) Run 1 Retrieval: HINDI_TELECOMMUNICATION_87 -> Q. वायरलेस आईएसपी या है? (English-translated: Q. What is a Wireless ISP?)
RESULTS (CONT.)
- Multi-lingual FAQ Retrieval
English Hindi Malayalam Run 1 0.711 0.727 0.889 Run 2 0.683 0.839 0.829 Run 3 0.661
- English: Aspell spell-checker doesn’t work well
- Hindi: Involving the native UTF-8 retrieval gives better score
- Malayalam: Translated queries and corpus work well
RESULTS (CONT.)
- Comparison of best results from mono- and multi-lingual tasks
English Hindi Malayalam Mono 0.736 0.860 0.893 Multi 0.711 0.839 0.889
RESULTS (CONT.)
- Comparison of best results from mono- and multi-lingual tasks
- Identical errors in relevance judgment as in cross-lingual
retrieval
English Hindi Malayalam Mono 0.736 0.860 0.893 Multi 0.711 0.839 0.889
retrieval
- But minimal effect
% in Relevant English 704 (100%) Hindi 37 (5.2%) Malayala m 84 (11.9%) # of SMS queries Total 3405 Relevant 704 (20.6%) Non- relevant 2701 (79.4%)
CONCLUSIONS
- Google spelling suggestions and term expansion improve
retrieval performance
- For Hindi and Malayalam, translation to English helps
- Use of crowdsourcing for Malayalam-English translation is
effective
- Multi-lingual translation is more challenging than mono-lingual
- Multi-lingual translation is more challenging than mono-lingual
- Lesser noise (abbreviations, misspellings, etc.) in Hindi and
Malayalam SMS
- Future work: Explore other techniques for handling non-
standard abbreviations
ACKNOWLEDGMENT
- Text Mining and Retrieval Group, Computer Science,
University of Iowa
- FIRE 2011 Organizers