multi language support for virtual assistants
play

Multi Language Support for Virtual Assistants Sierra Kaplan-Nelson, - PowerPoint PPT Presentation

Multi Language Support for Virtual Assistants Sierra Kaplan-Nelson, Max Farr Mentor: Mehrad Moradshahi Broad Topic (everything we do now in many other languages) Speech


  1. Multi Language Support for Virtual Assistants Sierra Kaplan-Nelson, Max Farr Mentor: Mehrad Moradshahi

  2. Broad Topic (everything we do now in many other languages) تﺎﻣوﻠﻌﻣ ﻲﻧطﻋأ تﺎﺑﺎﺧﺗﻧﻻا نﻋ Speech recognition, speech -> text ● Machine translation ● Data collection ● Question answering ● Semantic parsing ● Guided learning ● Chatbots ● Etc., etc., ... ●

  3. Overview of Machine Language Translation تﺎﻣوﻠﻌﻣ ﻲﻧطﻋأ تﺎﺑﺎﺧﺗﻧﻻا نﻋ Previously all done via rules-based ● methods For awhile hybrid machine translation ● was the norm, where sentences were pre-processed using a rules engine before fed through an ML model Now almost all done by deep neural ● networks VAs in some ways are using hybrid ● machine translation since they can use templates

  4. State of the Art VAs in Other Languages Google VA has most languages ● Issues detecting accents ○ Started to employ AI on sound wave visualizations to improve ○ language detection and spelling correction techniques to reduce errors by 29% Supporting new language also involves localization that can take ○ a month Question answering in other languages is active ● research topic, currently performs much worse than English VAs that perform specific tasks, like helping children ● learn, are almost exclusively in English

  5. Arabic VA for Autistic Children (2019) Teaches both social behavior and academic skills, mostly using hardcoded ● flow diagrams and quizzes Autistic Innovative Assistant (AIA): an Android application for Arabic autism children (Sweidan, Salameh, Zakarneh & Darabkh)

  6. Multi Language Question Answering

  7. Supervised Learning to Improve Arabic Question Similarity Detection Arabic is poorly-informatized (not many ● knowledge graphs etc.) Uses rules to separate questions by broad type ● Created dataset of pairs questions from ● ejaaba.com (answer.com in Arabic) and hand labeled them as similar “Yes” or “No” Used paraphrasing to generate more “Yes” pairs ● Hybrid learning approach combining string and ● semantic similarity Novel Approach towards Arabic Question Similarity Detection (Daoud)

  8. Multilingual Extractive Reading Comprehension (2018) Most high quality large datasets are annotated in English ● Seeks to increase RC in other languages without costly process of creating ● new large training datasets Translates question AND document context from language L into English ● with attentive NMT model and get answer in English Multilingual Extractive Reading Comprehension by Runtime Machine Translation (Asai, Eriguchi, Hashimoto, and Tsuruoka)

  9. Multilingual Extractive Reading Comprehension Multilingual Extractive Reading Comprehension by Runtime Machine Translation (Asai, Eriguchi, Hashimoto, and Tsuruoka)

  10. Multilingual Extractive Reading Comprehension Recover answer in context in L using soft alignments from NMT ● Alignment in this context is the start and end of the span in the text containing answer ○ Found that how well questions are translated significantly affects ● performance Using paraphrased questions decreased accuracy ○ Oversampling high quality translations in training improves performance ○ Found that this method improved performance over just back translating ● English results with Google translate Multilingual Extractive Reading Comprehension by Runtime Machine Translation (Asai, Eriguchi, Hashimoto, and Tsuruoka)

  11. MLQA: Evaluating Cross-lingual Extractive Question Answering (2020) Benchmark datasets to compare with SQUAD to help ● speed up QA improvements in other languages Contains QA instances in 7 languages: English, Arabic, ● German, Spanish, Hindi, Vietnamese and Simplified Chinese MLQA has over 12K instances in English and 5K in each ● other language, with each instance parallel between 4 languages on average. Pulled text from Wikipedia articles that exist in many ● languages, then employed crowdsourced annotators Multilingual Extractive Reading Comprehension by Runtime Machine Translation (Asai, Eriguchi, Hashimoto, and Tsuruoka)

  12. MLQA: Evaluating Cross-lingual Extractive Question Answering (2020) Multilingual Extractive Reading Comprehension by Runtime Machine Translation (Asai, Eriguchi, Hashimoto, and Tsuruoka)

  13. Quiz 1 In what respect do you think multilingual semantic parsing differs from multilingual question answering?

  14. Multi Language Semantic Parsing

  15. Templated-based data generation Genie methodology : Developers write templates to synthesize data ● Generate more natural data using crowdsourced paraphrases and data ● augmentation Combine paraphrases with the synthesized data, to train a semantic parser ●

  16. Finding Data in Other Languages Structured: Any websites using Schema.org metadata can be scraped to find relevant ● properties in each domain General: Wikipedia and other open websites allow scraping but some knowledge is ● required to properly extract the values

  17. Prior work Datasets: ATIS: Airline Travel Information System ● GeoQuery: The functional query language used in the Geoquery domain ● Overnight: In seven domains covering various linguistic phenomena ● NLMaps: A Natural Language Interface to Query OpenStreetMap ● Methods: Polyglot decoder for source-code generation from API documentation ● Ensemble monolingual hybrid tree parsers to generate a single parse tree ● Find multilingual representations based on dependencies or embeddings of logical ● forms Bootstrapping from English to another language without parallel data ● Bootstrapping a Crosslingual Semantic Parser

  18. Bootstrapping a Crosslingual Semantic Parser Train data is translated using multiple public machine translation APIs ● Dev and test are human translated ●

  19. Bootstrapping a Crosslingual Semantic Parser Train with three different train sets ●

  20. Paraphrasing in Other Languages English dataset is synthesized and does not perfectly match with how ● humans write queries. Paraphrasing is used to generate more natural examples to cover a bigger ● space of all possible utterances Translation models can act as paraphrases although we won’t have much ● control over the generated response. More sophisticated paraphrasing for other languages has become ● possible with the recent introduction of mBART (already has 5 citations!) and MarianMT models. Marian: Fast Neural Machine Translation in C++ Multilingual Denoising Pre-training for Neural Machine Translation

  21. Quiz 2 Why is it better to train a single encoder on multiple languages compared to training one encoder for each language?

  22. Preliminary Error Analysis on Spanish

  23. Error Analysis of Current Results - Spanish Translating synthesized English sentences to Spanish can result in nonsense ¿cuál es el número de teléfono de la oficina más banh mi nha trang subs English: What is the office phone number more banh mi nha trang subs ¿el blended bistro & boba en local pond tiene una opinión todavía ? English: Does the blended bistro & boba at local pond still have an opinion? lo que hace el restaurante nimi v. reseña de ? English: what does the restaurant nimi v. review of?

  24. Error Analysis of Current Results - Spanish Often filters on location instead of cuisine type Example Question: buscar un restaurante dim sum . Correct Response: now => ( @org.schema.Restaurant.Restaurant ) filter param:servesCuisine =~ " dim sum " => notify Gives response: now => ( @org.schema.Restaurant.Restaurant ) filter param:geo == location: " dim sum " => notify

  25. Error Analysis of Current Results - Spanish Has difficulty with cuisines made up of two words (Asian fusion), thinks one of them is a description or restaurant name. This could be a problem with other params that can be 1 - many words long. Example Question: ¿hay restaurantes fusión asiática cercanos con opiniones 10 estrellas ? Gives Response: now => ( @org.schema.Restaurant.Restaurant ) filter @org.schema.Restaurant.Review { and param:description =~ " fusión " and param:reviewRating.ratingValue == 10 and param:servesCuisine =~ " asiática " => notify

  26. Error Analysis of Current Results - Spanish Sometimes generates random syntax: ¿cuáles son los últimos comentarios y puntuaciones de este restaurante ? English: What are some of the most recent reviews of this restaurant? Gives: now => [ param:aggregateRating.ratingValue , param:reviewRating.ratingValue ] of ( ( @org.schema.Restaurant.Restaurant ) filter param:geo == location:current_location ) => notify what does this even mean?

  27. Room for Improvement Templates to make sure that common grammar patterns create correct ● parameters (cuisine vs. location) AND hook up model with database to understand if a word is cuisine or ● something else Better ML to create paraphrased sentences in other languages to avoid ● nonsense

  28. Quiz 3 Why is translation-based data synthesis method a practical alternative to template-based sentence generation?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend