 
              Workshop on the Role of Speech in Developing Robust Speech Processing Applications May 7-8, 2015 NSF, Arlington, VA Overview: The workshop focused on the critical role that speech science should play in developing robust future speech technology applications. The result is a set of well- motivated interdisciplinary recommendations on how best to promote and foster the role of speech science in developing speech technology applications for the long- term benefit of society. To arrive at these recommendations the workshop invitees were asked to give a brief overview of specifically chosen research areas, explain where they thought more progress is needed, and to give opinions on the barriers to progress. Of special interest were those areas in which industry is unlikely to invest because the topic is not a priority, the time course is too long, or the population served is not of interest. Invitees included scientists and technologists, chosen based on recommendations from NSF Program Managers and by the workshop organizers given their expertise in the relevant fields. The following four specific theme areas which are of interest to a variety of programs at NSF were discussed. For example, in the SBE/BCS Division: Linguistics; Perception, Action, and Cognition; Cognitive Neuroscience; and Developmental & Learning Sciences. In the CISE/IIS Division: Core programs in Robust Intelligence and Cyber-Human Systems; Smart and Connected Health; Cyberlearning and Future Learning Technologies; and the National Robotics Initiative. 1) Elderly speech, mental illness, and assistive technologies 2) Speaker state including affective speech, personality, and speech in multi-person contexts 3) Children’s speech, a ccented speech, and limited-data adaptation 4) Biologically-inspired and cognitive models of speech communication Participants from these different disciplines and topic areas typically do not meet together to set future priorities. The list of invitees was developed with exactly this goal in mind, and the workshop allowed ample time for cross-disciplinary discussion. Recommendations: All participants highlighted the importance of multidisciplinary research with applications especially in health and education. They recommended that techniques should not only focus on automatic speech recognition (ASR) but also other real- world speech applications such as synthesis (especially expressive synthesis),
speech enhancement, voice morphing, and computational paralinguistics (e.g. emotion and speaker state.) In health-related areas, it was recommended that more research needs to be done to develop assistive technologies for both the speech and hearing impaired (e.g., neurological and physiological impairments), and also the physically impaired (e.g., motor impairments and central cord syndrome) for which focused speech technology can be of great assistance. Assessments that are correlated with daily functions and/or brain activities, and assessments that can be used for training professionals are also needed. Diagnostic systems based on speech and other biomarkers need to be explored especially for depression, post-traumatic stress disorder (PTSD), traumatic brain injury (TBI), mild cognitive impairment (MCI), Alzheimer’s disease, Parkinson’s disease and autism. Other recommendations in this area include biofeedback to improve articulation, speech therapy tools, and tools for modifying one’s own speech output. In general, there is a need to study sources of variability such as acoustic and channel distortions, accent, lab setting versus real world situations, and other situations which could result in significant mismatch between the testing and training data. A common way of dealing with the unwanted variability is to train the system over all sources of the ‘harmful’ variabilit y. While such an approach represents main-stream industrial approaches and can be effective, it may not be the most efficient way of dealing with the problem. We recommend that academic research efforts take guidance from human speech processing and study, understand, and model all sources of variability in speech, aiming at alleviating the current needs for ever increasing amounts of expensive training of speech information processing technology. Participants were asked to organize and prioritize suggestions to address these challenges, and come up with a list of specific thematic recommendations. The recommendations are listed below: 1) Suggest the creation of interdisciplinary NSF panels between the programs mentioned earlier: Linguistics; Perception, Action, and Cognition; Cognitive Neuroscience; Developmental & Learning Sciences; Core programs in Robust Intelligence and Cyber-Human Systems; Smart and Connected Health; Cyberlearning and Future Learning Technologies; and the National Robotics Initiative. These panels would review proposals addressing fundamental research in speech science and speech processing in the context of future applications. 2) Suggest creating a rich speech repository that would include children ’s and elderly speech, disordered speech, speech from hearing and/or cognitively impaired individuals, cross-linguistic and cross-dialectal speech, and low-resource languages. Possible funding mechanism: CRI 3) Suggestions for the Research Community:
Refine the definition of success in speech technology beyond word error rates (WER), which measures the system as a whole, but does not enable diagnosis of parts of the system. Research could aim at techniques for measuring distinct functional components in end-to-end ASR architectures, so that one could, for example, measure the success of an interactive system. Advances in this area could be made from close collaborations between the speech engineering and human language research communities. Ultimately, metrics that enable direct comparison of human and machine recognition could be developed.  Discover new problems in linguistic, clinical, and educational settings through collaborations between speech engineers and speech scientists  Develop research approaches that do not rely on deep learning because: a) dealing with small datasets leads to inaccurate conclusions; b) features need to capture richer information about speaking behaviors; c) approaches need to generalize to new data; and d) long-term goals require researchers to be able to interpret findings in a scientific manner.  Develop standardized archiving/data sharing. Protocols for data collection and annotations can be beneficial for speech science and technology.  Expand richness of data using parallel data streams like video, articulatory phonetic, brain imaging, and semantic data  Expand the type of speech events studied: from speech audio recordings to communicative events (i.e., person-to-person communication) across a broader range of environmental and social contexts. 4) Expand the intellectual capital devoted to speech technology research  Promote interdisciplinary training in linguistics and in engineering (for example, online modules, summer schools and workshops).  Promote interdisciplinary collaboration: Build on speech databases and tools such as those funded by NSF's Computer & Information Science & Engineering Research Infrastructure (CRI) program as the basis of collaborations especially in modeling speech production and perception Conclusion: It should be mentioned that the workshop generated many thought-provoking discussions on the future of speech science and technology research. The multidisciplinary team worked very well together and we hope that this workshop has laid the groundwork for mapping new directions for the speech community. All participants expressed willingness to help as contacts should questions about the recommendations arise in the future.
Invited Participants: Abeer Alwan, UCLA Anton Batliner, Universität Erlangen-Nürnberg, Germany Jeffrey A. Bilmes, University of Washington, Seattle Tim Bunnell, University of Delaware Hugo van Hamme, KU Leuven, Belgium Hynek Hermansky, John Hopkins University Julia Hirschberg, Columbia University Keith Johnson, UC Berkeley Tyler Kendall, University of Oregon Josh McDermott, Massachusetts Institute of Technology Florian Metze, Carnegie Mellon University Jonathan Peelle, Washington University Tom Quatieri, MIT Lincoln Laboratory Mitch Sommers, Washington University Sandra Gordon-Salant, University of Maryland Elizabeth Shriberg, STAR Laboratory, SRI
Recommend
More recommend