speech processing 15 492 18 492
play

Speech Processing 15-492/18-492 Speech Synthesis Talking heads - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More Information is Better Voice + text is easier to understand Voice + text is easier to understand Voice + face is easier too Voice + face is


  1. Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis

  2. More Information is Better Voice + text is easier to understand � Voice + text is easier to understand � Voice + face is easier too � Voice + face is easier too �

  3. Talking Heads Adds novelty/character/personification � Adds novelty/character/personification � Experiments show better understanding show better understanding � Experiments � � Lip synching Lip synching � � Facial movements Facial movements � Listeners swear its better synthesis � Listeners swear its better synthesis �

  4. Talking heads

  5. Talking Heads Synthesize text � Synthesize text � � Output phone position in audio stream Output phone position in audio stream � Map phones to lip/tongue positions � Map phones to lip/tongue positions � Build visual stream � Build visual stream � � Choose appropriate frames Choose appropriate frames � � Aligned with audio Aligned with audio � How many facial positions � How many facial positions �

  6. Visemes � Baphy Baphy Three positions Three positions � � Closed, open and rounded Closed, open and rounded � � Rho Rho � � 10 lip positions 10 lip positions � � Eyelid 4 Eyelid 4 � � Eyes 2 Eyes 2 � � When should the align When should the align � � Follow trajectories, not just at time instant Follow trajectories, not just at time instant � � Shape for syllables not just phones Shape for syllables not just phones �

  7. Synthesis Analogies � Articulatory Articulatory Synthesis Synthesis � Modeling the vocal tract Modeling the vocal tract � � Baldi: movement of muscles : movement of muscles Baldi � � � Format: Format: � Modeling of signal synthetically Modeling of signal synthetically � � Carton based faces (Baphy Baphy) ) Carton based faces ( � � � Concatenative Concatenative � Joining natural segments Joining natural segments � � JPL example JPL example � � Interval’s Video Rewrite Interval’s Video Rewrite � � � Unit size Unit size � Baphy == == uniphone uniphone Baphy � � JPL == JPL == diphone diphone � � Video Rewrite == unit selection Video Rewrite == unit selection � �

  8. Talking Heads � Personalization: Personalization: � � Can look like a mask put on a dummy Can look like a mask put on a dummy � � Uncanny valley Uncanny valley � � The more human like, the more critical we are The more human like, the more critical we are � � 3 3- -D movement (in real time) D movement (in real time) � � Second Second- -life type characters life type characters � � Gesture generation too Gesture generation too � � Off Off- -line line � � (Gollum, (Gollum, Jabba Jabba the Hut) the Hut) � � Usually actors do the voices Usually actors do the voices �

  9. Singing Synthesis Simple pitch and duration control � Simple pitch and duration control � � But singing is more than that But singing is more than that � Proper singing synthesis � Proper singing synthesis � � Recording a singing database Recording a singing database �  Phonetic, prosodic, and singing style coverage Phonetic, prosodic, and singing style coverage  � Sang rather than spoken voice Sang rather than spoken voice �

  10. Flinger (Festival Singer) (Macon) Sinusoidal modeling � Sinusoidal modeling � � More pitch control than just PSOLA More pitch control than just PSOLA � MIDI interface � MIDI interface � � Allow mixing with music Allow mixing with music � � Standard MIDI authoring techniques Standard MIDI authoring techniques �

  11. Festival Singing Mode � Dominic Dominic Mazzoni Mazzoni (11 (11- -752 project 2001) 752 project 2001) � � XML based song description XML based song description � � <DURATION BEATS=“1.0”> <DURATION BEATS=“1.0”> � � <PITCH NOTE=“C4”>Oh</PITCH> <PITCH NOTE=“C4”>Oh</PITCH> � � </DURATION> </DURATION> � � But not just setting pitch at duration point But not just setting pitch at duration point � � When do you move it (based on syllable and voicing) When do you move it (based on syllable and voicing) � � How quickly do you move pitch How quickly do you move pitch �

  12. Singing Example <?xml version="1.0"?> <?xml version="1.0"?> � � <!DOCTYPE SINGING PUBLIC "- -//SINGING//DTD SINGING mark up//EN" //SINGING//DTD SINGING mark up//EN" <!DOCTYPE SINGING PUBLIC " "Singing.v0_1.dtd" "Singing.v0_1.dtd" []> []> <SINGING BPM="30"> <SINGING BPM="30"> <PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH> <PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH> <PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH> <PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH> <PITCH NOTE="C4"><DURATION BEATS="0.3">fah <PITCH NOTE="C4"><DURATION BEATS="0.3"> fah</DURATION></PITCH> </DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH> <PITCH NOTE="E4"><DURATION BEATS="0.3">lah lah</DURATION></PITCH> </DURATION></PITCH> <PITCH NOTE="E4"><DURATION BEATS="0.3"> <PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH> <PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH> <PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH> </SINGING> </SINGING>

  13. Future in TTS � More natural voices More natural voices � � Sound human Sound human � � Interact in a human way (not just words) Interact in a human way (not just words) � � More personalization More personalization � � Sound like a particular person Sound like a particular person � � Cross lingual synthesis Cross lingual synthesis � � More flexible More flexible � � Say it with more feeling Say it with more feeling � � Realtime Realtime voice transformation voice transformation � � Have an American accent while you speak Have an American accent while you speak �

  14. Text to speech process Text analysis � Text analysis � � From characters to words From characters to words � Linguistic analysis � Linguistic analysis � � From words to pronunciations From words to pronunciations � Waveform analysis � Waveform analysis � � From pronunciations to noises From pronunciations to noises �

  15. HW2: TTS Due 3:30pm Monday October 20 th th � Due 3:30pm Monday October 20 � Install Festival and Festvox Festvox � Install Festival and � Find 10 errors in each of two different � Find 10 errors in each of two different � synthesizers synthesizers Build a voice � Build a voice � � A Talking Clock A Talking Clock � � A general voice A general voice � � (or both) (or both) �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend