Advanced Animatronics Voice and Jaws v1.0 Flfgf 22/11/2019 Floere - - PowerPoint PPT Presentation

advanced animatronics
SMART_READER_LITE
LIVE PREVIEW

Advanced Animatronics Voice and Jaws v1.0 Flfgf 22/11/2019 Floere - - PowerPoint PPT Presentation

Advanced Animatronics Voice and Jaws v1.0 Flfgf 22/11/2019 Floere T. Pillowbeaver, Devourer of Nuclear Submarines fmoere@robocow.be What is this Talk About ? An overview of the State of the Art of moving jaws and voice projection


slide-1
SLIDE 1

Advanced Animatronics Voice and Jaws v1.0

Flüüfgf – 22/11/2019

Floere T. Pillowbeaver, Devourer of Nuclear Submarines fmoere@robocow.be

slide-2
SLIDE 2

2 / 60

What is this Talk About ?

  • An overview of the State of the Art of moving

jaws and voice projection

  • Why I think their performance is ‘meh’
  • My research into a self-contained, real-time,

speech expression mimicking character with a clear voice

  • All the good ideas that weren’t...
slide-3
SLIDE 3

3 / 60

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-4
SLIDE 4

4 / 60

Goal: Puppet Without Strings

  • Your character driven

by your acting

  • Clear voice projection
  • Live audience

interaction

  • Everything self-

contained in the costume

  • Comfortable
  • Afgordable

Lip-syncing with puppet mask (manual actuated) Radula Castion – Zuzu’s White Rabbit https://www.youtube.com/watch?v=b2pDuWh3ik8

slide-5
SLIDE 5

5 / 60

Low Integration Complexity

  • Easy enough to

implement by hobbyists

  • Not a movie-grade

animatronic with 30+ servos and a head full

  • f gears
  • Simple mechanisms

must suffjce

– Ofg-the-shelf parts – 3D printable

Gustav Hoegen

slide-6
SLIDE 6

6 / 60

The Big Challenge

  • Motion must be

psychologically correct, not necessarily physiologically correct!

  • A big, fmappy mouth on

a fuzzy critter is not exactly real…

  • Uncanny valley helps

→ stay non-human!

Wikipedia - Uncanny Valley Conjecture (Mori 1970)

slide-7
SLIDE 7

7 / 60

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-8
SLIDE 8

8 / 60

Let’s Watch Some Videos...

  • All of these are live performances by the costume actor themselves

(no lip-syncing or over-dubbing)

  • Professional

Katey McGregor – T alking Mickey Mouse https://www.youtube.com/watch?v=762-tHwnAHg

Mascot – Animatronic Mascots https://www.youtube.com/watch?v=Ve3vuxII6Dc

Lunaspuppets - Human-Size Animatronic Robotic T alking Donkey Puppet

https://www.youtube.com/watch?v=Cv5yAfHWEY4

  • Furry Fandom

Bake Me Up Buttercup – How to Measure Flour Correctly https://www.youtube.com/watch?v=YBkT5woqmAY

Beautyofthe Bass – Speaker Costume T alks Live! V3 https://www.youtube.com/watch?v=UWOWqe1kP7U

DRAGON =^‿^= - Howwwwwwdy folks and welcome to Monday

T witter: @GRNdragon0

slide-9
SLIDE 9

9 / 60

It’s a Bit of a Mess, isn’t It?

  • Professional work

– Limited, static articulation (blinks + simple mouth) – Good voice quality

  • ...is not actually the case!
  • Often a remote voice actor involved
  • Often pre-recorded phrases (semi-scripted)

– Most costumes are actually puppets, controlled by

the actor’s hand/chin/tongue, or a remote operator

– Let’s have a look at this…

The Character Academy – How Disney Characters Blink https://www.youtube.com/watch?v=YRDBFc-TrtM

slide-10
SLIDE 10

10 / 60

It’s a Bit of a Mess, isn’t It?

  • Amateur work is actually better in some ways

– Articulated jaws can work (but often don’t)

  • But it does not look like real speech!
  • Good fjt = uncomfortable to use for long

– Voice is dull in real life

  • YouTube videos use internal microphones
  • Beautyofthe Bass is about the best one for

live voice projection

  • There are cosplayers who use the “TC Helicon

Perform V” for voice projection, which works well (but bulky system)

slide-11
SLIDE 11

11 / 60

Why is the Tech So Basic?

  • There are many practicalities for the big boys

that limit scope (getting the character voices right, consistency with many actors per costume, training requirements, etc …)

  • The main reason, I think, is because it is

actually a hard problem to solve in practice

  • It would take a lot of money, or a motivated

idiot with a PhD...

slide-12
SLIDE 12

12 / 60

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-13
SLIDE 13

13 / 60

Why Moving Jaws Fail for Speech

  • Fundamentally: moving jaws do not work

well while speaking because normal speech does not use much jaw motion

  • Any slop in the mechanism dulls jaw motion
  • Some performers can make their jaw work

– Speaking with exaggerated jaw motion – E.g.: Buttercup and NIIC do this well

  • Still does not feel right… (hint: visemes)
slide-14
SLIDE 14

14 / 60

What the Science Says...

  • There are two sets of muscles in the jaw:

– Big and very powerful ones for chewing and

large jaw motions. These are slow!

– Little, fast ones for speech – The big ones disengage when speaking

  • Jaw motion during speech is usually small:

– Under ~0.3 cm pronouncing /ta/ and /te/

Ostry and Flanagan, 1989

  • Some sounds (eg: vowels) can have large motion:

– Under ~2.5 cm pronouncing /a/

Vatikiotis-Bateson and Ostry, 1995

slide-15
SLIDE 15

15 / 60

What the Science Says...

“Human Jaw Movement in Mastication and Speech”, D.J. Ostry and J.R. Flanagan,

  • Archs. Oral Biol. Vol. 34, No. 9, pp. 685-693, 1989

Sensor attached to the chin, just posterior to the mental notch.

slide-16
SLIDE 16

16 / 60

What the Science Says...

Marker 4 cm from lower incisors, ~on the midsagittal plane. “An Analysis of the Dimensianality of Jaw Motion in Speech”, E. Vatikiotis-Bateson and D.J. Ostry, Journal of Phonetics, Vol. 23, pp. 101-117, 1995

slide-17
SLIDE 17

17 / 60

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-18
SLIDE 18

18 / 60

First, a Little...

slide-19
SLIDE 19

19 / 60

How Speech is Produced

Haskins Laboratories

  • K. Duh, M. Lloyd, M. Smiley

gosh.nhs.uk

slide-20
SLIDE 20

20 / 60

How Speech is Produced

Jörgen Ahlberg – Source-Filter Model of Speech Production

slide-21
SLIDE 21

21 / 60

Phonemes vs Visemes

  • Animators learn that much of

visible speech is lip motion

  • They use only a few visemes

– Many speech sounds

(phonemes) look alike

– Eg: to a lip reader

“elephant juice” = “I love you”

  • Thus: we can simplify a lot
  • Can we get phonemes from

speech?

– A very hard problem – Key to speech recognition

slide-22
SLIDE 22

22 / 60

Mouth Shape from Sound?

  • Look at the visemes

and try the utterances

– Voiced or louder

→ mouth more open

– Nasal or unvoiced

→ mouth more closed

  • Try: “mama” “is” “na”
  • Not perfect, but should

be good enough for a simple jaw

Wolf Paulus – Viseme Model with 12 Mouth Shapes

slide-23
SLIDE 23

23 / 60

How We’re Going to Do It

  • Key idea: rough visemes

– Estimate mouth state

from jaw + lips

– No actual phoneme

detection

– Don’t need perfection

  • Jaw sensor

– Chin motion (slow) – Measured from jaw – Includes static poses

  • Lip “sensor” (or/na mic)

– Lip motion (fast) – Estimated from

speech

– No action when silent

Jaw sensor Lip “sensor” Speech Analysis Jaw Servos Mouth Est.

slide-24
SLIDE 24

24 / 60

Voicedness + Nasalence

  • Voicing detection

– Voiced, unvoiced,

  • r silence?

– How much energy?

  • Nasalence

– How nasal is

voiced speech?

  • Have done original

research on sensors

Donald Derrick – nasalence of na “A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition,” Bishnu S. Atal, Lawrence R. Rabiner, 1976.

slide-25
SLIDE 25

25 / 60

Bringing it All Together

  • Jaw activity gets us the “wide open” visemes,

as well as silent + static mouth motions

  • Speech activity opens the lips
  • Unvoiced speech and high nasalence counter-

act the lip opening

  • Thus: voice signal adds the lost small (fast)

lips motion to the large (slow) jaw motion

– Lips can be separate or added to jaw motion

slide-26
SLIDE 26

26 / 60

Bringing it All Together

  • Mechanism

– Jaw → 1 servo

On jaw hinge

– Lips → 1/2 servos (opt.)

On lip actuation wires

  • Sensors

– T

wo microphones (mouth + nose)

– Jaw strap

Eva Taylor – Animatronic Alien

https://makezine.com/2014/10/27/the-making-of-an-animatronic-alien/

slide-27
SLIDE 27

27 / 60

Mechanisms

  • Tioh

http://www.tioh.de/

  • Radula Castion

https://radulacastion.wixsite.com/radulacastion

  • “Animatronic Character Creation – Organic Mechanics I & II,”

Rick Lazzarini, Stan Winston School of Character Arts

skud duncan – Animatronic Jaw Test https://www.youtube.com/watch?v=15IVl1VYdSk Winter Snowmew - “Couple of my followers have been curious about the weird snout. Here is the snarl and mouth mechanics.”

slide-28
SLIDE 28

28 / 60

How Good is “Simple”?

  • We gain a lot with only jaw, or jaw + simple

lips (1 – 3 servos)

  • Full expression of movie-grade animatronic

mouth would require many more servos and much more complex motion capture system

– This is not the point of this project – Afgorability and “bang for the buck” is key

slide-29
SLIDE 29

29 / 60

Does Simple Lose Much?

  • Let’s compare high-

end animatronics to a well-done lip sync

  • I think small errors in

animation are working against it → uncanny valley

  • Clearly: deminishing

returns

TheCharacterShop – TCSpolarbearWaldo.mov https://www.youtube.com/watch?v=bFW2azvVEdI Shanetheactor – MetroPCS Commercial https://www.youtube.com/watch?v=udlQ7SH_RtM Radula Castion – Zuzu’s White Rabbit https://www.youtube.com/watch?v=b2pDuWh3ik8

VS

slide-30
SLIDE 30

30 / 60

So, Is It Really That Simple?

  • Unfortunately, NO
  • This is one of those things that seems easy

enough in principle, doable in the lab…

  • ...but is much harder in the fjeld:

– Conventions are LOUD! – Voice acting gives bizarre speech patterns – Sensors don’t stay put

  • Not practical to glue sensors to the face
  • r require piercings/implants

– Computer vision systems not practical (yet)

slide-31
SLIDE 31

31 / 60

Fundamental Limitations

  • Errors in animation will happen (exp. 10-20%)
  • Some patterns of speech and acting will fail

– Mouth held open for a long time – Mouth unmoving while speaking – Mouth held shut while mumbling

  • Sudden, loud changes in environment may

result in jaw motion (surpirse?)

  • No provisions for smiling, snarling, etc… yet

– Smile = mouth a little open for now...

slide-32
SLIDE 32

32 / 60

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-33
SLIDE 33

33 / 60

Speech in LOTS of Noise

  • Noise causes the jaw to move
  • Conventions, outdoors:

– Loud, even during calm

moments

– Noise is non-stationary

  • Adaptive fjlters required!
  • I designed a 3-layer system
  • Purpose of L1 and L2 is to lift

as much of our voice out of the noise as possible, so L3 can really go to town on the

  • noise. (Which can also be a

voice! This is how it can tell the difgerence.)

L3 MMSE L2 GCCPF

L1 Cardioid Mic Ambient Mic

slide-34
SLIDE 34

34 / 60

Speech in LOTS of Noise

slide-35
SLIDE 35

35 / 60

Three-Layer Noise Reduction L1 Close-Talking Cardioid Microphone

  • Start with as high a

SNR as we can!

  • The test recording

was done facing a speaker set so loud I could hardly hear myself talk*

  • The costume head

will also add some noise reduction

* This test recording was actually done using an omni-directional microphone, thus worst-case

slide-36
SLIDE 36

36 / 60

Three-Layer Noise Reduction L2 Two-Channel Cancellation

  • GCCPF - Generalized

Cross-Coupled Paired Filter

  • Models the paths between

the noise reference and speech microphones, then substract the noise reference from the signal and vise-versa

  • I modifjed the Sugiyama

algorithm to take better advantage of the close- talking mic and self-adjust to the stupid acoustic environment better

“Low Distortion Noise Cancellers – Revival of a Classical Technique,” Akihiko Sugiyama

slide-37
SLIDE 37

37 / 60

Three-Layer Noise Reduction L3 One-Channel Cancellation

  • Based on MMSE-STSA

noise estimation (Minimum Mean-Square Error Short-Time Spectral Amplitude)

  • Related to the Audacity

noise canceller, but able to handle non-stationary noise conditions. Like your phone does!

  • The example is set
  • verly aggressive

“Development of speech technologies to support hearing through mobile terminal users,” T. Togawa, T. Otani, K. Suzuki, T. Taniguchi, 2015.

slide-38
SLIDE 38

38 / 60

Speech in Frequency Domain

slide-39
SLIDE 39

39 / 60

Handling the Algorithms

  • The good news: they are adaptive

They will work in most environments

They will work with most speakers and languages

They will work with squeakers

  • The bad news: they are adaptive

They can get it wrong at times

Many, many parameters to confjgure

  • The good news: they are robust and forgiving

These are some of the most robust algo’s out there

Most of the parameters are fjxed for the application

  • Your cellphone doesn’t need manual intervention either!

The remainder tunes easily to a specifjc costume

  • There will be assistance software
slide-40
SLIDE 40

40 / 60

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-41
SLIDE 41

41 / 60

Capturing Jaw Motion

  • Well, I’ll just use a chin

strap with a stretch sensor and…

  • Oh. Bugger
  • Never mind, it’s not

comfortable anyway

  • Tried a paddle, a bar,

elastic, etc…

– Shifts around too much – Interferes with speech

slide-42
SLIDE 42

42 / 60

Capturing Jaw Motion

  • Fibre-optic chin loop

– Very comfy – Quite robust – Cheap – Easy to manufacture – Looks boss!

  • Based on exceeding the

critical bend angle and causing light to leak-out

  • Still in development

– Needs an adaptive

algorithm!

Sensor output while saying “mama, papa”

slide-43
SLIDE 43

43 / 60

An Aside: Cameras

  • Why not just use computer vision?

– Aside from the latency? (need >50 fps) – Contrast with beards, balaclava’s; lighting (IR) – Powerful computer needed

(gets better, eg: Jetson board)

– Readily-available algorithms for facial

landmarking (Dlib) are rather noisy

  • Kalman fjltering and such removed the fast

lip motion, or I had issues with overshoot,

  • r noise again. Maybe LMS with access to

voice signal could work?

slide-44
SLIDE 44

44 / 60

An Aside: Cameras

  • Used in the industry

– Works very well – Good accuracy

  • Not suitable for use in a

costume

– Need clear view of

face from a distance

– Complex algorithms

need powerful computers

Cara Motion Capture (www.vicon.com) DisneyResearchHub – Synthetic prior design for real time facial capture https://www.youtube.com/watch?v=w71vxi60SzM

slide-45
SLIDE 45

45 / 60

An Aside: Cameras

  • Dlib-based real-time

facial landmark annotation

  • Requires aggressive

smoothing (Kalman)

– Filters-out all the

little motions

– Some overshoot

  • Camera positioning

requirements and lighting not practical

RoboCow Industries

slide-46
SLIDE 46

46 / 60

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-47
SLIDE 47

47 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-48
SLIDE 48

48 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-49
SLIDE 49

49 / 60

Awooooooooooo!!!!!!!!!!!

  • It’ll howl allright...

– Larson efgect – Why there are few

costume voice systems out there

  • Needs:

– Microphone design – Speaker design – Feed-back control

  • Speech efgects help!

(eg: pitch shifting)

slide-50
SLIDE 50

50 / 60

Adaptive Feed-Back Canceller

  • Models the path between the

microphone and speaker

  • Not magic: about 10 dB or so

extra gain

Cardioid mic + decent speaker design ~20 dB

T

  • tal: 30 dB system gain!
  • Good enough, as the goal is to

replicate your voice, at about the same volume. (Or “big creature” volume)

Not “punk band in a suit”!

  • BUT: it’s about gain, not volume

If you can speak loud, the suit can also be LOUD

“Robust and Efficient Implementation of the PEM–AFROW Algorithm for Acoustic Feedback Cancellation,” G. Rombouts, T. Van Waterschoot,

  • M. Moonen, 2007.
slide-51
SLIDE 51

51 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-52
SLIDE 52

52 / 60

Sound Efgects – An Example

  • People love pitch shifters

– But it often sounds bad (kinda incomprehensible)

  • Reason 1: simple (W)OLA algorithms (such as the one

commonly used on an Arduino) are NOT formant preserving

– This ruins the formant relationships in speech – A time-domain pitch shifter has to lock to F0 for that

  • Such algorithms are far more numerically complex
  • PSOLA is one such algorithm.
  • Reason 2: artefacts increase with increasing shift

– Help the algorithm and actually voice act!

slide-53
SLIDE 53

53 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-54
SLIDE 54

54 / 60

Parametric Equalizer

  • This corrects for the

muffmed voice

  • Compensates the

fjlter efgect of the costume head, speaker response, microphone, etc...

  • EQ tuning is complex

– REW to the rescue – With help from own

method for transfer function estimation

https://www.roomeqwizard.com/

slide-55
SLIDE 55

55 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-56
SLIDE 56

56 / 60

Why a Bi-Amped System?

  • The voice MUST come from the

mouth for realism

  • It’s hard to fjt a full-range

speaker in the mouth

  • We can cheat a bit:

High frequencies do most for sound localization

T weeter/mid in the nose

T weeters are small!

  • Mid-low range speaker can be

some place else (eg: cheeks, forehead, chin, chest, shoulders)

  • Woofer can be almost anywhere

(no directionality)

3-D Audio & Applied Acoustics Lab Princeton

slide-57
SLIDE 57

57 / 60

An Aside: Speaker Enclosures

  • Enclosure or sound

board required for proper sound

– Avoid comb fjltering

due to acoustic short-circuit

  • Must be big enough
  • Helps with speaker ↔

microphone isolation

  • Best is right in front of

the microphone (if cardioid)

Elliot Sound Products

slide-58
SLIDE 58

58 / 60

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-59
SLIDE 59

59 / 60

Eh, Yeah… About That...

I ran out of time for Flüüfff… Research = 99% falling on my face in high spirits 0.9% crying under a blanket 0.1% success It’ll be working by NFC 2020! (I hope... It’s a nice blanket)

slide-60
SLIDE 60