[PPT] - Advanced Animatronics Voice and Jaws v1.0 Flfgf 22/11/2019 Floere PowerPoint Presentation

SLIDE 1

Advanced Animatronics Voice and Jaws v1.0

Flüüfgf – 22/11/2019

Floere T. Pillowbeaver, Devourer of Nuclear Submarines fmoere@robocow.be

SLIDE 2

2 / 60

What is this Talk About ?

An overview of the State of the Art of moving

jaws and voice projection

Why I think their performance is ‘meh’
My research into a self-contained, real-time,

speech expression mimicking character with a clear voice

All the good ideas that weren’t...

SLIDE 3

3 / 60

Content

The Goal
State of the Art
Why Moving Jaws Fail
Mapping Human Speech to a Character
Dealing with Speech in the Real World
Jaw Motion Capture
Voice Projection
Putting it all together

SLIDE 4

4 / 60

Goal: Puppet Without Strings

Your character driven

by your acting

Clear voice projection
Live audience

interaction

Everything self-

contained in the costume

Comfortable
Afgordable

Lip-syncing with puppet mask (manual actuated) Radula Castion – Zuzu’s White Rabbit https://www.youtube.com/watch?v=b2pDuWh3ik8

SLIDE 5

5 / 60

Low Integration Complexity

Easy enough to

implement by hobbyists

Not a movie-grade

animatronic with 30+ servos and a head full

f gears
Simple mechanisms

must suffjce

– Ofg-the-shelf parts – 3D printable

Gustav Hoegen

SLIDE 6

6 / 60

The Big Challenge

Motion must be

psychologically correct, not necessarily physiologically correct!

A big, fmappy mouth on

a fuzzy critter is not exactly real…

Uncanny valley helps

→ stay non-human!

Wikipedia - Uncanny Valley Conjecture (Mori 1970)

SLIDE 7

7 / 60

Content

The Goal
State of the Art
Why Moving Jaws Fail
Mapping Human Speech to a Character
Dealing with Speech in the Real World
Jaw Motion Capture
Voice Projection
Putting it all together

SLIDE 8

8 / 60

Let’s Watch Some Videos...

All of these are live performances by the costume actor themselves

(no lip-syncing or over-dubbing)

Professional

–

Katey McGregor – T alking Mickey Mouse https://www.youtube.com/watch?v=762-tHwnAHg

–

Mascot – Animatronic Mascots https://www.youtube.com/watch?v=Ve3vuxII6Dc

–

Lunaspuppets - Human-Size Animatronic Robotic T alking Donkey Puppet

https://www.youtube.com/watch?v=Cv5yAfHWEY4

Furry Fandom

–

Bake Me Up Buttercup – How to Measure Flour Correctly https://www.youtube.com/watch?v=YBkT5woqmAY

–

Beautyofthe Bass – Speaker Costume T alks Live! V3 https://www.youtube.com/watch?v=UWOWqe1kP7U

–

DRAGON =^‿^= - Howwwwwwdy folks and welcome to Monday

T witter: @GRNdragon0

SLIDE 9

9 / 60

It’s a Bit of a Mess, isn’t It?

Professional work

– Limited, static articulation (blinks + simple mouth) – Good voice quality

...is not actually the case!
Often a remote voice actor involved
Often pre-recorded phrases (semi-scripted)

– Most costumes are actually puppets, controlled by

the actor’s hand/chin/tongue, or a remote operator

– Let’s have a look at this…

The Character Academy – How Disney Characters Blink https://www.youtube.com/watch?v=YRDBFc-TrtM

SLIDE 10

10 / 60

It’s a Bit of a Mess, isn’t It?

Amateur work is actually better in some ways

– Articulated jaws can work (but often don’t)

But it does not look like real speech!
Good fjt = uncomfortable to use for long

– Voice is dull in real life

YouTube videos use internal microphones
Beautyofthe Bass is about the best one for

live voice projection

There are cosplayers who use the “TC Helicon

Perform V” for voice projection, which works well (but bulky system)

SLIDE 11

11 / 60

Why is the Tech So Basic?

There are many practicalities for the big boys

that limit scope (getting the character voices right, consistency with many actors per costume, training requirements, etc …)

The main reason, I think, is because it is

actually a hard problem to solve in practice

It would take a lot of money, or a motivated

idiot with a PhD...

SLIDE 12

12 / 60

Content

The Goal
State of the Art
Why Moving Jaws Fail
Mapping Human Speech to a Character
Dealing with Speech in the Real World
Jaw Motion Capture
Voice Projection
Putting it all together

SLIDE 13

13 / 60

Why Moving Jaws Fail for Speech

Fundamentally: moving jaws do not work

well while speaking because normal speech does not use much jaw motion

Any slop in the mechanism dulls jaw motion
Some performers can make their jaw work

– Speaking with exaggerated jaw motion – E.g.: Buttercup and NIIC do this well

Still does not feel right… (hint: visemes)

SLIDE 14

14 / 60

What the Science Says...

There are two sets of muscles in the jaw:

– Big and very powerful ones for chewing and

large jaw motions. These are slow!

– Little, fast ones for speech – The big ones disengage when speaking

Jaw motion during speech is usually small:

– Under ~0.3 cm pronouncing /ta/ and /te/

Ostry and Flanagan, 1989

Some sounds (eg: vowels) can have large motion:

– Under ~2.5 cm pronouncing /a/

Vatikiotis-Bateson and Ostry, 1995

SLIDE 15

15 / 60

What the Science Says...

“Human Jaw Movement in Mastication and Speech”, D.J. Ostry and J.R. Flanagan,

Archs. Oral Biol. Vol. 34, No. 9, pp. 685-693, 1989

Sensor attached to the chin, just posterior to the mental notch.

SLIDE 16

16 / 60

What the Science Says...

Marker 4 cm from lower incisors, ~on the midsagittal plane. “An Analysis of the Dimensianality of Jaw Motion in Speech”, E. Vatikiotis-Bateson and D.J. Ostry, Journal of Phonetics, Vol. 23, pp. 101-117, 1995

SLIDE 17

17 / 60

Content

The Goal
State of the Art
Why Moving Jaws Fail
Mapping Human Speech to a Character
Dealing with Speech in the Real World
Jaw Motion Capture
Voice Projection
Putting it all together

SLIDE 18

18 / 60

First, a Little...

SLIDE 19

19 / 60

How Speech is Produced

Haskins Laboratories

K. Duh, M. Lloyd, M. Smiley

gosh.nhs.uk

SLIDE 20

20 / 60

How Speech is Produced

Jörgen Ahlberg – Source-Filter Model of Speech Production

SLIDE 21

21 / 60

Phonemes vs Visemes

Animators learn that much of

visible speech is lip motion

They use only a few visemes

– Many speech sounds

(phonemes) look alike

– Eg: to a lip reader

“elephant juice” = “I love you”

Thus: we can simplify a lot
Can we get phonemes from

speech?

– A very hard problem – Key to speech recognition

SLIDE 22

22 / 60

Mouth Shape from Sound?

Look at the visemes

and try the utterances

– Voiced or louder

→ mouth more open

– Nasal or unvoiced

→ mouth more closed

Try: “mama” “is” “na”
Not perfect, but should

be good enough for a simple jaw

Wolf Paulus – Viseme Model with 12 Mouth Shapes

SLIDE 23

23 / 60

How We’re Going to Do It

Key idea: rough visemes

– Estimate mouth state

from jaw + lips

– No actual phoneme

detection

– Don’t need perfection

Jaw sensor

– Chin motion (slow) – Measured from jaw – Includes static poses

Lip “sensor” (or/na mic)

– Lip motion (fast) – Estimated from

speech

– No action when silent

Jaw sensor Lip “sensor” Speech Analysis Jaw Servos Mouth Est.

SLIDE 24

24 / 60

Voicedness + Nasalence

Voicing detection

– Voiced, unvoiced,

r silence?

– How much energy?

Nasalence

– How nasal is

voiced speech?

Have done original

research on sensors

Donald Derrick – nasalence of na “A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition,” Bishnu S. Atal, Lawrence R. Rabiner, 1976.

SLIDE 25

25 / 60

Bringing it All Together

Jaw activity gets us the “wide open” visemes,

as well as silent + static mouth motions

Speech activity opens the lips
Unvoiced speech and high nasalence counter-

act the lip opening

Thus: voice signal adds the lost small (fast)

lips motion to the large (slow) jaw motion

– Lips can be separate or added to jaw motion

SLIDE 26

26 / 60

Bringing it All Together

Mechanism

– Jaw → 1 servo

On jaw hinge

– Lips → 1/2 servos (opt.)

On lip actuation wires

Sensors

– T

wo microphones (mouth + nose)

– Jaw strap

Eva Taylor – Animatronic Alien

https://makezine.com/2014/10/27/the-making-of-an-animatronic-alien/

SLIDE 27

27 / 60

Mechanisms

Tioh

http://www.tioh.de/

Radula Castion

https://radulacastion.wixsite.com/radulacastion

“Animatronic Character Creation – Organic Mechanics I & II,”

Rick Lazzarini, Stan Winston School of Character Arts

skud duncan – Animatronic Jaw Test https://www.youtube.com/watch?v=15IVl1VYdSk Winter Snowmew - “Couple of my followers have been curious about the weird snout. Here is the snarl and mouth mechanics.”

SLIDE 28

28 / 60

How Good is “Simple”?

We gain a lot with only jaw, or jaw + simple

lips (1 – 3 servos)

Full expression of movie-grade animatronic

mouth would require many more servos and much more complex motion capture system

– This is not the point of this project – Afgorability and “bang for the buck” is key

SLIDE 29

29 / 60

Does Simple Lose Much?

Let’s compare high-

end animatronics to a well-done lip sync

I think small errors in

animation are working against it → uncanny valley

Clearly: deminishing

returns

TheCharacterShop – TCSpolarbearWaldo.mov https://www.youtube.com/watch?v=bFW2azvVEdI Shanetheactor – MetroPCS Commercial https://www.youtube.com/watch?v=udlQ7SH_RtM Radula Castion – Zuzu’s White Rabbit https://www.youtube.com/watch?v=b2pDuWh3ik8

VS

SLIDE 30

30 / 60

So, Is It Really That Simple?

Unfortunately, NO
This is one of those things that seems easy

enough in principle, doable in the lab…

...but is much harder in the fjeld:

– Conventions are LOUD! – Voice acting gives bizarre speech patterns – Sensors don’t stay put

Not practical to glue sensors to the face
r require piercings/implants

– Computer vision systems not practical (yet)

SLIDE 31

31 / 60

Fundamental Limitations

Errors in animation will happen (exp. 10-20%)
Some patterns of speech and acting will fail

– Mouth held open for a long time – Mouth unmoving while speaking – Mouth held shut while mumbling

Sudden, loud changes in environment may

result in jaw motion (surpirse?)

No provisions for smiling, snarling, etc… yet

– Smile = mouth a little open for now...

SLIDE 32

32 / 60

Content

The Goal
State of the Art
Why Moving Jaws Fail
Mapping Human Speech to a Character
Dealing with Speech in the Real World
Jaw Motion Capture
Voice Projection
Putting it all together

SLIDE 33

33 / 60

Speech in LOTS of Noise

Noise causes the jaw to move
Conventions, outdoors:

– Loud, even during calm

moments

– Noise is non-stationary

Adaptive fjlters required!
I designed a 3-layer system
Purpose of L1 and L2 is to lift

as much of our voice out of the noise as possible, so L3 can really go to town on the

noise. (Which can also be a

voice! This is how it can tell the difgerence.)

L3 MMSE L2 GCCPF

L1 Cardioid Mic Ambient Mic

SLIDE 34

34 / 60

Speech in LOTS of Noise

SLIDE 35

35 / 60

Three-Layer Noise Reduction L1 Close-Talking Cardioid Microphone

Start with as high a

SNR as we can!

The test recording

was done facing a speaker set so loud I could hardly hear myself talk*

The costume head

will also add some noise reduction

* This test recording was actually done using an omni-directional microphone, thus worst-case

SLIDE 36

36 / 60

Three-Layer Noise Reduction L2 Two-Channel Cancellation

GCCPF - Generalized

Cross-Coupled Paired Filter

Models the paths between

the noise reference and speech microphones, then substract the noise reference from the signal and vise-versa

I modifjed the Sugiyama

algorithm to take better advantage of the close- talking mic and self-adjust to the stupid acoustic environment better

“Low Distortion Noise Cancellers – Revival of a Classical Technique,” Akihiko Sugiyama

SLIDE 37

37 / 60

Three-Layer Noise Reduction L3 One-Channel Cancellation

Based on MMSE-STSA

noise estimation (Minimum Mean-Square Error Short-Time Spectral Amplitude)

Related to the Audacity

noise canceller, but able to handle non-stationary noise conditions. Like your phone does!

The example is set
verly aggressive

“Development of speech technologies to support hearing through mobile terminal users,” T. Togawa, T. Otani, K. Suzuki, T. Taniguchi, 2015.

SLIDE 38

38 / 60

Speech in Frequency Domain

SLIDE 39

39 / 60

Handling the Algorithms

The good news: they are adaptive

–

They will work in most environments

–

They will work with most speakers and languages

–

They will work with squeakers

The bad news: they are adaptive

–

They can get it wrong at times

–

Many, many parameters to confjgure

The good news: they are robust and forgiving

–

These are some of the most robust algo’s out there

–

Most of the parameters are fjxed for the application

Your cellphone doesn’t need manual intervention either!

–

The remainder tunes easily to a specifjc costume

There will be assistance software

SLIDE 40

40 / 60

Content

The Goal
State of the Art
Why Moving Jaws Fail
Mapping Human Speech to a Character
Dealing with Speech in the Real World
Jaw Motion Capture
Voice Projection
Putting it all together

SLIDE 41

41 / 60

Capturing Jaw Motion

Well, I’ll just use a chin

strap with a stretch sensor and…

Oh. Bugger
Never mind, it’s not

comfortable anyway

Tried a paddle, a bar,

elastic, etc…

– Shifts around too much – Interferes with speech

SLIDE 42

42 / 60

Capturing Jaw Motion

Fibre-optic chin loop

– Very comfy – Quite robust – Cheap – Easy to manufacture – Looks boss!

Based on exceeding the

critical bend angle and causing light to leak-out

Still in development

– Needs an adaptive

algorithm!

Sensor output while saying “mama, papa”

SLIDE 43

43 / 60

An Aside: Cameras

Why not just use computer vision?

– Aside from the latency? (need >50 fps) – Contrast with beards, balaclava’s; lighting (IR) – Powerful computer needed

(gets better, eg: Jetson board)

– Readily-available algorithms for facial

landmarking (Dlib) are rather noisy

Kalman fjltering and such removed the fast

lip motion, or I had issues with overshoot,

r noise again. Maybe LMS with access to

voice signal could work?

SLIDE 44

44 / 60

An Aside: Cameras

Used in the industry

– Works very well – Good accuracy

Not suitable for use in a

costume

– Need clear view of

face from a distance

– Complex algorithms

need powerful computers

Cara Motion Capture (www.vicon.com) DisneyResearchHub – Synthetic prior design for real time facial capture https://www.youtube.com/watch?v=w71vxi60SzM

SLIDE 45

45 / 60

An Aside: Cameras

Dlib-based real-time

facial landmark annotation

Requires aggressive

smoothing (Kalman)

– Filters-out all the

little motions

– Some overshoot

Camera positioning

requirements and lighting not practical

RoboCow Industries

SLIDE 46

46 / 60

Content

The Goal
State of the Art
Why Moving Jaws Fail
Mapping Human Speech to a Character
Dealing with Speech in the Real World
Jaw Motion Capture
Voice Projection
Putting it all together

SLIDE 47

47 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

SLIDE 48

48 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

SLIDE 49

49 / 60

Awooooooooooo!!!!!!!!!!!

It’ll howl allright...

– Larson efgect – Why there are few

costume voice systems out there

Needs:

– Microphone design – Speaker design – Feed-back control

Speech efgects help!

(eg: pitch shifting)

SLIDE 50

50 / 60

Adaptive Feed-Back Canceller

Models the path between the

microphone and speaker

Not magic: about 10 dB or so

extra gain

–

Cardioid mic + decent speaker design ~20 dB

–

T

tal: 30 dB system gain!
Good enough, as the goal is to

replicate your voice, at about the same volume. (Or “big creature” volume)

–

Not “punk band in a suit”!

BUT: it’s about gain, not volume

–

If you can speak loud, the suit can also be LOUD

“Robust and Efficient Implementation of the PEM–AFROW Algorithm for Acoustic Feedback Cancellation,” G. Rombouts, T. Van Waterschoot,

M. Moonen, 2007.

SLIDE 51

51 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

SLIDE 52

52 / 60

Sound Efgects – An Example

People love pitch shifters

– But it often sounds bad (kinda incomprehensible)

Reason 1: simple (W)OLA algorithms (such as the one

commonly used on an Arduino) are NOT formant preserving

– This ruins the formant relationships in speech – A time-domain pitch shifter has to lock to F0 for that

Such algorithms are far more numerically complex
PSOLA is one such algorithm.
Reason 2: artefacts increase with increasing shift

– Help the algorithm and actually voice act!

SLIDE 53

53 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

SLIDE 54

54 / 60

Parametric Equalizer

This corrects for the

muffmed voice

Compensates the

fjlter efgect of the costume head, speaker response, microphone, etc...

EQ tuning is complex

– REW to the rescue – With help from own

method for transfer function estimation

https://www.roomeqwizard.com/

SLIDE 55

55 / 60

Voice System Overview

Close-Talking Cardioid Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

SLIDE 56

56 / 60

Why a Bi-Amped System?

The voice MUST come from the

mouth for realism

It’s hard to fjt a full-range

speaker in the mouth

We can cheat a bit:

–

High frequencies do most for sound localization

–

T weeter/mid in the nose

–

T weeters are small!

Mid-low range speaker can be

some place else (eg: cheeks, forehead, chin, chest, shoulders)

Woofer can be almost anywhere

(no directionality)

3-D Audio & Applied Acoustics Lab Princeton

SLIDE 57

57 / 60

An Aside: Speaker Enclosures

Enclosure or sound

board required for proper sound

– Avoid comb fjltering

due to acoustic short-circuit

Must be big enough
Helps with speaker ↔

microphone isolation

Best is right in front of

the microphone (if cardioid)

Elliot Sound Products

SLIDE 58

58 / 60

Content

The Goal
State of the Art
Why Moving Jaws Fail
Mapping Human Speech to a Character
Dealing with Speech in the Real World
Jaw Motion Capture
Voice Projection
Putting it all together

SLIDE 59

59 / 60

Eh, Yeah… About That...

I ran out of time for Flüüfff… Research = 99% falling on my face in high spirits 0.9% crying under a blanket 0.1% success It’ll be working by NFC 2020! (I hope... It’s a nice blanket)

SLIDE 60