Advanced Animatronics Voice and Jaws v1.1 NordicFuzzCon 20/02/2020 - - PowerPoint PPT Presentation

advanced animatronics
SMART_READER_LITE
LIVE PREVIEW

Advanced Animatronics Voice and Jaws v1.1 NordicFuzzCon 20/02/2020 - - PowerPoint PPT Presentation

Advanced Animatronics Voice and Jaws v1.1 NordicFuzzCon 20/02/2020 Floere T. Pillowcase, Devourer of Automobiles floere@robocow.be What is this Talk About ? An overview of the State of the Art of moving jaws and voice projection Why


slide-1
SLIDE 1

Advanced Animatronics Voice and Jaws v1.1

NordicFuzzCon – 20/02/2020

Floere T. Pillowcase, Devourer of Automobiles floere@robocow.be

slide-2
SLIDE 2

2 / 64

What is this Talk About ?

  • An overview of the State of the Art of moving jaws

and voice projection

  • Why I think their performance is ‘meh’
  • My research into a self-contained, real-time, speech

expression mimicking character with a clear voice

  • All the good ideas that weren’t...
slide-3
SLIDE 3

3 / 64

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-4
SLIDE 4

4 / 64

Goal: Puppet Without Strings

  • Your character driven by

your acting

  • Clear voice projection
  • Live audience

interaction

  • Everything self-

contained in the costume

  • Comfortable
  • Affordable

Lip-syncing with puppet mask (manual actuated) Radula Castion – Zuzu’s White Rabbit https://www.youtube.com/watch?v=b2pDuWh3ik8

slide-5
SLIDE 5

5 / 64

Low Integration Complexity

  • Easy enough to

implement by hobbyists

  • Not a movie-grade

animatronic with 30+ servos and a head full of gears

  • Simple mechanisms

must suffice

– Off-the-shelf parts – 3D printable

Gustav Hoegen

slide-6
SLIDE 6

6 / 64

The Big Challenge

  • Motion must be

psychologically correct, not necessarily physiologically correct!

  • A big, flappy mouth on a

fuzzy critter is not exactly real…

  • Uncanny valley helps

→ stay non-human!

  • Not conveying the motion,

but conveying the emotion!

Wikipedia - Uncanny Valley Conjecture (Mori 1970)

slide-7
SLIDE 7

7 / 64

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-8
SLIDE 8

8 / 64

Let’s Watch Some Videos...

  • All of these are live performances by the costume actor themselves

(no lip-syncing or over-dubbing)

  • Professional

Katey McGregor – Talking Mickey Mouse https://www.youtube.com/watch?v=762-tHwnAHg

Mascot – Animatronic Mascots https://www.youtube.com/watch?v=Ve3vuxII6Dc

Lunaspuppets - Human-Size Animatronic Robotic Talking Donkey Puppet

https://www.youtube.com/watch?v=Cv5yAfHWEY4

  • Furry Fandom

Bake Me Up Buttercup – How to Measure Flour Correctly https://www.youtube.com/watch?v=YBkT5woqmAY

Beautyofthe Bass – Speaker Costume Talks Live! V3 https://www.youtube.com/watch?v=UWOWqe1kP7U

DRAGON =^ ^= - Howwwwwwdy folks and welcome to Monday ‿

Twitter: @GRNdragon0

slide-9
SLIDE 9

9 / 64

It’s a Bit of a Mess, isn’t It?

  • Professional work

– Limited, static articulation (blinks + simple mouth) – Good voice quality

  • ...is not actually the case!
  • Often a remote voice actor involved
  • Often pre-recorded phrases (semi-scripted)

– Most costumes are actually puppets, controlled by the

actor’s hand/chin/tongue, or a remote operator

– Let’s have a look at this…

The Character Academy – How Disney Characters Blink https://www.youtube.com/watch?v=YRDBFc-TrtM

slide-10
SLIDE 10

10 / 64

It’s a Bit of a Mess, isn’t It?

  • Amateur work is actually better in some ways

– Articulated jaws can work (but often don’t)

  • But it does not look like real speech!
  • Good fit = uncomfortable to use for long

– Voice is dull in real life

  • YouTube videos use internal microphones
  • Beautyofthe Bass is about the best one for live

voice projection

  • There are cosplayers who use the “TC Helicon

Perform V” for voice projection, which works well (but bulky system)

slide-11
SLIDE 11

11 / 64

Why is the Tech So Basic?

  • There are many practicalities for the big boys that

limit scope (getting the character voices right, consistency with many actors per costume, training requirements, etc …)

  • The main reason, I think, is because it is actually a

hard problem to solve in practice

  • It would take a lot of money, or a motivated idiot

with a PhD...

slide-12
SLIDE 12

12 / 64

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-13
SLIDE 13

13 / 64

Why Moving Jaws Fail for Speech

  • Fundamentally: moving jaws do not work well

while speaking because normal speech does not use much jaw motion

  • Any slop in the mechanism dulls jaw motion
  • Some performers can make their jaw work

– Speaking with exaggerated jaw motion – E.g.: Buttercup and NIIC do this well

  • Still does not feel right… (hint: visemes)
slide-14
SLIDE 14

14 / 64

What the Science Says...

  • There are two sets of muscles in the jaw:

– Big and very powerful ones for chewing and large

jaw motions. These are slow!

– Little, fast ones for speech – The big ones disengage when speaking

  • Jaw motion during speech is usually small:

– Under ~0.3 cm pronouncing /ta/ and /te/

Ostry and Flanagan, 1989

  • Some sounds (eg: vowels) can have large motion:

– Under ~2.5 cm pronouncing /a/

Vatikiotis-Bateson and Ostry, 1995

slide-15
SLIDE 15

15 / 64

What the Science Says...

“Human Jaw Movement in Mastication and Speech”, D.J. Ostry and J.R. Flanagan,

  • Archs. Oral Biol. Vol. 34, No. 9, pp. 685-693, 1989

Sensor attached to the chin, just posterior to the mental notch.

slide-16
SLIDE 16

16 / 64

What the Science Says...

Marker 4 cm from lower incisors, ~on the midsagittal plane. “An Analysis of the Dimensianality of Jaw Motion in Speech”, E. Vatikiotis-Bateson and D.J. Ostry, Journal of Phonetics, Vol. 23, pp. 101-117, 1995

slide-17
SLIDE 17

17 / 64

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-18
SLIDE 18

18 / 64

First, a Little...

slide-19
SLIDE 19

19 / 64

How Speech is Produced

Haskins Laboratories

  • K. Duh, M. Lloyd, M. Smiley

gosh.nhs.uk

slide-20
SLIDE 20

20 / 64

How Speech is Produced

Jörgen Ahlberg – Source-Filter Model of Speech Production

slide-21
SLIDE 21

21 / 64

How Speech is Produced

  • You sound like a fat bee inside!

– Voiced speech starts from glottal impulses – Bzz! Bzzzzzz! – Recorded using a contact microphone – Is also why throat microphones sound iffy…

slide-22
SLIDE 22

22 / 64

Phonemes vs Visemes

  • Animators learn that much of

visible speech is lip motion

  • They use only a few visemes

– Many speech sounds

(phonemes) look alike

– Eg: to a lip reader

“elephant juice” = “I love you”

  • Thus: we can simplify a lot
  • Can we get phonemes from

speech?

– A very hard problem – Key to speech recognition

slide-23
SLIDE 23

23 / 64

Mouth Shape from Sound?

  • Look at the visemes and

try the utterances

– Voiced or louder

→ mouth more open

– Nasal or unvoiced

→ mouth more closed

  • Try: “mama” “is” “na”
  • Not perfect, but should be

good enough for a simple jaw

Wolf Paulus – Viseme Model with 12 Mouth Shapes

slide-24
SLIDE 24

24 / 64

How We’re Going to Do It

  • Key idea: rough visemes

– Estimate mouth state

from jaw + lips

– No actual phoneme

detection

– Don’t need perfection

  • Jaw sensor

– Chin motion (slow) – Measured from jaw – Includes static poses

  • Lip “sensor” (or/na mic)

– Lip motion (fast) – Estimated from speech – No action when silent

Jaw sensor Lip “sensor” Speech Analysis Jaw Servos Mouth Est.

slide-25
SLIDE 25

25 / 64

Voicedness + Nasalence

  • Voicing detection

– Voiced, unvoiced, or

silence?

– How much energy? – Can we do this

implicitly?

“A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition,” Bishnu S. Atal, Lawrence R. Rabiner, 1976.

slide-26
SLIDE 26

26 / 64

Voicedness + Nasalence

  • Nasalence

– How nasal is voiced

speech?

– Have done original

research on sensors…

– But can’t use any of that

work here!

  • Too bulky due to

underlying principle

  • Had to measure

airflow + speech

  • Not required here!

Donald Derrick – nasalence of na

slide-27
SLIDE 27

27 / 64

Bringing it All Together

  • Jaw activity gets us the “wide open” visemes, as well

as silent + static mouth motions

  • Speech activity opens the lips
  • Unvoiced speech and high nasalence counter-act

the lip opening

  • Thus: voice signal adds the lost small (fast) lips

motion to the large (slow) jaw motion

– Lips can be separate or added to jaw motion

slide-28
SLIDE 28

28 / 64

Bringing it All Together

  • Mechanism

– Jaw → 1 servo

On jaw hinge

– Lips → 1/2 servos (opt.)

On lip actuation wires

  • Sensors

– Two microphones (mouth + nose) – Jaw motion sensor

Eva Taylor – Animatronic Alien

https://makezine.com/2014/10/27/the-making-of-an-animatronic-alien/

slide-29
SLIDE 29

29 / 64

Mechanisms

  • Tioh

http://www.tioh.de/

  • Radula Castion

https://radulacastion.wixsite.com/radulacastion

  • “Animatronic Character Creation – Organic Mechanics I & II,”

Rick Lazzarini, Stan Winston School of Character Arts

skud duncan – Animatronic Jaw Test https://www.youtube.com/watch?v=15IVl1VYdSk Winter Snowmew - “Couple of my followers have been curious about the weird snout. Here is the snarl and mouth mechanics.”

slide-30
SLIDE 30

30 / 64

How Good is “Simple”?

  • We gain a lot with only jaw, or jaw + simple lips

(1 – 3 servos)

  • Full expression of movie-grade animatronic mouth

would require many more servos and much more complex motion capture system

– This is not the point of this project – Afforability and “bang for the buck” is key

slide-31
SLIDE 31

31 / 64

Does Simple Lose Much?

  • Let’s compare high-end

animatronics to a well- done lip sync

  • I think small errors in

animation are working against it → uncanny valley

  • Clearly: diminishing

returns

TheCharacterShop – TCSpolarbearWaldo.mov https://www.youtube.com/watch?v=bFW2azvVEdI Shanetheactor – MetroPCS Commercial https://www.youtube.com/watch?v=udlQ7SH_RtM Radula Castion – Zuzu’s White Rabbit https://www.youtube.com/watch?v=b2pDuWh3ik8

VS

slide-32
SLIDE 32

32 / 64

So, Is It Really That Simple?

  • Unfortunately, NO
  • This is one of those things that seems easy enough

in principle, doable in the lab…

  • ...but is much harder in the field:

– Conventions are LOUD! – Voice acting gives bizarre speech patterns – Sensors don’t stay put

  • Not practical to glue sensors to the face or

require piercings/implants

– Computer vision systems not practical (yet)

slide-33
SLIDE 33

33 / 64

Fundamental Limitations

  • Errors in animation will happen (exp. 10-20%)
  • Some patterns of speech and acting will fail

– Mouth held open for a long time – Mouth unmoving while speaking – Mouth held shut while mumbling

  • Sudden, loud changes in environment may result in

jaw motion (surprise?)

  • No provisions for smiling, snarling, etc… yet

– Smile = mouth a little open for now...

slide-34
SLIDE 34

34 / 64

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-35
SLIDE 35

35 / 64

Speech in LOTS of Noise

  • Noise causes the jaw to move
  • Conventions, outdoors:

– Loud, even during calm

moments

– Noise is non-stationary

  • Adaptive filters required!
  • I designed a 3-layer system
  • Purpose of L1 and L2 is to lift as

much of our voice out of the noise as possible, so L3 can really go to town on the noise. (Which can also be a voice! This is how it can tell the difference.)

L3 MMSE L2 GCCPF

L1 Cardioid Mic Ambient Mic

slide-36
SLIDE 36

36 / 64

Speech in LOTS of Noise

slide-37
SLIDE 37

37 / 64

Three-Layer Noise Reduction L1 Close-Talking Cardioid-ish Microphone

  • Start with as high a

SNR as we can!

  • The test recording was

done facing a speaker set so loud I could hardly hear myself talk*

  • The costume head will

also add some noise reduction

* This test recording was actually done using an omni-directional microphone, thus worst-case

Figure Eight - Knowles Acoustics Cardioid - SoundGuys

slide-38
SLIDE 38

38 / 64

Three-Layer Noise Reduction L2 Two-Channel Cancellation

  • GCCPF - Generalized Cross-

Coupled Paired Filter

  • Models the paths between the

noise reference and speech microphones, then subtract the noise reference from the signal and vise-versa

  • I modified the Sugiyama

algorithm to take better advantage of the close-talking mic and self-adjust to the stupid acoustic environment better

“Low Distortion Noise Cancellers – Revival of a Classical Technique,” Akihiko Sugiyama

slide-39
SLIDE 39

39 / 64

Three-Layer Noise Reduction L3 One-Channel Cancellation

  • Based on MMSE-STSA

noise estimation (Minimum Mean-Square Error Short- Time Spectral Amplitude)

  • Related to the Audacity

noise canceller, but able to handle non-stationary noise

  • conditions. Like your phone

does!

  • The example is set overly

aggressive!

“Development of speech technologies to support hearing through mobile terminal users,” T. Togawa, T. Otani, K. Suzuki, T. Taniguchi, 2015. (Not the exact algorithm used in my code – used here for the nice figure)

slide-40
SLIDE 40

40 / 64

Speech in Frequency Domain

slide-41
SLIDE 41

41 / 64

Noise-Robust Nasalence

  • Based on statistical speech energy estimation
  • Computes realistic speech motions!

– Latency low enough – Very noise robust – Works over wide range with same settings

slide-42
SLIDE 42

42 / 64

Handling the Algorithms

  • The good news: they are adaptive

They will work in most environments

They will work with most speakers and languages

They will work with squeakers

  • The bad news: they are adaptive

They can get it wrong at times

Many, many parameters to configure

  • The good news: they are robust and forgiving

These are some of the most robust algo’s out there

Most of the parameters are fixed for the application

  • Your cellphone doesn’t need manual intervention either!

The remainder tunes easily to a specific costume

  • There will be assistance software
slide-43
SLIDE 43

43 / 64

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-44
SLIDE 44

44 / 64

Capturing Jaw Motion

  • Well, I’ll just use a chin

strap with a stretch sensor and…

  • Oh. Bugger
  • Never mind, it’s not

comfortable anyway

  • Tried a paddle, a bar,

elastic, etc…

– Shifts around too much – Interferes with speech

slide-45
SLIDE 45

45 / 64

Capturing Jaw Motion

  • Fibre-optic chin loop

– Very comfy – Quite robust – Cheap – Easy to manufacture – Looks boss!

  • Based on exceeding the critical

bend angle and causing light to leak-out

  • Still in development

– Needs adaptive algorithm! Sensor output while saying “mama, papa”

slide-46
SLIDE 46

46 / 64

An Aside: Cameras

  • Why not just use computer vision?

– Aside from the latency? (need >50 fps) – Contrast with beards, balaclava’s; lighting (IR) – Powerful computer needed

(gets better, eg: Jetson board)

– Readily-available algorithms for facial landmarking

(Dlib) are rather noisy

  • Kalman filtering and such removed the fast lip

motion, or I had issues with overshoot, or noise

  • again. Maybe LMS with access to voice signal

could work?

slide-47
SLIDE 47

47 / 64

An Aside: Cameras

  • Used in the industry

– Works very well – Good accuracy

  • Not suitable for use in a

costume

– Need clear view of face

from a distance

– Complex algorithms

need powerful computers

Cara Motion Capture (www.vicon.com) DisneyResearchHub – Synthetic prior design for real time facial capture https://www.youtube.com/watch?v=w71vxi60SzM

slide-48
SLIDE 48

48 / 64

An Aside: Cameras

  • Dlib-based real-time facial

landmark annotation

  • Requires aggressive

smoothing (Kalman)

– Filters-out all the little

motions

– Some overshoot

  • Camera positioning

requirements and lighting not practical

RoboCow Industries

slide-49
SLIDE 49

49 / 64

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-50
SLIDE 50

50 / 64

Voice System Overview

Close-Talking Cardioid-ish Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-51
SLIDE 51

51 / 64

Voice System Overview

Close-Talking Cardioid-ish Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-52
SLIDE 52

52 / 64

Awooooooooooo!!!!!!!!!!!

  • It’ll howl allright...

– Larson effect – Why there are few

costume voice systems

  • ut there
  • Needs:

– Microphone design – Speaker design – Feed-back control

  • Speech effects help! (eg:

pitch shifting)

slide-53
SLIDE 53

53 / 64

Adaptive Feed-Back Canceller

  • Models the path between the

microphone and speaker

  • Not magic: about 10 dB or so extra gain

Cardioid mic + decent speaker design ~20 dB

Total: 30 dB system gain!

  • Good enough, as the goal is to replicate

your voice, at about the same volume. (Or “big creature” volume)

Not “punk band in a suit”!

  • BUT: it’s about gain, not volume

If you can speak loud, the suit can also be LOUD

“Robust and Efficient Implementation of the PEM–AFROW Algorithm for Acoustic Feedback Cancellation,” G. Rombouts, T. Van Waterschoot,

  • M. Moonen, 2007.
slide-54
SLIDE 54

54 / 64

Voice System Overview

Close-Talking Cardioid-ish Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-55
SLIDE 55

55 / 64

Sound Effects – An Example

  • People love pitch shifters

– But it often sounds bad (kinda incomprehensible)

  • Reason 1: simple (W)OLA algorithms (such as the one

commonly used on an Arduino) are NOT formant preserving

– This ruins the formant relationships in speech – A time-domain pitch shifter has to lock to F0 for that

  • Such algorithms are far more numerically complex
  • PSOLA is one such algorithm.
  • Reason 2: artefacts increase with increasing shift

– Help the algorithm and actually voice act!

slide-56
SLIDE 56

56 / 64

Voice System Overview

Close-Talking Cardioid-ish Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-57
SLIDE 57

57 / 64

Parametric Equalizer

  • This corrects for the

muffled voice

  • Compensates the filter

effect of the costume head, speaker response, microphone, etc...

  • EQ tuning is complex

– REW to the rescue – With help from own

method for transfer function estimation

https://www.roomeqwizard.com/

slide-58
SLIDE 58

58 / 64

Voice System Overview

Close-Talking Cardioid-ish Microphone 3L Noise Reduction Feed-Back Canceller Parametric Equalizer Cross-Over Sound Effects Amplifier Tweeter Mid-Range Woofer

slide-59
SLIDE 59

59 / 64

Why a Bi-Amped System?

  • The voice MUST come from the mouth

for realism

  • It’s hard to fit a full-range speaker in

the mouth

  • We can cheat a bit:

High frequencies do most for sound localization

Tweeter in the nose

Tweeters are small!

  • Mid-range speaker can be some place

else (eg: muzzle, cheeks, forehead, chin, chest, shoulders)

  • Woofer can be almost anywhere

(no directionality)

3-D Audio & Applied Acoustics Lab Princeton

slide-60
SLIDE 60

60 / 64

An Aside: Speaker Enclosures

  • Enclosure or sound board

required for proper sound

– Avoid comb filtering due

to acoustic short-circuit

  • Must be big enough
  • Helps with speaker –

microphone isolation

  • Best is right in front of the

microphone (if cardioid)

Elliot Sound Products

slide-61
SLIDE 61

61 / 64

Content

  • The Goal
  • State of the Art
  • Why Moving Jaws Fail
  • Mapping Human Speech to a Character
  • Dealing with Speech in the Real World
  • Jaw Motion Capture
  • Voice Projection
  • Putting it all together
slide-62
SLIDE 62

62 / 64

Not Done, Not Yet!

Still missing: jaw sensor, real-time speech Research = 99% falling on my face in high spirits 0.9% crying under a blanket 0.1% success Also, I have to eat... It will be all integrated by NFC 2021! (I hope... It’s a nice blanket)

slide-63
SLIDE 63

63 / 64

But: Finally Getting Somewhere!

Now Featured Li’l Bitey and Floere in Duet

You might want to cover your ears. Or run. Yes, you should have brought duct tape. What do you mean, you don’t have any? Well, sucks to be you… Maybe you can just sing-along loudly to drown it out?

slide-64
SLIDE 64