What is Lip Sync?
Chances are if you've ever animated a character, you've needed or wanted
to make it talk, sing, or otherwise communicate via dialogue of some sort.
Lip sync is the art of taking a pre-recorded track of dialogue, and making
a character appear to speak it. This involves figuring out the timings of
the speech (breakdown) as well as the actual animating of the lips/mouth.
In addition making the actual setup or mouth phonemes needed can also be
considered a part of the entire Lip Sync process for 3D.
This article will give an overview of the process of animating dialogue.
While I may focus on specific software in some cases, I'll try to stay
general enough that you can use this information with any software.
You should have a basic grasp of standard animation techniques and terminology.
If you don't you might want to check out the Frequently Asked Questions (FAQ)
document on the CG-Char web pages.
Part I - Phoneme Setup
As with any 3D animation, you can't actually animate until you have your
object, and it has to been setup for animation. For
characters bodies, this usually consists of segmented joints parented to each
other, or for smooth characters, a setup with bones.
Lip sync in 3D has similar requirements. In this case though we're only focusing on the head of the character. (Or at least the mouth if you have some freaky creature that doesn't have a 'head' per se.)
When thinking about 3D facial setup, it helps to explain how 2D cartoons achieve lipsync. The appearance of speech in cartoons is created by drawing different mouth shapes for a character on different frames. By animating the mouth shapes timed to dialogue, the character appears to speak. Each of these mouth shapes is usually referred to as a phoneme.
A phoneme is the unique shape a mouth takes to make a specific sound. For example when you make the sound "ooo" as in "groovy" your mouth tends to pucker into a small circle. When you make the sound "Aaa" as in "Apple" you mouth tends to open up wider and fuller. The word "Animation" could be described phonetically as "a-nee-may-shun" or more like a breakdown as "a-n-ee-m-ay-sh-u-n".
For basic breakdown purposes, the lowest number of phonemes most people use is 9.
These usually are broken down as:
The phoneme set I now use which is derived from a larger set the Ventriloquist
tool usually requires (see below) is as follows:
There is one other item you need to be aware of for setup. That is emotion or
expression. Unless you want your character to remain perfectly flat, you'll need to
make it look happy, sad or a wide variety of other expressions. Typically there are
6 base emotions. These are:
Getting back to our cartoons...the animator would draw one of these pictures of the mouths at the right point to create the lipsync. Naturally this carries over to 3D animation as well. You must keyframe your characters mouth to match the right phoneme for the sound at that frame. This means you must setup your character's head to deform into each of these shapes.
The method you plan to use for animating will directly influence how you want to
approach setup. For example, if you are going to use bones to modify your head and
mouth, you don't need to model different phonemes, you'd just setup the bones. On
the other hand for a weighted morphing system like Morpher, Blend Shapes, Morph Magic, Smirk or the Morph
Gizmo, you'd need to create different models. The following is a list of some ways
to setup and animate a head:
Bones and FFD setup consist of modeling and texturing one head. Then you setup bones or FFD controls around the mouth. Set them up such that the lips can bend into the required shapes and also so the lower jaw can drop. Once the skinning is done, you can continue and manually position the bones or FFD controls on each frame to animat them. Another alternative is to create an "archive" of keyframes for the phonemes. For example, 3D Studio MAX allows you to set keyframes for objects before frame 0. So you might set up your head with bones and make the A phoneme on -1, the E phoneme on -2 and so on. Later you could very quickly copy the bones keys from a negative frame over to a real part of the animation. This saves a lot of time, and in fact there are plugins to allow you to automate this. If you use 3D Studio MAX you can download the 47k ZIP file of the Magpie Importer by Andrew Reid
One problem with Bones and FFDs is it can be difficult to get the original model to deform nicely for each pose. Getting the proper creases, folds and lips to move correctly can be a pain to setup. I've personally done both Bones and FFD's for facial setup. Compared to actually modeling in a change, it can be a lot of work.
Another downside of Bones and FFDs is they can be slow to animate, and difficult to add emotion or variation to. For example if you're animating bones by hand without using an archive, then you will have to manually pose each bone for each keyframe. If you are using the archive, your animation may look robotic since every same phoneme will look identical. ie: all the A phonemes will look alike unless you manually change them. Plus unless you create a large archive for phonemes with different expressions (such as happy, sad,...) you'll need to go back by hand and tweak the keys.
Enter weighted morphing. Weighted morphing is just what it sounds like, different morph targets mixed together at different strengths. What you do is model a different head for each phoneme. Typically these heads must have the same point/vertex count and ordering as the original base. For that reason the other target models are usually created by simply pulling, stretching, using bones or FFDs on the original model and saving out the copies. However, as it's just a model you can use any non-destructive modeling tool. I usually work with a spline model, or low poly (smoothed after morphing), and simply move the points around to make the new targets.
When animating, instead of always hitting an A pose perfectly you could use the A phoneme target at say, 35% and get something between an A phoneme and a closed mouth. Or you could mix the A phoneme with others...to get a new variation. But it gets even better. Remember those expressions. Now you could mix say an A phoneme at 80% with say the Anger phoneme at 90% to get a very mad looking face still saying ahhh. Now let's go one step further. Rather than model phonemes, model targets for each muscle of the face. By modeling facial muscles, or rather the effect they have on the face, any phoneme or expression can be made from only a few targets.
As an example of muscle modeling, by making a target where the jaw drops and then targets for the zygomatic major, risorius and platysma muscles, you can make a smile and E phoneme. Simply put, using muscle targets mimicks the way a real head works. You simply animate the muscle groups to shape the head for each keyframe into the desired phoneme and expression.
One catch with the muscle approach however is sometimes you will have to mix greater than three shapes, and this can mess up how the targets look. For this reason you can still model regular phoneme shapes, and then also model additional muscle or expression shapes for tweaking.
Weighted morphing can be used to create very realistic and fluid lipsync. You may be able to obtain or write a plugin or expression to import a breakdown and directly alter the morph keyframes for the lipsync. Then you could tweak it and add in keyframes for expressions. Or you can animate by hand. While slower, it is still quicker to adjust a slider than manually moving or rotating tons of bones or FFD controls. For example, with bones, you might have 16 controls for the lips, and one for the jaw. To open the mouth, you would have to manually position and rotate each of these bones. With weighted morphing, the same task could be accomplished by dragging one slider.
The images to the right show the basic expression targets used for weighted morphing that work in conjunction with the phoneme targets listed earlier. Many books or papers on creating facial targets show a diagram of the face with the muscles used. Rather than do that here I've decided to actually show sample expression targets. This basically encompasses much of the same information when mixed with the phonemes, includes a bit more detail and should be more useful. However it is still helpful to look at and understand the facial muscles. I'd highly recommend taking a look at Gary Faigin's "Facial Expression" book listed in the bibliography at the bottom of this article. It is a fantastic reference for anyone doing facial animation, whether or not you will use the muscle targets or not. I'd also highly recommend using a mirror to study your own face.
The following is a quick description of each of these targets and how they are used:
For some of these targets I mention creating the left and right sides as separate poses. The reason for this is that you will have more control. You can make your character grin to one side, or blink one eye and so on. If you had both sides as one target this wouldn't be possible. Another big reason is to easily remove the "twins" syndrome of too much symmetry in the pose. If you are trying to or need to save space and create some of these as one target, you should make the target offset left and right a bit...such as the smile on one side slightly higher than the other. This will help make the face assymetrical and look more natural. For more information on basic animation principles and "twins" see the CG-Char Web Pages and the FAQ document there.
One thing to note with weighted morphing is that mixing the various targets won't always result in exactly what you want. This is usually a result of three or more targets having some odd interaction when mixed together at high percentages. Therefore it is typical to start with phonemes or targets for each muscle, but then to add new targets with specific snapshots of the head as needed. This is especially true for hyper-exaggerated poses.
Finally I'd like to mention a few points about the actual mouth model itself. This applies to any of the techniques mentioned above. Since the mouth of your character will be open it is generally a good idea to have some amount of detail in there. Teeth, gums, tongue and some inside mouth cavity. You may not think you need gums but for poses like the sneer, you'll see them, so they should be modeled. For a tongue I tend to use a simple deformed sphere. In allit really depends on your character. I imagine some creatures might not have teeth or could be more cartoony, but in many cases you'll want this detail. Once your character is setup, you are ready for the next step in lipsync, Breakdown.
Part II - Breakdown / Track Analysis
Track Analysis, or breakdown, is the art if listening
to pre-recorded dialogue and sound, and figuring out the timing. Traditionally the audio
would be played on a device with a counter. The person analyzing it would listen over and over
and write down the best estimate of when each phoneme occured.
For cartoons this information was (and is) recorded on what is called an Exposure sheet, or X-Sheet for short. An X-Sheet is really nothing more than a glorified table. It has numbers representing frames down one column, and then other areas where you can pencil in dialogue notes, camera instructions and so on.
![]() |
![]() |
A section from a traditional paper X-Sheet. The one on the right has been filled out as a sample. |
Of course, computer animation once again mimicks it's traditional history. The breakdown is still recorded...many times on a digital X-Sheet. However, the actual work of figuring out the timing has gotten a bit easier. The first method one can use is to load the audio file onto the computer with software that shows the waveform and timecode. Then repeatedly play the whole or parts of the audio figuring out the timing. This is essentially the digital counterpart of the traditional method.
Yet, track analysis can get even easier than this. There is software available that not only shows the waveform and timecode, but actually has a digital exposure sheet and allows bitmaps of phonemes to be played in real time. So you can see and check how your breakdown is working as you create it. Another variation of this is to simply work right in the 3D animation program and if it allows, scrub the time slider back and forth to hear the audio and then key the pose at the right place. Assuming fast enough playback, you may be able to view results in near realtime.
At the top of the line is automatic voice recoginition. These packages automatically look at a WAV file and figure out the timing and phonemes for lip sync. All you need to do isgo back and tweak to correct any errors. My personal favorite is Ventriloquist (a plugin 3D Studio MAX, the generic standalone is called "Echo") by Lips Inc.. This utility can take an audio file, a text line of what is in the file, and automatically generate nice weighted morph fcurves for phoneme based weighted morphing. It can also create a digital X-sheet for you if you still want to enter the data by hand, or want to write your own importer. It still requires tweaking, but is pretty good for a first step, or as a final step in cases where time doesn't permit. Since the point of this article is to teach how to manually do analysis, I won't go further into this topic.
The shareware software and my personal choice for manually breaking down audio is Magpie. Magpie is available at: http://thirdwish.simplenet.com/magpie.html and runs on Windows 95 and NT machines. It allows you to drag and drop custom phonemes into a digital exposure sheet as well as view 2d images for each phoneme in realtime for playback.
Sample screenshot from magpie. Note the phonemes on the left, the Waveform, sample
image and main window housing a digital exposure sheet.
Analyzing voice is simply a matter of listening to small chunks of the audio and marking
the proper phoneme for the proper frames. As an example, I have used
Treason.wav. This is a woman saying "You will pay for your treason pilot!".
In figuring out what phoneme comes where you can almost "see" the breakdown visually. Notice the pattern of the waveform from frames 2-10. These are the words "You will". 3-5 is the "You" and 6-10 is the "Will". You can tell this because vowels tend to make large balloon shapes in the waveform, while the M/B/P, F/V and to a lesser extent W, Y and S tend to flatten out the waveform. So frames 6 and 7 are likely the "W" phoneme of "will".
Look at frames 10-13. This is a very flat area. The next word "pay" has the M/B/P phoneme which is what causes this. Frames 14-15 tend to show an increase in the waveform as the mouth opens a bit and some of the air escapes. 16-20 is that "ay" sound of the word "pay".
You may be noticing that my breakdown doesn't follow this...that is I have actually placed the phoneme F on frames 17-19 and the A and E actually occurs much earlier. There are 2 reasons for this. First, it is very common to have the mouth shapes lead before the sound. That is, you'll want to shift your mouth poses to actually appear 2-4 frames before when it actually occurs in the WAV file. This method simply tends to make the lipsync look more correct. Tony White in "The Animator's Workbook" mentions that at Disney, some actions were even anticipated 12-16 frames ahead of when they occured.
In addition, it is very common to need to lead the M/B/P and sometimes F/V sounds even more. For example, the word "Pilot". Frames 49-51 are the flat section of the P phoneme. At 52 there is a breath of air as the mouth starts to open. 55 starts the vowels. However, I have placed the breakdown for this word about 3-4 frames ahead simply because it looks more correct when played back. Even so, I still use the visual cues to help me when I'm breaking down the track.
There are two ways to handle this leading of the poses. First, you can breakdown your track 100% accurate to the waveform. Look at the pattern, listen to the sound and place the right phoneme there. Then later on, simply slip your audio to start a little later. The other option is to actually compensate as I did in the breakdown itself. This is the method I prefer since I feel I may want to adjust the keys more or less for different phrases. It allows me to tweak the timing of the poses within magpie and then the final version in MAX without any extra mucking about later.
The other thing to keep in mind when analyzing the voice is the sound of each phoneme. Remember that a phoneme or sound of the voice doesn't need to actually match the spelling of the word. One famous example is that "ghoti" spells the word "fish". Granted, that looks like it would sound like "goat-tea" but phoentically it can just as easily sound like the aquatic animal. Here's how: Take the "gh" from the word "enough". The "o" from the word "women". Finally "ti" from the word "nation". As you can see, all of these words have sections that are spelled one way but sound another. Keep this in mind and really listen to the sound of each phrase as you do the breakdown.
Finally, it's OK to drop phonemes. For example, look at the final word "Pilot". The "ihhh" sound of "lot" isn't there. It simply is dropped. One of the key tricks in getting lipsync to work right...especially fast paced speech, is to learn what to drop and what to keep. Think of the mouth as flowing and figure out what the best shape is at that point. Do this by listening and looking at what phonemes are around the current frame. In general, if you hit the M/B/P phoneme and the large vowels that follow, your animation will usually look correct. Everything else is really just an inbetween of the extreme mouth closed, mouth open and mouth tighten poses. Think about puppets. Typically they have a mouth open, and a mouth closed pose. Yet in many cases, people accept a puppet as actually talking. If your lip sync looks off, check the timing of these poses first. Chances are correcting them will correct your animation.
When I first start to do breakdowns, I tend to put a keyframe on every frame. For example, take a look at the breakdown I have for this WAV file. If you can't see the image, you can also use the text version saved by Magpie. Note how there are X's on certain frames. This represents where the keyframe for that phoneme would go. Frame 4 has a W/OO sound. However, as shown in the text version of the breakdown, I start by placing a keyframe on all frames. This does NOT mean I have a different phoneme on each individual frame. Rather, there are no inbetweens.
What this does is make the animation very snappy. Each phoneme literally pops in. This makes it easier to see mistakes in the timing. You can view an AVI of this initial rough breakdown (301K). If this animation looks a bit harsh or wrong, you're quite right.
Essentially this version suffers from a lack of tweening, making it look too rough. Also the fact that I'm using only 14 basic phonemes...none of which really look like a yelling type expression makes it seem out of place. Plus the phonemes don't really smoothly correlate to each other. For example the "f" in "for" should probably be a bit wider because it comes right after "pay" where the mouth gets a bit wider. However even with some inbetweens this starts to look a little better. This is shown in the tweened default AVI (334k).
This second AVI is identical to the first, except there are key frames only where marked as on this text copy of the breakdown. This makes the phrases flow a little better...though it starts to look a bit too floaty.
Once you have finished the track analysis you should have your finished X-sheet With that information you are ready for the final phase of actually animating your character.
Part III - Animation
We have finally arrived at the animation stage! This is where some of the real fun happens.
As you saw, doing breakdowns is really pretty simple and in some ways tedious work. At this point,
you can start to actually make your character have emotion and come alive.
Probably the simplest way to get from a breakdown to an actual animated character is to simply import the breakdown directly. Magpie can export various formats, including some methods specific to Animation Master and Lightwave. If you use 3D Studio MAX you can download the 47k ZIP file of the Magpie Importer by Andrew Reid. Also the newer versions of Magpie may allow direct importing into MAX. All you really need is to get a keyframe of the pose on the right frame as listed on your X-Sheet.
There are two problems with this method though. First of all it's probable you'll still need to animate the upper face by hand. Second, the animation will probably look "canned" or generic. The result of simply taking a breakdown and importing it is the same as the animation of the simple tweened AVI mentioned before. what happens is every "A" phoneme looks the same, every "T" phoneme looks the same, and so on. Even if you have more phonemes, chances are things will just look kind of blah. (Note: some automated packages, like Ventriloquist/Echo mentioned above are smart enough to make the phoneme a % based on volume and speed of the phrase so the generic look isn't quite as prominent and it needs less tweaking.)
The best way to animate the face is to start with the object, your X-sheet, and then manually pose each keyframe yourself. If there is one thing I'd highly recommend when doing facial animation it's buying a mirror. You can pick up a little bathroom mirror at a drugstore for relatively cheap, and it's invaluable to have it sitting on your desk as you animate.
All you do is look at your X-Sheet and copy the pose. Take the above "Treason" breakdown, the first key after frame 0 is the Y pose on frame 2. Go to that frame, then say the word and look at yourself in the mirror. Then match what you see to your character in your 3D package. Then go to the next keyframe and repeat this process.
The benefits of this method are that you are not limited to any preset phonemes. The mouth will look very natural since you will be posing every frame as it would look. You can go extreme when needed, smaller when not. Pay attention to how letters change especially consonants when they're around vowels. Compare my custom hand animated AVI (360k) to the generic tweened (334k) version from before.
Hand animating allows you to pay close attention to snap, offsetting the left and right sides of the face to make things look more natural, and generally results in a better looking result. One thing to keep an eye on in with any method though is the interpolation.
By default, most spline interpolation has errors when two keyframes are close together, or nearby
other keys that are radically different. For example, look at the image on the right.
The top version has a default smooth spline type. Note how the area the red arrow is pointing
to dips downward even though the next key is higher. Actually, almost every
section of this curve is incorrectly interpolated.
Now examine the lower image. This spline has been adjusted with custom bezier tangents. This allowed me to direct the path of the curve to more closely follow the keyed points. Imagine if the point at the red arrow was equal to 0 percent. The top version would have tweened that target or bone to a negative value. If your morph program didn't cap at a 0 value you might end up with something very odd looking indeed. However, in some cases the default interpolation is fine and results in smoother motion or even adding a bit of 'anticipation' to a movement. The catch is to pay attention to the curves and adjust them where necessary. One quick way to fix fcurves is to take all the lower values, such as those at 0 and make them "linear", then take the tops of the curves and set them to be "ease-in" and "ease-out" types. This generally fixes any overshoot of the curves and looks pretty nice.
Back to our lipsync animation. At this point, you should have a pretty well adjusted version
of the character speaking. Up until now, chances are you haven't animated the upper face.
Simply adding in some eyebrow and eye movement can really add a lot to the character and
expression. In fact, the eyes are what most people focus on when talking so in that sense it's
even more critical than the mouth. What I tend to do is write down ideas on the X-Sheet as I
animate. I just make notes at different frames with things like, expand eyes, raise
eyebrows, etc... Most importantly again, look at the mirror! I repeatedly act out the
dialogue looking in a mirror experimenting with different versions. Then, I just match that
to the animation.
This can actually be kind of easy, since you already have the breakdown and know when each word occurs. You might notice that when you say a certain word, or part of a word, your eyebrows go up. All you need to do is look at your X-Sheet, then find what frame the word starts on and set your eyebrow keys. Actually, it's a good idea to offset the upper face and even parts of the lower face so things don't always hit keys on the same frame. This is shown in the previous custom AVI (360k). When the whole face works together the results are even better.
At this point, all that is left is to animate the actual body and head motion. Once again, I use a mirror, act it out and animate. You can view the final version of the AVI (434k) with custom mouth animation that has been tweaked more, eyes and head motion. I'd recommend studying how people move when they speak. There can be some very subtle head motion...and I think there is a propensity for new animators to either overdo or underdo the head motion. Think about what words you want to accent and which parts you want to play down. It also helps to get inside your character's head to figure out what it's thinking. That helps a lot especially with animating the eyes. As in all cases, study from life, and one more time, use that mirror!
Tips, Tricks and Suggestions
Hopefully, this article provided you with enough information to try out your own facial animation.
Doing lipsync breakdown is a great way to work on building your timing skills. It really forces
you to pay attention to every frame. There's nothing like moving a phoneme 1 frame back and having everything suddenly work. Plus, it can be darn fun to do. Below is a list of key points to remember:
Another idea is to also make phonemes for consonants paired with vowels. Compare how your mouth is shaped for the "L" in "Lease" vs. "Loop". In the first case it's wider, in the second much more puckered and closed. So for each consanant, you might want to make extra phonemes where it's paired with a vowel, such as "la", "li", "lo", "lu", "le" and "loo".
About the Author
Michael Comet is currently a Rigger/T.D. at Blue Sky Studios in New York.
Previosuly he was Video Team Lead Rigger and Cg-Supervisor and a 3D/Animator/Artist at
Big Idea in
Lombard, IL where he is worked on 3-2-1 Penguins and Veggie Tales.
Prior to that he was lead animator at the video game company
Volition, Inc., where he animated most of the cinematic
sequences for Descent: Freespace, and headed up much of the realtime character animation and
cinematics for their RPG title, "Summoner". He can be reached via email
at comet@comet-cartoons.com, and has a
personal homepage at: http://www.comet-cartoons.com/
which has more information and samples of his work
About the Sample Images
The head used was modeled with the Surface Tools plugin and 3D Studio MAX r1.2. The character is
part of a personal project I am working on.
This article, all images and character designs are Copyright 1998 Michael B. Comet All Rights Reserved.
This article may be reprinted for personal use only. It may not be packaged or sold in part or in whole, either alone or as part of another package. Unauthorized duplication is strictly prohibited. This article and related artwork, samples or text, are not to be copied onto other sites without prior written consent from the author. When in doubt, ask.
Bibliography