Microsoft VASA-1 makes faces talk and sing realistically

Microsoft VASA-1 makes faces talk and sing realistically

/ 02:21 PM April 22, 2024

Have you ever watched the noontime show, “Eat Bulaga?” You’ll notice that it uses human portraits made to talk like family members. 

The gag is that they sound like awkward robotic voices with exaggerated American accents.

READ: Scientists film plants “talking” to each other

Article continues after this advertisement

Recent artificial intelligence developments have transcended these caricatures. Microsoft VASA-1 is the latest example. 

FEATURED STORIES

The major tech firm announced it created an artificial intelligence model that can make faces articulate and speak clearly.

What is Microsoft VASA-1?

The Bill Gates co-founded company announced VASA, an AI framework for generating “talking faces of virtual characters” from a single picture and a speech audio clip. 

Article continues after this advertisement

Microsoft calls the first model VASA-1. It can produce lip movements that synchronize closely with sound clips. 

Article continues after this advertisement

Moreover, the AI model captures numerous facial nuances and natural head motions that make them convincingly lifelike.

Article continues after this advertisement

The company said VASA-1 came from core innovations like a holistic facial dynamics and head movement generation model that works in a face latent space. 

It also involved numerous video samples to create “expressive and disentangled face latent space using videos. As a result, VASA clips exhibit the following characteristics: 

Article continues after this advertisement
  • Realism and liveliness: The AI model can make face portraits move with naturally, without being stuck to the background.
  • Controllability of generation: Users can make the faces look in specific directions, zoom them in or out, and convey different emotions.
  • Out-of-distribution generalization: Microsoft VASA-1 can handle artistic images, singing audio, and non-English speech, without training for these features.
  • Power of disentanglement: The AI program allows users to change a face’s appearance, 3D head pose, and facial dynamics individually.
  • Real-time efficiency: The Microsoft AI generates video frames with 512×512 pixel size at 45fps in the offline batch processing mode. Also, it can support up to 40fps in the online streaming mode.

Microsoft reminds the public it used virtual, non-existing identities made by AI programs DALL-E-3 and StyleGAN2, except for the Mona Lisa sample. 

These portraits do not impersonate any people in the real world. Microsoft intended these limitations because it understands the possibility of misuse.

Your subscription could not be saved. Please try again.
Your subscription has been successful.

Subscribe to our daily newsletter

By providing an email address. I agree to the Terms of Use and acknowledge that I have read the Privacy Policy.

The company stated, “We have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.” 

TOPICS: Artificial Intelligence, technology
TAGS: Artificial Intelligence, technology

Your subscription could not be saved. Please try again.
Your subscription has been successful.

Subscribe to our newsletter!

By providing an email address. I agree to the Terms of Use and acknowledge that I have read the Privacy Policy.

© Copyright 1997-2024 INQUIRER.net | All Rights Reserved

This is an information message

We use cookies to enhance your experience. By continuing, you agree to our use of cookies. Learn more here.