Microsoft VASA-1 makes faces talk and sing realistically

By: Dale Arasa - 9 months ago

Have you ever watched the noontime show, “Eat Bulaga?” You’ll notice that it uses human portraits made to talk like family members.

The gag is that they sound like awkward robotic voices with exaggerated American accents.

READ: Scientists film plants “talking” to each other

Recent artificial intelligence developments have transcended these caricatures. Microsoft VASA-1 is the latest example.

The major tech firm announced it created an artificial intelligence model that can make faces articulate and speak clearly.

What is Microsoft VASA-1?

Microsoft just dropped VASA-1.

This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba

10 wild examples:

1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD
— Min Choi (@minchoi) April 18, 2024

The Bill Gates co-founded company announced VASA, an AI framework for generating “talking faces of virtual characters” from a single picture and a speech audio clip.

Microsoft calls the first model VASA-1. It can produce lip movements that synchronize closely with sound clips.

Moreover, the AI model captures numerous facial nuances and natural head motions that make them convincingly lifelike.

The company said VASA-1 came from core innovations like a holistic facial dynamics and head movement generation model that works in a face latent space.

It also involved numerous video samples to create “expressive and disentangled face latent space using videos. As a result, VASA clips exhibit the following characteristics:

Realism and liveliness: The AI model can make face portraits move with naturally, without being stuck to the background.
Controllability of generation: Users can make the faces look in specific directions, zoom them in or out, and convey different emotions.
Out-of-distribution generalization: Microsoft VASA-1 can handle artistic images, singing audio, and non-English speech, without training for these features.
Power of disentanglement: The AI program allows users to change a face’s appearance, 3D head pose, and facial dynamics individually.
Real-time efficiency: The Microsoft AI generates video frames with 512×512 pixel size at 45fps in the offline batch processing mode. Also, it can support up to 40fps in the online streaming mode.

Microsoft reminds the public it used virtual, non-existing identities made by AI programs DALL-E-3 and StyleGAN2, except for the Mona Lisa sample.

These portraits do not impersonate any people in the real world. Microsoft intended these limitations because it understands the possibility of misuse.

The company stated, “We have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.”