May 19, 2024
Realistic Talking Faces

NTU Singapore Develops Program to Create Realistic Talking Faces from Audio and Photos

A team of researchers from Nanyang Technological University, Singapore (NTU Singapore) has developed a groundbreaking computer program that can generate realistic videos of individuals talking by using just an audio clip and a face photo. The program, called DIverse yet Realistic Facial Animations (DIRFA), relies on artificial intelligence (AI) technology to produce 3D videos that accurately portray facial expressions and head movements synchronized with the spoken audio.

Unlike previous approaches that struggled with pose variations and emotional control, the NTU-developed program has been trained on over one million audiovisual clips from more than 6,000 individuals sourced from The VoxCeleb2 Dataset, an open-source database. By predicting cues from speech and associating them with facial expressions and head movements, DIRFA is able to generate highly realistic and consistent facial animations.

The potential applications of DIRFA span various industries and domains. In the healthcare sector, the program could facilitate the development of more sophisticated and realistic virtual assistants and chatbots, enhancing user experiences. Additionally, it could serve as a valuable tool for individuals with speech or facial disabilities, allowing them to convey their thoughts and emotions through expressive avatars or digital representations, thereby improving their ability to communicate effectively.

Associate Professor Lu Shijian, the corresponding author of the study and leader of the research team from NTU’s School of Computer Science and Engineering, emphasized the profound and far-reaching impact of the study. By combining AI and machine learning techniques, the program revolutionizes multimedia communication by enabling the creation of highly realistic talking videos based solely on audio recordings and static images. The videos produced by DIRFA feature accurate lip movements, vivid facial expressions, and natural head poses.

Dr. Wu Rongliang, the first author of the study and a PhD graduate from NTU’s School of Computer Science and Engineering, noted the complexity of creating lifelike facial expressions driven by audio. Speech exhibits numerous variations, encompassing factors such as tone, amplitude, duration, and emotional state. To tackle this challenge, the team designed DIRFA to capture the intricate relationships between audio signals and facial animations. Training the AI model on a large dataset of audio and video clips enabled the program to generate diverse yet lifelike sequences of facial animations that correspond to the provided audio.

While DIRFA has proven its ability to generate accurate lip movements, vivid facial expressions, and natural head poses, the researchers aim to enhance the program’s interface to allow for more control over the output. This includes enabling users to adjust specific expressions, such as changing a frown to a smile. Additionally, the NTU researchers plan to fine-tune the program’s facial expressions by incorporating a wider range of datasets that encompass more varied facial expressions and voice audio clips.

Overall, the development of DIRFA represents a significant advancement in the field of multimedia communication. By leveraging AI technology, the program has the potential to revolutionize virtual assistant technology, improve communication for individuals with speech or facial disabilities, and open up new opportunities across various industries.

*Note:
1.            Source: Coherent Market Insights, Public sources, Desk research
2.            We have leveraged AI tools to mine information and compile it