arxiv:2309.07509

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Published on Sep 14, 2023

Authors:

Xulong Zhang ,

Abstract

Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.07509 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.07509 in a dataset README.md to link it from this page.

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 1

Collections including this paper 1