With talking face creation, it is possible to create realistic video images of a target person that match the speech content. Because it provides visuals to the interested person as well as audio, it holds a lot of promise in applications such as virtual avatars, online conferences, and animated movies. The most widely used technologies to handle the creation of a voice-driven talking face use a two-stage framework. First, an intermediate representation is expected from the audio input; Then, the renderer is used to group the video images with the expected representation (eg, 2D features, blend shape parameters for 3D face models, etc.). By having natural head movements, increasing the quality of lip-syncing, creating an emotional expression, etc. . Along this path, significant progress has been made toward improving the overall realism of video images.
However, it should be noted that talking about faces is in essence a problem of assigning one to more. In contrast, the aforementioned algorithms tend towards learning deterministic mapping from audio provided to video. This indicates that there are many potential visual presentations to the target individual with the introduction of an audio clip due to the variety of audio contexts, moods, and lighting conditions, among other factors. This makes it difficult to provide realistic visual results when learning deterministic mapping as ambiguities are introduced during training. A two-stage framework, which divides the one-to-many mapping challenge into two sub-problems, may help facilitate one-to-many mapping (eg, the sound-to-expression problem and the neural presentation problem). Although efficient, both of these phases are still designed to predict data that inputs missed, making prediction difficult. As an illustration, the phonemic-to-expression model learns to create an expression that matches the meaning of phonemic input. However, it ignores higher-level connotations such as habits, attitudes, etc. Compared to this, the neural display model loses pixel-level information such as wrinkles and shadows because it creates visual appearances based on emotional prediction. This study proposes MemFace, which makes implicit memory and explicit memory track the sense of the two stages differently, to supplement the missing information with memories to further mitigate the person-to-many mapping problem.
More precisely, the non-parametric explicit memory is generated and customized for each target individual to complement the visual features. In contrast, implicit memory is co-optimized with the sound-to-expression paradigm to complete semantically aligned information. Therefore, their sound-to-expression model uses the feature of the extracted sound as a query to attend to implicit memory rather than using the input sound directly to predict articulation. The auditory property is combined with the attentional outcome, which previously served as semantically aligned data, to provide the expression output. The semantic gap between the input sound and the output expression is reduced by allowing cross-training, which encourages implicit memory to associate higher-level semantics in the joint space between the sound and the expression.
A neural presentation model groups visual phenotypes based on mouth shapes determined from expression estimates after expression acquisition. They first constructed each individual’s explicit memory using the heads of the 3D face models and their accompanying image patches as keys and values, respectively, to complement the pixel-level information between them. The associated image correction is then returned as pixel-level information to the neural display model for each input statement. Its corresponding headers are used as a query to get identical keys in explicit memory.
Intuitively, explicit memory facilitates generation by enabling the model to selectively associate information required for an expression without generating it. Extensive testing on several commonly used datasets (such as Obama and HDTF) show that the proposed MemFace offers advanced lip-syncing and rendering quality, consistently and significantly outperforming all basic methods in various contexts. For example, MemFace improves the personal score of the Obama data set by 37.52% versus baseline. Working samples of this can be found on their website.
scan the paper And github. All credit for this research goes to the researchers on this project. Also, don’t forget to join Our Reddit pageAnd discord channelAnd And Email newsletterwhere we share the latest AI research news, cool AI projects, and more.
Anish Teeku is a Consultant Trainee at MarktechPost. He is currently pursuing his undergraduate studies in Data Science and Artificial Intelligence from the Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is in image processing and he is passionate about building solutions around it. Likes to communicate with people and collaborate on interesting projects.