Portrait Video Editing Empowered by
Multimodal Generative Priors

Traditional portrait video editing methods often have problems with 3D effects and temporal consistency, and also perform poorly in terms of rendering quality and efficiency. To address these issues, PortraitGen elevates each frame of a portrait video into a unified dynamic 3D Gaussian field, which ensures structural and temporal consistency from frame to frame.PortraitGen is a powerful portrait video editing method that allows for consistent and expressive stylisation with multi-modal cues.
In addition, PortraitGen has devised a new neural Gaussian texturing mechanism that not only allows for complex stylistic editing, but also enables rendering speeds in excess of 100 frames per second.PortraitGen combines a wide range of inputs that are enhanced by knowledge distilled from large-scale 2D generative models. It also introduces expression similarity guidance and a facial recognition portrait editing module, effectively reducing problems that can occur when iteratively updating a dataset. (Link at the bottom of the article)

01 Caption Content

PortraitGen lifts 2D portrait videos into a 4D Gaussian field for multimodal portrait editing in just 30 minutes. The edited 3D portrait can be rendered at 100 frames per second. The SMPL-X coefficients in the monocular video are first tracked, and then a 3D Gaussian feature field is generated using a Neuro-Gaussian texture mechanism.
This Neuro-Gaussian data is further processed to render the portrait image.PortraitGen also employs an iterative dataset updating strategy for portrait editing and proposes a facial recognition editing module to enhance the quality of the expressions and preserve the personalised facial structure.

02 Practical Uses

The PortraitGen solution is a unified framework for portrait video editing. Any image editing model that preserves structure can be used to compose 3D consistent and temporally coherent portrait videos.
Text-driven editing: InstructPix2Pix is used as a 2D editing model. Its UNet requires three inputs: an input RGB image, a text command and a noise latent. Adds some noise to the rendered image and edits it based on the input source image and instructions.

Image-driven editing: focuses on two types of editing based on image cues. One is to extract the global style of a reference image and the other is to customise the image by placing objects in specific locations. These methods are used experimentally for style migration and virtual fitting. The style of the reference image was migrated to the dataset frames using the Neural Style Migration algorithm and the subject’s clothes were changed using AnyDoor.

Relighting: using IC-Light to manipulate the lighting of video frames. Given a text description as a lighting condition, the PortraitGen method harmoniously adjusts the lighting of the portrait video

03 Contrast and Ablation Experiments

The PortraitGen method is compared to state-of-the-art video editing methods including TokenFlow, Rerender A Video, CoDeF, and AnyV2V. the PortraitGen method significantly outperforms the other methods in terms of just-in-time preservation, identity preservation, and temporal consistency.
The time duration 00:47
Inspired by the neural texture proposed in ‘Delayed Neural Rendering’, PortraitGen proposes a neural Gaussian texture. This approach stores learnable features for each Gaussian instead of storing spherical harmonic coefficients. Next, a 2D neural renderer is used to convert the processed feature maps into RGB signals. This method provides richer information than the spherical harmonic coefficients and allows for better fusion of the processed features, making it easier to edit complex styles such as Lego and pixel art.

When editing an upper body image, if the face occupies a small area, the editing of the model may not be well adapted to the head pose and facial structure. Facial Awareness Portrait Editing (FA) can enhance the results by performing two edits to increase the focus on facial structure.

By mapping the rendered image and the input source image into EMOCA’s latent expression space and optimising the similarity of expressions, we can ensure that expressions remain natural and consistent with the original video frames.

Tech behind PortraitGen

Refrence

you can find more about PotraitGen here :https://ustc3dv.github.io/PortraitGen/

https://arxiv.org/pdf/2409.13591

Code here in github

類似の投稿