VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

Yuxuan Bian Zhaoyang Zhang Xuan Ju Mingdeng Cao Liangbin Xie Ying Shan Qiang Xu

Abstract

Video inpainting, crucial for the media industry, aims to restore corrupted content. However, current methods relying on limited pixel propagation or single-branch image inpainting architectures face challenges with generating fully masked objects, balancing background preservation with foreground generation, and maintaining ID consistency over long video. To address these issues, we propose VideoPainter, an efficient dual-branch framework featuring a lightweight context encoder. This plug-and-play encoder processes masked videos and injects background guidance into any pre-trained video diffusion transformer, generalizing across arbitrary mask types, enhancing background integration and foreground generation, and enabling user-customized control. We further introduce a strategy to resample inpainting regions for maintaining ID consistency in any-length video inpainting. Additionally, we develop a scalable dataset pipeline using advanced vision models and construct VPData and VPBench—the largest video inpainting dataset with segmentation masks and dense caption ( $>390K$ clips) —to support large-scale training and evaluation. We also show VideoPainter’s promising potential in downstream applications such as video editing. Extensive experiments demonstrate VideoPainter’s state-of-the-art performance in any-length video inpainting and editing across $8$ key metrics, including video quality, mask region preservation, and textual coherence.

Machine Learning, ICML

Figure 1: VideoPainter enables plug-and-play text-guided video inpainting and editing for any video length and pre-trained Diffusion Transformer with masked video and video caption (user editing instruction). The upper part demonstrates the effectiveness of VideoPainter in various video inpainting scenarios, including object, landscape, human, animal, multi-region (Multi), and random masks. The lower section demonstrates the performance of VideoPainter in video editing, including adding, removing, changing attributes, and swapping objects. In both video inpainting and editing, we demonstrate strong ID consistency in generating long videos (Any Len.).

1 Introduction

Video inpainting (Quan et al., 2024), which aims to restore the corrupted video while maintaining coherence, facilitates numerous applications, including try-on (Fang et al., 2024), film production (Polyak et al., 2024), and video editing (Sun et al., 2024). Recently, Diffusion Transformers (DiT) (Peebles & Xie, 2023; OpenAI, 2024) have shown promise in video generation, leading to the exploration of generative video inpainting (Zhang et al., 2024b; Zi et al., 2024).

Refer to caption — Figure 2: Framework Comparison. Non-generative approaches, limited to pixel or pixel feature propagation from backgrounds, fail to inpaint fully segmentation-masked objects due to the lack of enough background context infomation. Generative methods adapt single-branch image inpainting models to video by adding temporal attention, struggling to maintain background fidelity and generate foreground contents in one model. In contrast, *VideoPainter* implements a dual-branch architecture that leverages an efficient context encoder with any pre-trained DiT, decoupling video inpainting to background preservation and foreground generation, and enabling plug-and-play video inpainting control.

Existing approaches, as illustrated in Fig. 2, can be broadly categorized into two types: (1) Non-Generative methods (Zhou et al., 2023; Li et al., 2022; Lee et al., 2019) depend on limited pixel feature propagation (physical constraints or model architectural priors), which only take masked videos as inputs and cannot generate fully segmentation-masked objects. (2) Generative methods (Zhang et al., 2024b; Zi et al., 2024; Wang et al., 2024) extend single-branch image inpainting architectures (Rombach & Esser, 2022) to video by incorporating temporal attention, which struggles to balance background preservation and foreground generation in one model and obtain inferior temporal coherence compared to native video DiTs. Moreover, both paradigms neglect long video inpainting and struggle to maintain consistent object ID with long videos.

This motivates us to decompose video inpainting into background preservation and foreground generation and adopt a dual-branch architecture in DiTs, where we can incorporate a dedicated context encoder for masked video feature extraction while utilizing the pre-trained DiT’s capabilities to generate semantic coherent video content conditioned on both the preserved background and text prompts. Similar observations have been made in image inpainting research, notably in BrushNet (Ju et al., 2024) and ControlNet (Zhang et al., 2023). However, directly applying their architecture to video DiTs presents several challenges: (1) Given Video DiT’s robust generative foundation and heavy model size, replicating the full/half-giant Video DiT backbone as the context encoder would be unnecessary and computationally prohibitive. (2) Unlike BrushNet’s pure convolutional control branch, DiT’s tokens in masked regions inherently contain background information due to global attention, complicating the distinction between masked and unmasked regions in DiT backbones. (3) ControlNet lacks dense feature injection across all backbone layers, hindering dense background control for inpainting tasks.

To address these challenges, we introduce VideoPainter, which enhances pre-trained DiT with a lightweight context encoder comprising only 6% of the backbone parameters, to form the first efficient dual-branch video inpainting architecture. VideoPainter features three main components: (1) A streamlined context encoder with just two layers, which integrates context features into the pre-trained DiT in a group-wise manner, ensuring efficient and dense background guidance. (2) Mask-selective feature integration to clearly distinguish the tokens of the masked and unmasked region. (3) A novel inpainting region ID resampling technique to efficiently process videos of any length while maintaining ID coherence. By freezing the pre-trained context encoder and DiT backbone and adding an ID-Adapter, we enhance the backbone’s attention sampling by concatenating the original key-value vectors with the inpainting region tokens. During inference, inpainting region tokens from previous clips are appended to the current KV vectors, ensuring the long-term preservation of target IDs. Notably, our VideoPainter supports plug-and-play and user-customized control.

For large-scale training, we develop a scalable dataset pipeline using advanced vision models (OpenAI, 2024; Ravi et al., 2024; Zhang et al., 2024a), constructing the largest video inpainting dataset, VPData, and benchmark, VPBench, with over 390K clips featuring precise segmentation masks and dense text captions. We further demonstrate VideoPainter’s potential by establishing an inpainting-based video editing pipeline that delivers promising results.

To validate our approach, we compare VideoPainter against previous state-of-the-art (SOTA) baselines and a single-branch fine-tuning setup that combines noisy latent, masked video latent, and mask at the input channel. VideoPainter demonstrates superior performance in both training efficiency and final results.

In summary, our contributions are as follows:

•

We propose VideoPainter, the first dual-branch video inpainting diffusion transformer framework that supports plug-and-play and user-customized control.
•

We design a lightweight context encoder for efficient and dense background control, and an inpainting region ID resampling for boosting the target object ID consistency in any-length video inpainting and editing.
•

We introduce VPData, the largest video inpainting datasets comprising over 390K clips ( $\textgreater 866.7$ hours), and VPBench, both featuring precise segmentation masks and detailed video captions.
•

Experiments show VideoPainter achieves state-of-the-art performance across $8$ key metrics, including video quality, masked region preservation, and text alignment in video inpainting and video editing.

2 Related Work

2.1 Video Inpainting

Video inpainting approaches can be broadly classified into two categories based on whether they are generative:

Non-generative methods.

These methods (Zhou et al., 2023; Li et al., 2022; Hu et al., 2020; Zhang et al., 2022c, b) leverage architecture priors to facilitate pixel propagation. This includes utilizing local perception of 3D CNNs (Chang et al., 2019a; Wang et al., 2019; Hu et al., 2020; Chang et al., 2019b), and exploiting the global perception of attention to retrieve and aggregate tokens with similar texture for filling masked video (Lee et al., 2019; Zeng et al., 2020; Liu et al., 2021; Zhang et al., 2022c). They also introduce various physical quantities, especially optical flow, as auxiliary conditions as it simplifies RGB pixel inpainting by completing less complex flow fields (Kim et al., 2019; Li et al., 2020; Zou et al., 2021; Xu et al., 2019; Gao et al., 2020; Zhang et al., 2022a, b). However, they are only effective for partial object occlusions with random masks but face significant limitations when inpaint fully masked regions due to insufficient contexts.

Generative methods.

Recent advances in generative foundation models (Rombach et al., 2022; Guo et al., 2023) have sparked numerous approaches that leverage additional modules or training strategies to extend backbones’ capabilities for video inpainting (Zhang et al., 2024b; Zi et al., 2024; Wang et al., 2024). AVID (Zhang et al., 2024b) and COCOCO (Zi et al., 2024) represent the most related recent works. Both adopt a similar implementation by augmenting Stable Diffusion Inpainting (Rombach & Esser, 2022) with trainable temporal attention layers. This architecture includes per-frame region filling based on the image inpainting backbone and temporal smoothing with temporal attention. Despite showing promising results for both random and segmentation masks due to their generative abilities, they struggle to balance background preservation and foreground generation with text caption (Ju et al., 2024; Li et al., 2024) within the single backbone. AVID also explores any-length video inpainting by smoothing latent at segment boundaries and using the middle frame as the ID reference. In contrast, VideoPainter is a dual-branch framework by decoupling video inpainting into foreground generation and background-guided preservation. It employs an efficient context encoder to guide any pre-trained DiT, facilitating plug-and-play control. Furthermore, VideoPainter also introduces a novel inpainting region ID resampling technique that enables ID consistency in any-length video inpainting.

Table 1: Comparison of video inpainting datasets. Our VPData is the largest video inpainting dataset to date, comprising over 390K high-quality clips with segmentation masks, video captions, and masked region descriptions.

Dataset	#Clips	Duration	Video Caption	Masked Region Desc.
DAVIS (Perazzi et al., 2016)	0.4K	0.1h	✗	✗
YouTube-VOS (Xu et al., 2018)	4.5K	5.6h	✗	✗
VOST (Tokmakov et al., 2023)	1.5K	4.2h	✗	✗
MOSE (Ding et al., 2023)	5.2K	7.4h	✗	✗
LVOS (Hong et al., 2023)	1.0K	18.9h	✗	✗
SA-V (Ravi et al., 2024)	642.6K	196.0h	✗	✗
Ours	390.3K	866.7h	✓	✓

2.2 Video Inpainting Datasets

Recent advances in segmentation (Ravi et al., 2024) have created many video segmentation datasets (Perazzi et al., 2016; Xu et al., 2018; Hong et al., 2023; Ding et al., 2023; Tokmakov et al., 2023; Darkhalil et al., 2022). Among these, DAVIS (Perazzi et al., 2016) and YouTube-VOS (Xu et al., 2018) have become prominent benchmarks for video inpainting due to their high-quality masks and diverse object categories. However, the existing datasets face two primary limitations: (1) insufficient scale to meet the data requirements of generative models, and (2) the absence of crucial control conditions necessary for generating masked objects such as video captions. In contrast, as shown in Tab. 1, we developed a scalable dataset pipeline based on state-of-the-art vision understanding models (Zhang et al., 2024a; Ravi et al., 2024; OpenAI, 2024), and constructed the largest video inpainting dataset to date with over 390K clips, each annotated with precise segmentation mask sequences and high-quality dense video captions.

3 Method

Sec. 3.1 and Fig. 3 illustrate our pipeline for building VPData and VPBench. Sec. 3.2 and Fig. 4 show our dual-branch VideoPainter. Sec. 3.3 and Sec. 3.4 introduce our inpainting region ID resampling approach for any-length video inpainting and plug-and-play control.

3.1 VPData and VPBench Construction Pipeline

To address the challenges of limited size and lack of text annotations, we present a scalable dataset pipeline leveraging advanced vision models (Ravi et al., 2024; OpenAI, 2024; Zhang et al., 2024a). This leads to VPData and VPBench, the largest video inpainting dataset and benchmark with precise masks and video/masked region captions. As shown in Fig. 3, the pipeline involves $5$ steps: collection, annotation, splitting, selection, and captioning.

Collection. We chose Videvo and Pexels ¹¹1Videvo: https://www.videvo.net/, Pexels: https://www.pexels.com/ as our data sources. We finally obtained around $450K$ publicly available internet videos from these sources.

Annotation. For each collected video, we implement a cascaded workflow for automated annotation:

➠

We employ the Recognize Anything Model (Zhang et al., 2024a) for open-set video tagging to identify the primary objects in the given videos.
➠

Based on the detected primary object tags, we utilize Grounding DINO (Liu et al., 2023) to detect the corresponding bounding boxes for objects at fixed intervals.
➠

These bounding boxes serve as prompts for SAM2 (Ravi et al., 2024), which generates precise mask segmentations.

Splitting. Scene transitions may occur while tracking the same object from different angles, causing disruptive view changes. We utilize PySceneDetect (Castellano, 2024) to identify scene transitions and subsequently partition the masks. Then we segmented the sequences into 10-second intervals and discarded short clips ( $<6s$ ).

Selection. We employ $3$ key criteria: (1) Aesthetic Quality, evaluated using the Laion-Aesthetic Score Predictor (Schuhmann et al., 2022); (2) Motion Strength, predicted by optical flow measurements using the RAFT(Teed & Deng, 2020); and (3) Content Safety, assessed via the Stable Diffusion Safety Checker (Rombach et al., 2022).

Captioning. As Tab. 1 shows, existing video segmentation datasets lack textual annotations, primary conditions in generation (Chen et al., 2023; Betker et al., 2023), creating a data bottleneck for applying generative models to video inpainting. Therefore, we leverage SOTA VLMs, specifically CogVLM2 (Wang et al., 2023) and GPT-4o (OpenAI, 2024), to uniformly sample keyframes and generate dense video captions and detailed descriptions of the masked objects.

3.2 Dual-branch Inpainting Control

We incorporate masked video features into the pre-trained diffusion transformer (DiT) via an efficient context encoder, to decouple the background context extraction and foreground generation. This encoder processes a concatenated input of noisy latent, masked video latent, and downsampled masks. Specifically, the noisy latent provides information about the current generation. The masked video latent, extracted via VAE, aligns with the pre-trained DiT’s latent distribution. We apply cubic interpolation to down-sample masks, ensuring dimensional compatibility.

Based on DiT’s inherent generative abilities (OpenAI, 2024), the control branch only needs to extract contextual cues to guide the backbone in preserving background and generating foreground. Therefore, instead of previous heavy approaches that duplicate half or all of the backbone (Ju et al., 2024; Zhang et al., 2023), VideoPainter employs a lightweight design by cloning only the first two layers of pre-trained DiT, accounting for merely 6% of the backbone parameters. The pre-trained DiT weights provide a robust prior for extracting masked video features. The context encoder features are integrated into the frozen DiT in a group-wise, token-selective manner. The group-wise feature integration is formulated as follows: the first layer’s features are added back to the initial half of the backbone, while the second layer’s features are integrated into the latter half, achieving lightweight and efficient context control. The token-selective mechanism is a pre-filtering process, where only tokens representing pure background are added back, while others are excluded from integration, as shown in the upper right of Fig. 4. This ensures that only the background context is fused into the backbone, preventing potential ambiguity during backbone generation.

The feature integration is shown in Eq. 1. $\epsilon_{\theta}\left(z_{t},t,C\right)_{i}$ indicates the feature of the $i$ -th layer in DiT $\epsilon_{\theta}$ with $i\sim\left[1,n\right]$ , where $n$ is the number of layers. The same notation applies to $\epsilon_{\theta}^{VideoPainter}$ , which takes the concatenated noisy latent $z_{t}$ , masked video latent $z_{0}^{masked}$ , and downsampled mask $m^{resized}$ as input. The concatenation operation is denoted as $\left[\cdot\right]$ . $\mathcal{Z}$ is the zero linear operation.

$\epsilon_{\theta}\left(z_{t},t,C\right)_{i}=\epsilon_{\theta}\left(z_{t},t,C% \right)_{i}+\mathcal{Z}\left(\epsilon_{\theta}^{VideoPainter}\left(\left[z_{t}% ,z_{0}^{masked},m^{resized}\right],t\right)_{i//\frac{n}{2}}\right)$

(1)

3.3 Target Region ID Resampling

While current video DiTs have shown great promise in handling temporal dynamics (Kuaishou, 2024; Bian et al., 2024), they often struggle to maintain smooth transitions and long-term identity consistency.

Smooth Transition. Following AVID (Zhang et al., 2024b), we employ overlapping generation and weighted average to maintain consistent transitions. Additionally, we utilize the last frame of the last generated video clip (before overlap) as the first frame of the current video clip’s overlapping region to ensure visual appearance continuity.

Identity Consistency. To maintain identity consistency in the long video, we introduce an inpainting region ID resampling method, as shown in lower Fig. 4. During training, we freeze both the DiT and the context encoder. Then we add trainable ID-Resample Adapters into the frozen DiT (LoRA), enabling ID resampling functionality. Specifically, tokens from the current masked region, which contain the desired ID, are concatenated with the KV vectors, thereby enhancing ID preservation in the inpainting region through additional KV resampling. Specifically, given current $\mathbf{Q}^{v}_{i}$ , $\mathbf{K}^{v}_{i}$ , and $\mathbf{V}^{v}_{i}$ , we filter the mask region tokens in current $\mathbf{K}^{v}_{i}$ and $\mathbf{V}^{v}_{i}$ , and concatenate them to $\mathbf{K}^{v}_{i}$ and $\mathbf{V}^{v}_{i}$ , forcing the model to resample these tokens that have the needed ID. During inference, we prioritize maintaining ID consistency with the inpainting region tokens from the previous clip, as it represents the most temporally proximate generated result. Therefore, we concatenate the masked region tokens from the previous clip with the current key-value vectors, effectively resampling and maintaining the identity information in long video processing.

3.4 Plug-and-Play Control

Our plug-and-play framework demonstrates versatility across two aspects: it supports various stylization backbones or LoRAs and is compatible with both text-to-video (T2V) (Yang et al., 2024; NVIDIA, 2025) and image-to-video (I2V) (Guo et al., 2024; Shi et al., 2024) DiT architectures. The I2V compatibility particularly enables seamless integration with existing image inpainting capabilities. When utilizing an I2V DiT backbone, VideoPainter requires only one additional step: generating the initial frame using any image inpainting model guided by the masked region’s text caption. This inpainted frame then serves as both the image condition and the first masked video frame. These capabilities further demonstrate the exceptional transferability and versatility of VideoPainter.

4 Experiments

4.1 Implementation details

VideoPainter is built upon a pre-trained Image-to-Video Diffusion Transformer CogVideo-5B-I2V (Yang et al., 2024) (by default) and its Text-to-Video version. In training, we use VPData at a $480\times 720$ resolution, learning rate $1\times 10^{-5}$ , batch size $1$ for both the context encoder ( $80,000$ steps) and the ID Resample Adapter ( $2,000$ steps) in two stages with AdamW on 64 NVIDIA V100 GPUs.

Benchmarks.

In video inpainting, we employ Davis (Perazzi et al., 2016) as the benchmark for random masks and VPBench for segmentation-based masks. VPBench consists of $100$ 6-second videos for standard video inpainting, and $16$ videos with an average duration of more than $30$ seconds for long video inpainting. The VPBench includes diverse content, including objects, humans, animals, landscapes, and multi-range masks. For video editing evaluation, we also utilize VPBench, which includes four fundamental editing operations (adding objects, removing objects, swaping objects, and changing attributes) and comprises $45$ 6-second videos and $9$ videos with an average duration of $30$ seconds.

Metrics.

We consider $8$ key metrics from the following three aspects: masked region preservation, text alignment, and video generation quality.

$\bullet$

Masked Region Preservation. We follow previous works using standard PSNR (Wikipedia contributors, 2024c), LPIPS (Zhang et al., 2018), SSIM (Wang et al., 2004), MSE (Wikipedia contributors, 2024b) and MAE (Wikipedia contributors, 2024a) in the unmasked region among the generated video and the original video.
$\bullet$

Text Alignment. We employ CLIP Similarity (CLIP Sim) (Wu et al., 2021) to assess the semantic consistency between the generated video and its corresponding text caption. We also measure CLIP Similarity within the masked regions (CLIP Sim (M)).
$\bullet$

Video Generation Quality. Following previous methods (Zhou et al., 2023), we use FVID (Wang et al., 2018) to measure the generated video quality.

4.2 Video Inpainting

Quantitative comparisons

Tab. 2 shows the quantitative comparison on VPBench and Davis (Perazzi et al., 2016). We compare the inpainting results of non-generative ProPainter (Zhou et al., 2023), generative COCOCO (Zi et al., 2024), and Cog-Inp (Yang et al., 2024), a strong baseline proposed by us, which inpaint first frame using image inpainting models and use the I2V backbone to propagate results with the latent blending operation (Avrahami et al., 2023). In the segmentation-based VPBench, ProPainter, and COCOCO exhibit the worst performance across most metrics, primarily due to the inability to inpaint fully masked objects and the single-backbone architecture’s difficulty in balancing the competing background preservation and foreground generation, respectively. In the random mask benchmark Davis, ProPainter shows improvement by leveraging partial background information. However, VideoPainter achieves optimal performance across segmentation (standard and long length) and random masks through its dual-branch architecture that effectively decouples background preservation and foreground generation.

Qualitative comparisons

The qualitative comparison with previous video inpainting methods is shown in Fig. 5. VideoPainter consistently shows exceptional results in the video coherence, quality, and alignment with text caption. Notably, ProPainter fails to generate fully masked objects because it only depends on background pixel propagation instead of generating. While COCOCO demonstrates basic functionality, it fails to maintain consistent ID in inpainted regions ( inconsistent vessel appearances and abrupt terrain changes) due to its single-backbone architecture attempting to balance background preservation and foreground generation. Cog-Inp achieves basic inpainting results; however, its blending operation’s inability to detect mask boundaries leads to significant artifacts. Moreover, VideoPainter can generate coherent videos exceeding one minute while maintaining ID consistency through our ID resampling.

Table 2: Quantitative comparisons among VideoPainter and other video inpainting models in VPBench for segmentation mask (Standard (S) and Long (L) Video) and Davis for random mask: ProPainter (Zhou et al., 2023), COCOCO (Zi et al., 2024), and Cog-Inp (Yang et al., 2024). Metrics include masked region preservation, text alignment, and video quality. Red stands for the best, Blue stands for the second best.

Metrics		Masked Region Preservation					Text Alignment		Video Quality
Models		PSNR $\uparrow$	SSIM $\uparrow$	LPIPS ${}_{{}^{\times 10^{2}}}$ $\downarrow$	MSE ${}_{{}^{\times 10^{2}}}$ $\downarrow$	MAE ${}_{{}^{\times 10^{2}}}$ $\downarrow$	CLIP Sim $\uparrow$	CLIP Sim (M) $\uparrow$	FVID $\downarrow$
VPBench-S	ProPainter	20.97	0.87	9.89	1.24	3.56	7.31	17.18	0.44
	COCOCO	19.27	0.67	14.80	1.62	6.38	7.95	20.03	0.69
	Cog-Inp	22.15	0.82	9.56	0.88	3.92	8.41	21.27	0.18
	Ours	23.32	0.89	6.85	0.82	2.62	8.66	21.49	0.15
VPBench-L	ProPainter	20.11	0.84	11.18	1.17	3.71	9.44	17.68	0.48
	COCOCO	19.51	0.66	16.17	1.29	6.02	11.00	20.42	0.62
	Cog-Inp	19.78	0.73	12.53	1.33	5.13	11.47	21.22	0.21
	Ours	22.19	0.85	9.14	0.71	2.92	11.52	21.54	0.17
Davis	ProPainter	23.99	0.92	5.86	0.98	2.48	7.54	16.69	0.12
	COCOCO	21.34	0.66	10.51	0.92	4.99	6.73	17.50	0.33
	Cog-Inp	23.92	0.79	10.78	0.47	3.23	7.03	17.53	0.17
	Ours	25.27	0.94	4.29	0.45	1.41	7.21	18.46	0.09

4.3 Video Editing

VideoPainter can be used for video inpainting by employing Vison Language Models (OpenAI, 2024; Team et al., 2024) to generate modified captions based on user editing instructions and source captions and apply VideoPainter to inpaint based on the modified captions. Tab. 3 shows the quantitative comparison on VPBench. We compare the editing results of inverse-based UniEdit (Bai et al., 2024), DiT-based DiTCtrl (Cai et al., 2024), and end-to-end ReVideo (Mou et al., 2024). For both standard and long videos in VPBench, VideoPainter achieves superior performance, even surpassing the end-to-end ReVideo. This success can be attributed to its dual-branch architecture, which ensures excellent background preservation and foreground generation capabilities, maintaining high fidelity in non-edited regions while ensuring edited regions closely align with editing instructions, complemented by inpainting region ID resampling that maintains ID consistency in long video. The qualitative comparison with previous video inpainting methods is shown in Fig. 5. VideoPainter demonstrates superior performance in preserving visual fidelity and text-prompt consistency. VideoPainter successfully generates a seamless animation of a futuristic spaceship traversing the sky, maintaining smooth temporal transitions and precise background boundaries throughout the removal process, without introducing artifacts that were observed in ReVideo.

Table 3: Quantitative comparisons among VideoPainter and other video editing models in VPBench (Standard and Long Video): UniEdit (Bai et al., 2024), DitCtrl (Cai et al., 2024), and ReVideo (Mou et al., 2024). Metrics include masked region preservation, text alignment, and video quality. Red stands for the best, Blue stands for the second best.

Metrics		Masked Region Preservation					Text Alignment		Video Quality
Models		PSNR $\uparrow$	SSIM $\uparrow$	LPIPS ${}_{{}^{\times 10^{2}}}$ $\downarrow$	MSE ${}_{{}^{\times 10^{2}}}$ $\downarrow$	MAE ${}_{{}^{\times 10^{2}}}$ $\downarrow$	CLIP Sim $\uparrow$	CLIP Sim (M) $\uparrow$	FVID $\downarrow$
Standard	UniEdit	9.96	0.36	56.68	11.08	25.78	8.46	14.23	1.36
	DitCtrl	9.30	0.33	57.42	12.73	27.45	8.52	15.59	0.57
	ReVideo	15.52	0.49	27.68	3.49	11.14	9.34	20.01	0.42
	Ours	22.63	0.91	7.65	1.02	2.90	8.67	20.20	0.18
Long	UniEdit	10.37	0.30	54.61	10.25	24.89	10.85	15.42	1.00
	DitCtrl	9.76	0.28	62.49	11.50	26.64	11.78	16.52	0.56
	ReVideo	15.50	0.46	28.57	3.92	12.24	11.22	20.50	0.35
	Ours	22.60	0.90	7.53	0.86	2.76	11.85	19.38	0.11

Table 4: User Study: User preference ratios comparing VideoPainter with video inpainting and editing baselines.

Task	Video Inpainting			Video Editing
Task	Background	Text	Video	Background	Text	Video
	Preservation	Alignment	Quality	Preservation	Alignment	Quality
Ours	$74.2\%$	$82.5\%$	$87.4\%$	$78.4\%$	$76.1\%$	$81.7\%$

4.4 Human Evaluation

We conducted a user study on video inpainting and editing tasks using standard-length video samples from the VPBench inpainting and editing subsets. Thirty participants evaluated 50 randomly selected cases based on background preservation, text alignment, and video quality. As shown in Tab. 4, VideoPainter significantly outperformed existing baselines, achieving higher preference rates across all evaluation criteria in both tasks. Detailed experiment settings and results are provided in the Appendix.

4.5 Ablation Analysis

We ablate on VideoPainter in Tab .5, including framework architecture, context encoder size, group-wise feature integration strategy, and inpainting region ID resampling.

Table 5: Ablation Studies on VPBench. Single-Branch: We add input channels to adapt masked video and finetune the backbone. Layer Configuration (VideoPainter (*)): We vary the context encoder depth from one to four layers. w/o Selective Token Integration (w/o Select):: We bypass the token pre-selection step and integrate all context encoder tokens into DiT. T2V Backbone (VideoPainter (T2V)): We replace the backbone from image-to-video DiTs to text-to-video DiTs. w/o target region ID resampling (w/o Resample): We ablate on the target region ID resampling. (L) denotes evaluation on the long video subset. Red stands for the best result.

Metrics	Masked Region Preservation					Text Alignment		Video Quality
Models	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS ${}_{{}^{\times 10^{2}}}$ $\downarrow$	MSE ${}_{{}^{\times 10^{2}}}$ $\downarrow$	MAE ${}_{{}^{\times 10^{2}}}$ $\downarrow$	CLIP Sim $\uparrow$	CLIP Sim (M) $\uparrow$	FVID $\downarrow$
Single-Branch	20.54	0.79	10.48	0.94	4.16	8.19	19.31	0.22
VideoPainter (1)	21.92	0.81	8.78	0.89	3.26	8.44	20.79	0.17
VideoPainter (4)	22.86	0.85	6.51	0.83	2.86	9.12	20.49	0.16
w/o Select	20.94	0.74	7.90	0.95	3.87	8.26	17.84	0.25
VideoPainter (T2V)	23.01	0.87	6.94	0.89	2.65	9.41	20.66	0.16
VideoPainter	23.32	0.89	6.85	0.82	2.62	8.66	21.49	0.15
w/o Resample (L)	21.79	0.84	8.65	0.81	3.10	11.35	20.68	0.19
VideoPainter (L)	22.19	0.85	9.14	0.71	2.92	11.52	21.54	0.17

Based on rows 1 and 5, the dual-branch VideoPainter significantly outperforms its single-branch counterpart by explicitly decoupling background preservation from foreground generation, thereby reducing model complexity and avoiding the trade-off between competing objectives in a single branch. Row 2 to row 6 of Tab. 5 demonstrate the rationale behind our key design choices: utilizing a two-layer structure as an optimal balance between performance and efficiency for the context encoder, implementing token-selective feature fusion based on segmentation mask information to prevent confusion from indistinguishable foreground-background tokens in the backbone, and adapting plug-and-play control to different backbones with comparable performance. Furthermore, rows 7 and 8 verify the importance of employing inpainting region ID resampling for long videos, which preserves ID by explicitly resampling inpainted region tokens from previous clips.

4.6 Plug-and-Play Control Ability

Fig. 7 demonstrates the flexible plug-and-play control capabilities of VideoPainter in base diffusion transformer selection. We showcase how VideoPainter can be seamlessly integrated with community-developed Gromit-style LoRA. Despite the significant domain gap between anime-style data and our training dataset, VideoPainter’s dual-branch architecture ensures its plug-and-play inpainting abilities, enabling users to select the most appropriate base model for specific inpainting requirements and expected results.

5 Discussion

In this paper, we introduce VideoPainter, the first dual-branch video inpainting framework with plug-and-play control capabilities. Our approach features three key innovations: (1) a lightweight plug-and-play context encoder compatible with any pre-trained video DiTs, (2) an inpainting region ID resampling technique for maintaining long video ID consistency, and (3) a scalable dataset pipeline that produced VPData and VPBench, containing over 390K video clips with precise masks and dense captions. VideoPainter also shows promise in video editing applications. Extensive experiments demonstrate that VideoPainter achieves state-of-the-art performance across 8 key metrics in video inpainting and video editing, particularly in video quality, mask region preservation, and text coherence.

However, VideoPainter still has several limitations: (1) Generation quality is limited by the base video DiT models, which may struggle with complex physical and motion modeling scenarios, and (2) performance is suboptimal with low-quality masks or misaligned video captions.

References

Avrahami et al. (2023) Avrahami, O., Fried, O., and Lischinski, D. Blended latent diffusion. ACM transactions on graphics (TOG), 42(4):1–11, 2023.
Bai et al. (2024) Bai, J., He, T., Wang, Y., Guo, J., Hu, H., Liu, Z., and Bian, J. Uniedit: A unified tuning-free framework for video motion and appearance editing. arXiv preprint arXiv:2402.13185, 2024.
Betker et al. (2023) Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
Bian et al. (2024) Bian, Y., Ju, X., Li, J., Xu, Z., Cheng, D., and Xu, Q. Multi-patch prediction: Adapting llms for time series representation learning. arXiv preprint arXiv:2402.04852, 2024.
Cai et al. (2024) Cai, M., Cun, X., Li, X., Liu, W., Zhang, Z., Zhang, Y., Shan, Y., and Yue, X. Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video generation. arXiv preprint arXiv:2412.18597, 2024.
Castellano (2024) Castellano, B. Pyscenedetect: Intelligent scene cut detection and video analysis tool, 2024. URL https://github.com/Breakthrough/PySceneDetect.
Chang et al. (2019a) Chang, Y.-L., Liu, Z. Y., Lee, K.-Y., and Hsu, W. Free-form video inpainting with 3d gated convolution and temporal patchgan. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9066–9075, 2019a.
Chang et al. (2019b) Chang, Y.-L., Liu, Z. Y., Lee, K.-Y., and Hsu, W. Learnable gated temporal shift module for deep video inpainting. arXiv preprint arXiv:1907.01131, 2019b.
Chen et al. (2023) Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., and Li, Z. Pixart- $\alpha$ : Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
Cseti (2024) Cseti. Cogvideox-lora-wallace_and_gromit, 2024. URL https://huggingface.co/Cseti/CogVideoX-LoRA-Wallace_and_Gromit.
Darkhalil et al. (2022) Darkhalil, A., Shan, D., Zhu, B., Ma, J., Kar, A., Higgins, R., Fidler, S., Fouhey, D., and Damen, D. Epic-kitchens visor benchmark: Video segmentations and object relations. In Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022.
Ding et al. (2023) Ding, H., Liu, C., He, S., Jiang, X., Torr, P. H., and Bai, S. Mose: A new dataset for video object segmentation in complex scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20224–20234, 2023.
Fang et al. (2024) Fang, Z., Zhai, W., Su, A., Song, H., Zhu, K., Wang, M., Chen, Y., Liu, Z., Cao, Y., and Zha, Z.-J. Vivid: Video virtual try-on using diffusion models. arXiv preprint arXiv:2405.11794, 2024.
Gao et al. (2020) Gao, C., Saraf, A., Huang, J.-B., and Kopf, J. Flow-edge guided video completion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pp. 713–729. Springer, 2020.
Guo et al. (2024) Guo, X., Zheng, M., Hou, L., Gao, Y., Deng, Y., Wan, P., Zhang, D., Liu, Y., Hu, W., Zha, Z., et al. I2v-adapter: A general image-to-video adapter for diffusion models. In ACM SIGGRAPH 2024 Conference Papers, pp. 1–12, 2024.
Guo et al. (2023) Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., and Dai, B. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
Hong et al. (2023) Hong, L., Chen, W., Liu, Z., Zhang, W., Guo, P., Chen, Z., and Zhang, W. Lvos: A benchmark for long-term video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13480–13492, 2023.
Hu et al. (2020) Hu, Y.-T., Wang, H., Ballas, N., Grauman, K., and Schwing, A. G. Proposal-based video completion. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pp. 38–54. Springer, 2020.
Ju et al. (2024) Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., and Xu, Q. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. arXiv preprint arXiv:2403.06976, 2024.
Kim et al. (2019) Kim, D., Woo, S., Lee, J.-Y., and Kweon, I. S. Deep video inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5792–5801, 2019.
Kuaishou (2024) Kuaishou. Kling spark your imagination. https://kling.kuaishou.com/, 2024.
Lee et al. (2019) Lee, S., Oh, S. W., Won, D., and Kim, S. J. Copy-and-paste networks for deep video inpainting. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4413–4421, 2019.
Li et al. (2020) Li, A., Zhao, S., Ma, X., Gong, M., Qi, J., Zhang, R., Tao, D., and Kotagiri, R. Short-term and long-term context aggregation network for video inpainting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp. 728–743. Springer, 2020.
Li et al. (2024) Li, Y., Bian, Y., Ju, X., Zhang, Z., Shan, Y., and Xu, Q. Brushedit: All-in-one image inpainting and editing. arXiv preprint arXiv:2412.10316, 2024.
Li et al. (2022) Li, Z., Lu, C.-Z., Qin, J., Guo, C.-L., and Cheng, M.-M. Towards an end-to-end framework for flow-guided video inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 17562–17571, 2022.
Liu et al. (2021) Liu, R., Deng, H., Huang, Y., Shi, X., Lu, L., Sun, W., Wang, X., Dai, J., and Li, H. Decoupled spatial-temporal transformer for video inpainting. arXiv preprint arXiv:2104.06637, 2021.
Liu et al. (2023) Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
Mou et al. (2024) Mou, C., Cao, M., Wang, X., Zhang, Z., Shan, Y., and Zhang, J. Revideo: Remake a video with motion and content control. arXiv preprint arXiv:2405.13865, 2024.
NVIDIA (2025) NVIDIA. NVIDIA Cosmos: Accelerate physical AI development with world foundation models, 2025. URL https://www.nvidia.com/en-us/ai/cosmos/.
OpenAI (2024) OpenAI. Hello gpt-4, 2024. URL https://openai.com/index/hello-gpt-4o/.
OpenAI (2024) OpenAI. Video generation models as world simulators. https://openai.com/sora/, 2024.
Peebles & Xie (2023) Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
Perazzi et al. (2016) Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., and Sorkine-Hornung, A. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 724–732, 2016.
Polyak et al. (2024) Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.-Y., Chuang, C.-Y., et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024.
Quan et al. (2024) Quan, W., Chen, J., Liu, Y., Yan, D.-M., and Wonka, P. Deep learning-based image and video inpainting: A survey. International Journal of Computer Vision, 132(7):2367–2400, 2024.
Ravi et al. (2024) Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.
Rombach & Esser (2022) Rombach, R. and Esser, P. Stable diffusion 2 inpainting. https://huggingface.co/stabilityai/stable-diffusion-2-inpainting, 2022.
Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
Schuhmann et al. (2022) Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
Shi et al. (2024) Shi, X., Huang, Z., Wang, F.-Y., Bian, W., Li, D., Zhang, Y., Zhang, M., Cheung, K. C., See, S., Qin, H., et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In ACM SIGGRAPH 2024 Conference Papers, pp. 1–11, 2024.
Sun et al. (2024) Sun, W., Tu, R.-C., Liao, J., and Tao, D. Diffusion model-based video editing: A survey. arXiv preprint arXiv:2407.07111, 2024.
Team et al. (2024) Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
Teed & Deng (2020) Teed, Z. and Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 402–419. Springer, 2020.
Tokmakov et al. (2023) Tokmakov, P., Li, J., and Gaidon, A. Breaking the “object” in video object segmentation. In CVPR, 2023.
Wang et al. (2019) Wang, C., Huang, H., Han, X., and Wang, J. Video inpainting by jointly learning temporal structure and spatial details. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 5232–5239, 2019.
Wang et al. (2018) Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz, J., and Catanzaro, B. Video-to-video synthesis. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/d86ea612dec96096c5e0fcc8dd42ab6d-Paper.pdf.
Wang et al. (2023) Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., and Tang, J. Cogvlm: Visual expert for pretrained language models, 2023.
Wang et al. (2024) Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y., Shen, Y., Zhao, D., and Zhou, J. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024.
Wang et al. (2004) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
Wikipedia contributors (2024a) Wikipedia contributors. Mean absolute error — Wikipedia, the free encyclopedia, 2024a. URL https://en.wikipedia.org/wiki/Mean_absolute_error.
Wikipedia contributors (2024b) Wikipedia contributors. Mean squared error — Wikipedia, the free encyclopedia, 2024b. URL https://en.wikipedia.org/wiki/Mean_squared_error.
Wikipedia contributors (2024c) Wikipedia contributors. Peak signal-to-noise ratio — Wikipedia, the free encyclopedia, 2024c. URL https://en.wikipedia.org/w/index.php?title=Peak_signal-to-noise_ratio&oldid=1210897995. [Online; accessed 4-March-2024].
Wu et al. (2021) Wu, C., Huang, L., Zhang, Q., Li, B., Ji, L., Yang, F., Sapiro, G., and Duan, N. GODIVA: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
Xu et al. (2018) Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., and Huang, T. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 585–601, 2018.
Xu et al. (2019) Xu, R., Li, X., Zhou, B., and Loy, C. C. Deep flow-guided video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3723–3732, 2019.
Yang et al. (2024) Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024.
Zeng et al. (2020) Zeng, Y., Fu, J., and Chao, H. Learning joint spatial-temporal transformations for video inpainting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pp. 528–543. Springer, 2020.
Zhang et al. (2022a) Zhang, K., Fu, J., and Liu, D. Inertia-guided flow completion and style fusion for video inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5982–5991, 2022a.
Zhang et al. (2022b) Zhang, K., Fu, J., and Liu, D. Flow-guided transformer for video inpainting. In European Conference on Computer Vision, pp. 74–90. Springer, 2022b.
Zhang et al. (2022c) Zhang, K., Fu, J., and Liu, D. Flow-guided transformer for video inpainting. In European Conference on Computer Vision, pp. 74–90. Springer, 2022c.
Zhang et al. (2023) Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847, 2023.
Zhang et al. (2018) Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 586–595, 2018.
Zhang et al. (2024a) Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al. Recognize anything: A strong image tagging model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1724–1732, 2024a.
Zhang et al. (2024b) Zhang, Z., Wu, B., Wang, X., Luo, Y., Zhang, L., Zhao, Y., Vajda, P., Metaxas, D., and Yu, L. Avid: Any-length video inpainting with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7162–7172, 2024b.
Zhou et al. (2023) Zhou, S., Li, C., Chan, K. C., and Loy, C. C. Propainter: Improving propagation and transformer for video inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10477–10486, 2023.
Zi et al. (2024) Zi, B., Zhao, S., Qi, X., Wang, J., Shi, Y., Chen, Q., Liang, B., Wong, K.-F., and Zhang, L. Cococo: Improving text-guided video inpainting for better consistency, controllability and compatibility. arXiv preprint arXiv:2403.12035, 2024.
Zou et al. (2021) Zou, X., Yang, L., Liu, D., and Lee, Y. J. Progressive temporal feature alignment network for video inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16448–16457, 2021.