Youโve probably seen AI tools that can erase objects from photos and fill in the gap seamlessly. But how does the model know what to put there โ and how does it figure out where to edit when you just say โremove the dogโ? In this post, Iโll break down two papers: BrushNet, a clever architecture that adds inpainting ability to any diffusion model, and BrushEdit, an agent pipeline that wraps BrushNet with language understanding to turn natural instructions into image edits.
Imagine you have a photo of a dog on a beach. You want to replace the dog with a sandcastle. You need a model that:
The simplest approach? Fine-tune the entire diffusion model for inpainting. But this has a big downside โ you break the original model. It canโt do normal image generation anymore, and you canโt swap in a better base model later.
BrushNetโs solution: keep the original model frozen, and add a separate trainable branch alongside it.
BrushNet runs two U-Nets in parallel:
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Text prompt โโโโ Base U-Net (FROZEN) โโโโ Predicted noise
โ Has cross-attention โ
โ to understand text โ
โโโโโโโโโโโโโโฒโโโโโโโโโโโโโ
โ
+ (add features)
โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโ
Masked image โโโ BrushNet (TRAINABLE) โ
+ mask โโโโโโโโโโ NO cross-attention โ
+ noisy latent โโ Processes spatial info โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Base U-Net does what it always does โ denoise an image guided by a text prompt. BrushNet runs alongside it, processing the mask and surrounding context, then injects hints into the Base U-Net at every layer.
BrushNet takes 3 things, concatenated into a 9-channel input:
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Noisy latent โ โ Masked image โ โ Binary mask โ
โ (4 channels) โ โ (4 channels) โ โ (1 channel) โ
โ โ โ โ โ โ
โ Current state โ โ What's around โ โ Where the โ
โ of denoising โ โ the hole โ โ hole is โ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
โ
Concatenate โ 9 channels
โ
โโโโโโโผโโโโโโ
โ BrushNet โ
โโโโโโโโโโโโโ
Each input answers a different question:
1. Noisy latent z_t (4 channels) โ โWhat step are we at?โ
This is the current state of the image being denoised. At each timestep during the denoising loop, the image goes from pure noise to clean image. BrushNet needs to see this so it knows how much noise is left and can produce appropriate injection features for the current step.
t=T (start): z_t = pure noise โ BrushNet: "everything is noisy, give strong guidance"
t=T/2 (mid): z_t = half noise/half image โ BrushNet: "refine the details"
t=0 (end): z_t = nearly clean โ BrushNet: "just fix edges"
2. Masked image latent z_masked (4 channels) โ โWhatโs around the hole?โ
This is the original image with the masked region zeroed out, then VAE-encoded. It tells BrushNet what the surrounding context looks like โ colors, textures, edges near the mask boundary.
Original: [beach][dog][beach]
Mask applied: [beach][ 0 ][beach] โ dog region zeroed out
VAE encode: [4-channel latent] โ this goes to BrushNet
Why 4 channels instead of 3 (RGB)? Because the U-Net operates in VAE latent space, not pixel space. Raw pixels would be mismatched โ like feeding English text into a Chinese language model. The VAE encoder translates the image into the same โlanguageโ the U-Net understands.
Original image (512ร512ร3)
โ
Apply mask (zero out hole region)
โ
VAE Encoder
โ
Masked image latent (64ร64ร4) โ This goes to BrushNet
3. Mask (1 channel) โ โWhere is the hole?โ
A simple binary map: 1 = inpaint here, 0 = keep original. You might think BrushNet could figure this out from the masked image alone (just look for the zeros), but zeroed-out pixels are ambiguous:
Without mask channel:
z_masked has zeros at (2,3) โ Is this black pixels or a hole? ๐คท
With mask channel:
z_masked has zeros at (2,3) + mask=1 at (2,3) โ Definitely a hole! โ
| Withoutโฆ | Problem |
|---|---|
| Noisy latent | BrushNet doesnโt know which denoising step โ wrong features |
| Masked image | BrushNet canโt see surrounding context โ canโt blend |
| Mask | BrushNet canโt distinguish โblack pixelโ from โholeโ |
Each input answers a different question: when (timestep), whatโs around (context), and where (hole location).
Hereโs the clever part. BrushNetโs features are injected into the Base U-Net through zero convolutions โ 1ร1 convolutions where all weights start at zero.
At training start:
BrushNet feature โโโ ZeroConv โโโ 0.0 โโโ + Base U-Net feature
(all zeros) (unchanged!)
Why? Because the Base U-Net is a carefully trained model. If you inject random noise into it on day one, youโd destroy its ability to generate images. Starting from zero means:
Training step 0: BrushNet contributes nothing (U-Net works normally)
Training step 100: BrushNet whispers tiny hints (weights: 0.001)
Training step 10K: BrushNet provides real guidance (weights: 0.1)
Say BrushNet produces a feature value of 0.8 at some position. Hereโs what the zero convolution does with it over training:
Step 0: weight = 0.0 โ 0.0 ร 0.8 = 0.0 (silent)
Step 1000: weight = 0.02 โ 0.02 ร 0.8 = 0.016 (whispering)
Step 10000: weight = 0.25 โ 0.25 ร 0.8 = 0.2 (contributing)
Itโs like slowly turning up the volume from mute. The Base U-Net is never shocked by sudden changes.
Unlike ControlNet (which only injects into the decoder), BrushNet injects at every single layer โ all encoder blocks, the mid block, and all decoder blocks:

The left column (green) is the trainable BrushNet branch โ no cross-attention to text. The right column (blue) is the frozen Base U-Net with text cross-attention. The red arrows are zero-conv injection points where BrushNet features are added element-wise to the Base U-Net.
Each arrow is actually multiple injection points (one per sub-layer), totaling about 25 injection points in total. This dense injection gives BrushNet pixel-level control, which is crucial for inpainting โ you need precise boundaries where the generated content meets the original image.
The Base U-Net has cross-attention layers that let it understand text prompts:
Base U-Net block: ResBlock โ CrossAttention("a sunflower") โ output
BrushNet block: ResBlock โ output
โ
(removed!)
This is by design. BrushNetโs job is purely spatial โ โhereโs a hole, hereโs whatโs around it.โ The text understanding stays in the Base U-Net. This separation means:
The training loop is surprisingly simple โ it uses the standard diffusion denoising loss:
For each training step:
1. Take a clean image "cat on a couch"
2. Generate a RANDOM mask (random shape, random position)
3. Apply mask to image (hole in it)
4. VAE-encode both zโ (clean latent), z_masked (masked latent)
5. Add random noise to clean latent z_t = mix(zโ, noise, t)
6. Run through both branches:
BrushNet(z_t, z_masked, mask) โ injection features
Base_UNet(z_t, text) + features โ predicted noise
7. Loss = โ predicted_noise - actual_noise โยฒ (MSE)
Yes! The model predicts what noise was added, not what the clean image looks like. We know the actual noise because we added it ourselves in step 5. If the model can perfectly predict the noise, we can subtract it to recover the clean image.
We added noise ฮต to get z_t.
Model predicts ฮต_ฮธ.
If ฮต_ฮธ โ ฮต, then zโ โ (z_t - ฮต_ฮธ) / scale โ clean image recovered!
Nope. The loss is computed over the entire image, not just the masked region. But the model naturally focuses on the mask because:
The mask guides learning implicitly through gradients, not explicitly through loss weighting.
BrushNet doesnโt need paired before/after examples. Itโs self-supervised:
Dataset: clean images + text descriptions (same data as Stable Diffusion)
Masks: generated randomly during training
The model learns to reconstruct whatever was behind a random mask, using the surrounding context and text prompt. At inference, you provide a real mask of what you want to replace.
| Feature | SD Inpainting | ControlNet | BrushNet |
|---|---|---|---|
| Base model | Modified (retrained) | Frozen | Frozen |
| Branch coverage | N/A (single model) | Encoder only | Full U-Net |
| Injection points | N/A | ~12 (decoder only) | ~25 (everywhere) |
| Swap base models? | No | Yes | Yes |
| Extra params | 0 | ~360M | ~480M |
| Text handling | Single model | Branch has cross-attn | Branch has NO cross-attn |
| Best for | General inpainting | Structural control | Precise inpainting |
ControlNet copies only the encoder half โ it injects features into the decoder via the skip connections. This works well for structural guidance (edges, poses) but not for inpainting, where you need fine-grained control at every spatial resolution.
The BrushNet paper showed this clearly:
Full U-Net (BrushNet): PSNR 19.86 โ best quality
Half U-Net: PSNR 19.01
ControlNet-style: PSNR 18.28 โ worst quality
Inpainting needs dense per-pixel control, especially at mask boundaries where generated content must blend seamlessly with the original image.
At inference time, the full pipeline looks like this:
1. User provides: image + mask + text prompt ("a sunflower")
2. Encode:
masked_image = apply_mask(image, mask)
z_masked = VAE_encode(masked_image) [4, 64, 64]
mask_small = downsample(mask) [1, 64, 64]
3. Start from pure noise:
z_T ~ N(0, I) [4, 64, 64]
4. Denoise loop (T steps, e.g. 25-50):
for t in T โ 0:
brushnet_feats = BrushNet(z_t, z_masked, mask_small, t)
noise_pred = BaseUNet(z_t, t, "a sunflower") + brushnet_feats
z_{t-1} = scheduler_step(z_t, noise_pred)
5. Decode final latent:
result = VAE_decode(z_0) [3, 512, 512]
6. Blend:
output = blur_blend(result, original_image, mask)
The final blending step uses a Gaussian-blurred mask to smooth the transition between generated and original pixels, avoiding hard edges.
Because the Base U-Net is never modified, you can:
conditioning_scale (0.0 to 1.0) to control how much BrushNet influences the outputscale = 0.0 โ Base U-Net only (no inpainting guidance)
scale = 0.5 โ Gentle inpainting hints
scale = 1.0 โ Full BrushNet influence (default)
Base U-Net (frozen): ~520M params
BrushNet (trainable): ~480M params
โโ Zero-conv layers: 25 layers, ~20M params
Total at inference: ~1,000M params (1B)
BrushNet is nearly the same size as the Base U-Net โ the only difference is removing cross-attention layers (~40M params saved). The trade-off is clear: 2x memory for plug-and-play flexibility.
BrushNet gives us a powerful inpainting engine. But using it requires you to provide two things manually: a mask (where to edit) and a text prompt (what to generate). For simple cases thatโs fine โ draw a circle around the dog, type โa sunflower.โ
But what if you just want to say โremove the dogโ and have the system figure out the rest?
Thatโs exactly what BrushEdit does. It wraps BrushNet in an intelligent agent pipeline that automates the mask and prompt generation.
BrushEdit (arXiv 2412.10316) doesnโt change BrushNetโs architecture at all. Instead, it asks: how do you go from a natural language instruction to a BrushNet-ready mask and prompt?
The answer is an assembly line of 4 AI models:
User: "Remove the dog from the garden"
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. MLLM (Qwen2-VL) โ "What kind of edit? What object?"
โ Classify + Identify โ โ edit_type = "remove"
โ + Generate caption โ โ target = "dog"
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โ caption = "garden with flowers"
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2. GroundingDINO โ "Where is the dog?"
โ Text โ bounding box โ โ bbox around the dog
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 3. SAM โ "What's the exact shape?"
โ Bbox โ pixel mask โ โ silhouette of the dog
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 4. BrushNet + SD 1.5 โ "Fill the hole"
โ Mask + caption โ image โ โ dog replaced with garden
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Each model does one thing well. Letโs walk through each step.
The MLLM (a vision-language model like Qwen2-VL or GPT-4o) is called three separate times, each with a different question. No fine-tuning โ itโs used purely through prompt engineering.
System: "Classify this editing instruction into one of:
addition, remove, local, global, background.
Reply with a single word."
User: "Remove the dog from the garden"
โ "remove"
This classification matters because each edit type needs a different mask strategy:
| Edit Type | What Happens to the Mask |
|---|---|
| Remove โRemove the dogโ | Detect dog โ segment it โ dilate mask edges |
| Addition โAdd a cat on the sofaโ | No detection needed โ MLLM predicts a bounding box |
| Local โMake the car blueโ | Detect car โ segment it โ use mask as-is |
| Background โChange to a beachโ | Detect foreground โ segment โ invert the mask |
| Global โMake it nighttimeโ | Mask the entire image |
System: "Identify the main object being edited.
Reply with no more than 5 words, a single noun phrase."
User: "Remove the dog from the garden"
โ "dog"
This short phrase will be fed to GroundingDINO as a search query. It needs to be concise โ just enough to find the right thing in the image.
System: "Describe what the image should look like AFTER the edit.
Do NOT include elements that are removed or changed."
User: [source image] + "Remove the dog from the garden"
โ "A peaceful garden path with green grass and flowers"
This becomes the text prompt for BrushNetโs inpainting. Notice: it describes the scene without the dog โ because weโre removing it. The MLLM has to understand the instruction well enough to describe the result, not just parrot the input.
All three calls use the MLLM off-the-shelf. No fine-tuning. This means you can swap backends freely:
GPT-4o โ Best quality, requires API key, costs money
Qwen2-VL โ Best open-source, runs locally, ~16 GB VRAM
LLaVA โ Lighter alternative, ~17 GB VRAM
The paper doesnโt fine-tune any of these models. It just writes good prompts. This is a deliberate design choice โ it keeps the system modular and easy to upgrade as better VLMs come out.
Now we know weโre looking for โdog.โ But where in the image is it?
GroundingDINO is an open-vocabulary object detector. Unlike traditional detectors that only recognize a fixed set of classes (like COCOโs 80 categories), it takes any text query and finds matching objects:
Input: image + "dog"
Output: bounding box (128, 128, 384, 384), confidence 0.89
โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ โโโโโโโโโโโโ โ
โ โ โ โ
โ โ dog โ โ
โ โ โ โ
โ โโโโโโโโโโโโ โ
โ โ โ
โ bounding box โ
โ from DINO โ
โโโโโโโโโโโโโโโโโโโโโโโโโโ
This works for any object you can describe in words. โRed car,โ โwooden table,โ โperson in blue shirtโ โ GroundingDINO handles them all.
Exception: addition edits. If the instruction is โadd a cat on the sofa,โ thereโs no cat to detect yet. In this case, GroundingDINO is skipped entirely. Instead, the MLLM predicts where the new object should go by outputting a bounding box: โgiven this 512ร512 image, the cat should go at [256, 170, 128, 170].โ
A bounding box is too rough. The box around the dog also includes chunks of grass, maybe a bit of fence. We need the exact silhouette.
SAM (Segment Anything Model) takes the bounding box and produces a pixel-precise mask:
Before (bounding box): After (SAM mask):
โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ โ
โ โโโโโโโโโโโโ โ โ โโโโโโโโ โ
โ โ grass โ โ โ โโโโโโโโโโโโ โ
โ โ dog โ โ โ โโโโโโโโโโ โ
โ โ grass โ โ โ โโโโโโ โ
โ โโโโโโโโโโโโ โ โ โโ โ
โ โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
Box includes background Mask follows the dog's
around the dog exact silhouette
After SAM produces the mask, BrushEdit adjusts it based on the edit type:
Remove (dilated): Background (inverted):
โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโโโโโโโโ โ โโโโโ โโโโโโโโโ
โ โโโโโโโโโโโโโโ โ โโโ โโโโโโโ
โ โโโโโโโโโโโโ โ โโโโโ โโโโโโโโโ
โ โโโโโโโโ โ โโโโโโโ โโโโโโโโโโโ
โ โโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
Expanded to catch fur/shadow Everything EXCEPT the dog
Now we have everything BrushNet needs:
| Input | Value |
|---|---|
| Mask | Pixel-precise segmentation from SAM (dilated for removal) |
| Caption | โA peaceful garden path with green grass and flowersโ |
| Original image | The source photo |
This is the exact same BrushNet pipeline we covered in Part 1:
1. masked_image = original ร (1 - mask) โ zero out the dog region
2. z_masked = VAE.encode(masked_image) โ encode to latent space
3. conditioning = concat(z_masked, mask) โ 5-channel conditioning
4. Denoising loop (50 steps):
BrushNet features = BrushNet(z_t, conditioning)
noise_pred = Base_UNet(z_t, "garden with flowers") + BrushNet features
z_{t-1} = scheduler.step(z_t, noise_pred)
5. result = VAE.decode(z_0) โ back to pixel space
6. output = blur(mask) ร result + (1-blur(mask)) ร original โ blend
The blurred mask blending at the end creates a smooth transition at the boundary. Without it, youโd see a hard edge where the generated content meets the original image. This single step accounts for a +10 PSNR improvement in ablation studies.
Letโs trace through one more example to make sure itโs clear. Instruction: โChange the background to a tropical beach.โ
Step 1: MLLM classifies โ "background"
MLLM identifies โ "person" (the foreground object to keep)
MLLM captions โ "A person standing on a tropical beach with
palm trees and turquoise water"
Step 2: GroundingDINO("person") โ bounding box around the person
Step 3: SAM(bbox) โ pixel mask of the person
Mask is INVERTED โ now covers everything EXCEPT the person
Coverage: ~75% of the image
Step 4: BrushNet inpaints the masked region (the background)
using caption "tropical beach with palm trees"
Person is preserved in the unmasked region
Blended at edges for seamless transition
The key insight for background edits: GroundingDINO detects the foreground object (the person), SAM segments it, then the mask is inverted. BrushNet never touches the person โ it only regenerates the background.
You might wonder: why not train one big model that takes โremove the dogโ and directly outputs an edited image? Thatโs what InstructPix2Pix does. BrushEditโs decomposed approach has three advantages:
1. Transparency. Every intermediate result is visible. You can see the edit classification (โremoveโ), the detected object (โdogโ), the mask, and the caption. If something goes wrong, you know exactly where.
2. User control. You can override any step. Donโt like the auto-generated mask? Draw your own. Want a different caption? Type one. The pipeline doesnโt force you into a black box.
3. No paired training data. InstructPix2Pix needs millions of (instruction, before, after) triples โ expensive to create. BrushEdit needs none. The MLLM is used off-the-shelf, GroundingDINO and SAM are pre-trained, and BrushNet trains on standard images with random masks.
The trade-off is complexity. BrushEdit orchestrates 4 separate models totaling ~66 GB of weights. But each model is best-in-class at its job, and you can upgrade any component independently.
These methods invert the image to noise, then re-denoise with edits. BrushEdit skips inversion entirely โ it generates directly in the masked region.
| Method | PSNR (quality) | Time |
|---|---|---|
| DDIM + P2P | 22.67 | 11s |
| Null-Text + P2P | 26.52 | 148s |
| BrushEdit | 32.16 | 3.6s |
5 PSNR better and 3-40x faster.
BrushEdit uses BrushNet internally, but improves on it:
| BrushNet | BrushEdit | |
|---|---|---|
| Mask generation | Manual | Automatic (MLLM + DINO + SAM) |
| Caption | Manual | Automatic (MLLM) |
| Model checkpoints | 2 separate (seg masks, random masks) | 1 unified model |
| Object removal | Limited | Trained explicitly with removal data |
| Multi-round editing | No | Yes (output becomes next input) |
The unified model comes from training on BrushData-v2 โ a merged dataset that combines segmentation masks and random masks, plus new removal training pairs where clean-background images are paired with random masks.
No system is perfect. BrushEdit struggles with:
Irregular masks. Very thin, fragmented, or oddly shaped masks can produce artifacts. The model was trained mostly on blob-like masks and object silhouettes.
Text-mask misalignment. If the caption says โa large elephantโ but the mask is tiny, the model canโt fit an elephant in there. The MLLM doesnโt always reason well about spatial constraints.
Base model ceiling. BrushEdit uses Stable Diffusion 1.5 as its backbone. Output quality is bounded by what SD 1.5 can generate. It canโt produce FLUX-quality images because the underlying diffusion model isnโt that capable.
VLM errors cascade. If the MLLM misclassifies the edit type (calling a โremoveโ a โlocal editโ), the entire downstream pipeline produces wrong results. Thereโs no error recovery between steps.
BrushNet (Part 1):
BrushEdit (Part 2):
The two papers together tell a clean story: BrushNet solves how to inpaint (the architecture), and BrushEdit solves what to inpaint (the intelligence layer that turns natural language into masks and captions).
This post covers BrushNet (ECCV 2024) and BrushEdit (arXiv 2412.10316). The architecture diagrams come from hands-on experimentation and code analysis of the TencentARC/BrushEdit repository.
Youโve probably seen AI tools that can erase objects from photos and fill in the gap seamlessly. But how does the model know what to put there โ and how does it figure out where to edit when you just say โremove the dogโ? In this post, Iโll break down two papers: BrushNet, a clever architecture that adds inpainting ability to any diffusion model, and BrushEdit, an agent pipeline that wraps BrushNet with language understanding to turn natural instructions into image edits.
Imagine you have a photo of a dog on a beach. You want to replace the dog with a sandcastle. You need a model that:
The simplest approach? Fine-tune the entire diffusion model for inpainting. But this has a big downside โ you break the original model. It canโt do normal image generation anymore, and you canโt swap in a better base model later.
BrushNetโs solution: keep the original model frozen, and add a separate trainable branch alongside it.
BrushNet runs two U-Nets in parallel:
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Text prompt โโโโ Base U-Net (FROZEN) โโโโ Predicted noise
โ Has cross-attention โ
โ to understand text โ
โโโโโโโโโโโโโโฒโโโโโโโโโโโโโ
โ
+ (add features)
โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโ
Masked image โโโ BrushNet (TRAINABLE) โ
+ mask โโโโโโโโโโ NO cross-attention โ
+ noisy latent โโ Processes spatial info โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Base U-Net does what it always does โ denoise an image guided by a text prompt. BrushNet runs alongside it, processing the mask and surrounding context, then injects hints into the Base U-Net at every layer.
BrushNet takes 3 things, concatenated into a 9-channel input:
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Noisy latent โ โ Masked image โ โ Binary mask โ
โ (4 channels) โ โ (4 channels) โ โ (1 channel) โ
โ โ โ โ โ โ
โ Current state โ โ What's around โ โ Where the โ
โ of denoising โ โ the hole โ โ hole is โ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
โ
Concatenate โ 9 channels
โ
โโโโโโโผโโโโโโ
โ BrushNet โ
โโโโโโโโโโโโโ
Each input answers a different question:
1. Noisy latent z_t (4 channels) โ โWhat step are we at?โ
This is the current state of the image being denoised. At each timestep during the denoising loop, the image goes from pure noise to clean image. BrushNet needs to see this so it knows how much noise is left and can produce appropriate injection features for the current step.
t=T (start): z_t = pure noise โ BrushNet: "everything is noisy, give strong guidance"
t=T/2 (mid): z_t = half noise/half image โ BrushNet: "refine the details"
t=0 (end): z_t = nearly clean โ BrushNet: "just fix edges"
2. Masked image latent z_masked (4 channels) โ โWhatโs around the hole?โ
This is the original image with the masked region zeroed out, then VAE-encoded. It tells BrushNet what the surrounding context looks like โ colors, textures, edges near the mask boundary.
Original: [beach][dog][beach]
Mask applied: [beach][ 0 ][beach] โ dog region zeroed out
VAE encode: [4-channel latent] โ this goes to BrushNet
Why 4 channels instead of 3 (RGB)? Because the U-Net operates in VAE latent space, not pixel space. Raw pixels would be mismatched โ like feeding English text into a Chinese language model. The VAE encoder translates the image into the same โlanguageโ the U-Net understands.
Original image (512ร512ร3)
โ
Apply mask (zero out hole region)
โ
VAE Encoder
โ
Masked image latent (64ร64ร4) โ This goes to BrushNet
3. Mask (1 channel) โ โWhere is the hole?โ
A simple binary map: 1 = inpaint here, 0 = keep original. You might think BrushNet could figure this out from the masked image alone (just look for the zeros), but zeroed-out pixels are ambiguous:
Without mask channel:
z_masked has zeros at (2,3) โ Is this black pixels or a hole? ๐คท
With mask channel:
z_masked has zeros at (2,3) + mask=1 at (2,3) โ Definitely a hole! โ
| Withoutโฆ | Problem |
|---|---|
| Noisy latent | BrushNet doesnโt know which denoising step โ wrong features |
| Masked image | BrushNet canโt see surrounding context โ canโt blend |
| Mask | BrushNet canโt distinguish โblack pixelโ from โholeโ |
Each input answers a different question: when (timestep), whatโs around (context), and where (hole location).
Hereโs the clever part. BrushNetโs features are injected into the Base U-Net through zero convolutions โ 1ร1 convolutions where all weights start at zero.
At training start:
BrushNet feature โโโ ZeroConv โโโ 0.0 โโโ + Base U-Net feature
(all zeros) (unchanged!)
Why? Because the Base U-Net is a carefully trained model. If you inject random noise into it on day one, youโd destroy its ability to generate images. Starting from zero means:
Training step 0: BrushNet contributes nothing (U-Net works normally)
Training step 100: BrushNet whispers tiny hints (weights: 0.001)
Training step 10K: BrushNet provides real guidance (weights: 0.1)
Say BrushNet produces a feature value of 0.8 at some position. Hereโs what the zero convolution does with it over training:
Step 0: weight = 0.0 โ 0.0 ร 0.8 = 0.0 (silent)
Step 1000: weight = 0.02 โ 0.02 ร 0.8 = 0.016 (whispering)
Step 10000: weight = 0.25 โ 0.25 ร 0.8 = 0.2 (contributing)
Itโs like slowly turning up the volume from mute. The Base U-Net is never shocked by sudden changes.
Unlike ControlNet (which only injects into the decoder), BrushNet injects at every single layer โ all encoder blocks, the mid block, and all decoder blocks:

The left column (green) is the trainable BrushNet branch โ no cross-attention to text. The right column (blue) is the frozen Base U-Net with text cross-attention. The red arrows are zero-conv injection points where BrushNet features are added element-wise to the Base U-Net.
Each arrow is actually multiple injection points (one per sub-layer), totaling about 25 injection points in total. This dense injection gives BrushNet pixel-level control, which is crucial for inpainting โ you need precise boundaries where the generated content meets the original image.
The Base U-Net has cross-attention layers that let it understand text prompts:
Base U-Net block: ResBlock โ CrossAttention("a sunflower") โ output
BrushNet block: ResBlock โ output
โ
(removed!)
This is by design. BrushNetโs job is purely spatial โ โhereโs a hole, hereโs whatโs around it.โ The text understanding stays in the Base U-Net. This separation means:
The training loop is surprisingly simple โ it uses the standard diffusion denoising loss:
For each training step:
1. Take a clean image "cat on a couch"
2. Generate a RANDOM mask (random shape, random position)
3. Apply mask to image (hole in it)
4. VAE-encode both zโ (clean latent), z_masked (masked latent)
5. Add random noise to clean latent z_t = mix(zโ, noise, t)
6. Run through both branches:
BrushNet(z_t, z_masked, mask) โ injection features
Base_UNet(z_t, text) + features โ predicted noise
7. Loss = โ predicted_noise - actual_noise โยฒ (MSE)
Yes! The model predicts what noise was added, not what the clean image looks like. We know the actual noise because we added it ourselves in step 5. If the model can perfectly predict the noise, we can subtract it to recover the clean image.
We added noise ฮต to get z_t.
Model predicts ฮต_ฮธ.
If ฮต_ฮธ โ ฮต, then zโ โ (z_t - ฮต_ฮธ) / scale โ clean image recovered!
Nope. The loss is computed over the entire image, not just the masked region. But the model naturally focuses on the mask because:
The mask guides learning implicitly through gradients, not explicitly through loss weighting.
BrushNet doesnโt need paired before/after examples. Itโs self-supervised:
Dataset: clean images + text descriptions (same data as Stable Diffusion)
Masks: generated randomly during training
The model learns to reconstruct whatever was behind a random mask, using the surrounding context and text prompt. At inference, you provide a real mask of what you want to replace.
| Feature | SD Inpainting | ControlNet | BrushNet |
|---|---|---|---|
| Base model | Modified (retrained) | Frozen | Frozen |
| Branch coverage | N/A (single model) | Encoder only | Full U-Net |
| Injection points | N/A | ~12 (decoder only) | ~25 (everywhere) |
| Swap base models? | No | Yes | Yes |
| Extra params | 0 | ~360M | ~480M |
| Text handling | Single model | Branch has cross-attn | Branch has NO cross-attn |
| Best for | General inpainting | Structural control | Precise inpainting |
ControlNet copies only the encoder half โ it injects features into the decoder via the skip connections. This works well for structural guidance (edges, poses) but not for inpainting, where you need fine-grained control at every spatial resolution.
The BrushNet paper showed this clearly:
Full U-Net (BrushNet): PSNR 19.86 โ best quality
Half U-Net: PSNR 19.01
ControlNet-style: PSNR 18.28 โ worst quality
Inpainting needs dense per-pixel control, especially at mask boundaries where generated content must blend seamlessly with the original image.
At inference time, the full pipeline looks like this:
1. User provides: image + mask + text prompt ("a sunflower")
2. Encode:
masked_image = apply_mask(image, mask)
z_masked = VAE_encode(masked_image) [4, 64, 64]
mask_small = downsample(mask) [1, 64, 64]
3. Start from pure noise:
z_T ~ N(0, I) [4, 64, 64]
4. Denoise loop (T steps, e.g. 25-50):
for t in T โ 0:
brushnet_feats = BrushNet(z_t, z_masked, mask_small, t)
noise_pred = BaseUNet(z_t, t, "a sunflower") + brushnet_feats
z_{t-1} = scheduler_step(z_t, noise_pred)
5. Decode final latent:
result = VAE_decode(z_0) [3, 512, 512]
6. Blend:
output = blur_blend(result, original_image, mask)
The final blending step uses a Gaussian-blurred mask to smooth the transition between generated and original pixels, avoiding hard edges.
Because the Base U-Net is never modified, you can:
conditioning_scale (0.0 to 1.0) to control how much BrushNet influences the outputscale = 0.0 โ Base U-Net only (no inpainting guidance)
scale = 0.5 โ Gentle inpainting hints
scale = 1.0 โ Full BrushNet influence (default)
Base U-Net (frozen): ~520M params
BrushNet (trainable): ~480M params
โโ Zero-conv layers: 25 layers, ~20M params
Total at inference: ~1,000M params (1B)
BrushNet is nearly the same size as the Base U-Net โ the only difference is removing cross-attention layers (~40M params saved). The trade-off is clear: 2x memory for plug-and-play flexibility.
BrushNet gives us a powerful inpainting engine. But using it requires you to provide two things manually: a mask (where to edit) and a text prompt (what to generate). For simple cases thatโs fine โ draw a circle around the dog, type โa sunflower.โ
But what if you just want to say โremove the dogโ and have the system figure out the rest?
Thatโs exactly what BrushEdit does. It wraps BrushNet in an intelligent agent pipeline that automates the mask and prompt generation.
BrushEdit (arXiv 2412.10316) doesnโt change BrushNetโs architecture at all. Instead, it asks: how do you go from a natural language instruction to a BrushNet-ready mask and prompt?
The answer is an assembly line of 4 AI models:
User: "Remove the dog from the garden"
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. MLLM (Qwen2-VL) โ "What kind of edit? What object?"
โ Classify + Identify โ โ edit_type = "remove"
โ + Generate caption โ โ target = "dog"
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โ caption = "garden with flowers"
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2. GroundingDINO โ "Where is the dog?"
โ Text โ bounding box โ โ bbox around the dog
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 3. SAM โ "What's the exact shape?"
โ Bbox โ pixel mask โ โ silhouette of the dog
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 4. BrushNet + SD 1.5 โ "Fill the hole"
โ Mask + caption โ image โ โ dog replaced with garden
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Each model does one thing well. Letโs walk through each step.
The MLLM (a vision-language model like Qwen2-VL or GPT-4o) is called three separate times, each with a different question. No fine-tuning โ itโs used purely through prompt engineering.
System: "Classify this editing instruction into one of:
addition, remove, local, global, background.
Reply with a single word."
User: "Remove the dog from the garden"
โ "remove"
This classification matters because each edit type needs a different mask strategy:
| Edit Type | What Happens to the Mask |
|---|---|
| Remove โRemove the dogโ | Detect dog โ segment it โ dilate mask edges |
| Addition โAdd a cat on the sofaโ | No detection needed โ MLLM predicts a bounding box |
| Local โMake the car blueโ | Detect car โ segment it โ use mask as-is |
| Background โChange to a beachโ | Detect foreground โ segment โ invert the mask |
| Global โMake it nighttimeโ | Mask the entire image |
System: "Identify the main object being edited.
Reply with no more than 5 words, a single noun phrase."
User: "Remove the dog from the garden"
โ "dog"
This short phrase will be fed to GroundingDINO as a search query. It needs to be concise โ just enough to find the right thing in the image.
System: "Describe what the image should look like AFTER the edit.
Do NOT include elements that are removed or changed."
User: [source image] + "Remove the dog from the garden"
โ "A peaceful garden path with green grass and flowers"
This becomes the text prompt for BrushNetโs inpainting. Notice: it describes the scene without the dog โ because weโre removing it. The MLLM has to understand the instruction well enough to describe the result, not just parrot the input.
All three calls use the MLLM off-the-shelf. No fine-tuning. This means you can swap backends freely:
GPT-4o โ Best quality, requires API key, costs money
Qwen2-VL โ Best open-source, runs locally, ~16 GB VRAM
LLaVA โ Lighter alternative, ~17 GB VRAM
The paper doesnโt fine-tune any of these models. It just writes good prompts. This is a deliberate design choice โ it keeps the system modular and easy to upgrade as better VLMs come out.
Now we know weโre looking for โdog.โ But where in the image is it?
GroundingDINO is an open-vocabulary object detector. Unlike traditional detectors that only recognize a fixed set of classes (like COCOโs 80 categories), it takes any text query and finds matching objects:
Input: image + "dog"
Output: bounding box (128, 128, 384, 384), confidence 0.89
โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ โโโโโโโโโโโโ โ
โ โ โ โ
โ โ dog โ โ
โ โ โ โ
โ โโโโโโโโโโโโ โ
โ โ โ
โ bounding box โ
โ from DINO โ
โโโโโโโโโโโโโโโโโโโโโโโโโโ
This works for any object you can describe in words. โRed car,โ โwooden table,โ โperson in blue shirtโ โ GroundingDINO handles them all.
Exception: addition edits. If the instruction is โadd a cat on the sofa,โ thereโs no cat to detect yet. In this case, GroundingDINO is skipped entirely. Instead, the MLLM predicts where the new object should go by outputting a bounding box: โgiven this 512ร512 image, the cat should go at [256, 170, 128, 170].โ
A bounding box is too rough. The box around the dog also includes chunks of grass, maybe a bit of fence. We need the exact silhouette.
SAM (Segment Anything Model) takes the bounding box and produces a pixel-precise mask:
Before (bounding box): After (SAM mask):
โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ โ
โ โโโโโโโโโโโโ โ โ โโโโโโโโ โ
โ โ grass โ โ โ โโโโโโโโโโโโ โ
โ โ dog โ โ โ โโโโโโโโโโ โ
โ โ grass โ โ โ โโโโโโ โ
โ โโโโโโโโโโโโ โ โ โโ โ
โ โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
Box includes background Mask follows the dog's
around the dog exact silhouette
After SAM produces the mask, BrushEdit adjusts it based on the edit type:
Remove (dilated): Background (inverted):
โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโโโโโโโโ โ โโโโโ โโโโโโโโโ
โ โโโโโโโโโโโโโโ โ โโโ โโโโโโโ
โ โโโโโโโโโโโโ โ โโโโโ โโโโโโโโโ
โ โโโโโโโโ โ โโโโโโโ โโโโโโโโโโโ
โ โโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
Expanded to catch fur/shadow Everything EXCEPT the dog
Now we have everything BrushNet needs:
| Input | Value |
|---|---|
| Mask | Pixel-precise segmentation from SAM (dilated for removal) |
| Caption | โA peaceful garden path with green grass and flowersโ |
| Original image | The source photo |
This is the exact same BrushNet pipeline we covered in Part 1:
1. masked_image = original ร (1 - mask) โ zero out the dog region
2. z_masked = VAE.encode(masked_image) โ encode to latent space
3. conditioning = concat(z_masked, mask) โ 5-channel conditioning
4. Denoising loop (50 steps):
BrushNet features = BrushNet(z_t, conditioning)
noise_pred = Base_UNet(z_t, "garden with flowers") + BrushNet features
z_{t-1} = scheduler.step(z_t, noise_pred)
5. result = VAE.decode(z_0) โ back to pixel space
6. output = blur(mask) ร result + (1-blur(mask)) ร original โ blend
The blurred mask blending at the end creates a smooth transition at the boundary. Without it, youโd see a hard edge where the generated content meets the original image. This single step accounts for a +10 PSNR improvement in ablation studies.
Letโs trace through one more example to make sure itโs clear. Instruction: โChange the background to a tropical beach.โ
Step 1: MLLM classifies โ "background"
MLLM identifies โ "person" (the foreground object to keep)
MLLM captions โ "A person standing on a tropical beach with
palm trees and turquoise water"
Step 2: GroundingDINO("person") โ bounding box around the person
Step 3: SAM(bbox) โ pixel mask of the person
Mask is INVERTED โ now covers everything EXCEPT the person
Coverage: ~75% of the image
Step 4: BrushNet inpaints the masked region (the background)
using caption "tropical beach with palm trees"
Person is preserved in the unmasked region
Blended at edges for seamless transition
The key insight for background edits: GroundingDINO detects the foreground object (the person), SAM segments it, then the mask is inverted. BrushNet never touches the person โ it only regenerates the background.
You might wonder: why not train one big model that takes โremove the dogโ and directly outputs an edited image? Thatโs what InstructPix2Pix does. BrushEditโs decomposed approach has three advantages:
1. Transparency. Every intermediate result is visible. You can see the edit classification (โremoveโ), the detected object (โdogโ), the mask, and the caption. If something goes wrong, you know exactly where.
2. User control. You can override any step. Donโt like the auto-generated mask? Draw your own. Want a different caption? Type one. The pipeline doesnโt force you into a black box.
3. No paired training data. InstructPix2Pix needs millions of (instruction, before, after) triples โ expensive to create. BrushEdit needs none. The MLLM is used off-the-shelf, GroundingDINO and SAM are pre-trained, and BrushNet trains on standard images with random masks.
The trade-off is complexity. BrushEdit orchestrates 4 separate models totaling ~66 GB of weights. But each model is best-in-class at its job, and you can upgrade any component independently.
These methods invert the image to noise, then re-denoise with edits. BrushEdit skips inversion entirely โ it generates directly in the masked region.
| Method | PSNR (quality) | Time |
|---|---|---|
| DDIM + P2P | 22.67 | 11s |
| Null-Text + P2P | 26.52 | 148s |
| BrushEdit | 32.16 | 3.6s |
5 PSNR better and 3-40x faster.
BrushEdit uses BrushNet internally, but improves on it:
| BrushNet | BrushEdit | |
|---|---|---|
| Mask generation | Manual | Automatic (MLLM + DINO + SAM) |
| Caption | Manual | Automatic (MLLM) |
| Model checkpoints | 2 separate (seg masks, random masks) | 1 unified model |
| Object removal | Limited | Trained explicitly with removal data |
| Multi-round editing | No | Yes (output becomes next input) |
The unified model comes from training on BrushData-v2 โ a merged dataset that combines segmentation masks and random masks, plus new removal training pairs where clean-background images are paired with random masks.
No system is perfect. BrushEdit struggles with:
Irregular masks. Very thin, fragmented, or oddly shaped masks can produce artifacts. The model was trained mostly on blob-like masks and object silhouettes.
Text-mask misalignment. If the caption says โa large elephantโ but the mask is tiny, the model canโt fit an elephant in there. The MLLM doesnโt always reason well about spatial constraints.
Base model ceiling. BrushEdit uses Stable Diffusion 1.5 as its backbone. Output quality is bounded by what SD 1.5 can generate. It canโt produce FLUX-quality images because the underlying diffusion model isnโt that capable.
VLM errors cascade. If the MLLM misclassifies the edit type (calling a โremoveโ a โlocal editโ), the entire downstream pipeline produces wrong results. Thereโs no error recovery between steps.
BrushNet (Part 1):
BrushEdit (Part 2):
The two papers together tell a clean story: BrushNet solves how to inpaint (the architecture), and BrushEdit solves what to inpaint (the intelligence layer that turns natural language into masks and captions).
This post covers BrushNet (ECCV 2024) and BrushEdit (arXiv 2412.10316). The architecture diagrams come from hands-on experimentation and code analysis of the TencentARC/BrushEdit repository.