-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SD 1.x to SDXL refiner #4
Comments
Hi! I did a quick test and using the refiner with 1.5 works with the ComfyUI node, meaning the issue is somewhere else. Could it be a scaling issue? Just looking at the code it looks like you're passing the scaled latents directly to the refiner. Try modify the code like this: scaled_latents = 1 / 0.18215 * latents
sdxl_scaled_latents = convert(scaled_latents.to(dtype=torch.float32), "v1", "xl", torch.float32, torch_device)
sdxl_latents = 0.18215 * sdxl_scaled_latents |
I made a simplified notebook to limit the number of potential issues. Still having the same problem as in the previous notebook unfortunately. I'm not very familiar with ComfyUI and I don't know how exactly you connect the two models, so it's hard for me to pinpoint what the issue is. Could you share some details on how you connected SD1.5 and the refiner and/or can you have a look at the simplified notebook to see if there are any obvious issues? Could it be that the interposer was trained on a specific vae and the default vae for SD1.5 are not compatible? |
Your notebook asks me to log in, which I assume means it's set to private. Could you check the visibility settings? ComfyUI is just a node-based frontend to the LDM code, so internally it uses the same models/etc as diffusers, so that shouldn't matter in this case. Here is a quick and dirty example of the refiner being connected to the output of a 1.5 model. (Officially, this isn't quite correct, since you're supposed to return the noisy latent at around 80% denoise, then pass it to the refiner for the final 20%, but it works as an example here.) I don't think it's a VAE incompatibility issue either, the encoder part is the same for all v1.5 VAE as far as I know. I can try to write some example code for how to use this with diffusers if you want. I still suspect it's a scaling issue. |
Not great but it works. Oddly enough the v1 pipe doesn't have a Code belowimport torch
from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline
# Load pipelines
pipe = StableDiffusionPipeline.from_single_file(
r"D:\Software\AI\sd-models\checkpoints\mix\Silicon29_dark.safetensors",
load_safety_checker=False, # takes forever to download
torch_dtype=torch.float16,
)
pipe.enable_xformers_memory_efficient_attention()
refiner = StableDiffusionXLPipeline.from_single_file(
r"D:\Software\AI\sd-models\checkpoints\sd\sdxl_v1.0_refiner.safetensors",
torch_dtype=torch.float16,
)
refiner.enable_xformers_memory_efficient_attention()
# Generate image on SDv1
pipe.to("cuda")
scaled_latent = pipe(
prompt,
height = 1024,
width = 1024,
output_type = "latent",
# denoising_end = 0.90, # doesn't work on v1
num_inference_steps=20,
).images[0]
del pipe # free VRAM
# Convert latent
latent = scaled_latent * (1/0.18215)
xl_latent = convert_latent(latent, "v1", "xl") # code for the interposer, from your notebook
xl_scaled_latent = xl_latent * 0.18215
# Finish with refiner
refiner.to("cuda")
image = refiner(
prompt = prompt,
image = xl_scaled_latent,
denoising_start = 0.90,
num_inference_steps=20,
).images[0]
del refiner # free VRAM
image.show() |
Awesome! Thanks for a thorough answer. It definitely seems like the issue was the scaling. With you code I got some more acceptable output. I made the notebook public, so it should be possible to view it now. In the notebook I made a simple test and I'm curious to get your opinion on whether this is the expected quality or not. import requests
import torch
from PIL import Image
from io import BytesIO
import torchvision.transforms as transforms
from diffusers.image_processor import VaeImageProcessor
import gc
from diffusers import AutoencoderKL
generator = torch.manual_seed(0)
response = requests.get("https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg")
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))
# Processing
sd_vae = AutoencoderKL().from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="vae", variant="fp16", torch_dtype=torch.float16).to("cuda")
vaeImageProcessor = VaeImageProcessor(2 ** (len(sd_vae.config.block_out_channels) - 1))
init_pre_image = vaeImageProcessor.preprocess(init_image).to(dtype=torch.float16, device="cuda")
# Encode
sd_latents = sd_vae.encode(init_pre_image).latent_dist.sample(generator)
#sd_latents = sd_latents * (1/0.18215)
# Convert
sdxl_latents = convert(sd_latents, "v1", "xl", torch.float16, "cuda").to(dtype=torch.float32)
sdxl_latents = sdxl_latents * 0.18215
# Decode
sdxl_vae = AutoencoderKL().from_pretrained("stabilityai/sdxl-vae").to("cuda")
image_tensor = sdxl_vae.decode(sdxl_latents / sdxl_vae.config.scaling_factor, return_dict=False)[0]
# Post-processing
image = vaeImageProcessor.postprocess(image=image_tensor.detach())[0]
image Input image: Output image: As you can see, it has some artifacts. I could've done something wrong there though, as I inferred a lot of the steps from the diffusers library and it has a lot of stuff going on. |
That quality looks similar to what I get, maybe a bit worse but that could be from you running it in FP16. It's a tiny model so I'd recommend keeping the cast you had in the first nodebook and using it with FP32, though not sure how much that changes. It could also be a clamping difference on the output, hardware differences, etc, etc... (Also noticed you were using the default XL VAE. I usually use this one since it lets me use FP16, though there's no noticeable difference in terms of visual quality.) Doing v1=>xl is a lot harder than xl=>v1 because the XL latent contains more information than the V1 latent, so I could never get it 100% perfect since it has to "make up" fake details to fit the format I'm pretty sure. For the generation example, I think the image degradation you're seeing might be from the fact that you're passing a fully denoised latent into the refiner. As I noted above, there's no Again, I'm just guessing. You could also do a 3 stage thing where your initial image is v1 512x512, then upscale it and send it to v1 1024x1024 before sending it to the refiner. v1 doesn't like generating at resolutions that high natively. (xl=>v1 is simpler since v1 can handle the 1024x1024 image from xl nicely as there it's basically img2img at a low denoise, meaning no weird hires repetition problems appear.) |
@city96 thank you for the great work and I hope that there will be a new version with less artifacts, and that latent space expansion is a tough problem indeed. Here is a small question: can a LoRA or embedding be transferred the same way? @holwech could you interpolate between 70-30, 80-20, and 90-10 to see if the issue is "too much" or "not enough"? |
Hey! Very cool that you've made this! I tried to combine you converter with SD 1.x and the SDXL refiner but so far I haven't had much luck. Is this something you've managed to do successfully?
Here is the code I've used to combine SD 1.x and the SDXL refiner:
https://colab.research.google.com/drive/1lUHih8KsSGuKFTfYBz0I-6FkMEU5GkdP?usp=sharing
Here is an example of what I get out from the refiner atm:
The text was updated successfully, but these errors were encountered: