You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, great work! I succeed in reproducing the VAE adaption from SD2' to SDXL's, as discussed in Pixart-Sigma. However, the adaption to SDXL's VAE is not successful. After 10k steps finetuning in SAM, sampled images are meaningless and chaotic(attached below), although the training loss looks pretty good.
(the first two images generated from the the adaption exp. - SD2's to SDXL's VAE, while the latter from SD2' to SDv3's VAE)
The key change of SDv3's VAE is, the latent channel expands from 4 to 16. Thus, the compressed latents will reserve more details and avoid unplesant artifacts(eg. little faces, texts). To accommodate with this change, I initialize the net with official 'Pixart-alpha-256x256.pt' weights, except the 'x_embed' layer and 'final_layer'(channels 4->16, 8 -> 32).
Could anyone give me some hints? I'm really confused. Thanks, guys!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, great work! I succeed in reproducing the VAE adaption from SD2' to SDXL's, as discussed in Pixart-Sigma. However, the adaption to SDXL's VAE is not successful. After 10k steps finetuning in SAM, sampled images are meaningless and chaotic(attached below), although the training loss looks pretty good.
(the first two images generated from the the adaption exp. - SD2's to SDXL's VAE, while the latter from SD2' to SDv3's VAE)
The key change of SDv3's VAE is, the latent channel expands from 4 to 16. Thus, the compressed latents will reserve more details and avoid unplesant artifacts(eg. little faces, texts). To accommodate with this change, I initialize the net with official 'Pixart-alpha-256x256.pt' weights, except the 'x_embed' layer and 'final_layer'(channels 4->16, 8 -> 32).
Could anyone give me some hints? I'm really confused. Thanks, guys!
Beta Was this translation helpful? Give feedback.
All reactions