Stability AI Advances Image Generation with Stable Cascade

Stability AI, purveyor of the popular Stable Diffusion image generator, has introduced a completely new model called Stable Cascade. Now in preview, Stable Cascade uses a different architecture than Stable Diffusion’s SDXL that the UK company’s researchers say is more efficient. Cascade builds on a compression architecture called Würstchen (German for “sausage”) that Stability began sharing in research papers early last year. Würstchen is a three-stage process that includes two-step encoding. It uses fewer parameters, meaning less data to train on, greater speed and reduced costs.

Würstchen uses a VQ-GAN, or vector quantized generative adversarial network (Stage A), then employs a diffusion autoencoder to further compress (Stage B). Those two stages result in significantly reduced, yet denser, latent space compared to traditional models. It provides faster inference, generating images more efficiently due to the smaller space it operates in. Stage C is the training that links text prompts with images.

“Unlike the single large model used by Stable Diffusion, Stable Cascade employs a modular three-stage architecture,” a setup that “allows for significant improvements in training efficiency and customization,” explains ReadWrite, noting “the process begins with Stage C, which converts text prompts into compact 24×24 pixel latents. These latents are then decoded into full high-resolution images by Stages A and B.”

Würstchen is said to achieve comparable performance to other top models while using fewer resources, Stability researchers say in a paper under consideration for the ICLR 2024, the 12th Annual International Conference on Learning Representation, May 7-11 in Vienna. It is also the subject of a June 2023 paper on Hugging Face.

“By separating the text-to-image generation from the image decoding, the initial text-conditional model can be trained and fine-tuned much more efficiently,” VentureBeat writes. “According to Stability AI, fine-tuning Stage C alone provides a 16x cost reduction compared to fine-tuning an equivalently sized single Stable Diffusion model.”

“Stable Cascade is exceptionally easy to train and finetune on consumer hardware thanks to its three-stage approach,” Stability says in an announcement for the model, which is being released only for non-commercial use.

In addition to providing checkpoints and inference scripts, Stability AI on its GitHub page is sharing scripts for finetuning, ControlNet, and LoRA training “to enable users further to experiment with this new architecture.”

“Stable Cascade can generate photos and give variations of the exact image it created, or try to increase an existing picture’s resolution,” explains The Verge, adding that “other text-to-image editing features include inpainting and outpainting, where the model will fill edit only a specific part of the image, as well as canny edge, where users can make a new photo just by using the edges of an existing picture.”

No Comments Yet

You can be the first to comment!

Leave a comment

You must be logged in to post a comment.