Steve Kinney

Text-to-Image Generation (Stable Diffusion)

Key Takeaways

  • Stable Diffusion denoises latents conditioned on text; VAE decodes to pixels
  • Classifier-free guidance balances adherence to prompts vs realism (tune CFG)
  • Schedulers trade speed/detail; step count interacts with scheduler choice
  • Use valid latent resolutions (multiples), craft positive/negative prompts, and set seeds
  • Optimize memory: efficient attention, slicing, channels-last, and mixed precision

Overview

Stable Diffusion generates images from text using a latent diffusion process guided by a text encoder and a denoising UNet. Unlike classification models that analyze existing images, diffusion models create new images from noise.

Think of Stable Diffusion as a super-smart magic coloring book: you whisper what picture you want—say, “a purple dinosaur eating ice cream on the moon”—and the coloring book starts with a page covered in silly scribbles (just random dots). Then, almost like an eraser and crayon working together, it keeps gently wiping away the messy dots and adding the right colors and shapes bit by bit, listening to your words the whole time, until the random mess turns into the exact picture you asked for.

Notebook

View the companion notebook: Image Generation

Setting Up Stable Diffusion

First, install and import the necessary dependencies:

# Install dependencies
!pip install --quiet "diffusers[torch]" transformers accelerate safetensors

import torch
from diffusers import AutoPipelineForText2Image, DPMSolverMultistepScheduler

# Set up device and generator for reproducibility
device = "cuda" if torch.cuda.is_available() else "cpu"
generator = torch.Generator(device).manual_seed(1337)

Generating Your First Image

Load a Stable Diffusion model and generate an image:

model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"

pipeline = AutoPipelineForText2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    use_safetensors=True,
    low_cpu_mem_usage=True
).to(device)

# Swap to a faster scheduler
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
    pipeline.scheduler.config
)

# Memory-saving optimizations
pipeline.enable_attention_slicing()    # Chunk cross-attention
pipeline.enable_vae_slicing()          # Same for VAE decoder
pipeline.enable_model_cpu_offload()    # Swap idle layers to CPU

prompt = "A pygmy hippo in New York City watching Instagram on an iPhone"

image = pipeline(
    prompt,
    num_inference_steps=50,
    guidance_scale=7.5,
    generator=generator,
).images[0]

image

Architecture Components

Stable Diffusion consists of four key components working together:

Text Encoder (CLIP) converts your prompt into embeddings that condition the denoiser. It understands the semantic meaning of your words.

UNet Denoiser is the sculptor that repeatedly predicts and removes noise at each diffusion step. It’s a neural network trained to gradually transform random noise into meaningful images based on text guidance.

VAE (Variational Autoencoder) translates between pixel space and latent space. The encoder compresses images to smaller latent representations for efficient processing, while the decoder reconstructs full-resolution images from latents.

Scheduler controls the noise schedule and update rule. Different schedulers (DDIM, Euler, DPM++) offer trade-offs between speed and quality.

Understanding Diffusion Models

Diffusion models create images by gradually removing noise from random patterns:

Forward Process: Take a clear image and gradually add noise until it becomes pure random noise.

Reverse Process: Train a model to reverse this—start with noise and gradually remove it to create an image.

The U-Net is called that because of its U-shaped architecture: it has a downsampling path (analyzes the noisy image at multiple scales), a bottleneck (understands global context), and an upsampling path (rebuilds the image, refining it step by step).

Key Parameters

Quality versus Speed

Control the trade-off between generation quality and speed:

# Fewer steps = faster, lower quality
quick_image = pipeline(prompt, num_inference_steps=10)

# More steps = slower, higher quality
quality_image = pipeline(prompt, num_inference_steps=50)

# Guidance scale controls prompt adherence
creative_image = pipeline(prompt, guidance_scale=3.0)   # More creative
strict_image = pipeline(prompt, guidance_scale=15.0)    # Stricter to prompt

num_inference_steps: 10-50 steps, lower is faster. guidance_scale: 1-15, higher means more faithful to text but potentially stiffer. width/height: Stick to multiples of 64; larger sizes need more VRAM.

Understanding the Generator

The generator provides reproducible randomness:

# Creates a random number generator on your GPU/CPU
generator = torch.Generator(device).manual_seed(1337)

# Same seed = same initial noise = same final image
# Different seed = different image

Pass this generator to the pipeline for identical results across runs. Drop the seed or change it for variety.

Prompt Engineering

The quality of generated images heavily depends on how you write prompts. A good prompt typically includes:

  1. Subject: What/who is in the image
  2. Style: Artistic style, medium, or technique
  3. Quality modifiers: Words that enhance quality
  4. Lighting: How the scene is lit
  5. Composition: Camera angle, framing
  6. Details: Specific attributes or characteristics
# Basic prompt
basic_prompt = "a cat"

# Detailed prompt
detailed_prompt = """
Ultra-realistic cinematic photo of a fluffy Persian cat sitting on a vintage
velvet armchair, golden hour lighting streaming through Victorian windows,
85mm lens, shallow depth of field, shot on Kodak Portra 400, 8K resolution
"""

# Generate with both
basic_image = pipeline(basic_prompt).images[0]
detailed_image = pipeline(detailed_prompt).images[0]

Negative Prompts

Negative prompts tell the model what to avoid. In classifier-free guidance, the model denoises twice at every step—once with your positive prompt and once with an unconditional prompt. The negative prompt replaces that empty unconditional prompt, actively steering away from unwanted concepts:

model_id = "stabilityai/stable-diffusion-xl-base-1.0"

pipeline = AutoPipelineForText2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    use_safetensors=True,
).to(device)

pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
    pipeline.scheduler.config
)
pipeline.enable_attention_slicing()

# Positive prompt - what you want
prompt = (
    "Ultra-realistic cinematic photo of a pygmy hippo jaywalking through "
    "a neon-lit downtown street at dusk, rain-soaked asphalt reflecting "
    "colored lights, 85mm lens, f/1.4 bokeh, dramatic rim lighting, "
    "shot on Kodak Portra 800, 8K resolution"
)

# Negative prompt - what to avoid
negative_prompt = (
    "blurry, grainy, lowres, overexposed, watermark, text, logo, "
    "extra limbs, cars, people, distorted anatomy, jpeg artifacts"
)

image = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=28,
    guidance_scale=7.0,
    width=512,
    height=512,
    generator=generator,
).images[0]

The mathematical effect is that the model actively moves away from concepts in the negative prompt, reducing artifacts and unwanted elements.

Schedulers and Their Trade-offs

Different schedulers offer different benefits:

from diffusers import (
    DDIMScheduler,
    EulerDiscreteScheduler,
    DPMSolverMultistepScheduler,
    HeunDiscreteScheduler
)

# DPMSolverMultistep - Fast and high quality
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
    pipeline.scheduler.config
)

# Euler - Good for artistic styles
pipeline.scheduler = EulerDiscreteScheduler.from_config(
    pipeline.scheduler.config
)

# DDIM - Deterministic, good for reproducibility
pipeline.scheduler = DDIMScheduler.from_config(
    pipeline.scheduler.config
)

DPMSolverMultistep is like having a smart instruction booklet that finds shortcuts, making pictures appear faster without losing quality. Euler/Euler A often gives crisp details. DDIM can be faster with fewer steps. DPM++ variants provide good detail and stability.

Memory Optimizations

Attention Slicing

Attention slicing processes the image in smaller chunks instead of all at once:

# Instead of processing the whole image at once
pipeline.enable_attention_slicing()

# The model looks at one small chunk at a time
# Like coloring one square of a grid at a time

This keeps your GPU from running out of memory while still creating the full picture.

Additional Optimizations

# VAE slicing - process VAE in chunks
pipeline.enable_vae_slicing()

# CPU offloading - swap unused layers to CPU
pipeline.enable_model_cpu_offload()

# Channels-last memory format for CUDA
pipeline.unet.to(memory_format=torch.channels_last)

# Mixed precision with autocast
with torch.autocast("cuda"):
    image = pipeline(prompt).images[0]

Model Variants

Different models offer different strengths:

  • runwayml/stable-diffusion-v1-5: Classic SD 1.5—fast, versatile
  • stabilityai/stable-diffusion-2-1: SD 2.1—better faces, higher resolution
  • stabilityai/stable-diffusion-xl-base-1.0: SDXL—highest quality, slower
  • prompthero/openjourney: Midjourney style
  • hakurei/waifu-diffusion: Anime style
  • nitrosocke/Ghibli-Diffusion: Studio Ghibli style
# Load a specialized model
anime_pipeline = AutoPipelineForText2Image.from_pretrained(
    "hakurei/waifu-diffusion",
    torch_dtype=torch.float16
).to(device)

anime_image = anime_pipeline(
    "anime girl with blue hair in cherry blossom garden",
    num_inference_steps=30
).images[0]

Classifier-Free Guidance (CFG)

CFG amplifies the difference between conditional and unconditional predictions:

predicted_noise = uncond + s * (cond - uncond)

Where s is the guidance scale. Higher values make the model follow your prompt more strictly but can lead to oversaturation or artifacts. Moderate CFG (5-9 for SD 1.5, different ranges for SDXL) balances fidelity and naturalness.

Resolution and Aspect Ratios

Latents operate at lower resolution than pixels (64× downscale for SD 1.5). Valid dimensions must be multiples of the model’s downsampling factor:

# Valid resolutions for SD 1.5 (multiples of 64)
resolutions = [
    (512, 512),    # Square
    (768, 512),    # Landscape
    (512, 768),    # Portrait
    (640, 512),    # Wide
]

for width, height in resolutions:
    image = pipeline(
        prompt,
        width=width,
        height=height
    ).images[0]

Start with moderate sizes (512×512), then upscale with dedicated upscalers if needed.

Safety and Ethics

Use safety checkers and content filters responsibly:

# Safety checker is included by default
# To disable (use responsibly):
pipeline.safety_checker = None

# Or create custom filters
def custom_safety_check(images, **kwargs):
    # Your safety logic here
    return images, [False] * len(images)

pipeline.safety_checker = custom_safety_check

Respect licensing and usage policies. Avoid copying living artists’ named styles unless allowed; prefer generic descriptors. Be aware of potential biases in generated content.

Hardware Considerations

CUDA Settings

Enable optimizations for better performance:

import torch
torch.backends.cudnn.benchmark = True
# Use autocast for fp16/bf16 when running custom inference loops

Batch Generation

Generate multiple images efficiently:

# Generate batch of images
batch_size = 4
prompts = [prompt] * batch_size

images = pipeline(
    prompts,
    num_inference_steps=30,
    generator=[torch.Generator(device).manual_seed(i) for i in range(batch_size)]
).images

Reproducibility and Seeds

Set seeds for reproducible results:

# Fixed seed for reproducibility
seed = 42
generator = torch.Generator(device).manual_seed(seed)

# Generate multiple variations
seeds = [42, 123, 456, 789]
images = []

for seed in seeds:
    gen = torch.Generator(device).manual_seed(seed)
    image = pipeline(prompt, generator=gen).images[0]
    images.append(image)

Troubleshooting Common Issues

Over-saturation

If images look washed out or oversaturated, reduce guidance scale or try a different scheduler.

Prompt Ignored

If the model ignores parts of your prompt, try increasing guidance scale, rephrasing with clearer concepts, or using a different model checkpoint.

Repeating Patterns

For artifacts or repeating patterns, try different seeds, add negative prompts for common artifacts, or adjust the number of inference steps.

Advanced Techniques

Prompt Weighting

Some interfaces support prompt weighting, though base Diffusers doesn’t directly:

# Conceptual - requires custom implementation
weighted_prompt = "(beautiful:1.2) sunset over (calm:0.8) ocean"

Checkpoint Merging

Combine strengths of different models:

# Conceptual - requires model merging tools
merged_model = merge_models(
    model_a="stable-diffusion-v1-5",
    model_b="custom-style-model",
    alpha=0.5  # Blend ratio
)

Conclusion

Stable Diffusion democratizes image generation, letting anyone create stunning visuals from text descriptions. Success depends on understanding the interplay between prompts, parameters, and model capabilities. Start with proven models and default settings, then experiment with prompts, schedulers, and parameters to find what works for your use case. Remember that generation is probabilistic—embrace the variability and use seeds when you need consistency.

Further Reading

Last modified on .