Getting Started with Stable Diffusion: A Beginner's Guide

Getting Started with Stable Diffusion: A Beginner's Guide

I recently came across this excellent little tutorial for how to use Stable Diffusion and decided to make my own version of it. But I wanted to offer some starter code as a class encapsulates all the basics for how to use Stable Diffusion XL.

What is Stable Diffusion?

Stable Diffusion is a text to image (and text to video) generation model from Stability AI. It is an open-source model that competes with Dall-E and Midjourney. Stability AI has made the model available publicly through the HuggingFace ecosystem and website. You can try it out for free via this website hosted on HuggingFace’s website.

The Quickest Way to Get Started with Stable Diffusion

The absolute easiest way to get started with Stable Diffusion is to use the AutoPipelineForText2Image package from HuggingFace. (Documentation for this package found here).

This is available from the diffusers package available from Hugging Face. (See the Diffusers organization here. Official documentation including Tutorials found here).

You’ll first need to install the diffusers package like this:

pip install diffusers

Or if you are using colab you might use this instead to cover your bases:

!pip install diffusers accelerate mediapy

For this demo I’m doing it in Google Colab (final version found here) and they have Pytorch already installed. You probably don’t need Pytorch to make this work, but it is convenient to use their datatypes. If you need instruction how to install Pytorch you can follow the official instructions here.

Here is some very simple code that will work out of the box. Find this code in my github found here:

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
prompt = "Superman to the rescue!"

image = pipeline(prompt, num_inference_steps=25).images[0]
# save image off to output_images folder
image.save("output_images/output.jpg")

I modified that slightly for colab to use the mediapy library so that I can show the output inside the notebook rather than saving it off to a folder. Here is my final result:

An AI generated image of Superman, he is in the act of flying and has a speech bubble with text that is indecipherable.

Prompt: Superman to the rescue!

Not bad, eh?

Stable Diffusion XL: An Open Source Alternative to Professional Text to Image Generation

Of course, the result isn’t great because we’re not utilizing the latest stable diffusion model: Stable Diffusion XL. So, I’ve written a class for you (find the final code here) that encapsulates all the logic for using both Stable Diffusion XL and the Stable Diffusion XL Refiner. The Refiner is a model that improves the output of Stable Diffusion and makes the comparable to professional models – at the expense of taking longer of course.

Here is my code in full before I explain the most important parts. You can find my full code on my github found here. It can also be found in the Google Colab for this post if you want to just try it out there with a GPU.

import random
import sys
import torch
import os
import datetime
from diffusers import DiffusionPipeline
from accelerate import Accelerator
from typing import List, Union, Dict, Optional


class StableDiffusionXLPipeline:
    def __init__(self, use_refiner: bool = True, height: int = 768, width: int = 768,
                 guidance_scale: float = 5.0, num_images_per_prompt: int = 1, output_dir: str = 'output_images'):
        self._use_refiner = use_refiner
        self._height = height
        self._width = width
        self._guidance_scale = guidance_scale
        self._num_images_per_prompt = num_images_per_prompt
        self._output_dir = output_dir
        self.torch_dtype = None
        self.accelerator = None
        self.device = None
        self.refiner = None
        self.pipe = None
        self.setup_pipeline()

    @property
    def use_refiner(self):
        return self._use_refiner

    @use_refiner.setter
    def use_refiner(self, value: bool):
        self._use_refiner = value
        self.setup_pipeline()

    @property
    def height(self):
        return self._height

    @height.setter
    def height(self, value: int):
        self._height = value

    @property
    def width(self):
        return self._width

    @width.setter
    def width(self, value: int):
        self._width = value

    @property
    def guidance_scale(self):
        return self._guidance_scale

    @guidance_scale.setter
    def guidance_scale(self, value: float):
        self._guidance_scale = value

    @property
    def num_images_per_prompt(self):
        return self._num_images_per_prompt

    @num_images_per_prompt.setter
    def num_images_per_prompt(self, value: int):
        self._num_images_per_prompt = value

    @property
    def output_dir(self):
        return self._output_dir

    @output_dir.setter
    def output_dir(self, value: str):
        self._output_dir = value

    def setup_pipeline(self):
        self.torch_dtype = torch.float16
        self.accelerator = Accelerator()
        self.device = self.accelerator.device

        self.pipe = DiffusionPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=self.torch_dtype,
            use_safetensors=True,
            variant="fp16"
        )

        if self._use_refiner:
            self.refiner = DiffusionPipeline.from_pretrained(
                "stabilityai/stable-diffusion-xl-refiner-1.0",
                text_encoder_2=self.pipe.text_encoder_2,
                vae=self.pipe.vae,
                torch_dtype=self.torch_dtype,
                use_safetensors=True,
                variant="fp16"
            ).to(self.device)
            self.pipe.enable_model_cpu_offload()
        else:
            self.pipe = self.pipe.to(self.device)
            self.refiner = None

    def process_prompts(self, prompts: List[Union[str, Dict]]):
        for prompt in prompts:
            if isinstance(prompt, str):
                self.generate_image(prompt)
            elif isinstance(prompt, dict):
                self.generate_image(**prompt)

    def generate_image(self, prompt: str, prompt_2: Optional[str] = None,
                       negative_prompt: Optional[str] = None, negative_prompt_2: Optional[str] = None,
                       seed: Optional[int] = None):
        if seed is None:
            seed = random.randint(0, sys.maxsize)

        generator = torch.Generator(self.device).manual_seed(seed)

        print(f"Prompt:\t{prompt}")
        print(f"Seed:\t{seed}")

        output_type = "latent" if self._use_refiner else "pil"
        images = self.pipe(
            prompt=prompt,
            prompt_2=prompt_2,
            negative_prompt=negative_prompt,
            negative_prompt_2=negative_prompt_2,
            output_type=output_type,
            generator=generator,
            height=self._height,
            width=self._width,
            guidance_scale=self._guidance_scale,
            num_images_per_prompt=self._num_images_per_prompt
        ).images

        if self._use_refiner:
            images = self.refiner(prompt=prompt, image=images).images

        self._save_images(images, prompt)
        return images

    def _save_images(self, images, prompt):
        os.makedirs(self._output_dir, exist_ok=True)
        timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

        for i, image in enumerate(images):
            image_path = os.path.join(self._output_dir, f"output_{timestamp}_{i}.jpg")
            print(f"Saving image to {image_path}")
            image.save(image_path)

        print(f"Image complete. {prompt}")


# Usage example
if __name__ == "__main__":
    pipeline = StableDiffusionXLPipeline(use_refiner=False,
                                         num_images_per_prompt=1,
                                         output_dir="my_images")

    # Generate a single image using a string prompt
    pipeline.generate_image("A fairy princess and her majestic dragon. Photorealistic.")

    # Change some properties
    pipeline.use_refiner = True

    # Generate an image with more parameters
    pipeline.generate_image(
        prompt="A cyberpunk cityscape at night",
        negative_prompt="daytime, bright, sunny",
        seed=42
    )

    # Process multiple prompts
    prompts_to_process = [
        "A serene lake surrounded by mountains",
        {
            "prompt": "An alien landscape with two moons",
            "prompt_2": "Highly detailed, science fiction art",
            "negative_prompt": "Earth-like, familiar"
        }
    ]

    pipeline.process_prompts(prompts_to_process)

Output

Here is the output from this code based on the various prompts I tried out:

An AI generated image of a fairy princess standing next to a dragon.

A fairy princess and her majestic dragon. Photorealistic. (Note: No refiner)

An AI generated image of a cyberpunk cityscape, the overall color scheme is in purples, oranges, and blacks/greys.

A cyberpunk cityscape at night; negative_prompt: daytime, bright, sunny

An AI generated image of a serene lake surrounded by mountains. The lake takes up the bottom half of the image, and appears to continue on through a canyon between the mountains. The mountains and sky are reflected in the lake.

A serene lake surrounded by mountains

An AI generated image of an alien landscape and various alien buildings. The alien buildings are tall towers with many ramparts and saucer-shaped tops. The landscape is dusty orange with a few trees and dust clouds in the background. In the sky is a planet that looks similar to mars, either as a moon or as the parent planet of the landscape's source planet.

An alien landscape with two moons; prompt_2: Highly detailed, science fiction art; negative_prompt: Earth-like, familiar

What You Need to Know

So how does this all work?

The basic class just stores a number of important properties and values such as if you are going to use the refiner or not, the height and weight of the images it will create, or the number of images it will create per prompt.

The class also contains code to allow you to easily write a list of prompts for it to run that are either a simple string description of what you want or a dictionary of prompts and negative prompts. (A negative prompt is text you want Stable Diffusion to NOT include in the image).

But let’s down to business. How do we actually use the Stable Diffusion XL model? The main code you need to pay attention to is how to setup the pipeline:

    def setup_pipeline(self):
        self.torch_dtype = torch.float16
        self.accelerator = Accelerator()
        self.device = self.accelerator.device

        self.pipe = DiffusionPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=self.torch_dtype,
            use_safetensors=True,
            variant="fp16"
        )

        if self._use_refiner:
            self.refiner = DiffusionPipeline.from_pretrained(
                "stabilityai/stable-diffusion-xl-refiner-1.0",
                text_encoder_2=self.pipe.text_encoder_2,
                vae=self.pipe.vae,
                torch_dtype=self.torch_dtype,
                use_safetensors=True,
                variant="fp16"
            ).to(self.device)
            self.pipe.enable_model_cpu_offload()
        else:
            self.pipe = self.pipe.to(self.device)
            self.refiner = None

The pipeline is setup with this call:

    self.pipe = DiffusionPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=self.torch_dtype,
        use_safetensors=True,
        variant="fp16"
    )

I’m specifying that I’m using the stable-diffusion-x1-base-1.0 model. If using the refiner, then also call this:

            self.refiner = DiffusionPipeline.from_pretrained(
                "stabilityai/stable-diffusion-xl-refiner-1.0",
                text_encoder_2=self.pipe.text_encoder_2,
                vae=self.pipe.vae,
                torch_dtype=self.torch_dtype,
                use_safetensors=True,
                variant="fp16"
            ).to(self.device)
            self.pipe.enable_model_cpu_offload()

This will setup the stable-diffusion-xl-refiner-1.0 object for later. We use these objects later when generating an image:

And how to generate an image:

    def generate_image(self, prompt: str, prompt_2: Optional[str] = None,
                       negative_prompt: Optional[str] = None, negative_prompt_2: Optional[str] = None,
                       seed: Optional[int] = None):
        if seed is None:
            seed = random.randint(0, sys.maxsize)

        generator = torch.Generator(self.device).manual_seed(seed)

        print(f"Prompt:\t{prompt}")
        print(f"Seed:\t{seed}")

        output_type = "latent" if self._use_refiner else "pil"
        images = self.pipe(
            prompt=prompt,
            prompt_2=prompt_2,
            negative_prompt=negative_prompt,
            negative_prompt_2=negative_prompt_2,
            output_type=output_type,
            generator=generator,
            height=self._height,
            width=self._width,
            guidance_scale=self._guidance_scale,
            num_images_per_prompt=self._num_images_per_prompt
        ).images

        if self._use_refiner:
            images = self.refiner(prompt=prompt, image=images).images

        self._save_images(images, prompt)
        return images

Essentially call the pipeline we previously saved off and pass in the various parameters you want for the image including the prompts.

One thing that requires some explanation is ‘guidance scale’. This parameter determines how closely the model will try to follow the text prompt vs being more creative as it creates the image. (See here for further explanation). Try this out using trial and error to find the results that work best for you.

Once the pipeline generates images, we then feed them to the refiner object which improves the quality of the image further. All pretty simple, right?

Stable Diffusion is an ocean and we’re just playing in the shallows, but this should be enough to get you started with adding Stable Diffusion text-to-image functionality to your applications.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter