Getting Started with Stable Diffusion: A Beginner's Guide
- By Bruce Nielson
- ML & AI Specialist
I recently came across this excellent little tutorial for how to use Stable Diffusion and decided to make my own version of it. But I wanted to offer some starter code as a class encapsulates all the basics for how to use Stable Diffusion XL.
What is Stable Diffusion?
Stable Diffusion is a text to image (and text to video) generation model from Stability AI. It is an open-source model that competes with Dall-E and Midjourney. Stability AI has made the model available publicly through the HuggingFace ecosystem and website. You can try it out for free via this website hosted on HuggingFace’s website.
- You can find the Stability AI organization on Hugging Face here: https://huggingface.co/stabilityai
- The latest and greatest Stable Diffusion model (XL) is found here: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
- We will also be playing with the Refiner model found here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0
The Quickest Way to Get Started with Stable Diffusion
The absolute easiest way to get started with Stable Diffusion is to use the AutoPipelineForText2Image package from HuggingFace. (Documentation for this package found here).
This is available from the diffusers package available from Hugging Face. (See the Diffusers organization here. Official documentation including Tutorials found here).
You’ll first need to install the diffusers package like this:
pip install diffusers
Or if you are using colab you might use this instead to cover your bases:
!pip install diffusers accelerate mediapy
For this demo I’m doing it in Google Colab (final version found here) and they have Pytorch already installed. You probably don’t need Pytorch to make this work, but it is convenient to use their datatypes. If you need instruction how to install Pytorch you can follow the official instructions here.
Here is some very simple code that will work out of the box. Find this code in my github found here:
from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained(
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
prompt = "Superman to the rescue!"
image = pipeline(prompt, num_inference_steps=25).images[0]
# save image off to output_images folder
image.save("output_images/output.jpg")
I modified that slightly for colab to use the mediapy library so that I can show the output inside the notebook rather than saving it off to a folder. Here is my final result:
Prompt: Superman to the rescue!
Not bad, eh?
Stable Diffusion XL: An Open Source Alternative to Professional Text to Image Generation
Of course, the result isn’t great because we’re not utilizing the latest stable diffusion model: Stable Diffusion XL. So, I’ve written a class for you (find the final code here) that encapsulates all the logic for using both Stable Diffusion XL and the Stable Diffusion XL Refiner. The Refiner is a model that improves the output of Stable Diffusion and makes the comparable to professional models – at the expense of taking longer of course.
Here is my code in full before I explain the most important parts. You can find my full code on my github found here. It can also be found in the Google Colab for this post if you want to just try it out there with a GPU.
import random
import sys
import torch
import os
import datetime
from diffusers import DiffusionPipeline
from accelerate import Accelerator
from typing import List, Union, Dict, Optional
class StableDiffusionXLPipeline:
def __init__(self, use_refiner: bool = True, height: int = 768, width: int = 768,
guidance_scale: float = 5.0, num_images_per_prompt: int = 1, output_dir: str = 'output_images'):
self._use_refiner = use_refiner
self._height = height
self._width = width
self._guidance_scale = guidance_scale
self._num_images_per_prompt = num_images_per_prompt
self._output_dir = output_dir
self.torch_dtype = None
self.accelerator = None
self.device = None
self.refiner = None
self.pipe = None
self.setup_pipeline()
@property
def use_refiner(self):
return self._use_refiner
@use_refiner.setter
def use_refiner(self, value: bool):
self._use_refiner = value
self.setup_pipeline()
@property
def height(self):
return self._height
@height.setter
def height(self, value: int):
self._height = value
@property
def width(self):
return self._width
@width.setter
def width(self, value: int):
self._width = value
@property
def guidance_scale(self):
return self._guidance_scale
@guidance_scale.setter
def guidance_scale(self, value: float):
self._guidance_scale = value
@property
def num_images_per_prompt(self):
return self._num_images_per_prompt
@num_images_per_prompt.setter
def num_images_per_prompt(self, value: int):
self._num_images_per_prompt = value
@property
def output_dir(self):
return self._output_dir
@output_dir.setter
def output_dir(self, value: str):
self._output_dir = value
def setup_pipeline(self):
self.torch_dtype = torch.float16
self.accelerator = Accelerator()
self.device = self.accelerator.device
self.pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=self.torch_dtype,
use_safetensors=True,
variant="fp16"
)
if self._use_refiner:
self.refiner = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=self.pipe.text_encoder_2,
vae=self.pipe.vae,
torch_dtype=self.torch_dtype,
use_safetensors=True,
variant="fp16"
).to(self.device)
self.pipe.enable_model_cpu_offload()
else:
self.pipe = self.pipe.to(self.device)
self.refiner = None
def process_prompts(self, prompts: List[Union[str, Dict]]):
for prompt in prompts:
if isinstance(prompt, str):
self.generate_image(prompt)
elif isinstance(prompt, dict):
self.generate_image(**prompt)
def generate_image(self, prompt: str, prompt_2: Optional[str] = None,
negative_prompt: Optional[str] = None, negative_prompt_2: Optional[str] = None,
seed: Optional[int] = None):
if seed is None:
seed = random.randint(0, sys.maxsize)
generator = torch.Generator(self.device).manual_seed(seed)
print(f"Prompt:\t{prompt}")
print(f"Seed:\t{seed}")
output_type = "latent" if self._use_refiner else "pil"
images = self.pipe(
prompt=prompt,
prompt_2=prompt_2,
negative_prompt=negative_prompt,
negative_prompt_2=negative_prompt_2,
output_type=output_type,
generator=generator,
height=self._height,
width=self._width,
guidance_scale=self._guidance_scale,
num_images_per_prompt=self._num_images_per_prompt
).images
if self._use_refiner:
images = self.refiner(prompt=prompt, image=images).images
self._save_images(images, prompt)
return images
def _save_images(self, images, prompt):
os.makedirs(self._output_dir, exist_ok=True)
timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
for i, image in enumerate(images):
image_path = os.path.join(self._output_dir, f"output_{timestamp}_{i}.jpg")
print(f"Saving image to {image_path}")
image.save(image_path)
print(f"Image complete. {prompt}")
# Usage example
if __name__ == "__main__":
pipeline = StableDiffusionXLPipeline(use_refiner=False,
num_images_per_prompt=1,
output_dir="my_images")
# Generate a single image using a string prompt
pipeline.generate_image("A fairy princess and her majestic dragon. Photorealistic.")
# Change some properties
pipeline.use_refiner = True
# Generate an image with more parameters
pipeline.generate_image(
prompt="A cyberpunk cityscape at night",
negative_prompt="daytime, bright, sunny",
seed=42
)
# Process multiple prompts
prompts_to_process = [
"A serene lake surrounded by mountains",
{
"prompt": "An alien landscape with two moons",
"prompt_2": "Highly detailed, science fiction art",
"negative_prompt": "Earth-like, familiar"
}
]
pipeline.process_prompts(prompts_to_process)
Output
Here is the output from this code based on the various prompts I tried out:
A fairy princess and her majestic dragon. Photorealistic. (Note: No refiner)
A cyberpunk cityscape at night; negative_prompt: daytime, bright, sunny
A serene lake surrounded by mountains
An alien landscape with two moons; prompt_2: Highly detailed, science fiction art; negative_prompt: Earth-like, familiar
What You Need to Know
So how does this all work?
The basic class just stores a number of important properties and values such as if you are going to use the refiner or not, the height and weight of the images it will create, or the number of images it will create per prompt.
The class also contains code to allow you to easily write a list of prompts for it to run that are either a simple string description of what you want or a dictionary of prompts and negative prompts. (A negative prompt is text you want Stable Diffusion to NOT include in the image).
But let’s down to business. How do we actually use the Stable Diffusion XL model? The main code you need to pay attention to is how to setup the pipeline:
def setup_pipeline(self):
self.torch_dtype = torch.float16
self.accelerator = Accelerator()
self.device = self.accelerator.device
self.pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=self.torch_dtype,
use_safetensors=True,
variant="fp16"
)
if self._use_refiner:
self.refiner = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=self.pipe.text_encoder_2,
vae=self.pipe.vae,
torch_dtype=self.torch_dtype,
use_safetensors=True,
variant="fp16"
).to(self.device)
self.pipe.enable_model_cpu_offload()
else:
self.pipe = self.pipe.to(self.device)
self.refiner = None
The pipeline is setup with this call:
self.pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=self.torch_dtype,
use_safetensors=True,
variant="fp16"
)
I’m specifying that I’m using the stable-diffusion-x1-base-1.0 model. If using the refiner, then also call this:
self.refiner = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=self.pipe.text_encoder_2,
vae=self.pipe.vae,
torch_dtype=self.torch_dtype,
use_safetensors=True,
variant="fp16"
).to(self.device)
self.pipe.enable_model_cpu_offload()
This will setup the stable-diffusion-xl-refiner-1.0 object for later. We use these objects later when generating an image:
And how to generate an image:
def generate_image(self, prompt: str, prompt_2: Optional[str] = None,
negative_prompt: Optional[str] = None, negative_prompt_2: Optional[str] = None,
seed: Optional[int] = None):
if seed is None:
seed = random.randint(0, sys.maxsize)
generator = torch.Generator(self.device).manual_seed(seed)
print(f"Prompt:\t{prompt}")
print(f"Seed:\t{seed}")
output_type = "latent" if self._use_refiner else "pil"
images = self.pipe(
prompt=prompt,
prompt_2=prompt_2,
negative_prompt=negative_prompt,
negative_prompt_2=negative_prompt_2,
output_type=output_type,
generator=generator,
height=self._height,
width=self._width,
guidance_scale=self._guidance_scale,
num_images_per_prompt=self._num_images_per_prompt
).images
if self._use_refiner:
images = self.refiner(prompt=prompt, image=images).images
self._save_images(images, prompt)
return images
Essentially call the pipeline we previously saved off and pass in the various parameters you want for the image including the prompts.
One thing that requires some explanation is ‘guidance scale’. This parameter determines how closely the model will try to follow the text prompt vs being more creative as it creates the image. (See here for further explanation). Try this out using trial and error to find the results that work best for you.
Once the pipeline generates images, we then feed them to the refiner object which improves the quality of the image further. All pretty simple, right?
Stable Diffusion is an ocean and we’re just playing in the shallows, but this should be enough to get you started with adding Stable Diffusion text-to-image functionality to your applications.