Generative AI Playground: Image-To-Image Stable Diffusion with RunwayML and Stability AI on the Latest Intel® GPU

Benjamin Consolvo

5.00/5 (1 vote)

Dec 11, 2023

CPOL

6 min read

13951

This article explores the use of Stable Diffusion models, focusing on image-to-image generation, using Intel's newly released Intel Data Center GPU Max Series 1100.

This article was originally published on Medium*.

Figure 1. A stable diffusion image result of an image of a waterfall plus a text prompt of Mars waterfall, run on Intel® Data Center GPU Max Series 1100.

Oct. 15, 2023 — Stable Diffusion* models have become a great way for creators, artists, and designers to quickly prototype visual ideas without the need for hiring outside help. If you have ever used a stable diffusion model, you might be familiar with giving a text prompt to generate an image. There are also models that allow for both a text prompt and an image as a starting point to generate an image. In this article, I show how I ran prediction of image-to-image Stable Diffusion models on Intel’s just-released Intel® Data Center GPU Max Series 1100.

I ran two different Stable Diffusion models for image-to-image generation, hosted on Hugging Face*. Though both models are used primarily for text-to-image, they both work on image-to-image as well:

Stability AI with Stable Diffusion v2–1 Model

The Stability AI with Stable Diffusion v2–1 model was trained on an impressive cluster of 32 x 8 x A100 GPUs (256 GPU cards total). It was fine-tuned from a Stable Diffusion v2 model. The original dataset was a subset of the LAION-5B dataset, created by the DeepFloyd team at Stability AI. The LAION-5B dataset is the largest text-image pair dataset to date as of the time of writing, with over 5.85 billion text-image pairs. Figure 2 shows a few samples from the dataset.

Figure 2. Samples from the LAION-5B dataset of examples of of cats. Image Source

The sample images shown reveal that the original images do come in a variety of pixel sizes; however, training these models in practice usually involves padding or resizing of the images to have a consistent pixel size for the model architecture.

The breakdown of the dataset is as follows:

Laion2B-en: 2.32 billion text-image pairs in English
Laion2B-multi: 2.26 billion text-image pairs from over 100 other languages
Laion1B-nolang: 1.27 billion text-image pairs with an undetectable language

Tracing the path of training these models is a bit convoluted, but here is the full story:

Stable Diffusion 2-Base was trained from scratch for 550K steps on 256 x 256 pixel images, filtered to remove pornographic material, and then trained for 850 K more steps on 512 x 512 pixel images.
Stable Diffusion v2 picks up training where Stable Diffusion 2-Base left off and was trained for 150 K more steps on 512 x 512 pixel images, followed by 140K more steps on 768x768 pixel images.
Stability AI Stable Diffusion v2–1 was further fine-tuned from Stable Diffusion v2 first with 55K steps followed by 155K steps with two different explicit material filters.

More details on the training can be found on the Stability AI Stable Diffusion v2–1 Hugging Face model card. I wanted to mention that I have repeated this description from a previous article on text-to-image stable diffusion, as it is the same model.

Runway ML with Stable Diffusion v1–5 Model

The Runway ML model was actually fine-tuned from the previously described Stability AI v2–1 model. It was trained for an additional 595 K steps at a resolution of 512 x 512. One of its advantages is that it is relatively lightweight: "With its 860 M UNet and 123 M text encoder, the model is relatively lightweight and runs on a GPU with at least 10 GB VRAM." (source on GitHub). The Intel Data Center GPU Max Series 1100 has 48 GB of VRAM, so it is plenty for this model.

The Intel GPU Hardware

As I just mentioned, the particular GPU that I used for my inference test is the Intel Data Center GPU Max 1100, which has 48 GB of memory, 56 X^e-cores, and 300 W of thermal design power. On the command line, I can first verify that I indeed do have the GPUs that I expect by running:

clinfo -l

And I get an output showing that I have access to four Intel GPUs on the current node:

Platform #0: Intel(R) OpenCL Graphics
 +-- Device #0: Intel(R) Data Center GPU Max 1100
 +-- Device #1: Intel(R) Data Center GPU Max 1100
 +-- Device #2: Intel(R) Data Center GPU Max 1100
 `-- Device #3: Intel(R) Data Center GPU Max 1100

Similar to the nvidia-smi function, you can run the xpu-smi in the command line with a few options selected to get the statistics you want on GPU use.

xpu-smi dump -d 0 -m 0,5,18

The result is a printout every 1 s of important GPU use for the device 0:

getpwuid error: Success
Timestamp, DeviceId, GPU Utilization (%), GPU Memory Utilization (%), GPU Memory Used (MiB)
13:34:51.000,    0, 0.02, 0.05, 28.75
13:34:52.000,    0, 0.00, 0.05, 28.75
13:34:53.000,    0, 0.00, 0.05, 28.75
13:34:54.000,    0, 0.00, 0.05, 28.75

Run the Stable Diffusion Image-To-Image Examples

My colleague, Rahul Nair, wrote the Stable Diffusion image-to-image Jupyter* Notebook that is hosted directly on the Intel® Developer Cloud. It gives you the option of using either model that I outlined earlier. Here are the steps you can take to get started:

Go to Intel Developer Cloud.
Register as a standard user.
Once you are logged in, go to the Training and Workshops section.
Select GenAI Launch Jupyter Notebook. You can find the text-to-image Stable Diffusion image-to-image Jupyter Notebook and run it there.

In the Jupyter Notebook, to speed up inference, Intel® Extension for PyTorch* was used. One of the key functions is _optimize_pipeline where ipex.optimize is called to optimize the DiffusionPipeline object.

    def _optimize_pipeline(
        self, pipeline: StableDiffusionImg2ImgPipeline
    ) -> StableDiffusionImg2ImgPipeline:
        """
        Optimize the pipeline of the model.

        Args:
            pipeline (StableDiffusionImg2ImgPipeline): The pipeline to optimize.

        Returns:
            StableDiffusionImg2ImgPipeline: The optimized pipeline.
        """
        for attr in dir(pipeline):
            if isinstance(getattr(pipeline, attr), nn.Module):
                setattr(
                    pipeline,
                    attr,
                    ipex.optimize(
                        getattr(pipeline, attr).eval(),
                        dtype=pipeline.text_encoder.dtype,
                        inplace=True,
                    ),
                )
        return pipeline

Figure 3 shows the handy mini user interface within the Jupyter Notebook itself for the image-to-image generation. Select one of the models, enter the desired image URL, enter a prompt, select the number of images to generate, and you’re off to creating your own images.

Figure 3. A mini user interface for the image-to-image interface within the Jupyter Notebook.

Figures 1 and 4 show samples of the results with entirely new images from text + image prompts that I ran with this Intel GPU. I thought it was neat to start with a real Earth nature photo of a waterfall, tell it to make a Mars waterfall, and see the adaptation to a red-colored landscape (Figure 1). And then in Figure 4, the model transformed an image of Jupiter to have some Earth continental structure but still colored red with some of the distinctive features of Jupiter left over.

Figure 4. A Stable Diffusion image result of an image of the planet Jupiter plus a text prompt of Earth run on the latest Intel Data Center GPU Max Series 1100.

I was able to generate these images by running through the Jupyter Notebook, and inference runs in a matter of seconds. Feel free to share your images with me over social media by connecting with me through the following links. Also please let me know if you have any questions or would like help on getting started with trying out stable diffusion.

You can reach me on:

DevHub Discord* server (user name bconsolv)
LinkedIn*
X (formerly known as Twitter*)