3 min read

Finally, Stable Diffusion SDXL with ROCm acceleration and benchmarks

In this post I'd just like to write how ROCm support by AMD and the ecosystem python, pytorch,.. is a mess but it finally works! At least for my CPU / APU, AMD Ryzen 7 7700 glx1036 / 48Gb RAM @ 5200Mhz.
Finally, Stable Diffusion SDXL with ROCm acceleration and benchmarks

Joker tries Stable Diffusion on ROCm

PROMPT: Joker fails to install ROCm to run Stable Diffusion SDXL, cinematic

AI is the future and I'd like to not depend on NVIDIA monopoly, I also do not need a GPU for gaming, so is AMD the alternative? Maybe, I wrote previously about it (in italian).

In this post I'd just like to write how ROCm support by AMD and the ecosystem python, pytorch,.. is a mess but it finally works! At least for my CPU / APU, AMD Ryzen 7 7700 glx1036 / 48Gb RAM @ 5200Mhz.

So will it also works for you? I don't know, there are too many moving parts.

Is your CPU / GPU / APU supported ?

The official AMD docs is improving but a few Linux distros are supported, the most useful website to check the compatibility is Salsa from Debian (which isn't an officially supported distro).

$ rocm_agent_enumerator 
gfx000
gfx1036 # <-- THIS IS WHAT YOU NEED TO KNOW

Tell me about the "secret sauce" to make it work Stable Diffusion Web UI with ROCm

Use Docker and search for the errors, expect "segfaults", library incompatibilities, pinned dependencies..

Here is my special combo.

# Dockerfile.rocm
FROM rocm/dev-ubuntu-22.04
ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 \
    PYTHONIOENCODING=UTF-8
WORKDIR /sdtemp
RUN apt-get update &&\
    apt-get install -y \
    wget \
    git \
    python3 \
    python3-pip \
    python-is-python3
RUN python -m pip install --upgrade pip wheel
RUN git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui /sdtemp

ENV TORCH_COMMAND="pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6"

RUN python -m $TORCH_COMMAND

EXPOSE 7860

RUN python launch.py --skip-torch-cuda-test --exit
RUN python -m pip install opencv-python-headless
WORKDIR /stablediff-web

# docker-compose.yml

version: '3'
services:
stablediff-rocm:
    build: 
      context: .
      dockerfile: Dockerfile.rocm
    container_name: stablediff-rocm-runner
    environment:
      TZ: "Europe/Rome"
      ROC_ENABLE_PRE_VEGA: 1
      COMMANDLINE_ARGS: "--listen --lowvram --no-half --skip-torch-cuda-test"
      ## IMPORTANT!
      HSA_OVERRIDE_GFX_VERSION: "10.3.0"
      ROCR_VISIBLE_DEVICES: 1
      #PYTORCH_HIP_ALLOC_CONF: "garbage_collection_threshold:0.6,max_split_size_mb:128"

    entrypoint: ["/bin/sh", "-c"]
    command: >
      "rocm-smi; . /stablediff.env; echo launch.py $$COMMANDLINE_ARGS;
      if [ ! -d /stablediff-web/.git ]; then
        cp -a /sdtemp/. /stablediff-web/
      fi;
      python launch.py"
    ports:
      - "7860:7860"
    devices:
      - "/dev/kfd:/dev/kfd"
      - "/dev/dri:/dev/dri"
    group_add:
      - video
    ipc: host
    cap_add:
      - SYS_PTRACE
    security_opt:
      - seccomp:unconfined
    volumes:
      - ./cache:/root/.cache
      - ./stablediff.env:/stablediff.env
      - ./stablediff-web:/stablediff-web
      - ./stablediff-models:/stablediff-web/models/Stable-diffusion

$ docker-compose build -f stablediff-rocm
$ docker-compose up stablediff-rocm 

The current stable ROCm 5.4 do not work here, you have to use ROCm 5.6 pre or Pytorch 1 instead of Pytorch 2, crazy.

Inspired by this discussion and a lot of debugging, the environment variables are very important set HSA_OVERRIDE_GFX_VERSION and ROCR_VISIBLE_DEVICES for your situation, while --lowvram is optional, it will make the generation a little bit slower but it will save you some RAM

Stable Diffusion CPU vs ROCm benchmarks

Is ROCm useful if you have a little of VRAM ?

Here is the test:

Debian 12 Linux 6.1 • SDWEBUI version: v1.5.1  •  python: 3.10.6  •  torch: 2.1.0.dev20230827+rocm5.6  •  xformers: N/A  •  gradio: 3.32.0  •  checkpoint: 31e35c80fc

Prompt: "a dog, cinematic"
Steps: 20 (default)
Sampling: Euler A (default)
Width: 1024 (SDXL at 512px is useless)
Height: 1024

# Results:

cpu sd1.5 1024 = 36.55s/it - RAM 18gb
cpu sdxl 1024  = 22.72s/it - RAM 36gb (+ swap?)


rocm sd1.5 1024 = 36.69s/it RAM 15gb (same of cpu)
rocm sdxl 1024  = 19.33s/it RAM 35gb (the best)
rocm sdxl 1024 lowvram = 20.27s/it RAM 33gb

Stable Diffusion SDXL ROCm
Stable Diffusion SDXL ROCm is a little bit better than in CPU mode (6:47s vs 7:30s)

Conclusion

Currently ROCm is just a little bit faster than CPU on SDXL, but it will save you more RAM specially with --lowvram flag.
CPU mode is more compatible with the libraries and easier to make it work.
Stable Diffusion 1.5 is slower than SDXL at 1024 pixel an in general is better to use SDXL.

Stable Diffusion raccomand a GPU with 16Gb of VRAM for SDXL but it's nice to see that also cheaper and more general purpose hardware can run it. The GPU is faster than CPU/APU for AI but you can still run AI if you have at least 48Gb RAM, you just have to wait more time.