Optimise stable diffusion model on Rasberry Pi

Optimise stable diffusion model on Rasberry Pi

Github repo: https://github.com/WideSu/OnnxStream

Project idea: https://www.raspberrypi.com/news/creating-ai-art-with-raspberry-pi-magpimonday/

YouTube Video: https://youtu.be/NvJ4HtWQ_OY

Introduction

Localised GenAI model is the future for data privacy and security. But the expensive computing resources stoped many people from developing their own GenAI models. I saw an interesting project on GitHub which enabled running stable diffusion model on Rasberry Pi Zero with limited memory and computing resources by using attention slicing and quantization.

The project I saw is called OnnxStream, it makes use of XNNPACK to optimise how Stable Diffusion generates AI imagery. It was tested was using Raspberry Pi OS Lite 64-bit on Raspberry Pi Zero 2 W. Vito’s GitHub has more details on setting up OnnxStream on a Raspberry Pi with “every KB of RAM needed to run Stable Diffusion.” I tried to run it on my rasberry pi 4B which has 8GB ram and wrote this step-to-step guide to help people who are interested to build their own localised diffusion model.

In just 1.5 hours, OnnxStream produced a near identical output on Raspberry Pi Zero 2 to a PC

Less memory usage than other stable diffusion models

This table shows the various inference times of the three models of Stable Diffusion 1.5, together with the memory consumption (i.e. the Peak Working Set Size in Windows or the Maximum Resident Set Size in Linux).

Model / Library1st run2nd run3rd run
FP16 UNET / OnnxStream0.133 GB - 18.2 secs0.133 GB - 18.7 secs0.133 GB - 19.8 secs
FP16 UNET / OnnxRuntime5.085 GB - 12.8 secs7.353 GB - 7.28 secs7.353 GB - 7.96 secs
FP32 Text Enc / OnnxStream0.147 GB - 1.26 secs0.147 GB - 1.19 secs0.147 GB - 1.19 secs
FP32 Text Enc / OnnxRuntime0.641 GB - 1.02 secs0.641 GB - 0.06 secs0.641 GB - 0.07 secs
FP32 VAE Dec / OnnxStream1.004 GB - 20.9 secs1.004 GB - 20.6 secs1.004 GB - 21.2 secs
FP32 VAE Dec / OnnxRuntime1.330 GB - 11.2 secs2.026 GB - 10.1 secs2.026 GB - 11.1 secs

In the case of the UNET model (when run in FP16 precision, with FP16 arithmetic enabled in OnnxStream), OnnxStream can consume even 55x less memory than OnnxRuntime with a 50% to 200% increase in latency.

1.Perquisites

Summary of the bash command for setting up environment for building OnnxStream diffusion model on Rasberry pi

apt install git
apt install cmake
apt install tmux
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-Linux-aarch64.sh
export PATH="${PATH}:/root/miniforge3/bin"
conda install -c conda-forge jupyterlab
conda install ipykernel
jupyter lab --allow-root --ip="your-ip" --port=8082 --log-level=40 --no-browser

apt install git

apt install tmux

for installing miniconda on dietpi os:

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"

bash Miniforge3-Linux-aarch64.sh

add conda to system path

export PATH="${PATH}:/root/miniforge3/bin"

conda install -c conda-forge jupyterlab

conda install ipykernel

jupyter lab --NotebookApp.token='' --NotebookApp.password=''

jupyter lab --allow-root --ip="your-ip" --port=8082 --log-level=40 --no-browser

2.Build diffusion Model

git clone https://github.com/google/XNNPACK.git
cd XNNPACK
git rev-list -n 1 --before="2023-06-27 00:00" master
git checkout <COMMIT_ID_FROM_THE_PREVIOUS_COMMAND>
mkdir build
cd build
cmake -DXNNPACK_BUILD_TESTS=OFF -DXNNPACK_BUILD_BENCHMARKS=OFF ..
cmake --build . --config Release

Change <directory_where_xnnpack_was_cloned> to the folder of your XNNPACK clone path.

git clone https://github.com/WideSu/OnnxStream.git
cd OnnxStream
cd src
mkdir build
cd build
cmake -DMAX_SPEED=ON -DXNNPACK_DIR=<DIRECTORY_WHERE_XNNPACK_WAS_CLONED> ..
cmake --build . --config Release
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_single_file("https://huggingface.co/YourUsername/YourModel/blob/main/Model.safetensors")

dummy_input = (torch.randn(1, 4, 64, 64), torch.randn(1), torch.randn(1, 77, 768))
input_names = ["sample", "timestep", "encoder_hidden_states"]
output_names = ["out_sample"]

torch.onnx.export(pipe.unet, dummy_input, "/path/to/save/unet_temp.onnx", verbose=False, input_names=input_names, output_names=output_names, opset_version=14, do_constant_folding=True, export_params=True)

python -m onnxruntime.tools.make_dynamic_shape_fixed --input_name sample --input_shape 1,4,64,64 model.onnx model_fixed1.onnx
python -m onnxruntime.tools.make_dynamic_shape_fixed --input_name timestep --input_shape 1 model_fixed1.onnx model_fixed2.onnx
python -m onnxruntime.tools.make_dynamic_shape_fixed --input_name encoder_hidden_states --input_shape 1,77,768 model_fixed2.onnx model_fixed3.onnx

Note by Vito: This can be achieved simply by following the approach outlined in "Option A" above, which remains the recommended approach. Making the input shapes fixed might be useful if your starting point is already an ONNX file.

python -m onnx_simplifier model_fixed3.onnx model_simplified.onnx

Note:

  • If you exported your model from Hugging Face, you'll need around 100GB of swap space.

  • If you manually fixed the input shapes, 16GB of RAM should suffice.

  • The process may take some time; please be patient.

Once you have the final model from onnx2txt, move it into the unet_fp16 folder of the standard SD 1.5 model, which can be found in the Windows release of OnnxStream.

The command to run the model might look like this:

./sd --models-path ./Converted/ --prompt "space landscape" --steps 28 --rpi

If you see the "Shape" operator in the output of Onnx Simplifier or in onnx2txt.ipynb, it indicates that Onnx Simplifier may not be functioning as expected. This issue is often not caused by Onnx Simplifier itself but rather by Onnx's Shape Inference.

In such cases, you have the alternative to re-export the model by modifying the parameters of torch.onnx.export. Locate this file on your computer:

export_onnx.py from GitHub

And make sure to:

  • Set opset_version to 14

  • Remove dynamic_axes

After making these changes, you can rerun Onnx Simplifier and onnx2txt.

Note by Vito: This solution, although working, generates ONNX files with Einsum operations. When OnnxStream supports the Einsum operator, this solution will become the recommended one.

This guide is designed to be a comprehensive resource for those looking to run a custom Stable Diffusion 1.5 model with OnnxStream. Additional contributions are welcome!

Related Projects