llama.cpp on older AMD hardware

I still have an AMD RX 560 (4GB VRAM, Polaris architecture) in my machine. It serves its purpose for driving my main monitor and I haven’t yet felt the itch to replace it. I also have an Nvidia RTX 4070 for tasks that need more power. Since I don’t run my display server on the Nvidia card, I can pass it into a virtual machine and run different operating systems without having to reboot my system. This setup has served me well, but are there use cases for the AMD card beyond just plain desktop usage?

llama.cpp has an official backend for AMD GPUs using the HIP language and the ROCm compute stack, but setting it up is unfortunately a little more involved than with Nvidia CUDA. Officially ROCm only supports a limited number of distributions and the range of supported consumer GPUs is tiny and usually restricted to only the latest generation. Debian has started an effort to package the entire ecosystem and in the process they decided to enable architectures that have been dropped upstream. The LLVM backends for these architectures are still intact and most of the libraries in the ROCm stack can be built for them just fine.

In a stark contrast to the CUDA ecosystem, HIP applications are compiled ahead of time with no intermediate representation like Nvidia’s PTX. Theoretically this avoids the delays that can be caused by JITing kernels, but it also means that binaries are quite large and distribution is a hassle. More on that later.

Building

My host system runs the current Debian Bookworm 6.1 kernel. AMD ships its own out-of-tree DKMS version of the amdgpu kernel driver, but they usually upstream all of their changes over time and on older hardware, you should have no issues with the mainline amdgpu driver. Thus, we only have to concern ourselves with user space. The setup used here is just a container with a Debian Trixie userland. This is currently the simplest way to get access to Debian’s ROCm packaging.

docker run -it --device=/dev/kfd --device=/dev/dri --name test debian:trixie

Docker CLI has a --gpus argument, but it only works with Nvidia. On AMD we have to manually bind-mount the required device nodes. Using the root user from within the container sidesteps any permission issues.

HIP (ROCm)

After cloning the llama.cpp repo, we need the following packages to build the HIP backend:

sudo apt install librocblas-dev libhipblas-dev hipcc

This will pull in the HIP compiler and all the required dependencies. hipBLAS is a marshalling library that forwards to either rocBLAS (AMD) or cuBLAS (Nvidia). We can then build llama.cpp:

HIPCXX=clang-17 cmake -B build -DGGML_HIP=ON
cmake --build build -j

On Debian you have to manually set the HIP compiler to “clang-17”. This is not necessary if you install the official ROCm packages provided by AMD. Furthermore, CMake will autodetect the architecture of the current card, but you can also specify it explicitly, like so: -DAMDGPU_TARGETS=gfx803. You can build fat binaries for multiple architectures by separating the architecture names with commas. This is indicative of AMD’s HPC focus, where you will probably only build for one set of GPUs anyway. General distribution of binaries to many different users becomes a chore however. This is the reason why you can find so many posts suggesting the HSA_OVERRIDE_GFX_VERSION environment variable on the web. It allows you to pretend to have a different architecture at load time. Sometimes you can select a generic GFX architecture that is binary compatible with your actual one.

Vulkan

The Vulkan compute backend is an alternative to the vendor-specific backends and has made great strides in recent months. Building it is a lot simpler in terms of dependencies and the binaries should be quite portable if you statically link libggml and build with an old enough glibc, but the usual Linux static linking caveats apply.

We just need the Vulkan headers and the glslc compiler to generate intermediate SPIR-V code from llama.cpp’s GLSL shaders:

sudo apt install libvulkan-dev glslc

cmake -B build -DGGML_VULKAN=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build -j

These generic SPIR-V shaders are then compiled to native code by the platform shader compiler—the open source Mesa in our case—at runtime. This means that in theory, it should be at a disadvantage compared to hand-tuned GEMM kernels used in libraries like rocBLAS or cuBLAS.

Benchmarks

I’ll be using a Q4_K_M GGUF quant of Gemma 3 4B for testing. It is one of the most powerful models available in this size class at the time of writing. I disabled flash attention and KV cache quantization and used the default sampler chain. The weights plus KV cache take up approximately 2.9 GB of memory at a context length of 4096 tokens. This number is inflated, because llama.cpp has not yet implemented proper interleaved sliding window attention for Gemma 3 series models (see this issue). This issue becomes much worse at longer context lengths. For example, I can run Mistral Nemo 12B at 32k context on the 12GB 4070, but I can only fit 12k of context with Gemma 3 12B.

We can use llama.cpp’s built-in benchmarking facilities:

time llama-bench -m /models/gemma-3-4b-it-Q4_K_M.gguf -r 3 --progress

Backend	Prompt processing (tokens/s)	Token generation (tokens/s)	Power consumption (W)
HIP	6.11 ± 0.07	17.44 ± 0.02	43
Vulkan	81.40 ± 0.31	25.53 ± 0.06	70 (max. TDP)
CPU (AVX2, 16 threads)	177.50 ± 7.28	16.51 ± 0.11	~ 130

Right away, we can see that the HIP performance is basically unusable. The prompt processing is so slow that there is a noticeable delay even for very short prompts. Polaris was last officially supported in ROCm 4.3 (2021) and while Debian has done a valiant job in building newer versions for this architecture, it is clear that any tuned code paths and optimized kernels have since been deleted or have deteriorated due to lack of testing. Curiously, while the GRBM “Graphics pipe” utilization was near 100% when running the HIP benchmark, the reported power consumption was only around 60% of the maximum TDP. This indicates to me that there are some memory inefficiencies and we are not utilizing the SIMDs to their fullest.

Vulkan is in a much better state in the year 2025 and there are probably still some gains to be had, with relevant commits landing in the repo almost daily. The level of performance on offer is usable for interactive tasks and the setup is much simpler than HIP.

For comparison purposes, I also tested the pure CPU backend. The CPU in this system is a Ryzen 5950X with 16 physical cores. The results really demonstrate that the Polaris architecture is quite dated by now. AVX2 instructions allow the CPU to process the prompt at more than twice the speed. The trade-off is a lower token generation rate, which is limited here by dual channel DDR4-3200 RAM. This combined with full system utilization and a higher power consumption, still leaves some justification for using the GPU, at least for very small models.

Comparing the theoretical maximum memory bandwidths of GPU (112 GB/s) and CPU (50 GB/s) also gives us an upper bound for the maximum possible token generation rate. We have to load all weights and KV cache for each generated token so we can make a rough estimate by dividing the bandwidth B by the size S.

\[\begin{aligned} \text{GPU:}& \quad t_\mathrm{rate} = \frac{B}{S} = \frac{112 \, \mathrm{GB/s}}{2.9 \, \mathrm{GB}} = 38.6 \, \mathrm{/s} \\ \text{CPU:}& \quad t_\mathrm{rate} = \frac{B}{S} = \frac{50 \, \mathrm{GB/s}}{2.9 \, \mathrm{GB}} = 17.2 \, \mathrm{/s} \end{aligned}\]

It looks like we’re compute-bound on the GPU and memory-bound on the CPU.

Conclusion

While 4 GB of VRAM is unsuitable for today’s larger models, you can still run models with less than four billion parameters for tasks like RAG or summarization without taxing your CPU. The poor state of ROCm should really be a wake-up call to AMD. Nvidia supports its entire consumer GPU range with CUDA and they do not drop support for older hardware, beyond introducing different levels of compute capability. Thankfully, Vulkan is here to save the day, with third parties (e.g. Valve) and enthusiasts chipping in to improve the performance and features of the Mesa compute stack even on old hardware.