Unlocking the Power of LLMs on OpenSUSE with AMD: A Step-by-Step Guide to Installing ROCm and Compiling llama.cpp

Introduction

Large Language Models (LLMs) are transforming industries by enabling advanced natural language processing, automation, and decision-making.
Running these models locally on custom hardware setups not only provides better control over performance but also ensures data privacy. If you have an AMD GPU and thought you’d need an NVIDIA GPU to run LLMs efficiently, think again!

With ROCm (Radeon Open Compute), you can harness the power of AMD GPUs for LLM inference at a fraction of the cost. 💪

In this guide, I’ll walk you through setting up ROCm and compiling llama.cpp on an OpenSUSE system. Whether you're a developer, researcher, or AI enthusiast, this tutorial will equip you with the skills to run LLMs efficiently on your AMD GPU.

Section 1: Why ROCm and llama.cpp?

ROCm Overview: ROCm is AMD’s open software platform for GPU computing, designed to accelerate machine learning, AI, and high-performance computing workloads.

It’s a cost-effective alternative to NVIDIA’s CUDA, enabling AMD GPU users to run LLMs with GPU acceleration.

llama.cpp Introduction: llama.cpp is a lightweight and efficient implementation of LLMs, optimized for both CPU and GPU execution.

It’s open-source, highly customizable, and supports various quantization techniques to maximize performance and has great support for ROCM for running LLMs on AMD GPUs.

Why AMD GPUs? AMD GPUs like the RX 7900 XTX offer exceptional value for LLM workloads. With 24GB of VRAM and ~960 GB/s memory bandwidth, the RX 7900 XTX can handle larger, more sophisticated models while delivering impressive token generation speeds.

NVIDIA GPUs like the RTX 4090 (24GB VRAM) cost 2.5 times as much, while the RTX 4080 (16GB VRAM) not only costs more but also limits your ability to run larger models. When it comes to LLMs, VRAM amount and memory bandwidth are the most critical factors—AMD GPUs excel on both fronts.

Why OpenSUSE? OpenSUSE is known for its stability, flexibility, and robust package management.

With up to date packages thanks to it's rolling release packaging model, making it an ideal platform for experimenting with AI technologies where the underlying dependencies and inference engines are constantly being updated.

Section 2: Preparing Your System

Before diving into the installation, ensure your system meets the following requirements:

Hardware:
- A recent AMD GPU (e.g., Radeon RX 7900 series for RDNA3 or RX 6800 series for RDNA2).
- At least 16GB system RAM (more the better)
Software: OpenSUSE with root access and basic development tools installed.

Section 3: Step-by-Step Installation Guide

Step 1: Adding the ROCm Repository

Add the ROCm repository to your OpenSUSE system:

sudo zypper addrepo https://download.opensuse.org/repositories/science:GPU:ROCm/openSUSE_Factory/science:GPU:ROCm.repo
sudo zypper refresh

Step 2: Installing ROCm System Dependencies

Install the necessary ROCm packages:

sudo zypper in hipblas-common-devel rocminfo rocm-hip rocm-hip-devel rocm-cmake rocm-smi libhipblas2-devel librocblas4-devel libcurl-devel libmtmd

Step 3: Cloning the llama.cpp Repository

Download the llama.cpp repository:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Step 4: Setting Up the Build Environment

Specify your AMD GPU target (e.g., gfx1100 for RDNA3 or gfx1030 for RDNA2) and configure the build:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build-rocm -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100

Step 5: Compiling llama.cpp

Build llama.cpp using the specified number of threads (e.g., 16):

cmake --build build-rocm --config Release -- -j 16

Step 6: Testing the Build

Download a GGUF model from Hugging Face and test the setup:

cd models
wget -c -O qwen2.5-0.5b.gguf https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf
cd ..
./build-rocm/bin/llama-bench -m models/qwen2.5-0.5b.gguf

you should get an output like this

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | ROCm       |  99 |         pp512 |    30061.00 ± 201.52 |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | ROCm       |  99 |         tg128 |        271.27 ± 0.74 |

build: dc39a5e7 (5169)

Section 4: Enhanced Flash Attention Using rocWMMA

Flash Attention (FA) is a memory-efficient attention algorithm that enables larger context windows and better performance. To enhanced FA performance, we need to build llama.cpp with the rocWMMA library.

Step 1: Cloning and Building rocWMMA

Clone the rocWMMA repository and build it:

git clone https://github.com/ROCm/rocWMMA.git
cd rocWMMA
CC=/usr/lib64/rocm/llvm/bin/clang CXX=/usr/lib64/rocm/llvm/bin/clang++ \
cmake -B build . \
 -DROCWMMA_BUILD_TESTS=OFF \
 -DROCWMMA_BUILD_SAMPLES=OFF \
 -DOpenMP_CXX_FLAGS="-fopenmp" \
 -DOpenMP_CXX_LIB_NAMES="omp" \
 -DOpenMP_omp_LIBRARY="/usr/lib/libgomp1.so"
cmake --build build -- -j 16

Step 2: Installing rocWMMA

Install rocWMMA to the system directory:

sudo mkdir -p /opt/rocm
sudo chown -R $USER:users /opt/rocm/
cmake --build build --target install -- -j 16

Step 3: Patching llama.cpp (Optional)

If rocWMMA is not detected, apply the provided patch to ggml/src/ggml-hip/CMakeLists.txt:

diff --git a/ggml/src/ggml-hip/CMakeLists.txt b/ggml/src/ggml-hip/CMakeLists.txt
index 1fe8fe3b..9577203b 100644
--- a/ggml/src/ggml-hip/CMakeLists.txt
+++ b/ggml/src/ggml-hip/CMakeLists.txt
@@ -39,10 +39,12 @@ endif()
 find_package(hip     REQUIRED)
 find_package(hipblas REQUIRED)
 find_package(rocblas REQUIRED)
-if (GGML_HIP_ROCWMMA_FATTN)
-    CHECK_INCLUDE_FILE_CXX("rocwmma/rocwmma.hpp" FOUND_ROCWMMA)
-    if (NOT ${FOUND_ROCWMMA})
-        message(FATAL_ERROR "rocwmma has not been found")
+if(GGML_HIP_ROCWMMA_FATTN)
+    if(EXISTS "/opt/rocm/include/rocwmma/rocwmma.hpp")
+        set(FOUND_ROCWMMA TRUE)
+        include_directories(/opt/rocm/include)
+    else()
+        message(FATAL_ERROR "rocwmma.hpp not found at /opt/rocm/include/rocwmma/rocwmma.hpp")
     endif()
 endif()

save this as rocwmma.patch and apply it with git

git apply rocwmma.patch

Step 4: Building llama.cpp with rocmWMMA

Reconfigure and build llama.cpp with rocWMMA support:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build-wmma -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_CXX_FLAGS="-I/opt/rocm/include/rocwmma"
cmake --build build-wmma --config Release -- -j 16

Step 5: Testing the build

Run llama-bench with FA enabled:

./build-wmma/bin/llama-bench -m models/qwen2.5-0.5b.gguf -fa 1

you should get an output like this

./build-wmma/bin/llama-bench -m models/qwen2.5-0.5b.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | ROCm       |  99 |  1 |           pp512 |   31558.16 ± 1501.69 |
| qwen2 1B Q4_K - Medium         | 373.71 MiB |   494.03 M | ROCm       |  99 |  1 |           tg128 |        269.67 ± 0.29 |

build: 053b1539 (5558)

Section 5: Optimizing Your Setup

Check Memory Usage: Use the rocm-smi command to monitor VRAM utilization by the LLM model and adjust the context window length for optimal performance.
- Imatrix Quants: Implement techniques like imatrix quants (e.g., IQ4_XS) for smaller model sizes with minimal performance loss.
- Cache Quantization: Leverage cache quantization to minimize memory usage and support larger context windows within the same GPU memory.
Community Resources: Engage with the r/LocalLlama subreddit and participate in discussions about the latest LLM advancements, model releases, and optimizations for llama.cpp.

Section 6: Why This Matters

This llama.cpp + ROCm setup enables you to run cutting-edge Large Language Models (LLMs) efficiently on cost-effective AMD hardware. By leveraging this solution, you can build fully private chatbots and sophisticated Retrieval Augmented Generation (RAG) systems that operate entirely on-premises. This ensures that sensitive data is never exposed to public cloud platforms, addressing critical privacy and security concerns for businesses and organizations.

Additionally, this setup gives you complete control and customization over your AI workflows. You can run fine-tuned models, optimize performance, and tailor the system to meet your specific needs without being constrained by vendor limitations or cloud dependencies.

Conclusion

In this guide, we’ve walked through setting up ROCm, compiling llama.cpp, and enabling Flash Attention on OpenSUSE using an AMD GPU. With AMD’s cost-effective hardware and powerful software tools, you can run state-of-the-art LLMs without breaking the bank. These skills not only enhance your technical repertoire but also open doors to exciting opportunities in AI and machine learning.

If you found this guide helpful, feel free to share it on LinkedIn or connect with me for further discussions. Let’s push the boundaries of what’s possible with LLMs—AMD style!

Rohan Sircar @rohan-sircar