Introduction
Large Language Models (LLMs) are transforming industries by enabling advanced natural language processing, automation, and decision-making.
Running these models locally on custom hardware setups not only provides better control over performance but also ensures data privacy. If you have an AMD GPU and thought you’d need an NVIDIA GPU to run LLMs efficiently, think again!
With ROCm (Radeon Open Compute), you can harness the power of AMD GPUs for LLM inference at a fraction of the cost. 💪
In this guide, I’ll walk you through setting up ROCm and compiling llama.cpp on an OpenSUSE system. Whether you're a developer, researcher, or AI enthusiast, this tutorial will equip you with the skills to run LLMs efficiently on your AMD GPU.
Section 1: Why ROCm and llama.cpp?
- ROCm Overview: ROCm is AMD’s open software platform for GPU computing, designed to accelerate machine learning, AI, and high-performance computing workloads.
It’s a cost-effective alternative to NVIDIA’s CUDA, enabling AMD GPU users to run LLMs with GPU acceleration.
- llama.cpp Introduction: llama.cpp is a lightweight and efficient implementation of LLMs, optimized for both CPU and GPU execution.
It’s open-source, highly customizable, and supports various quantization techniques to maximize performance and has great support for ROCM for running LLMs on AMD GPUs.
- Why AMD GPUs? AMD GPUs like the RX 7900 XTX offer exceptional value for LLM workloads. With 24GB of VRAM and ~960 GB/s memory bandwidth, the RX 7900 XTX can handle larger, more sophisticated models while delivering impressive token generation speeds.
NVIDIA GPUs like the RTX 4090 (24GB VRAM) cost 2.5 times as much, while the RTX 4080 (16GB VRAM) not only costs more but also limits your ability to run larger models. When it comes to LLMs, VRAM amount and memory bandwidth are the most critical factors—AMD GPUs excel on both fronts.
- Why OpenSUSE? OpenSUSE is known for its stability, flexibility, and robust package management.
With up to date packages thanks to it's rolling release packaging model, making it an ideal platform for experimenting with AI technologies where the underlying dependencies and inference engines are constantly being updated.
Section 2: Preparing Your System
Before diving into the installation, ensure your system meets the following requirements:
-
Hardware:
- A recent AMD GPU (e.g., Radeon RX 7900 series for RDNA3 or RX 6800 series for RDNA2).
- At least 16GB system RAM (more the better)
- Software: OpenSUSE with root access and basic development tools installed.
Section 3: Step-by-Step Installation Guide
Step 1: Adding the ROCm Repository
Add the ROCm repository to your OpenSUSE system:
sudo zypper addrepo https://download.opensuse.org/repositories/science:GPU:ROCm/openSUSE_Factory/science:GPU:ROCm.repo
sudo zypper refresh
Step 2: Installing ROCm System Dependencies
Install the necessary ROCm packages:
sudo zypper in hipblas-common-devel rocminfo rocm-hip rocm-hip-devel rocm-cmake rocm-smi libhipblas2-devel librocblas4-devel libcurl-devel libmtmd
Step 3: Cloning the llama.cpp Repository
Download the llama.cpp repository:
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
Step 4: Setting Up the Build Environment
Specify your AMD GPU target (e.g., gfx1100
for RDNA3 or gfx1030
for RDNA2) and configure the build:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build-rocm -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100
Step 5: Compiling llama.cpp
Build llama.cpp using the specified number of threads (e.g., 16):
cmake --build build-rocm --config Release -- -j 16
Step 6: Testing the Build
Download a GGUF model from Hugging Face and test the setup:
cd models
wget -c -O qwen2.5-0.5b.gguf https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf
cd ..
./build-rocm/bin/llama-bench -m models/qwen2.5-0.5b.gguf
you should get an output like this
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 1B Q4_K - Medium | 373.71 MiB | 494.03 M | ROCm | 99 | pp512 | 30061.00 ± 201.52 |
| qwen2 1B Q4_K - Medium | 373.71 MiB | 494.03 M | ROCm | 99 | tg128 | 271.27 ± 0.74 |
build: dc39a5e7 (5169)
Section 4: Enhanced Flash Attention Using rocWMMA
Flash Attention (FA) is a memory-efficient attention algorithm that enables larger context windows and better performance. To enhanced FA performance, we need to build llama.cpp with the rocWMMA library.
Step 1: Cloning and Building rocWMMA
Clone the rocWMMA repository and build it:
git clone https://github.com/ROCm/rocWMMA.git
cd rocWMMA
CC=/usr/lib64/rocm/llvm/bin/clang CXX=/usr/lib64/rocm/llvm/bin/clang++ \
cmake -B build . \
-DROCWMMA_BUILD_TESTS=OFF \
-DROCWMMA_BUILD_SAMPLES=OFF \
-DOpenMP_CXX_FLAGS="-fopenmp" \
-DOpenMP_CXX_LIB_NAMES="omp" \
-DOpenMP_omp_LIBRARY="/usr/lib/libgomp1.so"
cmake --build build -- -j 16
Step 2: Installing rocWMMA
Install rocWMMA to the system directory:
sudo mkdir -p /opt/rocm
sudo chown -R $USER:users /opt/rocm/
cmake --build build --target install -- -j 16
Step 3: Patching llama.cpp (Optional)
If rocWMMA is not detected, apply the provided patch to ggml/src/ggml-hip/CMakeLists.txt
:
diff --git a/ggml/src/ggml-hip/CMakeLists.txt b/ggml/src/ggml-hip/CMakeLists.txt
index 1fe8fe3b..9577203b 100644
--- a/ggml/src/ggml-hip/CMakeLists.txt
+++ b/ggml/src/ggml-hip/CMakeLists.txt
@@ -39,10 +39,12 @@ endif()
find_package(hip REQUIRED)
find_package(hipblas REQUIRED)
find_package(rocblas REQUIRED)
-if (GGML_HIP_ROCWMMA_FATTN)
- CHECK_INCLUDE_FILE_CXX("rocwmma/rocwmma.hpp" FOUND_ROCWMMA)
- if (NOT ${FOUND_ROCWMMA})
- message(FATAL_ERROR "rocwmma has not been found")
+if(GGML_HIP_ROCWMMA_FATTN)
+ if(EXISTS "/opt/rocm/include/rocwmma/rocwmma.hpp")
+ set(FOUND_ROCWMMA TRUE)
+ include_directories(/opt/rocm/include)
+ else()
+ message(FATAL_ERROR "rocwmma.hpp not found at /opt/rocm/include/rocwmma/rocwmma.hpp")
endif()
endif()
save this as rocwmma.patch and apply it with git
git apply rocwmma.patch
Step 4: Building llama.cpp with rocmWMMA
Reconfigure and build llama.cpp with rocWMMA support:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build-wmma -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_CXX_FLAGS="-I/opt/rocm/include/rocwmma"
cmake --build build-wmma --config Release -- -j 16
Step 5: Testing the build
Run llama-bench with FA enabled:
./build-wmma/bin/llama-bench -m models/qwen2.5-0.5b.gguf -fa 1
you should get an output like this
./build-wmma/bin/llama-bench -m models/qwen2.5-0.5b.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 1B Q4_K - Medium | 373.71 MiB | 494.03 M | ROCm | 99 | 1 | pp512 | 31558.16 ± 1501.69 |
| qwen2 1B Q4_K - Medium | 373.71 MiB | 494.03 M | ROCm | 99 | 1 | tg128 | 269.67 ± 0.29 |
build: 053b1539 (5558)
Section 5: Optimizing Your Setup
-
Check Memory Usage: Use the
rocm-smi
command to monitor VRAM utilization by the LLM model and adjust the context window length for optimal performance.- Imatrix Quants: Implement techniques like imatrix quants (e.g., IQ4_XS) for smaller model sizes with minimal performance loss.
- Cache Quantization: Leverage cache quantization to minimize memory usage and support larger context windows within the same GPU memory.
- Community Resources: Engage with the r/LocalLlama subreddit and participate in discussions about the latest LLM advancements, model releases, and optimizations for llama.cpp.
Section 6: Why This Matters
This llama.cpp + ROCm setup enables you to run cutting-edge Large Language Models (LLMs) efficiently on cost-effective AMD hardware. By leveraging this solution, you can build fully private chatbots and sophisticated Retrieval Augmented Generation (RAG) systems that operate entirely on-premises. This ensures that sensitive data is never exposed to public cloud platforms, addressing critical privacy and security concerns for businesses and organizations.
Additionally, this setup gives you complete control and customization over your AI workflows. You can run fine-tuned models, optimize performance, and tailor the system to meet your specific needs without being constrained by vendor limitations or cloud dependencies.
Conclusion
In this guide, we’ve walked through setting up ROCm, compiling llama.cpp, and enabling Flash Attention on OpenSUSE using an AMD GPU. With AMD’s cost-effective hardware and powerful software tools, you can run state-of-the-art LLMs without breaking the bank. These skills not only enhance your technical repertoire but also open doors to exciting opportunities in AI and machine learning.
If you found this guide helpful, feel free to share it on LinkedIn or connect with me for further discussions. Let’s push the boundaries of what’s possible with LLMs—AMD style!
Yo crypto fam! Ethereum became the #1 blockchain — Vitalik shares ETH with the community! take your verified share of 5000 ETH right now! — Don’t miss out! MetaMask or WalletConnect needed. 👉 ethereum.id-transfer.com