llama.cpp: CPU vs GPU, shared VRAM and Inference Speed

This is the 2nd part of my investigations of local LLM inference speed. Here're the 1st and 3rd ones

NVidia GPUs offer a Shared GPU Memory feature for Windows users, which allocates up to 50% of system RAM to virtual VRAM. If your GPU runs out of dedicated video memory, the driver can implicitly use system memory without throwing out-of-memory errors—application execution is not interrupted. Yet there's a performance toll.

Memory is the key constraint when dealing with LLMs. And VRAM is way more expensive than ordinary DDR4/DDR5 system memory. E.g. RTX 4090 with 24GB GDDR6 on board costs around $1700, while RTX 6000 with 48GB of GDDR6 goes above $5000. 2 sticks of G.Skill DDR5 with a total capacity of 96GB will cost you around $300. My workstation has RTX 4090 and 96GB of RAM, making 72GB of video memory available to the video card. Does it make sense to fill your PC with as much RAM as possible and have your LLM workloads use Shared GPU RAM?

I have already tested how GPU memory overflow into RAM influences LLM training speed. This time I've tried inference via LM Studio/llama.cpp using 4-bit quantized Llama 3.1 70B taking up 42.5GBs

LM Studio (a wrapper around llama.cpp) offers a setting for selecting the number of layers that can be offloaded to the GPU, with 100% making the GPU the sole processor. At the same time, you can choose to keep some of the layers in system RAM and have the CPU do part of the computations—the main purpose is to avoid VRAM overflows. Yet, as already mentioned, on Windows (not Linux), it is possible to overflow VRAM.

I tried 3 off-load settings:

100% GPU - ~50% of model weights stayed in VRAM while the other half was located in Shared VRAM, GPU did all the computations
50% GPU and 50% CPU - this setting has filled VRAM almost completely without overflows into Shared VRAM, half of layers where conputed by GPU and the other half by CPU
100% CPU, no GPU involved

Here's my hardware setup:

Intel Core i5 13600KF (OC to 5.5GHz)
96GB DDR5 RAM at 4800MT/s (CL 30, RCD 30, RCDW 30, RP 30 ~ 70GB/s Read/Write/Copy at AIDA)
RTX 4090 24GB VRAM (OC, core at 3030MHz, VRAM +1600MHz ~37000 GPU Score at TimeSpy)

And here're the results:

	Tokens/s	Time-to-first Token (s)	RAM used (GB)
100% GPU	0.69	4.66	60
50/50 GPU/CPU	2.32	0.42	42
100% CPU	1.42	0.71	42

Please note that for the time to the first token I used the "warm" metric, i.e. the time from the second generation (loaded the model, generated completion, and then clicked "regenerate"). For the cold time to the first token I got

100% GPU ~6.9s
50/50 CPU/GPU ~2.4s
100% CPU ~30s

Besides, when using 100% GPU offload I had ~20GB more system RAM used (no matter whether "use_mlock" was set or not).

Apparently, there's not much point in Shared VRAM.

P.S> Screenshots...

100% GPU
50/50 CPU/GP
100% CPU

Maxim Saplin @maximsaplin

llama.cpp: CPU vs GPU, shared VRAM and Inference Speed

Comments 6 total