This is the 2nd part of my investigations of local LLM inference speed. Here're the 1st and 3rd ones
NVidia GPUs offer a Shared GPU Memory feature for Windows users, which allocates up to 50% of system RAM to virtual VRAM. If your GPU runs out of dedicated video memory, the driver can implicitly use system memory without throwing out-of-memory errors—application execution is not interrupted. Yet there's a performance toll.
Memory is the key constraint when dealing with LLMs. And VRAM is way more expensive than ordinary DDR4/DDR5 system memory. E.g. RTX 4090 with 24GB GDDR6 on board costs around $1700, while RTX 6000 with 48GB of GDDR6 goes above $5000. 2 sticks of G.Skill DDR5 with a total capacity of 96GB will cost you around $300. My workstation has RTX 4090 and 96GB of RAM, making 72GB of video memory available to the video card. Does it make sense to fill your PC with as much RAM as possible and have your LLM workloads use Shared GPU RAM?
I have already tested how GPU memory overflow into RAM influences LLM training speed. This time I've tried inference via LM Studio/llama.cpp using 4-bit quantized Llama 3.1 70B taking up 42.5GBs
LM Studio (a wrapper around llama.cpp) offers a setting for selecting the number of layers that can be offloaded to the GPU, with 100% making the GPU the sole processor. At the same time, you can choose to keep some of the layers in system RAM and have the CPU do part of the computations—the main purpose is to avoid VRAM overflows. Yet, as already mentioned, on Windows (not Linux), it is possible to overflow VRAM.
I tried 3 off-load settings:
- 100% GPU - ~50% of model weights stayed in VRAM while the other half was located in Shared VRAM, GPU did all the computations
- 50% GPU and 50% CPU - this setting has filled VRAM almost completely without overflows into Shared VRAM, half of layers where conputed by GPU and the other half by CPU
- 100% CPU, no GPU involved
Here's my hardware setup:
- Intel Core i5 13600KF (OC to 5.5GHz)
- 96GB DDR5 RAM at 4800MT/s (CL 30, RCD 30, RCDW 30, RP 30 ~ 70GB/s Read/Write/Copy at AIDA)
- RTX 4090 24GB VRAM (OC, core at 3030MHz, VRAM +1600MHz ~37000 GPU Score at TimeSpy)
And here're the results:
| Tokens/s | Time-to-first Token (s) | RAM used (GB) | |
|---|---|---|---|
| 100% GPU | 0.69 | 4.66 | 60 |
| 50/50 GPU/CPU | 2.32 | 0.42 | 42 |
| 100% CPU | 1.42 | 0.71 | 42 |
Please note that for the time to the first token I used the "warm" metric, i.e. the time from the second generation (loaded the model, generated completion, and then clicked "regenerate"). For the cold time to the first token I got
- 100% GPU ~6.9s
- 50/50 CPU/GPU ~2.4s
- 100% CPU ~30s
Besides, when using 100% GPU offload I had ~20GB more system RAM used (no matter whether "use_mlock" was set or not).
Apparently, there's not much point in Shared VRAM.
P.S> Screenshots...











Absolutely thrilled to share my excitement about this site! Having recently celebrated my birthday, I was introduced to SEVlaser by a friend, and what a fantastic gift it turned out to be. I’ve been exploring the comments here and noticed so many happy experiences that inspired me to write my own. If you’re considering Laser Hair Removal in Uptown Dallas, don’t hesitate to check out the amazing services here: Laser Hair Removal in Uptown Dallas.