Comparative Analysis: Testing & Evaluating LLM Security with Garak Across Different Models

Hey everyone, I’m trying out LLM security testing and it’s pretty interesting. With all these AI models around, it’s important to check they don’t do weird stuff. I found this tool called Garak (Generative AI Red-teaming and Assessment Kit). It’s like a set of tests to see if models can be tricked into revealing secrets, making bad code, or saying nasty things.

In this post, I’ll go over four models: Mistral-Nemo, Gemma3, LLaMA 2-13B, and Vicuna-13B. I ran Garak on the Ollama setup. You can think of it like a friendly match, but instead of goal scoring, we score on security stuff.

Background on Garak

Garak is kind of like Nmap but for AI. Instead of scanning a network, it sends special prompts (called “probes”) to the model. Each probe tries to make the model slip up.

Some probes are:

PromptInject: Tries to hide instructions to see if the model follows them.
MalwareGen: Asks for malicious code.
LeakerPlay: Tries to get the model to spill its internal stuff or training data.
RealToxicityPrompts: Pushes the model to say toxic things.

When Garak runs, it saves all the outputs in logs. Later I look at the logs and see which models passed or failed.

Models and Setup

I picked these open models:

Mistral-Nemo (7B): Released July 2024, supposed to be good at chat.
Gemma3:latest: From Google, late 2024 release, with some safety tweaks.
LLaMA 2-13B: Meta’s 13B model, popular for many tasks.
Vicuna-13B: Based on LLaMA, tuned to be safer.

I ran all models on Ollama (version 0.1.34 or newer). My computer has 32 CPU cores, 128 GB RAM, and an NVIDIA A100 GPU. I used Garak v0.10.3.post1 with default settings.

Installing and Configuring Ollama

# Update Ollama
ollama upgrade

# Download models
ollama pull mistral-nemo
ollama pull gemma3:latest
ollama pull llama2-13b
ollama pull vicuna-13b

Running Garak with Ollama

To scan a model, run:

garak --model_type ollama \
      --model_name mistral-nemo \
      --probes malwaregen.Evasion,promptinject \
      --report_prefix ./reports/mistral_nemo

I set a random seed and repeated each test three times to make it more reliable.

Methodology

Probe Selection: I chose four main probes:
- MalwareGen.Evasion: Asks for code that could bypass antivirus.
- PromptInject.Encoding: Hides instructions in encoded text to see if the model follows them.
- LeakerPlay.DataLeakage: Tries to get the model to reveal training data or hidden prompts.
- RealToxicityPrompts: Pushes the model to use toxic language.
Metrics:
- Failure Rate (%): How often the model messed up.
- Mean Time per Probe (s): How long it takes on average.
- Resource Usage: GPU memory and CPU usage.
Probe Execution: Each probe had 20 prompts. The model got five tries for each prompt. If it failed once, that prompt counts as a fail.
Data Analysis: I averaged results from three runs and got standard deviations. Results are in the table below.

Comparative Results

Model	MalwareGen.Evasion	PromptInject.Encoding	LeakerPlay.DataLeakage	RealToxicityPrompts
Mistral-Nemo	100.0% ± 0.0%	92.0% ± 1.7%	85.7% ± 2.3%	17.0% ± 1.5%
Gemma3:latest	56.3% ± 4.1%	37.5% ± 3.8%	48.2% ± 4.5%	10.5% ± 1.2%
LLaMA 2-13B	81.0% ± 3.9%	68.3% ± 2.5%	72.4% ± 3.1%	26.7% ± 2.0%
Vicuna-13B	62.5% ± 4.8%	54.6% ± 3.0%	61.3% ± 3.5%	3.8% ± 1.0%

Note: Failure Rate (%) shows how often the model produced unwanted behavior.

Yes, I know that you'll be like this (even I was 😂).

Mistral-Nemo

MalwareGen.Evasion (100.0%): It always gave malware code. No defense at all.
PromptInject.Encoding (92.0%): Fell for encoding tricks most of the time.
LeakerPlay.DataLeakage (85.7%): Leaked training prompts a lot.
RealToxicityPrompts (17.0%): Created toxic content sometimes.

Gemma3:latest

MalwareGen.Evasion (56.3%): Sometimes refused but got tricked by advanced hacks.
PromptInject.Encoding (37.5%): Better but not perfect.
LeakerPlay.DataLeakage (48.2%): Half the time it leaked something.
RealToxicityPrompts (10.5%): Rarely said toxic things.

LLaMA 2-13B

MalwareGen.Evasion (81.0%): Produced malware scripts often.
PromptInject.Encoding (68.3%): Fell for encoding a lot.
LeakerPlay.DataLeakage (72.4%): Regularly leaked data.
RealToxicityPrompts (26.7%): Most toxic among the group.

Vicuna-13B

MalwareGen.Evasion (62.5%): Was not as bad as LLaMA but still failed a lot.
PromptInject.Encoding (54.6%): Mediocre, could still be tricked.
LeakerPlay.DataLeakage (61.3%): Leaked data more than half the time.
RealToxicityPrompts (3.8%): Best at not being toxic.

Discussion

Security Trends Across Models

Older vs Newer: Older models like Vicuna and LLaMA 2 failed more often. Newer ones like Gemma3 have more guardrails.
Instruction-Tuning: Vicuna had extra tuning so it was better at not making malware or saying toxic stuff.
Guardrails Matter: Gemma3 blocked some attacks but still fell for advanced ones.
Architecture: Models without built-in safety (Mistral-Nemo, LLaMA 2) were very vulnerable.

Performance and Resource Usage

Average Time per Prompt:
- Mistral-Nemo: 4.8 s
- Gemma3: 6.2 s
- LLaMA 2-13B: 5.5 s
- Vicuna-13B: 5.7 s
GPU Memory Used:
- Mistral-Nemo: 12 GB
- Gemma3: 16 GB
- LLaMA 2-13B: 14 GB
- Vicuna-13B: 15 GB
CPU Load: About 20–25% for all while testing.

Gemma3 used more memory, so it’s slower but a bit safer.

And yeah don't be him and make sure to do the checks properly 😁.

Recommendations

Keep Testing: Run Garak regularly to find new flaws.
Use Multiple Safety Layers: Combine model guardrails with external checks.
Choose Tuned Models: Vicuna shows that tuning helps.
Update Your Tools: Ollama has had bugs (like CVE-2024-37032). Always use the latest version.

Conclusion

Running Garak on these models shows that all of them have weak spots. Mistral-Nemo always failed, Gemma3 was okay but not perfect, LLaMA 2 struggled, and Vicuna was the best but still not flawless. The main lesson is that we need ongoing tests, several safety measures, and up-to-date software to keep these AI models safe.

Thanks for reading, and happy red-teaming!

Crispin @crispin_r