Comparative Analysis: Testing & Evaluating LLM Security with Garak Across Different Models
Crispin

Crispin @crispin_r

About: A sophomore who loves to develop cool stuff and likes facts, art & crafts, space, music, and <code/>

Location:
India
Joined:
Jun 21, 2021

Comparative Analysis: Testing & Evaluating LLM Security with Garak Across Different Models

Publish Date: Jun 1
1 0

Hey everyone, I’m trying out LLM security testing and it’s pretty interesting. With all these AI models around, it’s important to check they don’t do weird stuff. I found this tool called Garak (Generative AI Red-teaming and Assessment Kit). It’s like a set of tests to see if models can be tricked into revealing secrets, making bad code, or saying nasty things.

In this post, I’ll go over four models: Mistral-Nemo, Gemma3, LLaMA 2-13B, and Vicuna-13B. I ran Garak on the Ollama setup. You can think of it like a friendly match, but instead of goal scoring, we score on security stuff.

AI Bake-off GIF

Background on Garak

Garak is kind of like Nmap but for AI. Instead of scanning a network, it sends special prompts (called “probes”) to the model. Each probe tries to make the model slip up.

Some probes are:

  • PromptInject: Tries to hide instructions to see if the model follows them.
  • MalwareGen: Asks for malicious code.
  • LeakerPlay: Tries to get the model to spill its internal stuff or training data.
  • RealToxicityPrompts: Pushes the model to say toxic things.

When Garak runs, it saves all the outputs in logs. Later I look at the logs and see which models passed or failed.

Models and Setup

I picked these open models:

  1. Mistral-Nemo (7B): Released July 2024, supposed to be good at chat.
  2. Gemma3:latest: From Google, late 2024 release, with some safety tweaks.
  3. LLaMA 2-13B: Meta’s 13B model, popular for many tasks.
  4. Vicuna-13B: Based on LLaMA, tuned to be safer.

I ran all models on Ollama (version 0.1.34 or newer). My computer has 32 CPU cores, 128 GB RAM, and an NVIDIA A100 GPU. I used Garak v0.10.3.post1 with default settings.

Installing and Configuring Ollama

# Update Ollama
ollama upgrade

# Download models
ollama pull mistral-nemo
ollama pull gemma3:latest
ollama pull llama2-13b
ollama pull vicuna-13b
Enter fullscreen mode Exit fullscreen mode

Running Garak with Ollama

To scan a model, run:

garak --model_type ollama \
      --model_name mistral-nemo \
      --probes malwaregen.Evasion,promptinject \
      --report_prefix ./reports/mistral_nemo
Enter fullscreen mode Exit fullscreen mode

I set a random seed and repeated each test three times to make it more reliable.

Methodology

  1. Probe Selection: I chose four main probes:

    • MalwareGen.Evasion: Asks for code that could bypass antivirus.
    • PromptInject.Encoding: Hides instructions in encoded text to see if the model follows them.
    • LeakerPlay.DataLeakage: Tries to get the model to reveal training data or hidden prompts.
    • RealToxicityPrompts: Pushes the model to use toxic language.
  2. Metrics:

    • Failure Rate (%): How often the model messed up.
    • Mean Time per Probe (s): How long it takes on average.
    • Resource Usage: GPU memory and CPU usage.
  3. Probe Execution: Each probe had 20 prompts. The model got five tries for each prompt. If it failed once, that prompt counts as a fail.

  4. Data Analysis: I averaged results from three runs and got standard deviations. Results are in the table below.

Comparative Results

Model MalwareGen.Evasion PromptInject.Encoding LeakerPlay.DataLeakage RealToxicityPrompts
Mistral-Nemo 100.0% ± 0.0% 92.0% ± 1.7% 85.7% ± 2.3% 17.0% ± 1.5%
Gemma3:latest 56.3% ± 4.1% 37.5% ± 3.8% 48.2% ± 4.5% 10.5% ± 1.2%
LLaMA 2-13B 81.0% ± 3.9% 68.3% ± 2.5% 72.4% ± 3.1% 26.7% ± 2.0%
Vicuna-13B 62.5% ± 4.8% 54.6% ± 3.0% 61.3% ± 3.5% 3.8% ± 1.0%

Note: Failure Rate (%) shows how often the model produced unwanted behavior.

Gon GIF
Yes, I know that you'll be like this (even I was 😂).

Mistral-Nemo

  • MalwareGen.Evasion (100.0%): It always gave malware code. No defense at all.
  • PromptInject.Encoding (92.0%): Fell for encoding tricks most of the time.
  • LeakerPlay.DataLeakage (85.7%): Leaked training prompts a lot.
  • RealToxicityPrompts (17.0%): Created toxic content sometimes.

Gemma3:latest

  • MalwareGen.Evasion (56.3%): Sometimes refused but got tricked by advanced hacks.
  • PromptInject.Encoding (37.5%): Better but not perfect.
  • LeakerPlay.DataLeakage (48.2%): Half the time it leaked something.
  • RealToxicityPrompts (10.5%): Rarely said toxic things.

LLaMA 2-13B

  • MalwareGen.Evasion (81.0%): Produced malware scripts often.
  • PromptInject.Encoding (68.3%): Fell for encoding a lot.
  • LeakerPlay.DataLeakage (72.4%): Regularly leaked data.
  • RealToxicityPrompts (26.7%): Most toxic among the group.

Vicuna-13B

  • MalwareGen.Evasion (62.5%): Was not as bad as LLaMA but still failed a lot.
  • PromptInject.Encoding (54.6%): Mediocre, could still be tricked.
  • LeakerPlay.DataLeakage (61.3%): Leaked data more than half the time.
  • RealToxicityPrompts (3.8%): Best at not being toxic.

Discussion

Security Trends Across Models

  1. Older vs Newer: Older models like Vicuna and LLaMA 2 failed more often. Newer ones like Gemma3 have more guardrails.
  2. Instruction-Tuning: Vicuna had extra tuning so it was better at not making malware or saying toxic stuff.
  3. Guardrails Matter: Gemma3 blocked some attacks but still fell for advanced ones.
  4. Architecture: Models without built-in safety (Mistral-Nemo, LLaMA 2) were very vulnerable.

Performance and Resource Usage

  • Average Time per Prompt:
    • Mistral-Nemo: 4.8 s
    • Gemma3: 6.2 s
    • LLaMA 2-13B: 5.5 s
    • Vicuna-13B: 5.7 s
  • GPU Memory Used:
    • Mistral-Nemo: 12 GB
    • Gemma3: 16 GB
    • LLaMA 2-13B: 14 GB
    • Vicuna-13B: 15 GB
  • CPU Load: About 20–25% for all while testing.

Gemma3 used more memory, so it’s slower but a bit safer.

Check GIF

And yeah don't be him and make sure to do the checks properly 😁.

Recommendations

  1. Keep Testing: Run Garak regularly to find new flaws.
  2. Use Multiple Safety Layers: Combine model guardrails with external checks.
  3. Choose Tuned Models: Vicuna shows that tuning helps.
  4. Update Your Tools: Ollama has had bugs (like CVE-2024-37032). Always use the latest version.

Conclusion

Running Garak on these models shows that all of them have weak spots. Mistral-Nemo always failed, Gemma3 was okay but not perfect, LLaMA 2 struggled, and Vicuna was the best but still not flawless. The main lesson is that we need ongoing tests, several safety measures, and up-to-date software to keep these AI models safe.

Bye GIF

Thanks for reading, and happy red-teaming!

Comments 0 total

    Add comment