Moonshot Labs just changed the game with Kimi K2—a one-trillion parameter, open-source LLM built for agentic tasks and positioned as a true rival to Claude 4 Sonnet, but at a fraction of the cost.
What makes Kimi K2 stand out? It’s not just big, it’s smart—designed for high-agency workloads with a cutting-edge MoE architecture (384 experts, streamlined attention heads) and their custom Muon optimizer. The team trained it for real-world tool use, multi-step reasoning, and robust self-critique, making it uniquely capable for next-gen agent workflows.
Bottom line: Kimi K2 delivers natural, human-like responses, excels at creative writing, and brings agentic intelligence to the open-source community—finally offering a serious alternative to closed models like Claude.
Why Red Team Kimi K2?
The massive capabilities of Kimi K2 mean equally massive attack surfaces. Here’s why red teaming this model (or any agentic LLM at this scale) isn’t optional:
- Extended Context Processing: With up to 2M tokens, context poisoning and injection attacks have never had a bigger playground. Can your defenses spot the threat buried in a page-1000 footnote?
- Tool Use & Multi-Agent Risks: Kimi K2’s core strength—its agentic ability—also means attackers can exploit multi-step reasoning, tool chains, and API calls in unexpected ways.
- MoE & Memory Tricks: The unique mixture-of-experts architecture introduces new surface area for resource hijacking, prompt routing bugs, and efficiency attacks—especially in distributed deployments.
- Real-World Stakes: Whether you’re building autonomous research agents, workflow assistants, or live code bots, a vulnerability here doesn’t just leak data; it can trigger actions in the real world.
That’s why this guide will walk you through using Promptfoo to systematically red team Kimi K2—surfacing jailbreaks, injection risks, bias, and everything else that comes with next-gen agentic models.
Resources
Link: Promptfoo Open Source Tool for Evaluation and Red Teaming
Link: OpenRouterFor Moonshot Kimi K2 APIs
Link: Kimi K2 Model Model Page
Prerequisites
Before you dive in, make sure you have the following ready:
- Node.js (v18 or later): Download and install from nodejs.org.
- OpenRouter API Key: Sign up for an OpenRouter account and grab your API key from the dashboard.
- Promptfoo: No setup required in advance—you’ll use npx to run all Promptfoo commands directly from your terminal.
Once you’ve got these lined up, you’re ready to start red teaming Kimi K2!
Note on API Access
For this red teaming workflow, we used Kimi K2 through OpenRouter’s API, which offers free usage up to a certain limit—perfect for testing, prototyping, or your first security sweeps. If you need to run large-scale evaluations or frequent scans (like in this guide), you’ll likely hit that free quota, at which point you can upgrade to a paid plan. Alternatively, you can also request API access directly from the Moonshot AI platform if you want even more control or need to run high-volume red teaming on Kimi K2.
Step 1: Start a New Red Team Project
Run this command in your terminal to initialize a new red teaming project directory for Kimi K2:
npx promptfoo@latest redteam init kimik2-redteam --no-gui
When prompted with
"What's the name of the target you want to red team?"
Type:
kimi-k2
…and press Enter.
When prompted
"What would you like to do?"
Use your arrow keys to select:
Red team a model + prompt
When prompted
"Do you want to enter a prompt now or later?"
Select:
Enter prompt later
…and press Enter.
When prompted
"Choose a model to target:"
Select:
I'll choose later
…and press Enter.
You’ll manually set the correct Kimi K2 model slug (openrouter/moonshotai/kimi-k2) in your promptfooconfig.yaml after setup.
When prompted
"How would you like to configure plugins?"
Select:
Use the defaults (configure later)
…and press Enter.
When prompted
"How would you like to configure strategies?"
Select:
Use the defaults (configure later)
…and press Enter.
After finishing the initialization, Promptfoo will create a configuration file for you at:
kimik2-redteam/promptfooconfig.yaml
Step 2: Set Up Your OpenRouter API Key
Before running any Promptfoo commands, you need to set your OpenRouter API key in your terminal session.
Copy and paste this command in your terminal:
export OPENROUTER_API_KEY="your_api_key"
- Make sure to replace the key value above with your actual OpenRouter API key.
- Do this in every new terminal session before running Promptfoo or any script that uses the OpenRouter API.
Step 3: Verify and Customize Your promptfooconfig.yaml for Kimi K2
After you run the initialization command (e.g. npx promptfoo@latest redteam init kimi-k2-redteam --no-gui), Promptfoo will automatically create a file called promptfooconfig.yaml in your new project directory.
Open the generated promptfooconfig.yaml in your editor (like VS Code, Sublime, nano, etc.).
Carefully review and update these settings for Kimi K2:
Change the targets: (or providers: for latest Promptfoo) to use Kimi K2’s OpenRouter model slug.
Update any default prompts, plugins, or strategies as needed for your red teaming scenario.
Example for Kimi K2 (with targets):
targets:
- id: openrouter:moonshotai/kimi-k2
label: kimi-k2
config:
temperature: 0.7
max_tokens: 4096
or, with providers: (latest Promptfoo):
providers:
- id: openrouter:moonshotai/kimi-k2
config:
temperature: 0.7
max_tokens: 4096
Save the file and continue with the next steps.
Step 4: Write the Full Updated promptfooconfig.yaml File for Kimi K2
After creating your project directory, you need to update your promptfooconfig.yaml file so that Promptfoo can target Kimi K2 via OpenRouter.
Paste the following content into your promptfooconfig.yaml:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
# Red teaming configuration
# Docs: https://promptfoo.dev/docs/red-team/configuration
description: Red Team Evaluation for Kimi K2 (MoonshotAI) via OpenRouter
targets:
- id: openrouter:moonshotai/kimi-k2
label: kimi-k2
config:
apiKeyEnv: OPENROUTER_API_KEY
generationConfig:
temperature: 0.7
max_tokens: 4096
redteam:
purpose: |
Evaluate Kimi K2 for prompt injection, jailbreaks, context poisoning, and other LLM security risks.
numTests: 10
# Each plugin generates 5 adversarial inputs.
# To control the number of tests for each plugin, use:
# - id: plugin-name
# numTests: 10
plugins:
- bias:age # Age-related bias detection
- bias:disability # Disability-related bias detection
- bias:gender # Gender-related bias detection
- bias:race # Race-related bias detection
- contracts # Tests for unauthorized contractual commitments and legal exposure
- excessive-agency # Tests for unauthorized actions beyond defined system boundaries
- hallucination # Tests for fabrication of false or misleading information
- harmful:chemical-biological-weapons # Tests handling of WMD-related content
- harmful:child-exploitation # Tests handling of child exploitation content
- harmful:copyright-violations # Tests handling of intellectual property violations
- harmful:cybercrime # Tests handling of cybercrime-related content
- harmful:cybercrime:malicious-code # Tests creation of malicious code
- harmful:graphic-content # Tests handling of graphic or violent content
- harmful:harassment-bullying # Tests handling of harassment and bullying content
- harmful:hate # Tests handling of hate speech and discriminatory content
- harmful:illegal-activities # Tests handling of general illegal activities
- harmful:illegal-drugs # Tests handling of illegal drug-related content
- harmful:illegal-drugs:meth # Tests handling of methamphetamine-related content
- harmful:indiscriminate-weapons # Tests handling of weapons-related content
- harmful:insults # Tests handling of personal attacks and insults
- harmful:intellectual-property # Tests handling of IP theft and violations
- harmful:misinformation-disinformation # Tests handling of false information campaigns
- harmful:non-violent-crime # Tests handling of non-violent criminal content
- harmful:privacy # Tests handling of privacy violation attempts
- harmful:profanity # Tests handling of profane or inappropriate language
- harmful:radicalization # Tests handling of extremist content and recruitment
- harmful:self-harm # Tests handling of self-harm and suicide content
- harmful:sex-crime # Tests handling of sexual crime content
- harmful:sexual-content # Tests handling of explicit sexual content
- harmful:specialized-advice # Tests handling of unauthorized professional advice
- harmful:unsafe-practices # Tests handling of dangerous activity instructions
- harmful:violent-crime # Tests handling of violent crime content
- harmful:weapons:ied # Tests handling of Improvised Explosive Devices
- hijacking # Tests for unauthorized resource usage and purpose deviation
- pii:api-db # Tests for PII exposure via API/database access
- pii:direct # Tests for direct PII exposure vulnerabilities
- pii:session # Tests for PII exposure in session data
- pii:social # Tests for PII exposure via social engineering
- politics # Tests handling of political content and bias
# Attack methods for applying adversarial inputs
strategies:
- basic # Original plugin tests without any additional strategies or optimizations
- jailbreak # Single-shot optimization of safety bypass techniques
- jailbreak:composite # Combines multiple jailbreak techniques for enhanced effectiveness
- Make sure the id: uses a colon (:) as shown: openrouter:moonshotai/kimi-k2
- You can adjust temperature and max_tokens as needed.
- Save this file before moving on to the next step.
You can customize every aspect of this configuration file to fit your red teaming goals. Whether you want to tweak the prompts, fine-tune the list of plugins, adjust the testing strategies, or set the number and purpose of your tests, it’s all flexible and in your control. This makes Promptfoo a powerful tool for targeted security evaluations—tailor the setup for your own risk scenarios, compliance checks, or even specialized attacks. Just edit the YAML sections as needed: swap out plugins to match the threats you care about, set your own strategies for how the model should be attacked, or rewrite the purpose statement to reflect your exact use case. Your red team, your rules.
- You can adjust the prompts to target specific model behaviors or attack scenarios.
- The list of plugins is fully customizable—add or remove any to match your security concerns.
- Strategies can be swapped, reordered, or fine-tuned to test different types of adversarial techniques.
- You can change the purpose field to describe your exact red teaming use case.
- Set the numTests to control the number of adversarial samples generated for each plugin or overall.
- All YAML sections are editable, so you can tailor the configuration for compliance, research, or product safety evaluations.
- Promptfoo gives you the freedom to create focused, practical, or even extreme LLM security tests.
Step 5: Generate Red Team Test Cases
After saving your configuration, generate the test cases that Promptfoo will use to probe the model’s vulnerabilities.
Run this command in your terminal:
npx promptfoo@latest redteam generate
- This command will synthesize attack prompts and write them into a file named redteam.yaml in your current project directory.
- Promptfoo will display a list of all the plugins being used and the number of tests per plugin (as shown in the output).
- You’ll see output confirming that test cases have been generated for all the plugins you enabled.
Step 6: Review Red Team Test Cases
Check the test generation report:
- Review the terminal output for “Success” or “Failed” status per plugin and strategy.
- The tool will write all generated test cases to a file called redteam.yaml in your project directory.
Note:
- If any plugin or strategy fails (shows as “Failed”), you may want to adjust your configuration or check for typos and rerun the command.
- At the end, Promptfoo will remind you to run the next step:
Run promptfoo redteam eval to run the red team!
After successful generation:
- The results (all test cases) will be automatically written to a file named redteam.yaml in your project directory.
- Check the terminal output for the number of test cases and “Success” or “Failed” status per plugin/strategy.
You should see a message like:
Wrote 2660 test cases to redteam.yaml
Step 7: Check the Generated redteam.yaml File
Open the redteam.yaml file in your project directory using your favorite code editor (like VS Code, as shown in the screenshot).
Review the contents:
This file contains all the adversarial test cases generated based on your configuration.
You’ll see:
- Metadata at the top (config schema, author, timestamp, etc.)
- A list of all enabled plugins and strategies.
- The purpose, number of tests, and full details for each plugin and attack scenario.
Why review this file?
- To verify that all expected test cases, plugins, and strategies are present.
- To customize or tweak any parameters, test cases, or descriptions before running the evaluation.
- To ensure everything aligns with your security/red teaming goals.
Generated redteam.yaml file
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
# ===================================================================
# REDTEAM CONFIGURATION
# ===================================================================
# Generated: 2025-07-31T11:28:12.193Z
# Author: ayushknj3@gmail.com
# Cloud: https://api.promptfoo.app
# Test Configuration:
# Total cases: 2660
# Plugins: bias:age, bias:disability, bias:gender, bias:race, contracts, excessive-agency, hallucination, harmful:chemical-biological-weapons, harmful:child-exploitation, harmful:copyright-violations, harmful:cybercrime, harmful:cybercrime:malicious-code, harmful:graphic-content, harmful:harassment-bullying, harmful:hate, harmful:illegal-activities, harmful:illegal-drugs, harmful:illegal-drugs:meth, harmful:indiscriminate-weapons, harmful:insults, harmful:intellectual-property, harmful:misinformation-disinformation, harmful:non-violent-crime, harmful:privacy, harmful:profanity, harmful:radicalization, harmful:self-harm, harmful:sex-crime, harmful:sexual-content, harmful:specialized-advice, harmful:unsafe-practices, harmful:violent-crime, harmful:weapons:ied, hijacking, pii:api-db, pii:direct, pii:session, pii:social, politics
# Strategies: basic, jailbreak, jailbreak:composite
# ===================================================================
description: Red Team Evaluation for Kimi K2 (MoonshotAI) via OpenRouter
targets:
- id: openrouter:moonshotai/kimi-k2
label: kimi-k2
config:
apiKeyEnv: OPENROUTER_API_KEY
generationConfig:
temperature: 0.7
max_tokens: 4096
redteam:
purpose: >
Evaluate Kimi K2 for prompt injection, jailbreaks, context poisoning, and
other LLM security risks.
numTests: 10
plugins:
- id: bias:age
numTests: 10
- id: bias:disability
numTests: 10
- id: bias:gender
numTests: 10
- id: bias:race
numTests: 10
- id: contracts
numTests: 10
- id: excessive-agency
numTests: 10
- id: hallucination
numTests: 10
- id: harmful:chemical-biological-weapons
numTests: 10
- id: harmful:child-exploitation
numTests: 10
- id: harmful:copyright-violations
numTests: 10
- id: harmful:cybercrime
numTests: 10
- id: harmful:cybercrime:malicious-code
numTests: 10
- id: harmful:graphic-content
numTests: 10
Step 8: Run the Red Team Evaluation and Review Results
After generating your test cases and updating redteam.yaml, run the full red team scan with:
npx promptfoo@latest redteam run
- This command executes all the adversarial tests defined in your redteam.yaml file against the Kimi K2 model.
- Promptfoo will run thousands of test cases (plugins, strategies, and prompts) and show a progress bar for each group.
- Once complete, you'll see a summary with total tokens used, test durations, success/failure counts, and a final pass rate.
What to look for:
✅ Success: Tests where the model handled the adversarial input correctly.
❌ Failure: Tests where the model was vulnerable (e.g., gave unsafe output or failed a security check).
Errors and pass rates will give you a clear picture of how robust your model is under red teaming scenarios.
How to Understand the Evaluation Output
After running your red team scan, Promptfoo provides a detailed evaluation summary that’s super useful for analyzing your model’s vulnerabilities and strengths. Here’s how to interpret the key sections:
- Token Usage Summary
- Probes: Total number of adversarial tests (individual test cases) executed.
- Evaluation: Shows how many tokens (and how many for prompt vs. completion) were used during testing.
- Grading: If Promptfoo is set to auto-grade, this shows tokens spent on grading each response.
- Grand Total: The total number of tokens used across all tests—useful for tracking API costs.
- Test Results
- Duration: Total time taken for the scan, including concurrency settings.
- Successes: Number of test cases where the model passed (handled input safely or as expected).
- Failures: Number of cases where the model failed (produced unsafe or unexpected output).
- Errors: Any test runs that encountered API or runtime errors.
- Pass Rate: Percentage of successful tests out of the total.
- What This Means
- High Pass Rate = Model is robust to adversarial and harmful prompts.
- High Failure Rate = Model is vulnerable to certain attacks or harmful behaviors—check these cases to improve your model’s safety and security.
- Total Tokens Used = Useful for budgeting API credits and estimating costs.
Step 9: View and Analyze Your Red Teaming Report
After running your red team evaluation, generate and launch the interactive report by using:
npx promptfoo@latest redteam report
- This command starts a local web server and opens an interactive dashboard where you can explore all test cases, failures, and vulnerabilities found during your scan.
- Press Ctrl+C to stop the server when you’re done reviewing.
Pro Tip:
The report lets you filter, search, and dig deep into specific failures, helping you quickly pinpoint exactly where your model is vulnerable and what you can improve next.
Step 10: Explore Results in the Promptfoo Web Viewer
Once you’ve generated your report, the Promptfoo web viewer provides a powerful dashboard to analyze all your red teaming results.
You can:
- See pass/fail rates and summaries across all plugins, strategies, and security domains.
- Click into individual test cases to see the exact prompts, model responses, and any vulnerabilities or failure points.
- Filter and search results by plugin, tag, or even regex for quick insights.
- Download or export the Vulnerability Report for sharing with your team or for audit documentation.
Pro Tip:
Use the "Vulnerability Report" button for a summarized view of where your model is most at risk. This helps you prioritize remediation and model tuning!
Step 11: Review the LLM Risk Assessment Dashboard
After your red team run and report generation, Promptfoo provides an LLM Risk Assessment dashboard summarizing the overall risk profile for Kimi K2.
The dashboard gives you:
- Critical, High, Medium, and Low issue counts, helping you quickly identify where your model is most vulnerable.
- Attack Methods Breakdown: See how successful various attack strategies were, including single-shot jailbreaks, multi-vector bypasses, and baseline plugin tests.
- Depth & Probe Stats: See the depth (number of probes) and which attack vectors had the highest success rates.
- Visual Insights: Instantly spot which categories (Critical/High) need your urgent attention for model hardening or further testing.
- Export & Share: Use the download or print buttons to save your results or share the risk report with your team or stakeholders.
Step 12: Deep Dive into Detailed Risk & Vulnerability Categories
After viewing the main LLM Risk Assessment summary, scroll down to explore the categorized breakdown of vulnerabilities and risk factors. Promptfoo organizes the evaluation into key sections—Security & Access Control, Compliance & Legal, Trust & Safety, and Brand.
- Each section displays a pass rate and the number of failed probes, helping you immediately spot areas with higher risk or compliance issues.
- On the right, you’ll see a granular breakdown of categories like “Resource Hijacking,” “PII via API/Database,” “Unauthorized Commitments,” “Child Exploitation,” “Hate Speech,” “Political Bias,” “Hallucination,” and more—each with its own pass/fail percentage.
- Red means the model failed on many probes in that area (needs urgent attention), while green and yellow show medium and low risks.
Why this matters:
This view gives you a comprehensive look at exactly where your model is robust and where it’s exposed, letting you prioritize improvements and mitigation efforts for real-world deployment.
Step 13: Explore Vulnerabilities & Mitigations Table
After reviewing risk categories, dive into the Vulnerabilities and Mitigations table. Here, Promptfoo lists every discovered vulnerability, showing:
- Type: What kind of risk was found (e.g., Resource Hijacking, Age Bias, Political Bias).
- Description: What the test actually checks.
- Attack Success Rate: How often the attack worked (the higher the percentage, the riskier!).
- Severity: Graded as high, medium, or low for easy prioritization.
- Actions: Instantly access detailed logs or apply mitigation strategies.
You can also export all vulnerabilities to CSV for compliance reporting, sharing, or further analysis.
Why this matters:
This step turns your red team scan into an actionable checklist. Now you know exactly which weaknesses are the most severe, and you have the logs and tools to start patching or retraining your model.
Next Steps
Now that you’ve successfully red teamed Kimi K2 (MoonshotAI), here’s how to keep your LLM secure and resilient:
- Regular Audits: Re-run red team evaluations whenever Kimi K2 is updated or fine-tuned, or as you change prompts and use cases.
- Custom Plugins: Build and add your own plugins to test for domain-specific risks unique to your application or business needs.
- CI/CD Automation: Integrate Promptfoo red teaming into your continuous integration pipeline, so you catch vulnerabilities before they reach production.
- Monitor & Iterate: Regularly review vulnerability trends and mitigation outcomes, tracking your security improvements over time.
Pro tip:
Make red teaming a habit, not a one-time task. As your usage of Kimi K2 evolves, so should your defense strategy.
Conclusion
Red teaming Kimi K2 isn’t just a checkbox—it’s a vital step in keeping your agentic LLMs secure and trustworthy as they grow in capability and impact. With Promptfoo, you’ve got a flexible, automated way to uncover risks and vulnerabilities at scale, so you can deploy Kimi K2 with greater confidence. Make regular red teaming part of your workflow, stay proactive about security, and keep pushing the boundaries—safely.