Real-Time Threat Detection With MongoDB & PuppyGraph
Weimo Liu

Weimo Liu @weimoliu

About: Weimo Liu is the CEO and co-founder of PuppyGraph, bringing his expertise in databases and query engines from his time at Google, where he worked on the F1 team.

Location:
Santa Clara
Joined:
Apr 22, 2025

Real-Time Threat Detection With MongoDB & PuppyGraph

Publish Date: Jul 11
16 0

Security operations teams face an increasingly complex environment. Cloud-native applications, identity sprawl, and continuous infrastructure changes generate a flood of logs and events. From API calls in AWS to lateral movement between virtual machines, the volume of telemetry is enormous-and it's growing.

The challenge isn't just scale. Its structure. Traditional security tooling often looks at events in isolation, relying on static rules or dashboards to highlight anomalies. But real attacks unfold as chains of related actions: A user assumes a role, launches a resource, accesses data, and then pivots again. These relationships are hard to capture with flat queries or disconnected logs.

That's where graph analytics comes in. By modeling your data as a network of users, sessions, identities, and events, you can trace how threats emerge and evolve. And with PuppyGraph, you don't need a separate graph database or batch pipelines to get there.
In this post, we'll show how to combine MongoDB and PuppyGraph to analyze AWS CloudTrail data as a graph-without moving or duplicating data. You'll see how to uncover privilege escalation chains, map user behavior across sessions, and detect suspicious access patterns in real time.

Why MongoDB for cybersecurity data

MongoDB is a popular choice for managing security telemetry. Its document-based model is ideal for ingesting unstructured and semi-structured logs like those generated by AWS CloudTrail, GuardDuty, or Kubernetes audit logs. Events are stored as flexible JSON documents, which evolve naturally as logging formats change.

This flexibility matters in security, where schemas can shift as providers update APIs or teams add new context to events. MongoDB handles these changes without breaking pipelines or requiring schema migrations. It also supports high-throughput ingestion and horizontal scaling, making it well-suited for operational telemetry.

Many security products and SIEM backends already support MongoDB as a destination for real-time event streams. That makes it a natural foundation for graph-based security analytics: The data is already there—rich, semi-structured, and continuously updated.

Why graph analytics for threat detection

Modern security incidents rarely unfold as isolated events. Attackers don’t just trip a single rule—they navigate through systems, identities, and resources, often blending in with legitimate activity. Understanding these behaviors means connecting the dots across multiple entities and actions. That’s precisely what graph analytics excels at. By modeling users, sessions, events, and assets as interconnected nodes and edges, analysts can trace how activity flows through a system. This structure makes it easy to ask questions that involve multiple hops or indirect relationships—something traditional queries often struggle to express.

For example, imagine you’re investigating activity tied to a specific AWS account. You might start by counting how many sessions are associated with that account. Then, you might break those sessions down by whether they were authenticated using MFA. If some weren’t, the next question becomes: What resources were accessed during those unauthenticated sessions?

This kind of multi-step investigation is where graph queries shine. Instead of scanning raw logs or filtering one table at a time, you can traverse the entire path from account to identity to session to event to resource, all in a single query. You can also group results by attributes like resource type to identify which services were most affected.

And when needed, you can go beyond metrics and pivot to visualization, mapping out full access paths to see how a specific user or session interacted with sensitive infrastructure. This helps surface lateral movement, track privilege escalation, and uncover patterns that static alerts might miss.

Graph analytics doesn’t replace your existing detection rules; it complements them by revealing the structure behind security activity. It turns complex event relationships into something you can query directly, explore interactively, and act on with confidence.

Query MongoDB data as a graph without ETL

MongoDB is a popular choice for storing security event data, especially when working with logs that don’t always follow a fixed structure. Services like AWS CloudTrail produce large volumes of JSON-based records with fields that can differ across events. MongoDB’s flexible schema makes it easy to ingest and query that data as it evolves.

PuppyGraph builds on this foundation by introducing graph analytics—without requiring any data movement. Through the MongoDB Atlas SQL Interface, PuppyGraph can connect directly to your collections and treat them as relational tables. From there, you define a graph model by mapping key fields into nodes and relationships.

Figure 1. Architecture of the integration of MongoDB and PuppyGraph.

Figure 1. Architecture of the integration of MongoDB and PuppyGraph.
This makes it possible to explore questions that involve multiple entities and steps, such as tracing how a session relates to an identity or which resources were accessed without MFA. The graph itself is virtual. There’s no ETL process or data duplication. Queries run in real time against the data already stored in MongoDB.

While PuppyGraph works with tabular structures exposed through the SQL interface, many security logs already follow a relatively flat pattern: consistent fields like account IDs, event names, timestamps, and resource types. That makes it straightforward to build graphs that reflect how accounts, sessions, events, and resources are linked. By layering graph capabilities on top of MongoDB, teams can ask more connected questions of their security data, without changing their storage strategy or duplicating infrastructure.

Investigating CloudTrail activity using graph queries

To demonstrate how graph analytics can enhance security investigations, we’ll explore a real-world dataset of AWS CloudTrail logs. This dataset originates from flaws.cloud, a security training environment developed by Scott Piper.

The dataset comprises anonymized CloudTrail logs collected over 3.5 years, capturing a wide range of simulated attack scenarios within a controlled AWS environment. It includes over 1.9 million events, featuring interactions from thousands of unique IP addresses and user agents. The logs encompass various AWS API calls, providing a comprehensive view of potential security events and misconfigurations.

For our demonstration, we imported a subset of approximately 100,000 events into MongoDB Atlas. By importing this dataset into MongoDB Atlas and applying PuppyGraph’s graph analytics capabilities, we can model and analyze complex relationships between accounts, identities, sessions, events, and resources.

Demo

Let’s walk through the demo step by step! We have provided all the materials for this demo on GitHub. Please download the materials or clone the repository directly.

If you’re new to integrating MongoDB Atlas with PuppyGraph, we recommend starting with the MongoDB Atlas + PuppyGraph Quickstart Demo to get familiar with the setup and core concepts.

Prerequisites

  • A MongoDB Atlas account (free tier is sufficient)
  • Docker
  • Python 3

Set up MongoDB Atlas

Follow the MongoDB Atlas Getting Started guide to:

  1. Create a new cluster (free tier is fine).
  2. Add a database user.
  3. Configure IP access.
  4. Note your connection string for the MongoDB Python driver (you’ll need it shortly).

Download and import CloudTrail logs

Run the following commands to fetch and prepare the dataset:

wget https://summitroute.com/downloads/flaws_cloudtrail_logs.tar
mkdir -p ./raw_data
tar -xvf flaws_cloudtrail_logs.tar --strip-components=1 -C ./raw_data
gunzip ./raw_data/*.json.gz
Enter fullscreen mode Exit fullscreen mode

Create a virtual environment and install dependencies:

# On some Linux distributions, install `python3-venv` first.
sudo apt-get update
sudo apt-get install python3-venv
# Create a virtual environment, activate it, and install the necessary packages 
python -m venv venv
source venv/bin/activate
pip install ijson faker pandas pymongo
Enter fullscreen mode Exit fullscreen mode

Import the first chunk of CloudTrail data (replace the connection string with your Atlas URI):

export MONGODB_CONNECTION_STRING="your_mongodb_connection_string"
python import_data.py raw_data/flaws_cloudtrail00.json --database cloudtrail
Enter fullscreen mode Exit fullscreen mode

This creates a new cloudtrail database and loads the first chunk of data containing 100,000 structured events.

Enable Atlas SQL interface and get JDBC URI

To enable graph access:

  1. Create an Atlas SQL Federated Database instance.
  2. Ensure the schema is available (generate from sample, if needed).
  3. Copy the JDBC URI from the Atlas SQL interface. See PuppyGraph’s guide for setting up MongoDB Atlas SQL.

Start PuppyGraph and upload the graph schema

Start the PuppyGraph container:

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
  -e PUPPYGRAPH_PASSWORD=puppygraph123 \
  -d --name puppy --rm --pull=always puppygraph/puppygraph:stable
Enter fullscreen mode Exit fullscreen mode

Log in to the web UI at http://localhost:8081 with:

  • Username: puppygraph.
  • Password: puppygraph123.

Upload the schema:

  1. Open schema.json.
  2. Fill in your JDBC URI, username, and password.
  3. Upload via the Upload Graph Schema JSON section or run:
curl -XPOST -H "content-type: application/json" \
  --data-binary @./schema.json \
  --user "puppygraph:puppygraph123" localhost:8081/schema
Enter fullscreen mode Exit fullscreen mode

Wait for the schema to upload and initialize (approximately five minutes).

Figure 2. A graph visualization of the schema, which models the graph from relational data.

Figure 2. A graph visualization of the schema, which models the graph from relational data.

Run graph queries to investigate security activity

Once the graph is live, open the Query panel in PuppyGraph’s UI.

Let's say we want to investigate the activity of a specific account. First, we count the number of sessions associated with the account.

Cypher:

MATCH (a:Account)-[:HasIdentity]->(i:Identity)
  -[:HasSession]->(s:Session)
WHERE id(a) = "Account[811596193553]"
RETURN count(s)
Enter fullscreen mode Exit fullscreen mode

Gremlin:

g.V("Account[811596193553]")
 .out("HasIdentity").out("HasSession").count()
Enter fullscreen mode Exit fullscreen mode

Figure 3. Graph query in the PuppyGraph UI.

Figure 3. Graph query in the PuppyGraph UI.

Then, we want to see how many of these sessions are MFA-authenticated or not.

Cypher:

MATCH (a:Account)-[:HasIdentity]->(i:Identity)
  -[:HasSession]->(s:Session)
WHERE id(a) = "Account[811596193553]"
RETURN s.mfa_authenticated AS mfaStatus, count(s) AS count
Enter fullscreen mode Exit fullscreen mode

Gremlin:

g.V("Account[811596193553]")
  .out("HasIdentity").out("HasSession")
  .groupCount().by("mfa_authenticated")
Enter fullscreen mode Exit fullscreen mode

Figure 4. Graph query results in the PuppyGraph UI.
Figure 4. Graph query results in the PuppyGraph UI.

Next, we investigate those sessions that are not MFA authenticated and see what resources they accessed.

Cypher:

MATCH (a:Account)-[:HasIdentity]->
  (i:Identity)-[:HasSession]->
  (s:Session {mfa_authenticated: false})
  -[:RecordsEvent]->(e:Event)
  -[:OperatesOn]->(r:Resource)
WHERE id(a) = "Account[811596193553]"
RETURN r.resource_type AS resourceType, count(r) AS count
Enter fullscreen mode Exit fullscreen mode

Gremlin:

g.V("Account[811596193553]").out("HasIdentity")
  .out("HasSession")
  .has("mfa_authenticated", false)
  .out('RecordsEvent').out('OperatesOn')
  .groupCount().by("resource_type")
Enter fullscreen mode Exit fullscreen mode

Figure 5. PuppyGraph UI showing results that are not MFA authenticated.
Figure 5. PuppyGraph UI showing results that are not MFA authenticated.

We show those access paths in a graph.

Cypher:

MATCH path = (a:Account)-[:HasIdentity]->
  (i:Identity)-[:HasSession]->
  (s:Session {mfa_authenticated: false})
  -[:RecordsEvent]->(e:Event)
  -[:OperatesOn]->(r:Resource)
WHERE id(a) = "Account[811596193553]"
RETURN path
Enter fullscreen mode Exit fullscreen mode

Gremlin:

g.V("Account[811596193553]").out("HasIdentity").out("HasSession").has("mfa_authenticated", false)
  .out('RecordsEvent').out('OperatesOn')
  .path()
Enter fullscreen mode Exit fullscreen mode

Figure 6. Graph visualization in PuppyGraph UI.
Figure 6. Graph visualization in PuppyGraph UI.

Tear down the environment

When you’re done:

docker stop puppy
Enter fullscreen mode Exit fullscreen mode

Your MongoDB data will persist in Atlas, so you can revisit or expand the graph model at any time.

Conclusion

Security data is rich with relationships, between users, sessions, resources, and actions. Modeling these connections explicitly makes it easier to understand what’s happening in your environment, especially when investigating incidents or searching for hidden risks.

By combining MongoDB Atlas and PuppyGraph, teams can analyze those relationships in real time without moving data or maintaining a separate graph database. MongoDB provides the flexibility and scalability to store complex, evolving security logs like AWS CloudTrail, while PuppyGraph adds a native graph layer for exploring that data as connected paths and patterns.

In this post, we walked through how to import real-world audit logs, define a graph schema, and investigate access activity using graph queries. With just a few steps, you can transform a log collection into an interactive graph that reveals how activity flows across your cloud infrastructure.

Comments 0 total

    Add comment