"If it disappears when you look at it, it’s probably a Heisenbug."
Embedded systems are notorious for their complexity, tight resource constraints, and real-time requirements. But among all the challenges, few are as maddening as the Heisenbug—a bug that vanishes or changes behavior when you try to debug it.
In this post, we’ll explore:
- What makes Heisenbugs so tricky in embedded environments
- Real-world examples
- Tools and techniques to catch them
- A case study from a real firmware project
🧠 What Is a Heisenbug?
A Heisenbug is a software bug that seems to disappear or alter its behavior when you attempt to study it. In embedded systems, this often happens due to:
- Timing sensitivity: Adding breakpoints or logging changes execution timing.
- Memory corruption: Observing memory can inadvertently fix alignment or overwrite issues.
- Interrupt interference: Debugging tools may disable interrupts, masking race conditions.
🔍 Real-World Example: UART Overrun That Disappeared
Imagine a microcontroller-based system where UART communication occasionally fails. You add logging to trace the issue—and suddenly, it works perfectly. Remove the logging, and the bug returns.
Root cause? The logging slowed down the main loop just enough to avoid a UART buffer overrun. Classic Heisenbug.
🛠️ Tools and Techniques to Catch Them
1. Trace-Based Debugging
Use tools like:
- SEGGER J-Trace
- ARM ITM/SWO
- ETM (Embedded Trace Macrocell)
These allow non-intrusive logging of program execution and events.
2. Hardware Watchpoints
Set watchpoints on memory addresses to catch unexpected writes or reads.
3. Cycle-Accurate Simulators
Simulators like Renode or QEMU (with peripherals) can help reproduce timing-sensitive bugs.
4. Snapshot Debugging
Tools like Percepio Tracealyzer or Lauterbach Trace32 let you capture system state over time.
5. Redundant Logging
Use ring buffers in RAM to store logs and dump them only on crash or reboot.
🧪 Case Study: RTOS Task Starvation
In a recent project, a low-priority task was intermittently failing to run. Debugging with breakpoints showed no issue. The culprit? A high-priority task was hogging the CPU due to a missed semaphore release.
Fix: Added a watchdog timer and used Tracealyzer to visualize task execution. The starvation became obvious.
🧭 Final Thoughts
Heisenbugs are a rite of passage for embedded developers. They test your patience, your tools, and your understanding of the system. But with the right approach, you can turn even the most elusive bug into a solved mystery.
Have you faced a Heisenbug in your embedded work? Share your story in the comments!
I know Heisenbug is the common term for this kind of bug, but it's unfortunate nevertheless. In physics, the Heisenberg Uncertainty Principle is that you can't know two properties of a particle, say its position and momentum, at the same time.
An analog of that for debugging software might be that you couldn't know the bug's location (file and line number) and (assuming the bug in question involves an ordinary variable, say a pointer) the pointer's value at that line number at the same time.
The phenomenon in debugging software where trying to debug something changes the system just enough so as to cause the bug not to manifest is instead an analog of the Observer Effect.