More Than a Bug: How Root Cause Analysis Helped Payroll
Juliano

Juliano @julianomoreno

About: Squad Leader with almost 20 years of QA experience, worked on diverse projects. Sharing insights and interacting with the QA community.

Joined:
Oct 23, 2023

More Than a Bug: How Root Cause Analysis Helped Payroll

Publish Date: Jul 17
0 0

“The payroll is wrong. Many workers got double night bonus.”

I saw this message in the support channel on Monday morning. I thought it was only one more bug in production. But when we looked more, we saw the problem was not only one if in the code.

It was time to stop fixing only what we see. We needed to find the real reason for the problem. It was time to use Root Cause Analysis (RCA).


Note: Examples in this post use Brazilian labor law for night shift calculations.


The Start: Just Another Bug?

The team who works with the payroll module in our ERP got a message.
Clients said the night bonus was wrong in the last month.

One developer checked and saw that people who clocked in from 10pm to midnight got the bonus twice or on the wrong day.

The code was using the time from the database in UTC:

const nightBonusStart = dayjs(clockIn).hour() >= 22;
Enter fullscreen mode Exit fullscreen mode

Our server uses UTC, but data comes from Brazil (BRT). So “10pm” in Brazil became “1am” next day in database.

“I just need to fix the time, I will change and deploy.”

Done. Payroll fixed. HR happy.
But next month, the bug was back in another part of the code.

Then we understood: we were fixing results, not causes.


Why RCA Is Important

Root Cause Analysis (RCA) helps you find the real reason for a problem. Not only what you see.

People use RCA in software, DevOps, quality, and also in health and industry.

Instead of asking:

“How can I fix this bug now?”

RCA asks:

“Why did this happen, and how can I stop it forever?”


How We Used RCA

1. Method: 5 Whys

We start with a simple way: we ask “why?” many times until we find the root cause.

Example in our case:

  1. Why is night bonus wrong? Because calculation used wrong time.
  2. Why wrong time? Because time was saved in UTC.
  3. Why do we use UTC without change? Because the code used dayjs without timezone.
  4. Why not set timezone? Because new library (day.js) does not change timezone alone.
  5. Why we did not find before? Because tests do not check timezone or 10pm and midnight.

Conclusion:
The problem was not only code. It was also missing timezone and no test for this case.


2. Method: Change Analysis

We check what changed before the problem.

What we saw:

  • Two days before, a PR changed moment.js to day.js.
  • We wanted better performance.
  • But day.js does not have timezone by default. Need plugins.

A small change in code made a big problem because we did not check the business rules.


3. Method: Barrier Analysis

We check what should protect us but did not work.

Problems:

  • Automated tests?
    They exist, but no test for timezone or night shift. Data was always 8am–5pm. No night. So tests “passed” but real problem was there.

  • Code review?
    It happened, but a junior dev who does not know payroll did the review. No meeting between dev and product teams. Rules (like night bonus, timezones) are not clear for all.

  • Acceptance test?
    Dev team used only simple data. No test for night shift. Product team did not check before production.

Many protections failed. Problem is in process, not only code.


4. Complementary Technique: Ishikawa Diagram (Fishbone Diagram)

This technique helps visualize categories of causes around the problem.

Category Cause
People Reviewer with no payroll experience
Tools CI without timezone-related tests
Process Functional staging not required
Environment Server uses UTC, users are in BRT
Data Seed does not cover night shift
Tests No coverage for calculations between 10pm and 5am

This reinforces that the problem is multifactorial, not just a simple isolated bug.

See below how we represented this in an Ishikawa diagram:

Fishbone diagram showing the multifactorial causes of incorrect night shift calculation across People, Tools, Process, Environment, Data, and Tests.

It’s worth noting that the diagram was also refined over time. Some causes were condensed to avoid redundancy (for example, “weak staging” and “no calculation review”), while others were highlighted in more detail, such as misalignment with the Product team. This exercise helps the team visualize causal relationships and make decisions on where to act first.

How we built the Ishikawa Diagram

The Ishikawa Diagram, also called the Fishbone Diagram, is a visual technique used to organize the causes of a problem into categories.

In this case, the problem analyzed was:

"Night shift allowance calculated incorrectly"

From there, we identified six major categories of causes:

  1. People
  2. Tools
  3. Process
  4. Environment
  5. Data
  6. Tests

For each category, we listed:

  • Detail: direct causes
  • Sub-detail: more specific causes

For example, in the People category:

  • Detail: Junior reviewer

    • Sub-detail: No experience in payroll
    • Sub-detail: Undocumented rule

This process was conducted collaboratively with the team, combining:

  • The 5 Whys technique
  • Change analysis
  • Code inspection
  • Identification of failed safeguards

By visualizing the causes this way, it became clear that it wasn’t just a bug in the code, but rather a chain of organizational, technical, and communication failures.


Problems and Solutions

Original Problem How We Fixed
Product must review PRs Product does not review code, but must check rules. For sensitive code (payroll, finance), do a meeting and check together before final approve.
“Seed” not explained Now we explain: seed is fake data for test
“DST ignored” could confuse Now we say: no test for Daylight Saving Time
No test for 10pm–5am We keep and explain more about night bonus

Simple Takeaway

The diagram showed that the bug did not come from only one place, but many parts of the system not working together.


What We Changed

Quick fixes:

  • Fix the code to use timezone:
  dayjs.tz(clockIn, 'America/Sao_Paulo').hour() >= 22
Enter fullscreen mode Exit fullscreen mode
  • Redo payroll for all affected clients.

Prevention:

  • Add new tests for:

    • 10pm, midnight, 5am
    • Different timezones (BRT, UTC)
    • Daylight Saving Time (DST)
  • Make a checklist for all PRs that change payroll, calculation or finance rules.

  • Always set timezone in API, backend, and tests.

  • Add product team to check sensitive PRs.


Why Tell This Story?

We did not write a boring postmortem. We told a story.

This helps to:

  • Make team pay attention
  • Help non-technical people understand
  • Make lessons easy to remember
  • Build a team that wants to improve

Final Lessons: RCA Is Necessary

Fixing bugs without knowing the real reason is like cleaning water without fixing the leak.

RCA helps to:

  • Stop fixing the same bug again and again
  • Learn every time
  • Make stronger systems
  • Improve your work step by step

RCA Checklist

  • [ ] Did we stop and think about the error?
  • [ ] Did we check what changed before?
  • [ ] Did we use the 5 Whys?
  • [ ] Did we check which protections failed?
  • [ ] Did we write a postmortem to share?
  • [ ] Did we do prevention, not only a fix?

Conclusion

This “timezone bug” showed us problems in code, process, and team work.

RCA helped us work better, not just fix a bug. This is the difference between teams who only fix, and teams who grow.

Are you only fixing bugs, or fixing the real problem?

Comments 0 total

    Add comment