Devops isn’t broken it just hurts more than we admit
<devtips/>

<devtips/> @dev_tips

About: Helping developers find the right cloud and software solutions

Joined:
Feb 27, 2025

Devops isn’t broken it just hurts more than we admit

Publish Date: Jun 19
2 0

The automation dream, meet reality

DevOps was supposed to be our escape from chaos.

We were promised clean pipelines, automated everything, zero-downtime deploys, and fewer meetings. It was marketed like a cheat code for infrastructure. One CLI to rule them all.

But the reality? It’s dashboards on top of dashboards. Alerts that ping you into oblivion. CI/CD pipelines that break when Mercury’s in retrograde. And secrets that still end up in Slack messages.

€50 free credits for 30 days trial
 Promo code: devlink50

It’s not that DevOps is a lie it’s just… messier than advertised.

You don’t see this in the glossy conference talks. But every DevOps engineer knows the truth: behind every “green build” is a trail of bash scripts, GitHub issues, and at least one broken promise.

This article isn’t a rant. It’s a debug log a list of 10 silent frustrations every DevOps engineer deals with but rarely talks about. If you’ve ever stared at a failing pipeline and whispered “why,” this one’s for you.

Flaky CI pipelines that gaslight you

You push your code. Tests fail.
You change nothing. Rerun the job. It passes.

Welcome to the haunted house known as CI.

There’s no greater productivity killer than a flaky pipeline. It’s like trying to deploy with a slot machine maybe you’ll get a green build today, maybe you’ll get exit code 137 and a mystery timeout.

The worst part? Everyone just accepts it.

You’ll hear:

  • “Yeah, that test fails sometimes. Just rerun.”
  • “Oh, GitHub Actions is weird with concurrency.”
  • “Jenkins has feelings, okay?”

And so we rerun. Again. And again. Until the green checkmark blesses our PR like it’s some kind of ritual.

This isn’t DevOps this is DevOps by hope and retries. CI tools should be a source of confidence, not confusion. But until we treat flaky tests and janky runners as real bugs (not just vibes), we’ll keep pretending reruns are a fix instead of an error in disguise.

Secrets management is still dangerously human

We all say we take secrets seriously.
But let’s be honest half the time, we’re just hoping no one notices that AWS_SECRET_KEY sitting in not-final.env in a Notion doc last updated 6 months ago.

Secrets management is the one part of DevOps where every solution still feels duct-taped:

  • Vault is powerful, but configuring it without summoning a cursed token is another story.
  • Doppler is sleek, until you hit usage limits.
  • AWS Secrets Manager exists, but the UX feels like it was built as punishment.

The real issue? Human behavior.

Secrets live in weird places:

  • Slack DMs
  • Git history (oops)
  • Shell history
  • Someone’s Downloads folder labeled do_not_upload.zip

Every team has “that one secret” that never got rotated and now holds production together like a forgotten Horcrux. Worse, secrets leak quietly until one day, you get an email from GitGuardian and suddenly you’re in incident response mode.

Until secrets are actually treated like code versioned, reviewed, rotated, and locked down we’re just playing hide-and-seek with our own infra.

And spoiler: the internet always finds what you hide poorly.

Observability ≠ clarity

Your dashboard looks amazing.
There’s a dark mode theme, graphs that spike like a DJ set, and a Prometheus panel with 47 metrics.

And yet… no one knows what any of it actually means.

Observability is supposed to help us understand our systems. But let’s be honest: half the time it’s just noise. You get alerts that say:

pod_memory_usage_bytes threshold exceeded

And your brain goes:

okay, but… does that matter?

The three horsemen of observability confusion:

  • Logs: Endless scroll with zero structure
  • Metrics: Dozens of panels, all red, none actionable
  • Tracing: Everyone says it’s amazing, no one has it set up right

Then there’s the cost of observability tools. Somehow your monitoring bill is now higher than your compute bill. You have Datadog, New Relic, Prometheus, and Loki all running at once, but you still had to grep logs manually last week.

The truth? Visibility doesn’t equal understanding.

Observability only works when it’s:

  • Purposeful
  • Owned by the team
  • Reviewed regularly
  • Tied to real incidents, not theoretical thresholds

Otherwise, it’s just expensive art and your 2am alert will be the same five-line Slack message that says nothing helpful.

On-call still ruins sleep (and weekends)

There’s a special kind of dread that comes from hearing your phone buzz at 2:43 AM.
You don’t even look at it anymore you just know it’s not good.

Being on-call was supposed to be rotational, balanced, and backed by automation. But most of the time, it feels like passive punishment.

The most common on-call nightmares:

  • You wake up to a CPU spike that resolved itself before you even unlocked your phone
  • The alert is real, but the runbook hasn’t been updated since Kubernetes v1.18
  • Your backup never woke up… because no one updated the escalation policy

Pager fatigue is real not just because of the interruptions, but because so many alerts are pointless. No context. No urgency. Just noise in the dark.

And even when you’re not on-call, you carry the mental load.
You second-guess deploys on Friday. You babysit feature flags. You fear “the quiet.”

The solution isn’t to remove on-call it’s to make it humane:

  • Triage with intent
  • Use smarter alerting (SLOs, not metrics)
  • Rotate for real, not just in theory
  • And for the love of uptime no alerts without actionable context

Because no system is worth trading your sanity for.

Terraform hates collaboration

Terraform is supposed to make infra-as-code clean, declarative, and collaborative.

But then two engineers run terraform apply in different branches, and suddenly you’re debugging a state file like it’s a corrupted save in Skyrim.

Real Terraform chaos, brought to you by:

  • State file locks that never release
  • Drift that no one notices until production looks different than staging
  • terraform plan shows changes no one asked for
  • terraform destroy accidentally nukes your entire dev environment

You’re told to “use workspaces” and “store state in remote backends,” but it still feels like handling radioactive material. One wrong move, and you’ve got an S3 bucket with public access and no logging… again.

Then there’s the collaboration tax:

  • Shared modules no one maintains
  • Secret variables sprinkled like seasoning across main.tf
  • PRs that are just diffs of JSON noise and ~ toggles

Terraform is powerful no doubt. But when it’s in the hands of a whole team, the complexity multiplies. Fast.

Until we treat Terraform like software with testing, ownership, and proper CI it’ll keep biting the hands that code it.

No one knows who owns what

Welcome to microservices, where every service has:

  • its own repo
  • its own pipeline
  • and no clear owner

You deploy something. It breaks. You check the repo. The last commit was from an intern who left six months ago. The Slack channel is empty. The README links to a Confluence doc that 404s.

Now it’s your problem.

The blame ping-pong begins:

  • Dev says, “Ask ops, they run it.”
  • Ops says, “Ask dev, they built it.”
  • QA says, “Wasn’t flagged in staging.”
  • Security says, “We flagged it. No one fixed it.”

No one really owns the service but everyone owns the consequences.

Ownership in DevOps isn’t about assigning blame. It’s about:

  • Defining maintainers
  • Setting clear escalation paths
  • Keeping docs current (yes, actually)
  • Labeling infra clearly in dashboards and alert configs

In a world of ephemeral containers and disposable clusters, ownership isn’t automatic.
It’s a decision. And when no one makes that decision, you get chaos in a nice CI badge wrapper.

“Just shift left” makes everything the dev’s fault

Shift left started with good intentions:
Catch bugs early. Improve security. Shorten feedback loops.

Then it slowly morphed into:
“Hey devs, can you also own security, infra, monitoring, and cost optimization in this sprint?”

Now, every PR is a battlefield:

  • Infra team wants you to add Terraform
  • Security drops a dozen linter warnings
  • QA says you didn’t test across all 3 staging clusters
  • Your manager wants it merged by EOD

“Shift left” was supposed to be empowerment.
It turned into delegation without support.

Devs aren’t trained in ops. Ops aren’t trained in app logic. And yet the expectation is that everyone knows how to do everything, perfectly, with no mistakes and no extra time.

Spoiler: shifting left without shifting resources is just pushing blame left.

What it should be:

  • Shared tooling
  • Contextual feedback
  • Better DX (developer experience), not just more responsibility
  • Actual collaboration, not email handoffs

Because DevOps isn’t about shifting burdens. It’s about building bridges ones that don’t collapse when someone adds a new pre-commit hook.

“Infra glue” breaks silently and randomly

The CI pipeline is green. The deployment succeeded. But nothing works.
Welcome to the magical world of infra glue the undocumented scripts, custom wrappers, YAML black magic, and tiny bash one-liners holding your whole system together.

Nobody owns this glue. Nobody wrote tests for it. But everyone is afraid to delete it.

Signs you’re deep in glue hell:

  • There’s a scripts/ folder that hasn’t been touched since 2021
  • CI includes ./deploy.sh && ./fix.sh && ./nudge.sh
  • Your job depends on a cron job you didn’t know existed until it failed
  • Someone wrote a Python CLI tool with no readme but 20 dependencies

These things break silently. They don’t trigger alerts. They don’t log errors. They just quietly fail, leaving you wondering why the config map wasn’t updated or why traffic is routing to the wrong service.

Infra glue is where DevOps tech debt goes to hide.

How to fix it:

  • Add ownership and version control
  • Document even the small stuff (especially the small stuff)
  • Refactor glue into pipelines and repeatable modules
  • If it’s important, make it visible. If it’s not, kill it.

Because nothing is scarier than a post_deploy_fallback.sh file that no one remembers writing… but everyone’s afraid to remove.

Tickets pile up while fire drills never end

You open your task board.
It’s packed with infra cleanup tickets, automation upgrades, and “migrate to Terraform 1.6” cards from three quarters ago.

Then your phone buzzes DNS is broken again.
You drop everything. Another fire. Another urgent deploy. Another config rollback that wasn’t versioned properly.

And so it goes:

  • You chase incidents
  • You patch what’s breaking
  • You promise to revisit that long-term fix “next sprint”

But the sprint ends.
Then the next one.
And that tech debt backlog? It’s basically a DevOps museum now.

The cycle:

  • No time to improve infrastructure
  • Because you’re always reacting to infrastructure
  • Which never improves… because no one has time

It’s not that teams don’t care. It’s that they’re stuck in perma-reactive mode drowning in red alerts, half-migrated configs, and endless status check-ins.

How to stop the spiral:

  • Block dedicated “non-fire” time each sprint
  • Tie infra tickets to actual OKRs (not just “good to have”)
  • Assign real owners to lingering tasks or kill them
  • Celebrate the boring wins: the cleanup no one sees, but everyone benefits from

Because DevOps isn’t just about keeping things running it’s about creating space to make them better.

Conclusion DevOps isn’t broken, but it’s bruised

We love DevOps the automation, the culture, the tooling.
But let’s not pretend it’s all perfect.

Behind every green CI badge is a bash script no one wants to touch.
Behind every alert is someone who hasn’t slept through the night in weeks.
And behind every “devs should own it” post is a team already stretched thin.

This isn’t a rage-quit moment. It’s a reality check.
These 10 frustrations don’t mean DevOps is failing they mean we still have work to do:

  • Fix the glue
  • Share the ownership
  • Stop pretending reruns are good enough
  • And treat observability and on-call like human issues, not just tech ones

You’re not alone in this. If you’ve felt burnt out, buried in tickets, or betrayed by a pipeline at 1am welcome. You’re in good company.

DevOps isn’t about having no problems. It’s about building systems that survive them.

Helpful resources for bruised-but-determined DevOps engineers

Core reading & handbooks:

Tools & libraries:

Incident response & observability:

Comments 0 total

    Add comment