sre

Browse all articles, tutorials, and guides about sre

10posts

Posts

⌘K

2026-06-19|10 min read

AI SRE Agents: What They Actually Fix, and What They Will Happily Break

AI SRE is now its own category, with every incident vendor shipping an agent that investigates and remediates on its own. Here is the honest split: where these agents genuinely earn their keep, where they are oversold, and the one risk nobody puts on the marketing page.

DevOps

2026-06-11|12 min read

Neon vs Supabase in Production: We Benchmarked the Operations That Page You at 3am

Two benchmark sessions against Neon and Supabase Pro measured what spec sheets never show: compute resizes cost 39 seconds of real downtime on one platform and zero on the other, read replicas differ by 23x, and branch creation has a tail you should know about.

DevOps

2026-05-25|11 min read

How to Build an Effective On-Call Rotation and Escalation Policy

Your phone buzzed at 3:14 AM for a disk warning that auto-resolved by 3:16. Nobody fixes the alert. The next person on rotation hates their life. Here is how to build on-call schedules, escalation policies, and alert rules that respect your engineers.

Kubernetes

2026-05-18|12 min read

Karpenter Spot Storm Fallback Gap: The Production Loop Nobody Talks About

When AWS spot capacity dries up in a region, Karpenter does not automatically fall back to on-demand. It retries the same dying offerings on a 3-minute loop. Here is why, and how to design around it.

Kubernetes

2026-05-18|11 min read

Running Your First Chaos Engineering Experiment with Litmus

A hands-on walkthrough of installing LitmusChaos on Kubernetes, killing pods on purpose, and watching whether your app actually recovers. Real YAML, real output, no theory.

DevOps

2026-05-05|9 min read

10 GitHub Repositories That Will Actually Teach You DevOps in 2026

Most "top DevOps repos" lists are recycled awesome-list links. This one is a curated set of repositories that will move the needle on your DevOps skills, with star counts, who each one is for, and how to actually use it.

DevOps

2026-04-13|10 min read

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.

DevOps

2026-03-26|7 min read

5 DevOps Books Worth Reading in 2026

A curated list of DevOps books that are actually worth your time in 2026, from beginner Linux guides to production Kubernetes patterns and the SRE bible.

DevOps

2026-01-25|12 min read

DevOps vs SysAdmin vs SRE: What's the Difference?

Confused about DevOps, SysAdmin, and SRE roles? This beginner-friendly guide uses real-world analogies to explain what each role does, how they differ, and which path might be right for you.

DevOps

2025-07-14|12 min read

The Hidden Costs of Over-Automation in DevOps

Automation speeds things up, but too much of it can hide failures, slow incident response, and add fragile layers you have to maintain.