SRE WEEKLY
398 FOLLOWERS
SRE Weekly is a newsletter devoted to everything related to keeping a site or service available as consistently as possible. It's about a holistic view of reliability that takes into account everything from servers to human factors to processes to automation and more.
SRE WEEKLY
1w ago
A message from our sponsor, FireHydrant:
FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/
1.0 Launch Retrospective
The game Last Epoch launched in February, and they had a rocky start. This huge retrospective post tells the story of what happened and how they fixed it.
EHG_Kain — Last Epoch
Autonomous hardware diagnostics and recovery at scale
Cloudflare’s Phoenix system can find and recover fa ..read more
SRE WEEKLY
2w ago
A message from our sponsor, FireHydrant:
FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/
How Figma’s Databases Team Lived to Tell the Scale
Our nine month journey to horizontally shard Figma’s Postgres stack, and the key to unlocking (nearly) infinite scalability.
Retrofitting sharding is a huge undertaking.
Sammy Steele — Figma
Moving fast breaks things: the importance of a staging environme ..read more
SRE WEEKLY
3w ago
A message from our sponsor, FireHydrant:
FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.
https://firehydrant.com/blog/ai-for-incident-management-is-here/
Redefining Observability
The observability waters have been muddy for awhile, and this article does a great job of taking a step back and building a definition — and a roadmap.
Hazel Weakly
A Commentary on Defining Observability
Fred Hebert wrote this response/follow-on to Hazel’s article:
The main points ..read more
SRE WEEKLY
1M ago
A message from our sponsor, FireHydrant:
Join FireHydrant this Thursday for a conversation about on-call burnout and how to prevent it. Get a better understanding of what makes a fatigue-free on-call culture, including real-world examples from your incident management peers. No sales, just shop talk.
https://app.livestorm.co/firehydrant/better-incidents-spring-bonfire-secrets-to-fatigue-free-on-call-in-2024
Harnessing chaos in Cloudflare offices
Remember that cool lava lamp random number generator that Cloudflare uses? Now they have a couple of other sources of entropy, and they’re teaming ..read more
SRE WEEKLY
1M ago
A message from our sponsor, FireHydrant:
We need tools that help us show our value, enhance understanding of our systems, and free time for us to expand our skills. In this article, FireHydrant lays out three questions to ask vendors as you evaluate DevOps tools. https://firehydrant.com/blog/3-questions-to-ask-of-any-devops-tool-in-2024/
4 Instructive Postmortems on Data Downtime and Loss
What can we, in turn, learn from some of the most honest and blameless—and public—postmortems of the last few years?
They cover incidents from GitLab, Tarsnap, Roblox, and Cloudflare with great summarie ..read more
SRE WEEKLY
1M ago
A message from our sponsor, FireHydrant:
Join FireHydrant and talk shop with your DevOps peers on March 28! You’ll gain a better understanding of what makes a fatigue-free on-call culture and how to implement practices to improve yours at this free, virtual roundtable.
https://app.livestorm.co/firehydrant/better-incidents-spring-bonfire-secrets-to-fatigue-free-on-call-in-2024
The Wrong Way to Use DORA Metrics
[…] it must be said that the intent of these metrics was always to give an indicator of how well your team was delivering software, not a high-stakes metric that should be used, for ..read more
SRE WEEKLY
2M ago
Sorry about the automation fail and resend! That definitely wasn’t issue #1.
A message from our sponsor, FireHydrant:
Check out how global payments company Dock uses FireHydrant to streamline and consolidate their incident management stack and reduce what they call “mean time to combat.”
https://firehydrant.com/blog/the-revolution-in-critical-incident-response-at-dock-with-firehydrant/
The Domain of Failure
This article discusses building failure management directly into our systems, using Erlang as a case study.
Jamie Allen
Cinnamon: Using Century Old Tech to Build a Mean L ..read more
SRE WEEKLY
2M ago
A message from our sponsor, FireHydrant:
FireHydrant’s new and improved MTTX analytics dashboard is here! See which services are most affected by incidents, where they take the longest to detect (or acknowledge, mitigate, resolve … you name it); and how metrics and statistics change over time.
https://firehydrant.com/blog/mttx-incident-analytics-to-drive-your-reliability-roadmap/
The Single Pain of Glass
Can a single dashboard to cover your entire system really exist?
Jamie Allen
The importance of SEV-1 call leaders
This one makes the case for having a group of specially-tr ..read more
SRE WEEKLY
3M ago
A message from our sponsor, FireHydrant:
It’s time for a new world of alerting tools that prioritize engineer well-being and efficiency. The future lies in intelligent systems that are compatible with real life and use conditional rules to adapt and refine thresholds, reducing alert fatigue.
https://firehydrant.com/blog/the-alert-fatigue-dilemma-a-call-for-change-in-how-we-manage-on-call/
Executing Cron Scripts Reliably At Scale
I’ve occasionally wondered what’s behind Slack’s /remind or “clear my away status after my vacation ends”. Now I know!
Claire Adams
Consistency
Thi ..read more
SRE WEEKLY
3M ago
A message from our sponsor, FireHydrant:
Signals is now available in beta. Sign up to experience alerting for modern DevOps teams: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform — ring to retro — finally. https://firehydrant.com/blog/signals-beta-live/
On chains and complex systems
If you really want to understand how complex systems fail, you need to think in terms of webs rather than chains.
Lorin Hochstein
Practitioners Share How They Remove the Fear of On-Call
We asked members of the PagerDuty Community what t ..read more