SRE WEEKLY on Feedspot

SRE Weekly Issue #420

SRE WEEKLY

by lex

1w ago

A message from our sponsor, FireHydrant: FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/ 1.0 Launch Retrospective The game Last Epoch launched in February, and they had a rocky start. This huge retrospective post tells the story of what happened and how they fixed it. EHG_Kain — Last Epoch Autonomous hardware diagnostics and recovery at scale Cloudflare’s Phoenix system can find and recover fa ..read more

Visit website

SRE Weekly Issue #419

SRE WEEKLY

by lex

2w ago

A message from our sponsor, FireHydrant: FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/ How Figma’s Databases Team Lived to Tell the Scale Our nine month journey to horizontally shard Figma’s Postgres stack, and the key to unlocking (nearly) infinite scalability. Retrofitting sharding is a huge undertaking. Sammy Steele — Figma Moving fast breaks things: the importance of a staging environme ..read more

Visit website

SRE Weekly Issue #418

SRE WEEKLY

by lex

3w ago

A message from our sponsor, FireHydrant: FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/ Redefining Observability The observability waters have been muddy for awhile, and this article does a great job of taking a step back and building a definition — and a roadmap. Hazel Weakly A Commentary on Defining Observability Fred Hebert wrote this response/follow-on to Hazel’s article: The main points ..read more

Visit website

SRE Weekly Issue #417

SRE WEEKLY

by lex

1M ago

A message from our sponsor, FireHydrant: Join FireHydrant this Thursday for a conversation about on-call burnout and how to prevent it. Get a better understanding of what makes a fatigue-free on-call culture, including real-world examples from your incident management peers. No sales, just shop talk. https://app.livestorm.co/firehydrant/better-incidents-spring-bonfire-secrets-to-fatigue-free-on-call-in-2024 Harnessing chaos in Cloudflare offices Remember that cool lava lamp random number generator that Cloudflare uses? Now they have a couple of other sources of entropy, and they’re teaming ..read more

Visit website

SRE Weekly Issue #416

SRE WEEKLY

by lex

1M ago

A message from our sponsor, FireHydrant: We need tools that help us show our value, enhance understanding of our systems, and free time for us to expand our skills. In this article, FireHydrant lays out three questions to ask vendors as you evaluate DevOps tools. https://firehydrant.com/blog/3-questions-to-ask-of-any-devops-tool-in-2024/ 4 Instructive Postmortems on Data Downtime and Loss What can we, in turn, learn from some of the most honest and blameless—and public—postmortems of the last few years? They cover incidents from GitLab, Tarsnap, Roblox, and Cloudflare with great summarie ..read more

Visit website

SRE Weekly Issue #415

SRE WEEKLY

by lex

1M ago

A message from our sponsor, FireHydrant: Join FireHydrant and talk shop with your DevOps peers on March 28! You’ll gain a better understanding of what makes a fatigue-free on-call culture and how to implement practices to improve yours at this free, virtual roundtable. https://app.livestorm.co/firehydrant/better-incidents-spring-bonfire-secrets-to-fatigue-free-on-call-in-2024 The Wrong Way to Use DORA Metrics […] it must be said that the intent of these metrics was always to give an indicator of how well your team was delivering software, not a high-stakes metric that should be used, for ..read more

Visit website

SRE Weekly Issue #413

SRE WEEKLY

by lex

2M ago

Sorry about the automation fail and resend! That definitely wasn’t issue #1. A message from our sponsor, FireHydrant: Check out how global payments company Dock uses FireHydrant to streamline and consolidate their incident management stack and reduce what they call “mean time to combat.” https://firehydrant.com/blog/the-revolution-in-critical-incident-response-at-dock-with-firehydrant/ The Domain of Failure This article discusses building failure management directly into our systems, using Erlang as a case study. Jamie Allen Cinnamon: Using Century Old Tech to Build a Mean L ..read more

Visit website

SRE Weekly Issue #412

SRE WEEKLY

by lex

2M ago

A message from our sponsor, FireHydrant: FireHydrant’s new and improved MTTX analytics dashboard is here! See which services are most affected by incidents, where they take the longest to detect (or acknowledge, mitigate, resolve … you name it); and how metrics and statistics change over time. https://firehydrant.com/blog/mttx-incident-analytics-to-drive-your-reliability-roadmap/ The Single Pain of Glass Can a single dashboard to cover your entire system really exist? Jamie Allen The importance of SEV-1 call leaders This one makes the case for having a group of specially-tr ..read more

Visit website

SRE Weekly Issue #409

SRE WEEKLY

by lex

3M ago

A message from our sponsor, FireHydrant: It’s time for a new world of alerting tools that prioritize engineer well-being and efficiency. The future lies in intelligent systems that are compatible with real life and use conditional rules to adapt and refine thresholds, reducing alert fatigue. https://firehydrant.com/blog/the-alert-fatigue-dilemma-a-call-for-change-in-how-we-manage-on-call/ Executing Cron Scripts Reliably At Scale I’ve occasionally wondered what’s behind Slack’s /remind or “clear my away status after my vacation ends”. Now I know! Claire Adams Consistency Thi ..read more

Visit website

SRE Weekly Issue #407

SRE WEEKLY

by lex

3M ago

A message from our sponsor, FireHydrant: Signals is now available in beta. Sign up to experience alerting for modern DevOps teams: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform — ring to retro — finally. https://firehydrant.com/blog/signals-beta-live/ On chains and complex systems If you really want to understand how complex systems fail, you need to think in terms of webs rather than chains. Lorin Hochstein Practitioners Share How They Remove the Fear of On-Call We asked members of the PagerDuty Community what t ..read more

Visit website

Follow SRE WEEKLY on FeedSpot