PagerDuty Rotations posted to Slack
Reddit » Site Reliability Engineers
by /u/Blyd
21h ago
Looking for a way to simply post a pagerduty team rotation into a slack channel. Looking at a tool called Pagerly at the moment, but before I reach out to them, are there any other tools to consider? submitted by /u/Blyd [visit reddit] [comments ..read more
Visit website
Reliability metrics
Reddit » Site Reliability Engineers
by /u/Future-Papaya-1840
21h ago
Hey team, We have been calculating and monitoring all the latency, throughput and error for all our services. We also calculate availability using success rate which is total successful request / total requests for given period of time. I am trying to improve reliability metrics by adding error budget and defining SLi and slo . But I want to know if we can mean time between failures , mean time to recover and reliability of service using this ? Has anyone done this ? Can you help me with some details? When I say mean time between failure can I consider between two 5*** ? Or we should consider ..read more
Visit website
Post Deployment monitoring
Reddit » Site Reliability Engineers
by /u/siddharthnibjiya
21h ago
I just came across this article by a SRE at Slack discussing how they put together a script for post-deployment monitoring of their critical metrics. tl;dr 3 things that I found insightful: Slack previously had 2 Deployment Commanders (rotating devs from engg) to manually deploy and monitor their monolith (that has about 200 PR merges a day and ~40 releases/day) Instead of trying to do generic monitoring, they do a today vs yesterday vs LWSD (last week same day) evaluation of the same metrics Setup statistical analysis to monitor for the z-scores of these metrics and get alerted in case they ..read more
Visit website
Regulatory toil
Reddit » Site Reliability Engineers
by /u/syhlheti
21h ago
We have our Compliance saying “check this report daily, escalate to us if the alerts it raises are not false”. It raises false alerts everyday. They are unable to put in logic to alert reliably, stemming from the fact they are unsure if an EU regulation might apply in non-EU markets. My boss doesn’t allow me to challenge that; and we are left with false alerts daily. Perhaps once a year it raised a valid alert. submitted by /u/syhlheti [visit reddit] [comments ..read more
Visit website
My self-hosted app for SREs/devOps engineers to deal with all the tools and technologies
Reddit » Site Reliability Engineers
by /u/dev_user1091
3d ago
I created an app for developers and devOps engineers called Snipman.io >>> https://snipman.io It is a self hosted code snippet management app (currently free to download on Mac and Windows) that basically lets you store snippets by snippet types and tags. I primarily created it because I found myself creating a lot of text files for small code snippets for different programming languages, frameworks, cloud and devOps tools and technologies for e.g AWS, GCP, Terraform, Kubernetes, Docker etc. This not only resulted in a lot of clutter but also a pain when it came to searching. My ..read more
Visit website
How to update database or table stably
Reddit » Site Reliability Engineers
by /u/pangfaheng
3d ago
I am an SRE and I am learning mysql recently. I noticed a problem, if I need to update the structure of the table or do other operations, doing so may lock the table, so I should perform outage maintenance and stop application writing data, but maintenance It might last an hour. Due to website stability requirements, I should adopt a better solution, such as a few seconds of interruption. I guess I can migrate the old database to a new database. When the migration is completed, the application service will reconnect to the new database. database. I have never practiced it in a production envi ..read more
Visit website
Backup and Recovery in SRE Practice
Reddit » Site Reliability Engineers
by /u/Extreme-Opening7868
3d ago
Hello folks, I wanted to understand back up and recovery in terms of SRE. Basically what we do, what we backup and how do we do it. Do we use any tools? This might be vague but I really don't have much understanding on this subject. submitted by /u/Extreme-Opening7868 [visit reddit] [comments ..read more
Visit website
Are Certifications helpful?
Reddit » Site Reliability Engineers
by /u/prithvim1993
3d ago
I have an AWS certified developer and Azure Az900 certification and I’m looking for get deeper into the SRE Certification space. Anyone know how helpful such certifications are (I see some offered by Devops institute for example)? If there are any other good certifications any guidance is appreciated! submitted by /u/prithvim1993 [visit reddit] [comments ..read more
Visit website
Mastering Kubernetes: Dive into Workloads APIs
Reddit » Site Reliability Engineers
by /u/vfarcic
3d ago
submitted by /u/vfarcic [visit reddit] [comments ..read more
Visit website
A Guide to Unit Testing Prometheus Alerts
Reddit » Site Reliability Engineers
by /u/ankitdce
6d ago
submitted by /u/ankitdce [visit reddit] [comments ..read more
Visit website

Follow Reddit » Site Reliability Engineers on FeedSpot

Continue with Google
Continue with Apple
OR