Reddit » Site Reliability Engineers
0 FOLLOWERS
Reddit gives you the best of the internet in one place. Get a constantly updating feed of everything about site reliability engineering. A subreddit for Site Reliability Engineers.
Reddit » Site Reliability Engineers
21h ago
Looking for a way to simply post a pagerduty team rotation into a slack channel.
Looking at a tool called Pagerly at the moment, but before I reach out to them, are there any other tools to consider?
submitted by /u/Blyd
[visit reddit] [comments ..read more
Reddit » Site Reliability Engineers
21h ago
Hey team,
We have been calculating and monitoring all the latency, throughput and error for all our services. We also calculate availability using success rate which is total successful request / total requests for given period of time.
I am trying to improve reliability metrics by adding error budget and defining SLi and slo .
But I want to know if we can mean time between failures , mean time to recover and reliability of service using this ? Has anyone done this ? Can you help me with some details?
When I say mean time between failure can I consider between two 5*** ? Or we should consider ..read more
Reddit » Site Reliability Engineers
21h ago
I just came across this article by a SRE at Slack discussing how they put together a script for post-deployment monitoring of their critical metrics.
tl;dr 3 things that I found insightful:
Slack previously had 2 Deployment Commanders (rotating devs from engg) to manually deploy and monitor their monolith (that has about 200 PR merges a day and ~40 releases/day)
Instead of trying to do generic monitoring, they do a today vs yesterday vs LWSD (last week same day) evaluation of the same metrics
Setup statistical analysis to monitor for the z-scores of these metrics and get alerted in case they ..read more
Reddit » Site Reliability Engineers
21h ago
We have our Compliance saying “check this report daily, escalate to us if the alerts it raises are not false”. It raises false alerts everyday. They are unable to put in logic to alert reliably, stemming from the fact they are unsure if an EU regulation might apply in non-EU markets. My boss doesn’t allow me to challenge that; and we are left with false alerts daily. Perhaps once a year it raised a valid alert.
submitted by /u/syhlheti
[visit reddit] [comments ..read more
Reddit » Site Reliability Engineers
3d ago
I created an app for developers and devOps engineers called Snipman.io >>> https://snipman.io
It is a self hosted code snippet management app (currently free to download on Mac and Windows) that basically lets you store snippets by snippet types and tags.
I primarily created it because I found myself creating a lot of text files for small code snippets for different programming languages, frameworks, cloud and devOps tools and technologies for e.g AWS, GCP, Terraform, Kubernetes, Docker etc. This not only resulted in a lot of clutter but also a pain when it came to searching.
My ..read more
Reddit » Site Reliability Engineers
3d ago
I am an SRE and I am learning mysql recently. I noticed a problem, if I need to update the structure of the table or do other operations, doing so may lock the table, so I should perform outage maintenance and stop application writing data, but maintenance It might last an hour. Due to website stability requirements, I should adopt a better solution, such as a few seconds of interruption. I guess I can migrate the old database to a new database. When the migration is completed, the application service will reconnect to the new database. database.
I have never practiced it in a production envi ..read more
Reddit » Site Reliability Engineers
3d ago
Hello folks, I wanted to understand back up and recovery in terms of SRE. Basically what we do, what we backup and how do we do it. Do we use any tools?
This might be vague but I really don't have much understanding on this subject.
submitted by /u/Extreme-Opening7868
[visit reddit] [comments ..read more
Reddit » Site Reliability Engineers
3d ago
I have an AWS certified developer and Azure Az900 certification and I’m looking for get deeper into the SRE Certification space. Anyone know how helpful such certifications are (I see some offered by Devops institute for example)? If there are any other good certifications any guidance is appreciated!
submitted by /u/prithvim1993
[visit reddit] [comments ..read more
Reddit » Site Reliability Engineers
3d ago
submitted by /u/vfarcic
[visit reddit] [comments ..read more
Reddit » Site Reliability Engineers
6d ago
submitted by /u/ankitdce
[visit reddit] [comments ..read more