Your first time on this page? Allow me to give some explanations.
Awesome Site Reliability Engineering
A curated list of Site Reliability and Production Engineering resources.
Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.
Thank you dastergon & contributors
View Topic on GitHub:
Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.
Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.
DevOps Enterprise Summit, London, June 25-26, 2018
A set of Site Reliability Engineering challenges
Sample chapter titled CPUs]
A curated list of Chaos Engineering resources.
Monitoring & Observability & Alerting
Tips and tricks for getting through on-call
Run Book / Operations Manual template for modern software systems
A lifecycle model for describing incident management
Documents that describe parts of the PagerDuty Incident Response process. It provides information not only on preparing for an incident, but also what to do during and after. Source is available on GitHub.
A collection of postmortems. Sorry for the delay in merging PRs!
Compilation of public failure/horror stories related to Kubernetes
A collection of postmortem templates
Service Level Agreement
📙 Amazon Web Services — a practical guide
Discussion of Site Reliability Engineering generally.
Highly Technical Blog Posts About Systems Internals, Performance and SRE.
Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.
Various blog posts about SRE, Software Engineering and Microservices.
One article for each day of December, ending on the 25th article.
A digital magazine about how teams build and operate software systems at scale.
Blog posts about distributed systems and their management.
Weekly analysis of Resilience Engineering and Human Factors research designed for software systems
Blog posts about SRE best practices, reliability, on-call and incident management.
A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions.
The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas
Weekly systems engineering and operations news and insights from industry insiders.
Conferences & Meetups
Prominent Conference About SysAdmin/DevOps/SRE.
A Group For Individuals Who Tackle Reliability Challenges For Web-Scale Systems.
A Group Of People Who Are Passionate About Reliable, Performant Software Systems.
SRE Meetup in the greater area of Oktoberfest city.
A 24 hour conference that is completely online and free.
SRE Meetup in the city of light.
The Official Twitter Account of Site Reliability Engineering Book.
The Official Twitter Account of Site Reliability Workbook.
The Official Twitter Account of SRE Weekly Newsletter.