User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Site Reliability Engineering

A curated list of Site Reliability and Production Engineering resources.

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Dec. 2, 2020, 12:05 p.m.

Thank you dastergon & contributors
View Topic on GitHub:
dastergon/awesome-sre

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.

Culture

DevOps Enterprise Summit, London, June 25-26, 2018

110
15
2y 5m
n/a

A set of Site Reliability Engineering challenges

7
0
10m
n/a

Video]](https://www.youtube.com/watch?v=__wypEhdcrQ&t=0s)

Part 2](https://gcppodcast.com/post/episode-59-sre-ii-with-paul-newson/)

Part2](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0)

Education

Books

Hiring

Reliability

A curated list of Chaos Engineering resources.

3.97K
419
8d
CC0-1.0

Part 2](https://medium.com/@jerub/the-production-environment-at-google-part-2-610884268aaa)

Slides](https://speakerdeck.com/rakyll/are-you-ready-for-production)

Monitoring & Observability & Alerting

On-Call

Tips and tricks for getting through on-call

335
44
1y 6m
n/a

Run Book / Operations Manual template for modern software systems

506
279
1y 104d
n/a

A lifecycle model for describing incident management

21
6
4y 10d
n/a

Documents that describe parts of the PagerDuty Incident Response process. It provides information not only on preparing for an incident, but also what to do during and after. Source is available on GitHub.

Post-Mortem

A collection of postmortems. Sorry for the delay in merging PRs!

7.73K
293
4d
n/a

Compilation of public failure/horror stories related to Kubernetes

6.23K
284
101d
n/a

A collection of postmortem templates

588
213
4m
CC0-1.0

Capacity Planning

Service Level Agreement

Performance

Programming

Misc Articles

📙 Amazon Web Services — a practical guide

27.62K
2.79K
29d
CC-BY-4.0

Real-time Messaging

Blogs

Highly Technical Blog Posts About Systems Internals, Performance and SRE.

Blog Posts About SysAdmin/DevOps/SRE by Tom Limoncelli.

Technical Blog Posts About Systems Architecture.

A mailing list about building resilient infrastructure and tools.

Various blog posts about SRE, Software Engineering and Microservices.

One article for each day of December, ending on the 25th article.

A collection of resources for developers to strengthen their Ops skills.

A digital magazine about how teams build and operate software systems at scale.

Blog Posts about Go and SRE.

Blog posts about distributed systems and their management.

Blog posts about SRE culture and practices.

Weekly analysis of Resilience Engineering and Human Factors research designed for software systems

Blog posts about SRE best practices, reliability, on-call and incident management.

Newsletters

A weekly newsletter about SRE, SysAdmin and DevOps news, tools, tutorials and opinions.

The weekly newsletters for all things Kubernetes. KubeWeekly is curated by Bob Killen, Chris Short, Craig Box, Kim McMahon and Michael Hausenblas

Weekly Site Reliability Newsletter.

Weekly systems engineering and operations news and insights from industry insiders.

Chaos Engineering newsletter. All things Chaos Wngineering, directly to your inbox!

Conferences & Meetups

Twitter

The Official Twitter Account of Site Reliability Engineering Book.

SRECon's Official Twitter Account.

The Official Twitter Account of Site Reliability Workbook.

The Official Twitter Account of Twitter's SRE team.

The Official Twitter Account of SRE Weekly Newsletter.

The Official USENIX Twitter Account.

SRE Tools

A curated list of Site Reliability and Production Engineering Tools

124
22
13d
CC-BY-4.0

List of Continuous Integration services

2.33K
198
112d
MIT