DevOps Foundations: Site Reliability Engineering

DevOps Foundations: Site Reliability Engineering

English | MP4 | AVC 1280×720 | AAC 48KHz 2ch | 1h 20m | 704 MB

Site reliability engineering (SRE) is an emerging paradigm in DevOps. The biggest names in tech—companies like Google, Netflix, Microsoft, and LinkedIn—all use SRE. In fact, industry wide, “site reliability engineer” is replacing “DevOps engineer” in job posts. Simply put, SRE is software engineering applied to operations—for the cloud native era. This course introduces the basics of site reliability engineering, including how SRE fits into DevOps and how it can be integrated into your unique business environment. Instructors Ernest Mueller and James Wickett cover the major areas of expertise, including release engineering, change management, incident management and retrospectives, self-service automation, troubleshooting, performance, and deliberate adversity. Learn how to define reliability through SLAs and SLOs, handle crisis, design distributed systems, and scale your systems and your team. Plus, explore time and project management strategies that bring humanity back to the SRE’s job.

Topics include:

  • Site reliability engineering basics
  • Release engineering
  • Change management
  • Incident management
  • Postmortems
  • Troubleshooting
  • Distributed design
  • Organization
Table of Contents

Introduction
1 Welcome
2 What you should know

SRE Basics
3 Your job as a DevOp
4 You aren t Google or Netflix

SRE Practice Areas
5 Release engineering
6 Change management
7 Self-service automation
8 SLAs and SLOs
9 Incident management
10 Introducing postmortems
11 The postmortem process
12 Troubleshooting
13 Performance engineering
14 Capacity and scalability
15 Distributed design
16 Deliberate adversity

SRE Organization
17 Organizing SREs
18 The softer side of SRE

Conclusion
19 Next steps