How SRE relates to DevOps -- Foundations. Implementing SLOs -- SLO engineering case studies -- Alerting on SLOs -- Eliminating toil -- Simplicity -- Practices. On-call -- Incident response -- Postmortem culture: learning from failure -- Managing load -- Introducing non-abstract large system design -- Data processing pipelines -- Configuration design and best practices -- Configuration specifics -- Canarying releases -- Processes. Identifying and recovering from overload -- SRE engagement model -- SRE: reaching beyond your walls -- SRE team lifecycles -- Organizational change management in SRE -- A. Example SLO document -- B. Example error budget policy -- C. Results of postmortem analysis.
Text of Note
Intro; Copyright; Table of Contents; Foreword I; Foreword II; Preface; Conventions Used in This Book; Using Code Examples; O'Reilly Safari; How to Contact Us; Acknowledgments; Chapter 1. How SRE Relates to DevOps; Background on DevOps; No More Silos; Accidents Are Normal; Change Should Be Gradual; Tooling and Culture Are Interrelated; Measurement Is Crucial; Background on SRE; Operations Is a Software Problem; Manage by Service Level Objectives (SLOs); Work to Minimize Toil; Automate This Year's Job Away; Move Fast by Reducing the Cost of Failure; Share Ownership with Developers
Text of Note
Evernote's SLO StoryWhy Did Evernote Adopt the SRE Model?; Introduction of SLOs: A Journey in Progress; Breaking Down the SLO Wall Between Customer and Cloud Provider; Current State; The Home Depot's SLO Story; The SLO Culture Project; Our First Set of SLOs; Evangelizing SLOs; Automating VALET Data Collection; The Proliferation of SLOs; Applying VALET to Batch Applications; Using VALET in Testing; Future Aspirations; Summary; Conclusion; Chapter 4. Monitoring; Desirable Features of a Monitoring Strategy; Speed; Calculations; Interfaces; Alerts; Sources of Monitoring Data; Examples
Text of Note
Managing Your Monitoring SystemTreat Your Configuration as Code; Encourage Consistency; Prefer Loose Coupling; Metrics with Purpose; Intended Changes; Dependencies; Saturation; Status of Served Traffic; Implementing Purposeful Metrics; Testing Alerting Logic; Conclusion; Chapter 5. Alerting on SLOs; Alerting Considerations; Ways to Alert on Significant Events; 1: Target Error Rate ≥ SLO Threshold; 2: Increased Alert Window; 3: Incrementing Alert Duration; 4: Alert on Burn Rate; 5: Multiple Burn Rate Alerts; 6: Multiwindow, Multi-Burn-Rate Alerts; Low-Traffic Services and Error Budget Alerting
Text of Note
Moving from SLI Specification to SLI ImplementationMeasuring the SLIs; Using the SLIs to Calculate Starter SLOs; Choosing an Appropriate Time Window; Getting Stakeholder Agreement; Establishing an Error Budget Policy; Documenting the SLO and Error Budget Policy; Dashboards and Reports; Continuous Improvement of SLO Targets; Improving the Quality of Your SLO; Decision Making Using SLOs and Error Budgets; Advanced Topics; Modeling User Journeys; Grading Interaction Importance; Modeling Dependencies; Experimenting with Relaxing Your SLOs; Conclusion; Chapter 3. SLO Engineering Case Studies
Text of Note
Use the Same Tooling, Regardless of Function or Job TitleCompare and Contrast; Organizational Context and Fostering Successful Adoption; Narrow, Rigid Incentives Narrow Your Success; It's Better to Fix It Yourself; Don't Blame Someone Else; Consider Reliability Work as a Specialized Role; When Can Substitute for Whether; Strive for Parity of Esteem: Career and Financial; Conclusion; Part I. Foundations; Chapter 2. Implementing SLOs; Why SREs Need SLOs; Getting Started; Reliability Targets and Error Budgets; What to Measure: Using SLIs; A Worked Example
0
0
8
8
8
8
SUMMARY OR ABSTRACT
Text of Note
An expansion on the understanding of Google SRE, providing 'worked examples' for each essential facet of this area of IT prepared in co-operation with Google cloud customers based on their experiences. Instructs on methodology for running services at scale and starting SRE in greenfield or brownfield fashion.