ITSM Crowd 56 – Site Reliability Engineering

In our latest episode of the ITSM Crowd, Claire Agutter was joined by Jason Hand (Microsoft) and Doug Tedder to talk about site reliability engineering (SRE). 

Site reliability engineering can be described as an engineering principle that is applied to products and services to help find an appropriate level of reliability. Moving away from a focus on component monitoring and ‘5 9s’ targets, SRE looks at how products and services are being used and what good actually looks like from a business perspective. 

Jason shared insights into service level indicators, objectives and agreements, as well as talking about how error budgets can be used to make space for innovation.

SRE sits within a wider organizational context that changes how we structure our product teams and how we think of ‘done’ – this episode will challenge you to look at your products and services in new ways.

More resources

You can learn more about SRE via these links:

Site reliability engineering – how Google runs production systems (via O’Reilly)

Free SRE online training (via Microsoft)

SRE weekly (email service)

 

Watch this episode

You can watch the episode below, and ask follow up questions to our panel using the hashtag “ITSMCrowd” on twitter.

Be our guest

Got a burning topic you want to discuss on the ITSM Crowd? Contact us and we’ll be in touch.

Coming up

Join us on July 8th at 9am UK time with special guests Rob England and Dr Cherry Vu, who will be talking about their new book, The agile Manager. That’s ‘agile’ with a small a!