How SRE Teams Use Incident Metrics to Improve Response

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to wishlist failed.

Please try again later

Remove from wishlist failed.

Please try again later

Adding to library failed

Please try again

Follow podcast failed

Unfollow podcast failed

How SRE Teams Use Incident Metrics to Improve Response

Listen for free

View show details

In this episode of The Site Reliability Podcast, Lucas and Luna dive into the world of incident metrics — not just DORA or SLOs, but the specific numbers that help SRE teams get faster and better at incident response. They discuss mean time to acknowledge, mean time to resolve, and the controversial metric of mean time between failures, using real examples from a major cloud provider's 2023 outage. The hosts explore how tracking these metrics can reveal bottlenecks in incident response, improve runbooks, and even change team culture. They also touch on the delicate balance between using metrics for improvement versus using them for blame, and share tips for SRE teams just starting their metrics journey. Whether you're a seasoned SRE or just curious about reliability engineering, this episode offers concrete insights into measuring what matters in incident response. #SiteReliabilityEngineering #IncidentMetrics #MTTA #MTTR #MTBF #SRE #DevOps #IncidentResponse #ReliabilityEngineering #CloudComputing #Uptime #OnCall #MetricsDriven #BlamelessCulture #Runbooks #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

No reviews yet