The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering cover art

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

By: Fexingo
Listen for free

Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a specific case: how a major streaming service reduced paging noise, how a payments platform rebuilt its incident command structure, or how a cloud provider manages multi-region failover. Lucas brings the numbers — latency percentiles, MTTR trends, SLO burn rates — while Luna pushes on the human and organizational trade-offs: What does a junior SRE need to know about on-call? How do you measure reliability without crushing innovation? Why do some blameless postmortems actually work? Together they treat SRE not as a certification topic but as a living practice, citing real outages, open-source tools, and engineering blogs. This show is for engineers, ops leads, and platform teams who already know the basics and want to debate the hard edges: Is 99.999% uptime always worth the cost? When should you deliberately degrade service to improve reliability? How do you design for resilience when your system is already in production? Lucas and Luna don't pretend to have final answers — they build the conversation so you can draw your own. If you've ever argued about whether a page was necessary or whether an SLO should be tightened, this is your show. #SiteReliabilityEngineering #SRE #Uptime #ProductionEngineering #IncidentResponse #ErrorBudgets #SLOs #Postmortem #ToilAutomation #CapacityPlanning #Observability #DevOps #PlatformEngineering #Resilience #OnCall #FexingoBusiness #BusinessPodcast #Technology Keep every episode free: buymeacoffee.com/fexingo© 2026 Fexingo. All rights reserved. Economics
Episodes
  • How SRE Teams Use Game Days to Build Incident Muscle Memory
    Jul 4 2026
    In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use Game Days — simulated incidents — to build muscle memory and improve real-world response. They break down why Netflix's Chaos Monkey was just the beginning, and how modern teams run everything from network partitions to database failovers in a controlled environment. The conversation covers the key elements of a successful Game Day: blameless culture, clear objectives, and a 'no surprises' wrap-up. Lucas shares a concrete example: a 2025 study by Gremlin found that teams running quarterly Game Days reduced mean time to resolution by 34 percent. They also discuss common pitfalls like over-engineering scenarios and failing to include non-engineering stakeholders. Listeners walk away with a practical template for starting their own Game Day program, including the three questions every drill should answer: What did we learn? What broke? What do we fix next? #SiteReliabilityEngineering #SRE #GameDays #IncidentResponse #ChaosEngineering #Resilience #Uptime #ProductionEngineering #FexingoBusiness #BusinessPodcast #Technology #Podcast #Netflix #ChaosMonkey #Gremlin #MTTR #BlamelessCulture #Reliability Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    11 mins
  • How SRE Teams Use Error Budgets to Balance Reliability and Velocity
    Jul 4 2026
    In this episode, Lucas and Luna dive into the concept of error budgets—a cornerstone of Site Reliability Engineering that defines how much unreliability a team can tolerate while still meeting their Service Level Objectives. They explore how error budgets help SRE teams make data-driven trade-offs between shipping new features and maintaining system stability. Using examples from Google's original SRE model and real-world applications at companies like Netflix and Etsy, they unpack how tracking error budget burn rates can trigger automated rollbacks or throttle deployments. Lucas breaks down the math behind error budgets, explaining how they derive from SLOs and how teams calculate budget consumption over time. The conversation also covers common pitfalls, like teams setting error budgets too tight or ignoring the budget entirely during crunch time. By the end, listeners will understand why error budgets are not just a monitoring tool but a cultural mechanism that aligns engineering incentives with business priorities. Tune in to learn how to use error budgets to ship faster with confidence on The Site Reliability Podcast with Fexingo. #ErrorBudget #SRE #SiteReliabilityEngineering #ServiceLevelObjective #SLO #Reliability #Velocity #IncidentResponse #GoogleSRE #Netflix #Etsy #DeploymentAutomation #ToilBudget #EngineeringCulture #TechPodcast #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    12 mins
  • How SRE Teams Use Incident Metrics to Improve Response
    Jul 3 2026
    In this episode of The Site Reliability Podcast, Lucas and Luna dive into the world of incident metrics — not just DORA or SLOs, but the specific numbers that help SRE teams get faster and better at incident response. They discuss mean time to acknowledge, mean time to resolve, and the controversial metric of mean time between failures, using real examples from a major cloud provider's 2023 outage. The hosts explore how tracking these metrics can reveal bottlenecks in incident response, improve runbooks, and even change team culture. They also touch on the delicate balance between using metrics for improvement versus using them for blame, and share tips for SRE teams just starting their metrics journey. Whether you're a seasoned SRE or just curious about reliability engineering, this episode offers concrete insights into measuring what matters in incident response. #SiteReliabilityEngineering #IncidentMetrics #MTTA #MTTR #MTBF #SRE #DevOps #IncidentResponse #ReliabilityEngineering #CloudComputing #Uptime #OnCall #MetricsDriven #BlamelessCulture #Runbooks #Technology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
    Show More Show Less
    10 mins
adbl_web_anon_alc_button_suppression_t1
No reviews yet