Talks AWS re:Invent 2025 - Building reliable operations, feat. Fannie Mae (COP340) VIDEO
AWS re:Invent 2025 - Building reliable operations, feat. Fannie Mae (COP340) Building Reliable Operations: Lessons from AWS and Fannie Mae
Importance of Reliability
Reliability is a critical feature of applications, especially for mission-critical systems
42% of system failures are caused by issues in operations, not just architecture
Reliability should be a key focus, not just an afterthought
AWS Cloud Operations Philosophy
Two-pizza teams: Small, autonomous teams with ownership and accountability
Correction of Errors (COE): Blameless post-mortems to learn from incidents
Service Level Objectives (SLOs): Measurable service quality targets to drive continuous improvement
Automated incident remediation: Using Systems Manager to automatically respond to alarms
Proactive incident preparation: Tabletop exercises, fault injection testing
Fannie Mae's Reliability Journey
Integrated Observability and Incident Management
Leveraged AWS services like CloudWatch, EventBridge, and OpenTelemetry to provide end-to-end visibility
Built a custom incident management tool called Sentinus for real-time alerting and correlation
Implemented monitoring as code for consistent alerting across applications
Conducted chaos engineering experiments to identify and address weaknesses
Autonomous Failover and Recovery
Developed automated failover patterns independent of CI/CD pipelines
Used AWS native services like CloudWatch, Step Functions, and Lambda to enable fast, autonomous failovers
Regularly exercised failover procedures to build muscle memory
Achieved 80% reduction in mean-time-to-recover during major outages
Blameless Incident Reviews
Adapted AWS's Correction of Errors (COE) framework with Fannie Mae-specific enhancements
Focused on the "5 Whys" to understand root causes, not assign blame
Held weekly review meetings with cross-functional stakeholders
Shared learnings and implemented action items across the organization
Key Takeaways
Resiliency should be the top feature, not an afterthought
Build for resiliency by limiting dependencies and having fallback mechanisms
Start small with reliability improvements and build on successes
Leverage automation, observability, and blameless incident reviews to drive reliability
Technical Details
AWS services used: CloudWatch, EventBridge, Systems Manager, Config, CloudTrail, Lambda, Step Functions
Fannie Mae tools: Sentinus (custom incident management), Dino Trace (application monitoring), Catchpoint (synthetic monitoring)
Reliability metrics improved:
71% reduction in critical incidents
75% reduction in change failure rate
80% reduction in mean-time-to-recover
Business Impact
Fannie Mae is a critical player in the US mortgage industry, financing 1 in 4 homes
Reliability and availability of their systems is paramount to avoid financial and reputational damage
Improved reliability enabled Fannie Mae to prevent millions in potential losses during outages
Empowered teams to be self-reliant and quickly recover from failures
Examples
Automated failover patterns used by Fannie Mae to survive major AWS outages
Blameless incident review process that focused on root causes, not individual mistakes
Leveraging feature flags and automated remediation to control logging verbosity during suspicious activity
Your Digital Journey deserves a great story. Build one with us.