TalksAWS re:Invent 2025 - Building reliable operations, feat. Fannie Mae (COP340)

AWS re:Invent 2025 - Building reliable operations, feat. Fannie Mae (COP340)

Building Reliable Operations: Lessons from AWS and Fannie Mae

Importance of Reliability

  • Reliability is a critical feature of applications, especially for mission-critical systems
  • 42% of system failures are caused by issues in operations, not just architecture
  • Reliability should be a key focus, not just an afterthought

AWS Cloud Operations Philosophy

  • Two-pizza teams: Small, autonomous teams with ownership and accountability
  • Correction of Errors (COE): Blameless post-mortems to learn from incidents
  • Service Level Objectives (SLOs): Measurable service quality targets to drive continuous improvement
  • Automated incident remediation: Using Systems Manager to automatically respond to alarms
  • Proactive incident preparation: Tabletop exercises, fault injection testing

Fannie Mae's Reliability Journey

Integrated Observability and Incident Management

  • Leveraged AWS services like CloudWatch, EventBridge, and OpenTelemetry to provide end-to-end visibility
  • Built a custom incident management tool called Sentinus for real-time alerting and correlation
  • Implemented monitoring as code for consistent alerting across applications
  • Conducted chaos engineering experiments to identify and address weaknesses

Autonomous Failover and Recovery

  • Developed automated failover patterns independent of CI/CD pipelines
  • Used AWS native services like CloudWatch, Step Functions, and Lambda to enable fast, autonomous failovers
  • Regularly exercised failover procedures to build muscle memory
  • Achieved 80% reduction in mean-time-to-recover during major outages

Blameless Incident Reviews

  • Adapted AWS's Correction of Errors (COE) framework with Fannie Mae-specific enhancements
  • Focused on the "5 Whys" to understand root causes, not assign blame
  • Held weekly review meetings with cross-functional stakeholders
  • Shared learnings and implemented action items across the organization

Key Takeaways

  1. Resiliency should be the top feature, not an afterthought
  2. Build for resiliency by limiting dependencies and having fallback mechanisms
  3. Start small with reliability improvements and build on successes
  4. Leverage automation, observability, and blameless incident reviews to drive reliability

Technical Details

  • AWS services used: CloudWatch, EventBridge, Systems Manager, Config, CloudTrail, Lambda, Step Functions
  • Fannie Mae tools: Sentinus (custom incident management), Dino Trace (application monitoring), Catchpoint (synthetic monitoring)
  • Reliability metrics improved:
    • 71% reduction in critical incidents
    • 75% reduction in change failure rate
    • 80% reduction in mean-time-to-recover

Business Impact

  • Fannie Mae is a critical player in the US mortgage industry, financing 1 in 4 homes
  • Reliability and availability of their systems is paramount to avoid financial and reputational damage
  • Improved reliability enabled Fannie Mae to prevent millions in potential losses during outages
  • Empowered teams to be self-reliant and quickly recover from failures

Examples

  • Automated failover patterns used by Fannie Mae to survive major AWS outages
  • Blameless incident review process that focused on root causes, not individual mistakes
  • Leveraging feature flags and automated remediation to control logging verbosity during suspicious activity

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.