TalksAWS re:Invent 2025 - Binge-worthy: Netflix's journey to Amazon Aurora at scale (DAT322)
AWS re:Invent 2025 - Binge-worthy: Netflix's journey to Amazon Aurora at scale (DAT322)
Migrating Dozens of Databases to Amazon Aurora Postgres at Netflix
Overview
Netflix's data platform team faced the challenge of migrating over 100 databases running on a third-party Postgres-compatible distributed database to Amazon Aurora Postgres.
The migration was driven by the need to reduce the operational overhead and cost of managing the self-managed distributed database, as well as take advantage of the improved features, reliability, and cost-effectiveness of Aurora Postgres.
Pre-Migration Preparation
The data platform team conducted extensive pre-migration checks across the entire database fleet before involving application teams:
Used AWS Database Migration Service (DMS) Schema Conversion Tool to identify schema compatibility issues between the source and target databases.
Sampled application SQL queries against the target Aurora Postgres to validate query compatibility.
Provisioned target Aurora Postgres clusters based on observed traffic patterns in the source databases.
Copied over metadata, permissions, and other configuration from the source to the target databases.
Created temporary Aurora Postgres clusters to allow application teams to validate performance and functionality before cutover.
Schema and Data Migration
The team leveraged AWS DMS to handle the schema conversion and data migration process:
DMS was used to copy database objects (tables, views, sequences) from the source to the target, handling any necessary conversions.
The team built additional verification tooling to catch edge cases where the automated conversion process had issues, such as data type mismatches.
DMS was also used to perform a full data load from the source to the target, followed by ongoing change data capture (CDC) replication to keep the target in sync.
The team built monitoring tooling to proactively detect any issues with the replication process.
Data Validation
To validate the accuracy of the migrated data, the team used a multi-pronged approach:
Batch validation: Dumped data from both source and target into a data warehouse and ran distributed SQL queries to identify any discrepancies.
Online validation: Built a Flink-based validator that continuously compared incoming data between the source and target in real-time.
This allowed the team to have confidence in the data integrity before proceeding with the cutover.
Cutover Process
To minimize downtime and disruption to the applications, the team leveraged a proxy layer in front of the databases:
Ensured applications were already talking to the proxy and could seamlessly switch between the source and target databases.
Measured the replication lag between source and target to determine the optimal cutover window.
Performed final validation checks on the data and replication status.
Switched the proxy to point to the target Aurora Postgres cluster, effectively cutting over the applications with minimal downtime.
Handled any stateful objects, like database sequences, to ensure a smooth transition.
Results and Learnings
After nearly a year of development, the team has completed over 90% of the database migrations to Aurora Postgres.
In many cases, they observed lower latencies for the applications after the migration, due to the move from a distributed database to the more performant Aurora Postgres.
The team identified and fixed several data corruption issues during the pre-migration validation process, preventing potential problems down the line.
The initial fleet-wide dry run was crucial, as it allowed the team to identify and resolve compatibility issues early on, saving significant time and effort.
The team focused on building a generic, reusable migration solution that could handle the diverse set of applications and technologies at Netflix, while also providing specialized tooling for specific use cases (e.g., Java Flyway support).
Key Takeaways
Thorough pre-migration preparation, including schema and data validation, is essential for a successful large-scale database migration.
Leveraging proxy layers and automating the cutover process can minimize downtime and disruption to applications.
Continuous data validation, both in batch and real-time, ensures data integrity throughout the migration process.
Building reusable, flexible migration tooling that can handle diverse application requirements is crucial for scaling the migration effort.
Proactive identification and resolution of edge cases and compatibility issues during the pre-migration phase can save significant time and effort.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.