Preparing for the new frontier: Accelerating AI with great data (AIM264)

Here is a detailed summary of the video transcription in markdown format:

Capital One's Tech Journey and Data Ecosystem

Capital One's Technology Transformation

  • Over the last decade, Capital One has rebuilt its technology stack from the ground up, using open source and building their own technology.
  • They went all-in on the cloud, adopting a serverless-first approach and becoming a highly cloud-fluent AWS customer.
  • This technology transformation has enabled Capital One to scale AI and ML to serve their 100 million customers.

The Importance of Data for AI

  • A nimble, flexible, and elastic tech stack, a well-managed and real-time data ecosystem, and talented personnel are key to enabling effective AI and ML.
  • There is a flywheel effect between data and AI - better data leads to better AI, and better AI leads to better data insights.

The Challenges of Data Complexity

  • The 3 V's of data complexity: Volume (147 zettabytes by 2025), Variety (80-90% of data is unstructured), and Velocity (real-time data access required in milliseconds).
  • Data quality and access issues are major impediments to effective AI - 64% of data professionals cite data quality as a top challenge, and 62% cite real-time data access as requiring the most attention.

Principles for Producing and Consuming Good Data

  1. Self-service: Empowering the data community with tools, access, and discoverability.
  2. Automation: Baking in data lineage, quality, SLAs, and governance into the data processes.
  3. Scalable data: Avoiding point solutions and building for massive scale.

The Data Producer Experience

Onboarding Data

  1. Register metadata, privacy/security settings, and SLAs.
  2. Design and approve schema for structured data.
  3. Provision data into the right stores and formats.

The Self-Service Portal and Control Plane

  • The self-service portal abstracts complexity, automating provisioning, data quality, transformations, and observability.
  • The control plane is a collection of services that configures the data pipeline and enforces governance.

Automating Data Onboarding at Scale

  • Central platform approach: Publishing data through an API that enforces governance.
  • Federated model: Instrumenting Spark pipelines with a purpose-built SDK to enforce governance.
  • The key is maintaining consistency in data governance and management across approaches.

The Data Consumer Experience

Capital One's Lake Strategy

  1. Bring compute to the lake to minimize storage sprawl.
  2. Adopt open table formats like Delta and Iceberg to enable SQL-like operations.
  3. Implement a zone strategy for fit-for-purpose data access and management.

Lake Platform Capabilities

  1. Provisioning service to manage data set locations and metadata.
  2. Access management service for temporary, scoped data access.
  3. Lifecycle policies and intelligent tiering for data management.
  4. Cross-region replication for high availability.

Data Scientist and ML Engineer Experiences

  • Data scientists can self-provision spaces for model development and collaboration.
  • ML engineers can self-provision low-latency data stores (e.g., DynamoDB) for production model deployments.

Key Takeaways

  1. Streamline experiences for data producers and consumers.
  2. Build automation and scalable mechanisms for enforcement.
  3. Enable rapid experimentation for data-driven innovation.
  4. Ensure unwavering trustworthiness of the data ecosystem.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us