Innovations in AWS analytics: Data processing (ANT346)

Unified Platform for Data and AI

Overview

  • AWS announced the new SageMaker, a unified platform for all data, AI, and ML services on AWS.
  • SageMaker is now more than just a machine learning service - it's a unified platform bringing together various data processing products like EMR, Glue, Athena, Redshift, and Bedrock.
  • The new SageMaker platform has three key parts:
    1. Unified Studio: A new experience for data workers to use all data processing, AI, and ML features in one application.
    2. SageMaker Lakehouse: Bringing together the data warehousing capabilities of Redshift and the data lake capabilities of Iceberg on Amazon S3.
    3. Unified Governance: Integrating the data governance capabilities of Data Catalog (formerly DataZone) into the SageMaker platform.

SageMaker Unified Studio

  • Provides a single experience for all data processing products like EMR, Glue, Athena, and Redshift.
  • Offers a unified notebook interface that can be used with various runtimes like SQL, Python, Scala, etc. to access all the underlying services.
  • Provides a visual ETL tool similar to Glue's, with assistance from the AI-powered CodeWhisperer.
  • Integrates the data governance capabilities of Data Catalog for browsing, documenting, and managing data quality and lineage.

Improvements in Data Processing Services

  • Performance and cost optimizations:
    • 3.9x faster managed Spark on EMR compared to open-source Spark.
    • 2.7x faster Athena Trino engine compared to open-source.
    • 20% better performance with Graviton 3 instances.
  • Operational excellence:
    • Improved auto-scaling in EMR and Managed Workflows for Apache Airflow (MWA) for better cost optimization.
    • Usage profiles in Glue to enforce resource limits for different user personas.
  • Security:
    • Spark-native fine-grained access control to enforce policies at scale.

AI-powered Capabilities

  • Code generation and refactoring using CodeWhisperer:
    • Automatically generate Spark/SQL code based on natural language prompts.
    • Automatically upgrade Spark code to the latest version and handle breaking changes.
  • Root cause analysis for job failures:
    • Leverage generative AI to automatically identify and explain the root cause of job failures.

Bridgewater's Use Case

  • Bridgewater is a systematic global macro asset manager running over 1,000 models producing 300 million tables of data per year.
  • They use Trino as the core data processing engine, integrated with EMR, Glue, and Redshift to build a scalable and efficient data platform.
  • Key benefits:
    • Scalable and efficient infrastructure with Trino and EMR.
    • Ease of upgrades and maintenance with EMR.
    • Glue as a flexible data catalog to manage their large-scale data.
  • AWS has been an invaluable partner in helping Bridgewater architect and optimize their data platform.

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us