Scaling machine learning with containers on AWS: Lessons learned (DEV317)

Scaling Machine Learning with Containers

Key Takeaways

1. Importance of Good Tooling

  • Good tooling is crucial for running quick iterations in the research stage.

2. Flexible Infrastructure

  • Having a flexible infrastructure provides several advantages:
    • Scalability
    • Speed
    • Cost optimization

3. Metrics and Alerts

  • Implementing the right metrics and alerts is important for:
    • Scaling machine learning pipelines further
    • Troubleshooting machine learning pipelines

Detailed Summary

Introduction

  • The session discusses scaling machine learning with containers, covering challenges such as speed of research, production management, and scalability.
  • The presenters are Victor, a Developer Advocate, and Ram, an ML specialist from the company Instrumental.

About Instrumental

  • Instrumental is a manufacturing optimization platform that uses AI to detect defects and optimize manufacturing processes at scale.
  • The key requirements for ML at Instrumental include:
    • Ability to train and run predictions at scale (thousands of training jobs, millions of predictions)
    • Flexible models that can be adjusted for each customer and project
    • Robust models that can handle edge cases on the production line

Standard ML Project Lifecycle

  1. Research: Gather and label training data, establish baseline solutions, and try various ideas for improvements.
  2. Productization: Productize the training pipeline, add data preparation and model validation steps, and follow software engineering practices for release.
  3. Production: Monitor and troubleshoot the model performance in production, as issues can arise due to various reasons (e.g., label quality, training code, discrepancies between training and prediction).
  4. Scaling: Scale the ML workflows to handle increasing demand, while optimizing for cost and observability.

Challenges in Scaling ML Workflows

  1. Speed of Research: The ability to quickly iterate on ideas and verify them is crucial for making progress.
  2. Production Management: Troubleshooting and managing a large number of running jobs becomes challenging as the scale increases.
  3. Scalability: Scaling both the training and prediction workloads, while optimizing for cost, is a key requirement.

Benefits of Containers

  • Portability and consistency: Containers ensure the same environment across research, development, and production.
  • Scalability: Containers can be easily scaled using orchestration platforms like Kubernetes.
  • Isolation: Containers isolate dependencies and libraries, making the environment more reproducible.
  • Reproducibility: Containers allow the same environment to be used in research, development, and production.

Container Services in AWS

  1. Managed Containers: Services like Amazon ECS and Amazon EKS, where the user is responsible for managing the underlying infrastructure.
  2. Serverless Containers: Services like AWS Fargate and SageMaker jobs, where AWS manages the underlying infrastructure.
  3. Serverless Functions: Services like AWS Lambda, where the user is responsible only for the container image or ZIP package.

Comparing SageMaker Training Jobs and Classical Workflows

| Aspect | SageMaker Training Jobs | Classical Workflows | | --- | --- | --- | | Operations | AWS manages the infrastructure | User manages the infrastructure | | Scalability | Limited by AWS account limits | Limited by user's infrastructure | | Observability | Out-of-the-box metrics and logs | Requires sidecar containers for metrics and logs |

Serverless Functions and Cold Starts

  • Serverless functions like AWS Lambda can suffer from cold starts, which can impact the performance of ML workloads.
  • Techniques like Snap Start can help reduce the cold start impact by creating a pre-initialized runtime environment.

Migrating to a Container-based Architecture

  1. Research Environment: Container images are built and published to a container registry. Researchers can run experiments by invoking SageMaker training jobs with the latest container image.
  2. Production Environment: Similar to the research environment, but with a separate container registry and invoked by a Lambda function that reads from an SQS queue.
  3. Prediction Environment: Predictions are served using Lambda functions that reference the latest container image.

Benefits of the Container-based Architecture

  1. Speed of Research: Faster iteration cycles, easier onboarding for new hires, and reduced gap between research and production environments.
  2. Productization: Smaller gap between research and production, easier troubleshooting, and better ownership of the ML pipeline by the ML team.
  3. Scalability: Leveraging AWS services like SageMaker and Lambda makes it easy to scale both training and prediction workloads.

Lessons Learned

  1. Research Tooling: Importance of having a well-designed research framework with features like local execution, fast container image publishing, and parallel experiment handling.
  2. Productization: Ensuring model backward compatibility, leveraging inference frameworks like ONNX, and integrating AWS infrastructure into the CI/CD process.
  3. Scaling: Increasing service limits proactively, applying cost optimization techniques like instance type selection and spot instances, and developing a robust observability system.

Cost Optimization Techniques

  1. Instance Type Selection: Right-sizing instances based on the training job requirements.
  2. Spot Instances: Leveraging spot instances can provide significant cost savings, but requires handling interruptions.
  3. Service Substitution: Exploring alternative services like Batch for training and synchronous Batch Inference for predictions to optimize costs.
  4. AWS Savings Plans: Utilizing AWS Savings Plans for predictable workloads to reduce costs.

Demonstration

The demonstration showcases the following:

  1. Setting up the infrastructure using a CI/CD pipeline and container registry.
  2. Running a single training job in SageMaker and inspecting the job details, metrics, and logs.
  3. Running multiple training jobs in parallel, including handling failed jobs and rerunning them.
  4. Leveraging spot instances to achieve significant cost savings.
  5. Conducting experiments in the research environment and observing the results.

Additional Resources

  • SageMaker Workshop: Learning about SageMaker from idea to production
  • SageMaker Helpers: Open-source project providing utilities for working with SageMaker
  • Step Functions to Processing: Open-source project using different AWS compute services for machine learning
  • Lambda PX: Open-source project with optimized machine learning libraries for AWS Lambda

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Talk to us