Scaling Machine Learning with Containers
Key Takeaways
1. Importance of Good Tooling
- Good tooling is crucial for running quick iterations in the research stage.
2. Flexible Infrastructure
- Having a flexible infrastructure provides several advantages:
- Scalability
- Speed
- Cost optimization
3. Metrics and Alerts
- Implementing the right metrics and alerts is important for:
- Scaling machine learning pipelines further
- Troubleshooting machine learning pipelines
Detailed Summary
Introduction
- The session discusses scaling machine learning with containers, covering challenges such as speed of research, production management, and scalability.
- The presenters are Victor, a Developer Advocate, and Ram, an ML specialist from the company Instrumental.
About Instrumental
- Instrumental is a manufacturing optimization platform that uses AI to detect defects and optimize manufacturing processes at scale.
- The key requirements for ML at Instrumental include:
- Ability to train and run predictions at scale (thousands of training jobs, millions of predictions)
- Flexible models that can be adjusted for each customer and project
- Robust models that can handle edge cases on the production line
Standard ML Project Lifecycle
- Research: Gather and label training data, establish baseline solutions, and try various ideas for improvements.
- Productization: Productize the training pipeline, add data preparation and model validation steps, and follow software engineering practices for release.
- Production: Monitor and troubleshoot the model performance in production, as issues can arise due to various reasons (e.g., label quality, training code, discrepancies between training and prediction).
- Scaling: Scale the ML workflows to handle increasing demand, while optimizing for cost and observability.
Challenges in Scaling ML Workflows
- Speed of Research: The ability to quickly iterate on ideas and verify them is crucial for making progress.
- Production Management: Troubleshooting and managing a large number of running jobs becomes challenging as the scale increases.
- Scalability: Scaling both the training and prediction workloads, while optimizing for cost, is a key requirement.
Benefits of Containers
- Portability and consistency: Containers ensure the same environment across research, development, and production.
- Scalability: Containers can be easily scaled using orchestration platforms like Kubernetes.
- Isolation: Containers isolate dependencies and libraries, making the environment more reproducible.
- Reproducibility: Containers allow the same environment to be used in research, development, and production.
Container Services in AWS
- Managed Containers: Services like Amazon ECS and Amazon EKS, where the user is responsible for managing the underlying infrastructure.
- Serverless Containers: Services like AWS Fargate and SageMaker jobs, where AWS manages the underlying infrastructure.
- Serverless Functions: Services like AWS Lambda, where the user is responsible only for the container image or ZIP package.
Comparing SageMaker Training Jobs and Classical Workflows
| Aspect | SageMaker Training Jobs | Classical Workflows |
| --- | --- | --- |
| Operations | AWS manages the infrastructure | User manages the infrastructure |
| Scalability | Limited by AWS account limits | Limited by user's infrastructure |
| Observability | Out-of-the-box metrics and logs | Requires sidecar containers for metrics and logs |
Serverless Functions and Cold Starts
- Serverless functions like AWS Lambda can suffer from cold starts, which can impact the performance of ML workloads.
- Techniques like Snap Start can help reduce the cold start impact by creating a pre-initialized runtime environment.
Migrating to a Container-based Architecture
- Research Environment: Container images are built and published to a container registry. Researchers can run experiments by invoking SageMaker training jobs with the latest container image.
- Production Environment: Similar to the research environment, but with a separate container registry and invoked by a Lambda function that reads from an SQS queue.
- Prediction Environment: Predictions are served using Lambda functions that reference the latest container image.
Benefits of the Container-based Architecture
- Speed of Research: Faster iteration cycles, easier onboarding for new hires, and reduced gap between research and production environments.
- Productization: Smaller gap between research and production, easier troubleshooting, and better ownership of the ML pipeline by the ML team.
- Scalability: Leveraging AWS services like SageMaker and Lambda makes it easy to scale both training and prediction workloads.
Lessons Learned
- Research Tooling: Importance of having a well-designed research framework with features like local execution, fast container image publishing, and parallel experiment handling.
- Productization: Ensuring model backward compatibility, leveraging inference frameworks like ONNX, and integrating AWS infrastructure into the CI/CD process.
- Scaling: Increasing service limits proactively, applying cost optimization techniques like instance type selection and spot instances, and developing a robust observability system.
Cost Optimization Techniques
- Instance Type Selection: Right-sizing instances based on the training job requirements.
- Spot Instances: Leveraging spot instances can provide significant cost savings, but requires handling interruptions.
- Service Substitution: Exploring alternative services like Batch for training and synchronous Batch Inference for predictions to optimize costs.
- AWS Savings Plans: Utilizing AWS Savings Plans for predictable workloads to reduce costs.
Demonstration
The demonstration showcases the following:
- Setting up the infrastructure using a CI/CD pipeline and container registry.
- Running a single training job in SageMaker and inspecting the job details, metrics, and logs.
- Running multiple training jobs in parallel, including handling failed jobs and rerunning them.
- Leveraging spot instances to achieve significant cost savings.
- Conducting experiments in the research environment and observing the results.
Additional Resources
- SageMaker Workshop: Learning about SageMaker from idea to production
- SageMaker Helpers: Open-source project providing utilities for working with SageMaker
- Step Functions to Processing: Open-source project using different AWS compute services for machine learning
- Lambda PX: Open-source project with optimized machine learning libraries for AWS Lambda