Question #1973
A company is deploying machine learning models on AWS as independent microservices. Each microservice downloads 1.5 GB of model data from Amazon S3 during initialization, storing it in memory. The models are accessed via an asynchronous API, allowing users to submit individual or batch requests. The company serves numerous users with unpredictable usage patterns, where some models remain inactive for weeks, while others experience sudden bursts of thousands of requests.
Which solution best meets these requirements?
Route API requests through a Network Load Balancer (NLB) to AWS Lambda functions. Configure auto-scaling for Lambda based on NLB traffic, utilizing provisioned concurrency to minimize cold starts.
Use an Application Load Balancer (ALB) to direct requests to Amazon ECS services. Auto-scale ECS instances based on ALB request metrics, ensuring sufficient capacity during traffic spikes.
Send API requests to an Amazon SQS queue. Trigger AWS Lambda functions from the queue, with auto-scaling adjusting memory allocation based on queue backlog.
Direct API requests to an Amazon SQS queue. Deploy the models as Amazon ECS services that process messages from the queue. Auto-scale ECS tasks and cluster capacity dynamically using SQS queue metrics.
Explanation
Answer D is correct because:
1. Asynchronous API Handling: Amazon SQS decouples the API requests, allowing asynchronous processing, which aligns with the requirement.
2. Model Initialization Overhead: ECS services load the 1.5 GB model into memory once during initialization. This avoids repeated cold starts (a problem with Lambda in Options A/C) and reduces latency during bursts.
3. Auto-Scaling: ECS tasks auto-scale based on SQS queue metrics (e.g., message backlog), ensuring sufficient capacity during traffic spikes while scaling to zero when inactive.
4. Cost Efficiency: ECS scales dynamically, avoiding the cost of provisioned concurrency (Lambda in A) or idle ALB/ECS resources (Option B).
Why others are incorrect:
- A/C: Lambda incurs cold starts and reloads the 1.5 GB model per invocation, causing latency and inefficiency.
- B: ALB is designed for synchronous requests, not async APIs. Scaling ECS from zero during bursts would delay model initialization.
Key Points:
- Use SQS for async APIs.
- Prefer ECS over Lambda for large, stateful models.
- Scale based on queue metrics to handle unpredictable traffic.
Answer
The correct answer is: D