AWS Certified Solutions Architect - Associate / Question #1671 of 1019

Question #1671

A company runs a mission-critical nightly data pipeline that requires 8 hours to complete. The workload must not experience data loss during execution. A solutions architect needs to design an Amazon EMR cluster that balances cost efficiency with reliability. Which configuration BEST meets these requirements?

A

Use a long-running cluster with the primary node and core nodes on On-Demand Instances, and task nodes on Spot Instances.

B

Use a transient cluster with the primary node and core nodes on On-Demand Instances, and task nodes on Spot Instances.

C

Use a transient cluster with the primary node on an On-Demand Instance, and core nodes and task nodes on Spot Instances.

D

Use a long-running cluster with the primary node on an On-Demand Instance, and core nodes and task nodes on Spot Instances.

Explanation

Answer B is correct because:
1. Transient Cluster: A nightly 8-hour job does not require a long-running cluster. Transient clusters terminate after the job completes, reducing costs.
2. Primary/Core Nodes on On-Demand: The primary node (manages cluster) and core nodes (store data in HDFS) must avoid interruptions to prevent data loss. On-Demand Instances ensure reliability.
3. Task Nodes on Spot: Task nodes are stateless and handle compute tasks. Spot Instances reduce costs, and their interruption does not risk data loss.

Why other options are incorrect:
- A/D: Long-running clusters are unnecessary for nightly jobs, increasing costs.
- C: Core nodes on Spot risk data loss if interrupted, violating the 'no data loss' requirement.

Key Points:
- Use transient clusters for short-lived jobs.
- Core nodes (HDFS) must be On-Demand for reliability.
- Task nodes can use Spot Instances for cost savings.

Answer

The correct answer is: B