AWS Certified Solutions Architect - Professional / Question #997 of 529

Question #997

A company stores application logs from multiple AWS regions in a centralized Amazon S3 bucket. The logs are in CSV format and compressed with gzip. The company uses Amazon Athena to query these logs for auditing purposes. Over time, query performance has degraded due to increasing data volume, and storage costs are rising. A solutions architect must optimize query performance and reduce storage costs without compromising data retention requirements.

Which solution will meet these requirements with the MOST significant improvement?

A

Create an AWS Lambda function to convert CSV files to Apache ORC format using Snappy compression. Configure the Lambda function to trigger automatically for new S3 uploads.

B

Enable S3 Transfer Acceleration for the bucket and apply an S3 Lifecycle policy to transition logs to S3 Glacier Deep Archive after 30 days.

C

Update the logging pipeline to store logs in Apache Parquet format partitioned by date. Configure daily partitioning for new log files.

D

Migrate to the latest Athena engine version and increase the Athena workgroup's query concurrency limit.

Explanation

Option C is correct because:
1. Apache Parquet Format: Parquet is a columnar storage format that reduces the amount of data scanned during queries (improving Athena performance) and provides better compression than CSV (lowering storage costs).
2. Partitioning by Date: Daily partitioning allows Athena to skip scanning irrelevant partitions when queries filter by date, drastically reducing query execution time and cost.
3. Storage Efficiency: Parquet's compression outperforms gzipped CSV, reducing storage costs.

Other options:
- A: Converting to ORC with Snappy helps, but lacks partitioning, missing significant optimization.
- B: S3 Glacier reduces costs but hinders query accessibility and doesn't improve query performance.
- D: Athena engine updates may offer minor improvements but don't address the root cause (data format/structure).

Key Points:
- Use columnar formats (Parquet/ORC) for analytical queries.
- Partitioning reduces data scanned, improving performance.
- Efficient compression lowers storage costs.

Answer

The correct answer is: C