When storing data in S3 it is important to consider the size of files you store in S3. Parquet files have an ideal file size of 512 MB - 1 GB. Storing data in many small files can decrease the performance of data processing tools ie. Spark.
This repository provides a PySpark script Aggregate_Small_Parquet_Files.py that can consolidate small parquet files in an S3 prefix into larger parquet files.
Note if you are testing the Aggregate_Small_Parquet_Files.py and need to generate small parquet files as test data. You can follow the instructions in the Example folder to create small file test data.
-
Upload the Aggregate_Small_Parquet_Files.py file to a S3 bucket
-
Run the CloudFormation stack below to create a Glue job that will generate small parquet files
As you follow the prompts to deploy the CloudFormation stack ensure that you fill out the S3GlueScriptLocation parameter with the S3 URI of the Create_Small_Parquet_Files.py that you uploaded to a S3 bucket in the first step
- Update and run the Glue job
The CloudFormation stack deployed a Glue job named Aggregate_Small_Parquet_Files. Navigate to the Glue console. Select ETL jobs and then the Aggregate_Small_Parquet_Files
Update <s3_bucket_name> with the name of the S3 bucket with the small files that need to be aggregated Update <path_to_prefix> with the path to the prefix of a single partition with small files to aggregate in it Optional: update the total_prefix_size to the desired target size of the aggregated parquet file(s)
After you update the S3 bucket name and the path to the prefix, save and run the Glue job. When the Glue job finishes you will have small parquet files in the specified S3 location will have been aggregated.