Glue Aggregate Small Parquet Files

When storing data in S3 it is important to consider the size of files you store in S3. Parquet files have an ideal file size of 512 MB - 1 GB. Storing data in many small files can decrease the performance of data processing tools ie. Spark.

This repository provides a PySpark script Aggregate_Small_Parquet_Files.py that can consolidate small parquet files in an S3 prefix into larger parquet files.

How to run the Glue job to aggregate small parquet files

Note if you are testing the Aggregate_Small_Parquet_Files.py and need to generate small parquet files as test data. You can follow the instructions in the Example folder to create small file test data.

Upload the Aggregate_Small_Parquet_Files.py file to a S3 bucket
Run the CloudFormation stack below to create a Glue job that will generate small parquet files

As you follow the prompts to deploy the CloudFormation stack ensure that you fill out the S3GlueScriptLocation parameter with the S3 URI of the Create_Small_Parquet_Files.py that you uploaded to a S3 bucket in the first step

Update and run the Glue job

The CloudFormation stack deployed a Glue job named Aggregate_Small_Parquet_Files. Navigate to the Glue console. Select ETL jobs and then the Aggregate_Small_Parquet_Files

Update <s3_bucket_name> with the name of the S3 bucket with the small files that need to be aggregated Update <path_to_prefix> with the path to the prefix of a single partition with small files to aggregate in it Optional: update the total_prefix_size to the desired target size of the aggregated parquet file(s)

After you update the S3 bucket name and the path to the prefix, save and run the Glue job. When the Glue job finishes you will have small parquet files in the specified S3 location will have been aggregated.

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
.github/workflows		.github/workflows
Example		Example
README		README
Aggregate_Small_Parquet_File_Glue_Job_Deployment.yaml		Aggregate_Small_Parquet_File_Glue_Job_Deployment.yaml
Aggregate_Small_Parquet_Files.py		Aggregate_Small_Parquet_Files.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Glue Aggregate Small Parquet Files

How to run the Glue job to aggregate small parquet files

About

Releases

Packages

Languages

ev2900/Glue_Aggregate_Small_Files

Folders and files

Latest commit

History

Repository files navigation

Glue Aggregate Small Parquet Files

How to run the Glue job to aggregate small parquet files

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages