Skip to content

Latest commit

 

History

History
29 lines (20 loc) · 1.62 KB

README.md

File metadata and controls

29 lines (20 loc) · 1.62 KB

Polars Extractor

A simple data extraction script that can be deployed as a container. This uses Python, with Polars, ConnectorX and Pyarrow. The first two are implemented in Rust, and the second in C++. These libs do the heavy lifting, while Python binds it all.

This can be a lot less resource intensive then running PySpark (uses JVM), and a lot faster than using Pandas (Python implementation).

Parameters:

NAME DESCRIPTION DEFAULT
DATABASE_URL Database connection URL with credentials
TABLE DB Table
SCHEMA DB Schema public
WRITE_PATH Destination Path or fsspec URL
QUERY_OVERWRITE Query string to overwrite the default select * from schema.table
WRITE_PARTITIONED If filled, will use column as hive like partition
READ_PARTITIONED Makes the script read the data in parts (in parallel)
PARTITION_NUMBER If READ_PARTITIONED, instructs the amount of them. 4

Running it locally:

DATABASE_URL=postgresql://user:password@localhost:5432/cast_concursos TABLE=table_name WRITE_PATH="." WRITE_PARTITIONED="partition_column" python main.py