PySpark Example Project - Databricks

This document is designed to be read in parallel with the code in the pyspark-template-project repository. Together, these constitute what we consider to be a 'best practices' approach to writing ETL jobs using Apache Spark and its Python ('PySpark') APIs using Databricks. This project addresses the following topics:

how to structure ETL code in such a way that it can be easily tested and debugged;
how to pass configuration parameters to a PySpark job;
how to handle dependencies on other modules and packages; and,
what constitutes a 'meaningful' test for an ETL job.

ETL Project Structure

The basic project structure is as follows:

adb_etl/
 |-- configs/
 |   |-- etl_config.py
 |-- dependencies/
 |   |-- spark_utility.py
 |-- jobs/
 |   |-- etl_job.py
 |-- tests/
 |   |-- test_data/
 |   |-- | -- employees/
 |   |-- | -- employees_report/
 |   |-- run.py
 |   |-- test_activate_token.py
 |   |-- test_etl_job.py
 requirements.txt
 setup.py

The main Python module containing the ETL job , is jobs/etl_job.py. Any external configuration parameters required by etl_job.py are stored in Class file in tests/run.py. Additional modules that support this job can be kept in the dependencies folder. Unit test modules are kept in the tests folder and small chunks of representative input and output data, to be used with the tests, are kept in tests/test_data folder.

Passing Configuration Parameters to the ETL Job

There are different ways to pass configuration variables in etl_job.py. JSON, YAML, ini and Class file. In this example we have used configs/run.py

class Config():
    SCOPE_NAME = '<databricks scope name>'

    #Replace the Key name from Key Vault for azure active directory tenant id, client id and secret
    TENANT_ID = '<aad tenant_id>' 
    CLIENT_ID = '<aad_client_id>'
    SECRET = '<aad_client secret>'

    STEPS_PER_FLOOR = 21
    DATA_LAKE = '<data lake url>'

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
adb_etl		adb_etl
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Example Project - Databricks

ETL Project Structure

Passing Configuration Parameters to the ETL Job

About

Releases

Packages

Languages

dushyantbhatt2007/databricks-etl

Folders and files

Latest commit

History

Repository files navigation

PySpark Example Project - Databricks

ETL Project Structure

Passing Configuration Parameters to the ETL Job

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages