The Database Analysis Toolkit is a Python-based tool designed to perform comprehensive data analysis on large datasets, currently focusing on geospatial analysis and fuzzy matching. The toolkit supports various data formats and provides configurable options to tailor the analysis to specific needs. This use-case is particularly useful for data engineers, data scientists, and analysts working with large datasets and looking to perform advanced data processing tasks.
- Geospatial Analysis: Calculate distances between geographical coordinates using the Haversine formula and identify clusters within a specified threshold.
- Fuzzy Matching: Identify and group similar records within a dataset based on configurable matching criteria.
- Support for Multiple File Formats: Easily load and process data from CSV, Excel, JSON, Parquet, and Feather files.
- Customizable: Configurable through a YAML file or command-line arguments, allowing users to adjust the analysis according to their needs.
.
├── config/
│ └── config.yaml # Configuration file for the analysis
├── data/
│ └── input_file.csv # Input data files (CSV, Excel, JSON, Parquet, Feather)
├── env/
│ ├── linux/environment.yml # Conda environment file for Linux
│ └── win/environment.yml # Conda environment file for Windows
├── logs/
│ └── logfile.log # Log file storing all logging information
├── modules/
│ ├── data_loader.py # Module for loading data from various formats
│ ├── fuzzy_matching.py # Module for performing fuzzy matching
│ └── geospatial_analysis.py # Module for performing geospatial analysis
├── results/
│ └── output_file.csv # Output files generated by the analysis
├── util/
│ └── util.py # Utility functions for saving files and other tasks
├── database-analysis.py # Main script to run the analysis
└── README.md # Project documentation
- Conda: Ensure you have Conda installed. You can install it from here.
- Python 3.11 or later: The project is compatible with Python 3.11 and above.
To create the Conda environment with all necessary dependencies, use the following command:
conda env create -f environment.yml
Activate the environment:
conda activate database-analysis-env
If you prefer to install the dependencies manually or without Conda, you can install them using pip
:
pip install pandas rapidfuzz haversine pyyaml
The toolkit uses a YAML configuration file (config/config.yaml
) to define various parameters for the analysis, such as:
- Input and Output Files: Specify paths for input data and output results.
- Analysis Options: Enable or disable geospatial analysis and fuzzy matching.
- Sorting and Thresholds: Define columns for sorting and thresholds for matching.
Here’s a sample config.yaml
file:
input_file: "input.csv"
output_file: "output.csv"
sort_by_columns:
- "first_name"
- "last_name"
geospatial_analysis: True
geospatial_columns:
- "latitude"
- "longitude"
geospatial_threshold: 0.005
fuzzy_matching: True
fuzzy_columns:
- "address"
fuzzy_threshold: 0.8
To perform the analysis using the configuration file:
python database-analysis.py --config config/config.yaml
You can also override specific configurations using command-line arguments:
python database-analysis.py --input_file data/input.csv --output_file results/output.csv --geospatial_analysis True --fuzzy_matching True
- CSV (
.csv
) - Excel (
.xlsx
)
All logging information is saved in the logs/logfile.log
file. The log file includes details about data loading, the execution of geospatial analysis, fuzzy matching, and any errors encountered during processing.
We welcome contributions to the Database Analysis Toolkit! If you would like to contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeature
). - Make your changes and commit them (
git commit -m 'Add some feature'
). - Push to the branch (
git push origin feature/YourFeature
). - Open a Pull Request.
This project is licensed under the GPL-3 License. See the LICENSE
file for more details.
This toolkit leverages Python libraries such as pandas
, rapidfuzz
, and haversine
to perform data analysis. We thank the open-source community for their continuous support and contributions.