Skip to content

A Python toolkit that performs operations on a database, including fuzzy matching and geospatial analysis fully modifiable by the user.

License

Notifications You must be signed in to change notification settings

umarhunter/database-analysis-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Database Analysis Toolkit

The Database Analysis Toolkit is a Python-based tool designed to perform comprehensive data analysis on large datasets, currently focusing on geospatial analysis and fuzzy matching. The toolkit supports various data formats and provides configurable options to tailor the analysis to specific needs. This use-case is particularly useful for data engineers, data scientists, and analysts working with large datasets and looking to perform advanced data processing tasks.

Features

  • Geospatial Analysis: Calculate distances between geographical coordinates using the Haversine formula and identify clusters within a specified threshold.
  • Fuzzy Matching: Identify and group similar records within a dataset based on configurable matching criteria.
  • Support for Multiple File Formats: Easily load and process data from CSV, Excel, JSON, Parquet, and Feather files.
  • Customizable: Configurable through a YAML file or command-line arguments, allowing users to adjust the analysis according to their needs.

Project Structure

.
├── config/
│   └── config.yaml               # Configuration file for the analysis
├── data/
│   └── input_file.csv            # Input data files (CSV, Excel, JSON, Parquet, Feather)
├── env/
│   ├── linux/environment.yml     # Conda environment file for Linux
│   └── win/environment.yml       # Conda environment file for Windows
├── logs/
│   └── logfile.log               # Log file storing all logging information
├── modules/
│   ├── data_loader.py            # Module for loading data from various formats
│   ├── fuzzy_matching.py         # Module for performing fuzzy matching
│   └── geospatial_analysis.py    # Module for performing geospatial analysis
├── results/
│   └── output_file.csv           # Output files generated by the analysis
├── util/
│   └── util.py                   # Utility functions for saving files and other tasks
├── database-analysis.py          # Main script to run the analysis
└── README.md                     # Project documentation

Installation

Prerequisites

  • Conda: Ensure you have Conda installed. You can install it from here.
  • Python 3.11 or later: The project is compatible with Python 3.11 and above.

Setting Up the Environment

To create the Conda environment with all necessary dependencies, use the following command:

conda env create -f environment.yml

Activate the environment:

conda activate database-analysis-env

Manual Installation

If you prefer to install the dependencies manually or without Conda, you can install them using pip:

pip install pandas rapidfuzz haversine pyyaml

Configuration

The toolkit uses a YAML configuration file (config/config.yaml) to define various parameters for the analysis, such as:

  • Input and Output Files: Specify paths for input data and output results.
  • Analysis Options: Enable or disable geospatial analysis and fuzzy matching.
  • Sorting and Thresholds: Define columns for sorting and thresholds for matching.

Example Configuration

Here’s a sample config.yaml file:

input_file: "input.csv"
output_file: "output.csv"
sort_by_columns:
  - "first_name"
  - "last_name"
geospatial_analysis: True
geospatial_columns:
  - "latitude"
  - "longitude"
geospatial_threshold: 0.005
fuzzy_matching: True
fuzzy_columns:
  - "address"
fuzzy_threshold: 0.8

Usage

Running the Analysis

To perform the analysis using the configuration file:

python database-analysis.py --config config/config.yaml

You can also override specific configurations using command-line arguments:

python database-analysis.py --input_file data/input.csv --output_file results/output.csv --geospatial_analysis True --fuzzy_matching True

Supported File Formats

  • CSV (.csv)
  • Excel (.xlsx)

Logging

All logging information is saved in the logs/logfile.log file. The log file includes details about data loading, the execution of geospatial analysis, fuzzy matching, and any errors encountered during processing.

Contributing

We welcome contributions to the Database Analysis Toolkit! If you would like to contribute:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature/YourFeature).
  3. Make your changes and commit them (git commit -m 'Add some feature').
  4. Push to the branch (git push origin feature/YourFeature).
  5. Open a Pull Request.

License

This project is licensed under the GPL-3 License. See the LICENSE file for more details.

Acknowledgements

This toolkit leverages Python libraries such as pandas, rapidfuzz, and haversine to perform data analysis. We thank the open-source community for their continuous support and contributions.

About

A Python toolkit that performs operations on a database, including fuzzy matching and geospatial analysis fully modifiable by the user.

Topics

Resources

License

Stars

Watchers

Forks

Languages