Database Analysis Toolkit

The Database Analysis Toolkit is a Python-based tool designed to perform comprehensive data analysis on large datasets, currently focusing on geospatial analysis and fuzzy matching. The toolkit supports various data formats and provides configurable options to tailor the analysis to specific needs. This use-case is particularly useful for data engineers, data scientists, and analysts working with large datasets and looking to perform advanced data processing tasks.

Features

Geospatial Analysis: Calculate distances between geographical coordinates using the Haversine formula and identify clusters within a specified threshold.
Fuzzy Matching: Identify and group similar records within a dataset based on configurable matching criteria.
Support for Multiple File Formats: Easily load and process data from CSV, Excel, JSON, Parquet, and Feather files.
Customizable: Configurable through a YAML file or command-line arguments, allowing users to adjust the analysis according to their needs.

Project Structure

.
├── config/
│   └── config.yaml               # Configuration file for the analysis
├── data/
│   └── input_file.csv            # Input data files (CSV, Excel, JSON, Parquet, Feather)
├── env/
│   ├── linux/environment.yml     # Conda environment file for Linux
│   └── win/environment.yml       # Conda environment file for Windows
├── logs/
│   └── logfile.log               # Log file storing all logging information
├── modules/
│   ├── data_loader.py            # Module for loading data from various formats
│   ├── fuzzy_matching.py         # Module for performing fuzzy matching
│   └── geospatial_analysis.py    # Module for performing geospatial analysis
├── results/
│   └── output_file.csv           # Output files generated by the analysis
├── util/
│   └── util.py                   # Utility functions for saving files and other tasks
├── database-analysis.py          # Main script to run the analysis
└── README.md                     # Project documentation

Installation

Prerequisites

Conda: Ensure you have Conda installed. You can install it from here.
Python 3.11 or later: The project is compatible with Python 3.11 and above.

Setting Up the Environment

To create the Conda environment with all necessary dependencies, use the following command:

conda env create -f environment.yml

Activate the environment:

conda activate database-analysis-env

Manual Installation

If you prefer to install the dependencies manually or without Conda, you can install them using pip:

pip install pandas rapidfuzz haversine pyyaml

Configuration

The toolkit uses a YAML configuration file (config/config.yaml) to define various parameters for the analysis, such as:

Input and Output Files: Specify paths for input data and output results.
Analysis Options: Enable or disable geospatial analysis and fuzzy matching.
Sorting and Thresholds: Define columns for sorting and thresholds for matching.

Example Configuration

Here’s a sample config.yaml file:

input_file: "input.csv"
output_file: "output.csv"
sort_by_columns:
  - "first_name"
  - "last_name"
geospatial_analysis: True
geospatial_columns:
  - "latitude"
  - "longitude"
geospatial_threshold: 0.005
fuzzy_matching: True
fuzzy_columns:
  - "address"
fuzzy_threshold: 0.8

Usage

Running the Analysis

To perform the analysis using the configuration file:

python database-analysis.py --config config/config.yaml

You can also override specific configurations using command-line arguments:

python database-analysis.py --input_file data/input.csv --output_file results/output.csv --geospatial_analysis True --fuzzy_matching True

Supported File Formats

CSV (.csv)
Excel (.xlsx)

Logging

All logging information is saved in the logs/logfile.log file. The log file includes details about data loading, the execution of geospatial analysis, fuzzy matching, and any errors encountered during processing.

Contributing

We welcome contributions to the Database Analysis Toolkit! If you would like to contribute:

Fork the repository.
Create a new branch (git checkout -b feature/YourFeature).
Make your changes and commit them (git commit -m 'Add some feature').
Push to the branch (git push origin feature/YourFeature).
Open a Pull Request.

License

This project is licensed under the GPL-3 License. See the LICENSE file for more details.

Acknowledgements

This toolkit leverages Python libraries such as pandas, rapidfuzz, and haversine to perform data analysis. We thank the open-source community for their continuous support and contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Database Analysis Toolkit

Features

Project Structure

Installation

Prerequisites

Setting Up the Environment

Manual Installation

Configuration

Example Configuration

Usage

Running the Analysis

Supported File Formats

Logging

Contributing

License

Acknowledgements

About

Releases 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
config		config
data		data
env		env
logs		logs
modules		modules
results		results
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
database-analysis.py		database-analysis.py

License

umarhunter/database-analysis-toolkit

Folders and files

Latest commit

History

Repository files navigation

Database Analysis Toolkit

Features

Project Structure

Installation

Prerequisites

Setting Up the Environment

Manual Installation

Configuration

Example Configuration

Usage

Running the Analysis

Supported File Formats

Logging

Contributing

License

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages