- Great Expectation: data validation, documenting, and profiling
- Cerberus: lightweight data validation functionality
- PyJanitor: Pandas extension for data cleaning
- PyDQC: automatic data quality checking
- Feature-engine: transformer library for feature preparation and engineering
- pydantic: data parsing and validation using Python type hints
- Dora: exploratory data analysis toolkit for Python
- datacleaner: automatically cleans data sets and readies them for analysis
- whale: a lightweight data discovery, documentation, and quality engine for data warehouse
- bamboolib: a tool for fast and easy data exploration & transformation of pandas DataFrames
- pandas-summary: an extension to pandas dataframes describe function
- AugLy: a data augmentations library for audio, image, text, and video.
- Numba: JIT compiler that translates Python and NumPy to fast machine code
- CuPy: NumPy-like API accelerated with CUDA
- Dask: parallel computing library
- Ray: framework for distributed applications
- Modin: parallelized Pandas with Dask or Ray
- Vaex: lazy memory-mapping dataframe for big data
- Joblib: disk-caching and parallelization
- RAPIDS: GPU acceleration for data science
- Polars: a blazingly fast DataFrames library implemented in Rust & Python
- DVC: data version control system
- Pachyderm: data pipelining (versioning, lineage/tracking, and parallelization)
- d6tflow: effective data workflow
- Metaflow: end-to-end independent workflow
- Dolt: relational database with version control
- Airflow: platform to programmatically author, schedule and monitor workflows
- Luigi: dependency resolution, workflow management, visualization, etc.
- Seaborn: data visualization based on Matplotlib
- HiPlot: interactive high-dimensional visualization for correlation and pattern discovery
- Plotly.py: interactive browser-based graphing library
- Altair: declarative visualization based on Vega and Vega-Lite
- TabPy: Tableau visualizations with Python
- Chartify: easy and flexible charts
- Pandas-Profiling: HTML profiling reports for Pandas DataFrames
- missingno: toolset of flexible and easy-to-use missing data visualizations and utilities
- Yellowbrick: Scikit-Learn visualization for model selection and hyperparameter tuning
- FlashTorch: visualization toolkit for neural networks in PyTorch
- Streamlit: turn data scripts into sharable web apps in minutes
- python-tabulate: pretty-print tabular data in Python, a library and a command-line utility
- Lux: Python API for intelligent visual data discovery
- bokeh: interactive data visualization in the browser, from Python
- NNI: automate ML/DL lifecycle (feature engineering, neural architecture search, model compression and hyperparameter tuning)
- Comet.ml: self-hosted and cloud-based meta machine learning platform for tracking, comparing, explaining and optimizing experiments and models
- MLflow: platform for ML lifecycle , including experimentation, reproducibility and deployment
- Optuna: automatic hyperparameter optimization framework
- Hyperopt: serial and parallel optimization
- Tune: scalable experiment execution and hyperparameter tuning
- Determined: deep learning training platform
- Aim: a super-easy way to record, search and compare 1000s of ML training runs
- TPOT: a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming
- torchgpipe: a scalable pipeline parallelism library, which allows efficient training of large, memory-consuming models
- PipeDream: generalized pipeline parallelism for deep neural network training
- DeepSpeed: a deep learning optimization library that makes distributed training easy, efficient, and effective
- Horovod: a distributed deep learning training framework
- RaySGD: lightweight wrappers for distributed deep learning
- AdaptDL: a resource-adaptive deep learning training and scheduling framework
- Ignite: high-level library based on PyTorch
- PyTorch Lightning: lightweight wrapper for less boilerplate
- fastai: out-of-the-box tools and models for vision, text, and other data
- Skorch: Scikit-Learn interface for PyTorch models
- PyRo: deep universal probabilistic programming with PyTorch
- Kornia: differentiable computer vision library
- DGL: package for deep learning on graphs
- PyGeometric: geometric deep learning extension library for PyTorch
- PyTorch-BigGraph: a distributed system for learning graph embeddings for large graphs
- Torchmeta: datasets and models for few-shot-learning/meta-learning
- PyTorch3D: library for deep learning with 3D data
- learn2learn: meta-learning model implementations
- higher: higher-order (unrolled first-order) optimization
- Captum: model interpretability and understanding
- PyTorch summary: Keras style summary for PyTorch models
- Catalyst: PyTorch framework for Deep Learning research and development
- Poutyne: a simplified framework for PyTorch and handles much of the ea code needed to train neural networks
- Awesome-Pytorch-list: a comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
- DoWhy: causal inference combining causal graphical models and potential outcomes
- CausalML: a suite of uplift modeling and causal inference methods using machine learning algorithms based on recent research
- NetworkX: creation, manipulation, and study of complex networks/graphs
- Gym: toolkit for developing and comparing reinforcement learning algorithms
- Polygames: a platform of zero learning with a library of games
- Mlxtend: extensions and helper modules for data analysis and machine learning
- NLTK: a leading platform for building Python programs to work with human language data
- PyCaret: low-code machine learning library
- dabl: baseline library for data analysis
- OGB: benchmark datasets, data loaders and evaluators for graph machine learning
- AI Explainability 360: a toolkit for interpretability and explainability of datasets and machine learning models
- SDV: synthetic data generation for tabular, relational, time series data
- SHAP: game theoretic approach to explain the output of any machine learning mode
- TextBlob: a Python (2 and 3) library for processing textual data
- Google Datasets: high-demand public datasets
- Google Dataset Search: a search engine for freely-available online data
- OpenML: online platform for sharing data, ML algorithms and experiments
- DoltHub: data collaboration with Dolt
- OpenBlender: live-streamed open data sources
- Data Portal: a comprehensive list of open data portals from around the world
- Activeloop: unstructured dataset management for TensorFlow/PyTorch
- Best-of Machine Learning with Python: a ranked list of awesome machine learning Python libraries
- Machine Learning Systems Design by Chip Huyen
- Rules of Machine Learning: Best Practices for ML Engineering by Martin Zinkevich
- Awesome Data Science: an awesome data science repository to learn and apply for real world problems
- Cookiecutter Data Science: a logical, reasonably standardized, but flexible project structure
- PyTorch Template Project: PyTorch deep learning project template