Packages

Data Analysis, Augmentation, Validation and Cleaning

Great Expectation: data validation, documenting, and profiling
Cerberus: lightweight data validation functionality
PyJanitor: Pandas extension for data cleaning
PyDQC: automatic data quality checking
Feature-engine: transformer library for feature preparation and engineering
pydantic: data parsing and validation using Python type hints
Dora: exploratory data analysis toolkit for Python
datacleaner: automatically cleans data sets and readies them for analysis
whale: a lightweight data discovery, documentation, and quality engine for data warehouse
bamboolib: a tool for fast and easy data exploration & transformation of pandas DataFrames
pandas-summary: an extension to pandas dataframes describe function
AugLy: a data augmentations library for audio, image, text, and video.

Performance and Caching

Numba: JIT compiler that translates Python and NumPy to fast machine code
CuPy: NumPy-like API accelerated with CUDA
Dask: parallel computing library
Ray: framework for distributed applications
Modin: parallelized Pandas with Dask or Ray
Vaex: lazy memory-mapping dataframe for big data
Joblib: disk-caching and parallelization
RAPIDS: GPU acceleration for data science
Polars: a blazingly fast DataFrames library implemented in Rust & Python

Data Version Control and Workflow

DVC: data version control system
Pachyderm: data pipelining (versioning, lineage/tracking, and parallelization)
d6tflow: effective data workflow
Metaflow: end-to-end independent workflow
Dolt: relational database with version control
Airflow: platform to programmatically author, schedule and monitor workflows
Luigi: dependency resolution, workflow management, visualization, etc.

Visualization and Presentation

Seaborn: data visualization based on Matplotlib
HiPlot: interactive high-dimensional visualization for correlation and pattern discovery
Plotly.py: interactive browser-based graphing library
Altair: declarative visualization based on Vega and Vega-Lite
TabPy: Tableau visualizations with Python
Chartify: easy and flexible charts
Pandas-Profiling: HTML profiling reports for Pandas DataFrames
missingno: toolset of flexible and easy-to-use missing data visualizations and utilities
Yellowbrick: Scikit-Learn visualization for model selection and hyperparameter tuning
FlashTorch: visualization toolkit for neural networks in PyTorch
Streamlit: turn data scripts into sharable web apps in minutes
python-tabulate: pretty-print tabular data in Python, a library and a command-line utility
Lux: Python API for intelligent visual data discovery
bokeh: interactive data visualization in the browser, from Python

Project Lifecycles and Hyperparameter Optimization

NNI: automate ML/DL lifecycle (feature engineering, neural architecture search, model compression and hyperparameter tuning)
Comet.ml: self-hosted and cloud-based meta machine learning platform for tracking, comparing, explaining and optimizing experiments and models
MLflow: platform for ML lifecycle , including experimentation, reproducibility and deployment
Optuna: automatic hyperparameter optimization framework
Hyperopt: serial and parallel optimization
Tune: scalable experiment execution and hyperparameter tuning
Determined: deep learning training platform
Aim: a super-easy way to record, search and compare 1000s of ML training runs
TPOT: a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming

Distribution, Pipelining, and Sharding

torchgpipe: a scalable pipeline parallelism library, which allows efficient training of large, memory-consuming models
PipeDream: generalized pipeline parallelism for deep neural network training
DeepSpeed: a deep learning optimization library that makes distributed training easy, efficient, and effective
Horovod: a distributed deep learning training framework
RaySGD: lightweight wrappers for distributed deep learning
AdaptDL: a resource-adaptive deep learning training and scheduling framework

PyTorch Extensions

Ignite: high-level library based on PyTorch
PyTorch Lightning: lightweight wrapper for less boilerplate
fastai: out-of-the-box tools and models for vision, text, and other data
Skorch: Scikit-Learn interface for PyTorch models
PyRo: deep universal probabilistic programming with PyTorch
Kornia: differentiable computer vision library
DGL: package for deep learning on graphs
PyGeometric: geometric deep learning extension library for PyTorch
PyTorch-BigGraph: a distributed system for learning graph embeddings for large graphs
Torchmeta: datasets and models for few-shot-learning/meta-learning
PyTorch3D: library for deep learning with 3D data
learn2learn: meta-learning model implementations
higher: higher-order (unrolled first-order) optimization
Captum: model interpretability and understanding
PyTorch summary: Keras style summary for PyTorch models
Catalyst: PyTorch framework for Deep Learning research and development
Poutyne: a simplified framework for PyTorch and handles much of the ea code needed to train neural networks

Miscellaneous

Awesome-Pytorch-list: a comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
DoWhy: causal inference combining causal graphical models and potential outcomes
CausalML: a suite of uplift modeling and causal inference methods using machine learning algorithms based on recent research
NetworkX: creation, manipulation, and study of complex networks/graphs
Gym: toolkit for developing and comparing reinforcement learning algorithms
Polygames: a platform of zero learning with a library of games
Mlxtend: extensions and helper modules for data analysis and machine learning
NLTK: a leading platform for building Python programs to work with human language data
PyCaret: low-code machine learning library
dabl: baseline library for data analysis
OGB: benchmark datasets, data loaders and evaluators for graph machine learning
AI Explainability 360: a toolkit for interpretability and explainability of datasets and machine learning models
SDV: synthetic data generation for tabular, relational, time series data
SHAP: game theoretic approach to explain the output of any machine learning mode
TextBlob: a Python (2 and 3) library for processing textual data

Datasets:

Google Datasets: high-demand public datasets
Google Dataset Search: a search engine for freely-available online data
OpenML: online platform for sharing data, ML algorithms and experiments
DoltHub: data collaboration with Dolt
OpenBlender: live-streamed open data sources
Data Portal: a comprehensive list of open data portals from around the world
Activeloop: unstructured dataset management for TensorFlow/PyTorch

Libraries:

Best-of Machine Learning with Python: a ranked list of awesome machine learning Python libraries

Readings:

Machine Learning Systems Design by Chip Huyen
Rules of Machine Learning: Best Practices for ML Engineering by Martin Zinkevich
Awesome Data Science: an awesome data science repository to learn and apply for real world problems

Other ML/DL Templates:

Cookiecutter Data Science: a logical, reasonably standardized, but flexible project structure
PyTorch Template Project: PyTorch deep learning project template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resources.md

resources.md

Table of Contents

Packages

Data Analysis, Augmentation, Validation and Cleaning

Performance and Caching

Data Version Control and Workflow

Visualization and Presentation

Project Lifecycles and Hyperparameter Optimization

Distribution, Pipelining, and Sharding

PyTorch Extensions

Miscellaneous

Datasets:

Libraries:

Readings:

Other ML/DL Templates:

Files

resources.md

Latest commit

History

resources.md

File metadata and controls

Table of Contents

Packages

Data Analysis, Augmentation, Validation and Cleaning

Performance and Caching

Data Version Control and Workflow

Visualization and Presentation

Project Lifecycles and Hyperparameter Optimization

Distribution, Pipelining, and Sharding

PyTorch Extensions

Miscellaneous

Datasets:

Libraries:

Readings:

Other ML/DL Templates: