Skip to content

Latest commit

 

History

History
153 lines (127 loc) · 12 KB

resources.md

File metadata and controls

153 lines (127 loc) · 12 KB

Table of Contents

Packages

Data Analysis, Augmentation, Validation and Cleaning

  • Great Expectation: data validation, documenting, and profiling
  • Cerberus: lightweight data validation functionality
  • PyJanitor: Pandas extension for data cleaning
  • PyDQC: automatic data quality checking
  • Feature-engine: transformer library for feature preparation and engineering
  • pydantic: data parsing and validation using Python type hints
  • Dora: exploratory data analysis toolkit for Python
  • datacleaner: automatically cleans data sets and readies them for analysis
  • whale: a lightweight data discovery, documentation, and quality engine for data warehouse
  • bamboolib: a tool for fast and easy data exploration & transformation of pandas DataFrames
  • pandas-summary: an extension to pandas dataframes describe function
  • AugLy: a data augmentations library for audio, image, text, and video.

Performance and Caching

  • Numba: JIT compiler that translates Python and NumPy to fast machine code
  • CuPy: NumPy-like API accelerated with CUDA
  • Dask: parallel computing library
  • Ray: framework for distributed applications
  • Modin: parallelized Pandas with Dask or Ray
  • Vaex: lazy memory-mapping dataframe for big data
  • Joblib: disk-caching and parallelization
  • RAPIDS: GPU acceleration for data science
  • Polars: a blazingly fast DataFrames library implemented in Rust & Python

Data Version Control and Workflow

  • DVC: data version control system
  • Pachyderm: data pipelining (versioning, lineage/tracking, and parallelization)
  • d6tflow: effective data workflow
  • Metaflow: end-to-end independent workflow
  • Dolt: relational database with version control
  • Airflow: platform to programmatically author, schedule and monitor workflows
  • Luigi: dependency resolution, workflow management, visualization, etc.

Visualization and Presentation

  • Seaborn: data visualization based on Matplotlib
  • HiPlot: interactive high-dimensional visualization for correlation and pattern discovery
  • Plotly.py: interactive browser-based graphing library
  • Altair: declarative visualization based on Vega and Vega-Lite
  • TabPy: Tableau visualizations with Python
  • Chartify: easy and flexible charts
  • Pandas-Profiling: HTML profiling reports for Pandas DataFrames
  • missingno: toolset of flexible and easy-to-use missing data visualizations and utilities
  • Yellowbrick: Scikit-Learn visualization for model selection and hyperparameter tuning
  • FlashTorch: visualization toolkit for neural networks in PyTorch
  • Streamlit: turn data scripts into sharable web apps in minutes
  • python-tabulate: pretty-print tabular data in Python, a library and a command-line utility
  • Lux: Python API for intelligent visual data discovery
  • bokeh: interactive data visualization in the browser, from Python

Project Lifecycles and Hyperparameter Optimization

  • NNI: automate ML/DL lifecycle (feature engineering, neural architecture search, model compression and hyperparameter tuning)
  • Comet.ml: self-hosted and cloud-based meta machine learning platform for tracking, comparing, explaining and optimizing experiments and models
  • MLflow: platform for ML lifecycle , including experimentation, reproducibility and deployment
  • Optuna: automatic hyperparameter optimization framework
  • Hyperopt: serial and parallel optimization
  • Tune: scalable experiment execution and hyperparameter tuning
  • Determined: deep learning training platform
  • Aim: a super-easy way to record, search and compare 1000s of ML training runs
  • TPOT: a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming

Distribution, Pipelining, and Sharding

  • torchgpipe: a scalable pipeline parallelism library, which allows efficient training of large, memory-consuming models
  • PipeDream: generalized pipeline parallelism for deep neural network training
  • DeepSpeed: a deep learning optimization library that makes distributed training easy, efficient, and effective
  • Horovod: a distributed deep learning training framework
  • RaySGD: lightweight wrappers for distributed deep learning
  • AdaptDL: a resource-adaptive deep learning training and scheduling framework

PyTorch Extensions

  • Ignite: high-level library based on PyTorch
  • PyTorch Lightning: lightweight wrapper for less boilerplate
  • fastai: out-of-the-box tools and models for vision, text, and other data
  • Skorch: Scikit-Learn interface for PyTorch models
  • PyRo: deep universal probabilistic programming with PyTorch
  • Kornia: differentiable computer vision library
  • DGL: package for deep learning on graphs
  • PyGeometric: geometric deep learning extension library for PyTorch
  • PyTorch-BigGraph: a distributed system for learning graph embeddings for large graphs
  • Torchmeta: datasets and models for few-shot-learning/meta-learning
  • PyTorch3D: library for deep learning with 3D data
  • learn2learn: meta-learning model implementations
  • higher: higher-order (unrolled first-order) optimization
  • Captum: model interpretability and understanding
  • PyTorch summary: Keras style summary for PyTorch models
  • Catalyst: PyTorch framework for Deep Learning research and development
  • Poutyne: a simplified framework for PyTorch and handles much of the ea code needed to train neural networks

Miscellaneous

  • Awesome-Pytorch-list: a comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
  • DoWhy: causal inference combining causal graphical models and potential outcomes
  • CausalML: a suite of uplift modeling and causal inference methods using machine learning algorithms based on recent research
  • NetworkX: creation, manipulation, and study of complex networks/graphs
  • Gym: toolkit for developing and comparing reinforcement learning algorithms
  • Polygames: a platform of zero learning with a library of games
  • Mlxtend: extensions and helper modules for data analysis and machine learning
  • NLTK: a leading platform for building Python programs to work with human language data
  • PyCaret: low-code machine learning library
  • dabl: baseline library for data analysis
  • OGB: benchmark datasets, data loaders and evaluators for graph machine learning
  • AI Explainability 360: a toolkit for interpretability and explainability of datasets and machine learning models
  • SDV: synthetic data generation for tabular, relational, time series data
  • SHAP: game theoretic approach to explain the output of any machine learning mode
  • TextBlob: a Python (2 and 3) library for processing textual data

Datasets:

  • Google Datasets: high-demand public datasets
  • Google Dataset Search: a search engine for freely-available online data
  • OpenML: online platform for sharing data, ML algorithms and experiments
  • DoltHub: data collaboration with Dolt
  • OpenBlender: live-streamed open data sources
  • Data Portal: a comprehensive list of open data portals from around the world
  • Activeloop: unstructured dataset management for TensorFlow/PyTorch

Libraries:

Readings:

Other ML/DL Templates: