Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial work toward PyTorch data loaders #1

Closed
wants to merge 71 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
acf584f
initial commit of pytorch datapipe/loader
bkmartinjr Jul 31, 2024
12237b0
update comments
bkmartinjr Jul 31, 2024
2fc9beb
more lint
bkmartinjr Jul 31, 2024
220a11b
fix typos
bkmartinjr Aug 1, 2024
2c870ea
rework for performance
bkmartinjr Aug 20, 2024
1aedf3d
tuning
bkmartinjr Aug 21, 2024
e577ecd
tweaks, checkpoint
bkmartinjr Aug 21, 2024
ee2929d
lint
bkmartinjr Aug 21, 2024
013cea6
py 3.8 lint
bkmartinjr Aug 22, 2024
39dbab6
rework io and shuffle buffer size params
bkmartinjr Aug 22, 2024
98c4510
lint
bkmartinjr Aug 24, 2024
933787d
remove encoders; more perf work
bkmartinjr Aug 24, 2024
d44fcae
reorganize into separate python package
bkmartinjr Aug 25, 2024
8840983
fix name
bkmartinjr Aug 25, 2024
bb70d6d
add more paths to CI
bkmartinjr Aug 25, 2024
8e51344
fix typo in ci
bkmartinjr Aug 25, 2024
6714a89
fix a second typo in ci
bkmartinjr Aug 25, 2024
5505c28
set working dir in CI
bkmartinjr Aug 25, 2024
6090af2
make batched 3.12 compat
bkmartinjr Aug 25, 2024
c9789f9
debugging pre-commit failure
bkmartinjr Aug 25, 2024
33f3c2d
lint, lint, lint
bkmartinjr Aug 25, 2024
467bb15
more CI debugging
bkmartinjr Aug 25, 2024
52efb62
add build test to CI
bkmartinjr Aug 25, 2024
6ab8334
add code coverage
bkmartinjr Aug 25, 2024
4df5049
update GHA
bkmartinjr Aug 25, 2024
1b31d32
test TypeAlias
bkmartinjr Aug 25, 2024
71be802
add missing dependencies
bkmartinjr Aug 25, 2024
a0a8344
extend tests
bkmartinjr Aug 25, 2024
13da9c7
remove coverage reporting from CI for now
bkmartinjr Aug 25, 2024
7074e99
docstrings
bkmartinjr Aug 25, 2024
47d7052
more file organization
bkmartinjr Aug 25, 2024
c16b68d
add missing test
bkmartinjr Aug 25, 2024
2d06476
re-run notebook
bkmartinjr Aug 25, 2024
778022d
update changelog
bkmartinjr Aug 25, 2024
05e7fc5
add collate unit test
bkmartinjr Aug 26, 2024
1e442a7
clean up experiment_dataloader function
bkmartinjr Aug 26, 2024
b01f609
docstrings
bkmartinjr Aug 27, 2024
7b56809
fix typo in notebook name (thanks Ryan!)
bkmartinjr Aug 28, 2024
330bf43
checkpoint updates
bkmartinjr Aug 30, 2024
745d600
update tests to include _CSR tests
bkmartinjr Aug 30, 2024
065f78c
fix typo in method name
bkmartinjr Aug 30, 2024
0fcdd11
tuning
bkmartinjr Aug 30, 2024
0b8f786
update demo notebook
bkmartinjr Aug 30, 2024
5ab985b
add to README
bkmartinjr Aug 30, 2024
075b1ab
concurrency tweak
bkmartinjr Aug 31, 2024
34d4952
additional memory reductions
bkmartinjr Aug 31, 2024
c2e7fac
DDP/multi-GPU support
bkmartinjr Sep 4, 2024
723fa21
add further concurrency to CSR construction
bkmartinjr Sep 6, 2024
ea38c5c
cleanup
bkmartinjr Sep 10, 2024
f704c83
fix multi-gpu hang due to incorrect __len__ return value
bkmartinjr Sep 13, 2024
8e47320
compat with Lightning
bkmartinjr Sep 14, 2024
70cc170
PR review edits
bkmartinjr Sep 14, 2024
37bc9b1
formatting
bkmartinjr Sep 14, 2024
b0c4547
add py.typed to package
bkmartinjr Sep 16, 2024
8ae3992
add sparse support
bkmartinjr Sep 16, 2024
9809c5c
start draft of Ligtning notebook
bkmartinjr Sep 16, 2024
656e2e8
lint
bkmartinjr Sep 16, 2024
f9e13b0
update notebook for lightning
bkmartinjr Sep 17, 2024
eaeaab4
run notebooks
bkmartinjr Sep 17, 2024
61face3
fix RNG state bug in shuffle; add multi-worker notebook
bkmartinjr Sep 18, 2024
6ccad8e
add `rehome-census.sh`, used to construct this repo's history
ryan-williams Sep 18, 2024
9804ee7
update GHA, repo info
ryan-williams Sep 18, 2024
5347805
add `.pre-commit-config.yaml`, run lint
ryan-williams Sep 18, 2024
da1389a
add .gitignore
bkmartinjr Sep 19, 2024
672d306
autoupdate pre-commit
bkmartinjr Sep 19, 2024
975dd12
remove tiledbsoma-specific ruff/isort rules
bkmartinjr Sep 19, 2024
96c1516
Package dependency pins & test (#2)
bkmartinjr Sep 19, 2024
2a61940
additional tests of implementation base class
bkmartinjr Sep 23, 2024
9ee335b
Merge branch 'initial-pr' of https://github.com/TileDB-Inc/TileDB-SOM…
bkmartinjr Sep 23, 2024
ccdfccd
refactoring test params
bkmartinjr Sep 23, 2024
aef4c15
fix py 3.9 incompatibility
bkmartinjr Sep 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 65 additions & 3 deletions .github/workflows/python-tiledbsoma-ml.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,72 @@
name: python-tiledbsoma-ml
name: python-tiledbsoma-ml CI

on:
pull_request:
branches: ["*"]
paths-ignore:
- 'scripts/**'
push:
branches: [main]
paths-ignore:
- 'scripts/**'

workflow_dispatch:

jobs:
job:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Restore pre-commit cache
uses: actions/cache@v4
with:
path: ~/.cache/pre-commit
key: pre-commit-${{ hashFiles('.pre-commit-config.yaml') }}

- name: Install pre-commit
run: pip -v install pre-commit

- name: Run pre-commit hooks on all files
run: pre-commit run -v -a

tests:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
# Empty job; placeholder GHA
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip

- name: Install prereqs
run: |
pip install --upgrade pip wheel pytest pytest-cov setuptools
pip install .

- name: Run tests
run: pytest -v --cov=src --cov-report=xml tests

build:
# for now, just do a test build to ensure that it works
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Do build
run: |
pip install --upgrade build pip wheel setuptools setuptools-scm
python -m build .
46 changes: 46 additions & 0 deletions .github/workflows/python-tilledbsoma-ml-compat.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
name: python-tiledbsoma-ml past tiledbsoma compat # Latest tiledbsoma version covered by another workflow

on:
pull_request:
branches: ["*"]
paths-ignore:
- "scripts/**"
- "notebooks/**"
push:
branches: [main]
paths-ignore:
- "scripts/**"
- "notebooks/**"

jobs:
unit_tests:
strategy:
fail-fast: false
matrix:
os: ["ubuntu-latest"] # could add 'macos-latest', but the matrix is already huge...
python-version: ["3.9", "3.10", "3.11"] # TODO: add 3.12 when tiledbsoma releases wheels for it.
pkg-version:
- "tiledbsoma~=1.9.0 'numpy<2.0.0'"
- "tiledbsoma~=1.10.0 'numpy<2.0.0'"
- "tiledbsoma~=1.11.0"
- "tiledbsoma~=1.12.0"
- "tiledbsoma~=1.13.0"
- "tiledbsoma~=1.14.0"

runs-on: ${{ matrix.os }}

steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip

- name: Install prereqs
run: |
pip install --upgrade pip pytest setuptools
pip install ${{ matrix.pkg-version }} .

- name: Run tests
run: pytest -v tests
162 changes: 162 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
23 changes: 23 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
repos:
- repo: https://github.com/psf/black
rev: "24.8.0"
hooks:
- id: black

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.5
hooks:
- id: ruff
name: "ruff for tiledbsoma_ml"
args: ["--config=pyproject.toml"]

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.11.2
hooks:
- id: mypy
pass_filenames: false
args: ["--config-file=pyproject.toml", "src"]
additional_dependencies:
- attrs
- numpy
- pandas-stubs>=2
24 changes: 24 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@

# Change Log

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](http://keepachangelog.com/)
and this project adheres to [Semantic Versioning](http://semver.org/).

## [Unreleased] - yyyy-mm-dd

Port and enhance contribution from the Chan Zuckerberg Initiative Foundation
[CELLxGENE](https://cellxgene.cziscience.com/) project.

This is not a one-for-one migration of the contributed code. Substantial changes have
been made to the package utility (e.g., multi-GPU support), improve API usability, etc.

### Added

- Initial commits via [PR #1](https://github.com/single-cell-data/TileDB-SOMA-ML/pull/1)
- Refine package dependency pins and compatibility tests via [PR #2](https://github.com/single-cell-data/TileDB-SOMA-ML/pull/2)

### Changed

### Fixed
45 changes: 45 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@

# tiledbsoma_ml

A Python package containing ML tools for use with `tiledbsoma`.

## Description

The package currently contains a prototype PyTorch `IterableDataset` for use with the
[`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
API.

## Getting Started

### Installing

Install using your favorite package installer. For exapmle, with pip:

> pip install tiledbsoma-ml

Developers may install editable, from source, in the usual manner:

> pip install -e .

### Documentation

TBD

## Builds

This is a pure Python package. To build a wheel, ensure you have the `build` package installed, and then:

> python -m build .

## Version History

See the [CHANGELOG.md](CHANGELOG.md) file.

## License

This project is licensed under the MIT License.

## Acknowledgements

The SOMA team is grateful to the Chan Zuckerberg Initiative Foundation [CELLxGENE Census](https://cellxgene.cziscience.com)
team for their initial contribution.
13 changes: 0 additions & 13 deletions apis/python/src/tiledbsoma/ml/__init__.py

This file was deleted.

Loading