Prerequisites:
- Docker or Podman (note: unit tests run with Podman by default)
- If using
docker
- make sure it's usable withoutsudo
(guidelines) - If using
podman
- make sure it's setup to run rootless containers (guidelines)
- If using
- Rust toolset
- Install
rustup
- The correct toolchain version will be automatically installed based on the
rust-toolchain
file in the repository
- Install
- Tools used by tests
- Code generation tools (optional - needed if you will be updating schemas)
- Install
flatc
- Install
protoc
followed by:cargo install protoc-gen-prost
- to install prost protobuf plugincargo install protoc-gen-tonic
- to install tonic protobuf plugin
- Install
- Cargo toolbelt
- Prerequisites:
cargo install cargo-update
- to easily keep your tools up-to-datecargo install cargo-binstall
- to install binaries without compilingcargo binstall cargo-binstall --force -y
- make future updates ofbinstall
to use a precompiled version
- Recommended:
cargo binstall cargo-nextest -y
- advanced test runnercargo binstall bunyan -y
- for pretty-printing the JSON logscargo binstall cargo-llvm-cov -y
- for test coverage
- Optional - if you will be doing releases:
cargo binstall cargo-edit -y
- for setting crate versions during releasecargo binstall cargo-update -y
- for keeping up with major dependency updatescargo binstall cargo-deny -y
- for linting dependenciescargo binstall cargo-udeps -y
- for linting dependencies (detecting unused)
- To keep all these cargo tools up-to-date use
cargo install-update -a
- Prerequisites:
- Database tools (optional, unless modifying repositories is necessary):
- Install Postgres command line client
psql
:- deb:
sudo apt install -y postgresql-client
- rpm:
sudo dnf install -y postgresql
- deb:
- Install MariaDB command line client
mariadb
:- deb:
sudo apt install -y mariadb-client
- rpm:
sudo dnf install -y mariadb
- deb:
- Install
sqlx-cli
:cargo install sqlx-cli
- Install Postgres command line client
Clone the repository:
git clone git@github.com:kamu-data/kamu-cli.git
Build the project:
cd kamu-cli
cargo build
To use your locally-built kamu
executable link it as so:
ln -s $PWD/target/debug/kamu-cli ~/.local/bin/kamu
When needing to test against a specific official release, you can install it under a different alias:
curl -s "https://get.kamu.dev" | KAMU_ALIAS=kamu-release sh
New to Rust? Check out these IDE configuration tips.
Set podman as preferred runtime for your user:
cargo run -- config set --user engine.runtime podman
When you run tests or use kamu
anywhere in your user directory you will now use podman
runtime.
If you need to run some tests under Docker
use:
KAMU_CONTAINER_RUNTIME_TYPE=docker cargo test <some_test>
By default, we define SQLX_OFFLINE=true
environment variable to ensure the compilation succeeds without access to a live database.
The default mode is fine in most of the cases, assuming the developer's assignment is not related to databases/repositories directly.
When databases have to be touched, the setup of local database containers must be configured using the following script:
make sqlx-local-setup
This mode:
- creates Docker containers with empty databases
- applies all database migrations from scratch
- generates
.env
files in specific crates to point to databases running in Docker containers by settingDATABASE_URL
variables as well as to disableSQLX_OFFLINE
variable in those crates
This setup ensures any SQL queries are automatically checked against live database schema at compile-time. This is highly useful when queries have to be written or modified.
After the database-specific assignment is over, it makes sense to re-enable default mode by running another script:
make sqlx-prepare
make sqlx-local-clean
The first step, make sqlx-prepare
, analyzes SQL queries in the code and generates the latest up-to-date data
for offline checking of queries (.sqlx
directories). It is necessary to commit them into the version control
to share the latest updates with other developers, as well as to pass through GitHub pipeline actions.
Note that running make lint
will detect if re-generation is necessary before pushing changes.
Otherwise, GitHub CI flows will likely fail to build the project due to database schema differences.
The second step, make sqlx-local-clean
would reverse make sqlx-local-setup
by:
- stopping and removing Docker containers with the databases
- removing
.env
files in database-specific crates, which re-enablesSQLX_OFFLINE=true
for the entire repository.
Any change to the database structure requires writing SQL migration scripts.
The scripts are stored in ./migrations/<db-engine>/
folders, and they are unique per database type.
The migration commands should be launched within database-specific crate folders, such as ./src/database/sqlx-postgres
. Alternatively, you will need to define DATABASE_URL
variable manually.
Typical commands to work with migrations include:
sqlx migrate add --source <migrations_dir_path> <descriptoin>
to add a new migrationsqlx migrate run --source <migrations_dir_path>
to apply migrations to the databasesqlx migrate info --source <migrations_dir_path>
to print information about currently applied migration within the database
Use the following command:
make lint
This will do a number of highly useful checks:
- Rust formatting check
- License headers check
- Dependencies check: detecting issues with existing dependencies, detecting unused dependencies
- Rust coding practices checks (
clippy
) - SQLX offline data check (
sqlx
data for offline compilation must be up-to-date with the database schema)
Before you run tests for the first time, you need to run:
make test-setup
This will download all necessary images for containerized tests.
You can run all tests except database-specific as:
make test
In most cases, you can skip tests involving very heavy Spark and Flink engines and databases by running:
make test-fast
If testing with databases is required (including E2E tests), use:
make sqlx-local-setup # Start database-related containers
make test-full # or `make test-e2e` for E2E only
make sqlx-local-clean
These are just wrappers on top of Nextest that control test concurrency and retries.
To run tests for a specific crate, e.g. opendatafabric
use:
cargo nextest run -p opendatafabric
Given the native nature of Rust, we often have to rebuild very similar source code revisions (e.g. switching between git branches).
This is where sccache can help us save the compilation cache and our time (dramatically).
After installing in a way that is convenient for you, configure as follows ($CARGO_HOME/config.toml
):
[build]
rustc-wrapper = "/path/to/sccache"
Alternatively you can use the environment variable RUSTC_WRAPPER
:
export RUSTC_WRAPPER=/path/to/sccache # for your convenience, save it to your $SHELL configuration file (e.g. `.bashrc`, `.zshrc, etc)
cargo build
Consider configuring Rust to use lld
linker, which is much faster than the default ld
(may improve link times by ~10-20x).
To do so install lld
, then update $CARGO_HOME/config.toml
file with the following contents:
[build]
rustflags = ["-C", "link-arg=-fuse-ld=lld"]
One more alternative is to use mold
linker, which is also much faster than the default ld
.
To do so install mold
or build it with clang++
compiler from mold sources then update $CARGO_HOME/config.toml
file with the following contents:
[build]
linker = "clang"
rustflags = ["-C", "link-arg=-fuse-ld=mold"]
To build the tool with embedded Web UI you will need to clone and build kamu-web-ui repo or use pre-built release. Now build the tool while enabling the optional feature and passing the location of the web root directory:
KAMU_WEB_UI_DIR=../../../../kamu-web-ui/dist/kamu-platform/ cargo build --features kamu-cli/web-ui
Note: we assume that kamu-web-ui
repository directory will be at the same level as kamu-cli
, for example:
.
└── kamu-data
├── kamu-cli
└── kamu-web-ui
Note: in debug mode, the directory content is not actually being embedded into the executable but accessed from the specified directory.
Many core types in kamu
are generated from schemas and IDLs in the open-data-fabric repository. If your work involves making changes to those - you will need to re-run the code generation tasks using:
make codegen
Make sure you have all related dependencies installed (see above) and that ODF repo is checked out in the same directory as kamu-cli
repo.
This repository is built around our interpretation of Onion / Hexagonal / Clean Architecture patterns [1] [2].
In the /src
directory you will find:
domain
- Crates here contain implementation-agnostic domain model entities and interfaces for service and repositories
- Crate directories are named after domain they represent, e.g.
task-system
while crate names will typically havekamu-<domain>
prefix
adapter
- Crates here expose domain data and operations under different protocols
- Crate directories are named after the protocol they are using, e.g.
graphql
while crate names will typically havekamu-adapter-<protocol>
prefix - Adapters only operate on entities and interfaces defined in
domain
layer, independent of specific implementations
infra
- Crates here contain specific implementations of services and repositories (e.g. repository that stores data in S3)
- Crate directories are named as
<domain>-<technology>
, e.g.object-repository-s3
while crate names will typically havekamu-<domain>-<technology>
prefix - Infrastructure layer only operates on entities and interfaces defined in
domain
layer
app
- Crates here combine all layers above into functional applications
This architecture relies heavily on separation of interfaces from implementations and dependency inversion principle, so we are using a homegrown dependency injection library dill
to simplify gluing these pieces together.
The system is built to be highly-concurrent and, for better or worse, the explicit async/await
style concurrency is most prevalent in Rust libraries now. Therefore:
- Our domain interfaces (traits) use
async fn
for any non-trivial functions to allow concurrent implementations - Our domain traits are all
Send + Sync
- for implementations to be used as
Arc<dyn Svc>
- to have implementations use interior mutability
- for implementations to be used as
Our error handling approach is still evolving, but here are some basic design rules we settled on:
- We don't return
Box<dyn Error>
or any fancier alternatives (likeanyhow
orerror_stack
) - we want users to be able to handle our errors precisely - We don't put all errors into a giant enum - this is as hard for users to handle as
Box<dyn Error>
- We are explicit about what can go wrong in every function - i.e. we define error types per function
- Errors in domain interfaces typically carry
Internal(_)
enum variant for propagating errors that are not part of the normal flow - We never want
?
operator to implicitly convert something into anInternalError
- a decision that some error is not expected should be explicit - We want
Backtrace
s everywhere, as close to the source as possible
With these ideas in mind:
- We heavily use
thiserror
library to define errors per function and generate error type conversions - We use our own
internal-error
crate to concisely box unexpected errors intoInternalError
type
We use the homegrown test-group
crate to organize tests in groups. The complete set of groups is:
containerized
- for tests that spawn Docker/Podman containersengine
- for tests that involve any data engine or data framework (query, ingest, or transform paths), subsequently grouped by:datafusion
- tests that use Apache DataFusionspark
- tests that use Apache Sparkflink
- tests that use Apache Flink
database
- for tests that involve any database interaction, subsequently grouped by:mysql
- tests that use MySQL/MariaDBpostgres
- tests that use PostreSQL
ingest
- tests that test data ingestion pathtransform
- tests that test data transformation pathquery
- tests that test data query pathflaky
- special group for tests that sometimes fail and need to be retried (use very sparingly and create tickets)setup
- special group for tests that initialize the environment (e.g. pull container images) - this group is run by CI before executing the rest of the tests
- Our policy is to have
master
branch always stable, ready to be released at any point in time, thus all changes are developed on feature branches and merged tomaster
only when they pass all the checks - Continuous upkeep of our repo is every developer's responsibility, so before starting a feature branch check if major dependency update is due and perform it on a separate branch
- Please follow this convention for branch names:
bug/invalid-url-path-in-s3-store
feature/recursive-pull-flag
refactor/core-crate-reorg
ci/linter-improvements
docs/expand-developer-docs
chore/bump-dependencies
release/v1.2.3
- for hot fix releases
- Include brief description of your changes under
## Unreleased
section of theCHANGELOG.md
in your PR - (Recommended) Please configure git to sign your commits
- Branches should have coarse-grained commits with good descriptions - otherwise commits should be squashed
- Follow minor dependency update procedure - do it right before merging to avoid merge conflicts in
Cargo.lock
while you're maintaining your branch - (Optional) We usually create a new release for every feature merged into
master
, so follow the release procedure - Maintainers who merge branches should do so via
git merge --ff-only
and NOT rebasing to not lose commit signatures
- Start by either creating a release branch or with an existing feature branch
- We try to stay up-to-date with all dependencies, so before every release we:
- Run
cargo update
to pull in any minor releases - Run
cargo upgrade --dry-run --incompatible
and see which packages have major upgrades - either perform them or ticket them up - Run
cargo deny check
to audit updated dependencies for licenses, security advisories etc.
- Run
- Bump the version using:
make release-patch / make release-minor / make release-major
- Create a dated
CHANGELOG
entry for the new version - Create PR, wait for tests, then merge as normal feature branch
- On
master
tag the latest commit with a new version:git tag vX.Y.Z
. - Push the tag to the repo:
git push origin tag vX.Y.Z
- GitHub Actions will pick up the new tag and create a new GitHub release from it
Our Jupyter demo at https://demo.kamu.dev includes a special Jupyter notebook image that embeds kamu-cli
, multiple examples, and some other tools. The tutorials also guide users to interact with kamu-node
deployed in the demo environment. Because of this - it's important to update Jupyter whenever we break any protocol compatibility.
- Increment
DEMO_VERSION
version in the Makefile - Set the same version for
jupyter
andminio
images indocker-compose.yml
(minio
image that we will build is used to run the demo environment locally) - Run
make clean
- Run
make minio-data
- this will prepare example datasets to be included intominio
image - Prepare your
docker buildx
to build multi-platform images (see instructions below) - Run
make minio-multi-arch
to build and push multi-archminio
image - Setup github access token:
7.1. Go to Github access token page and generate
access token with
write:packages
permissions. 7.2 Runexport CR_PAT=<your_token>
. To check everything is fine runecho $CR_PAT | docker login ghcr.io -u <your_username>> --password-stdin
- Run
make jupyter-multi-arch
to build and push multi-archjupyter
image - You can now proceed to deploy the new image to kubernetes environment
- Run
cargo update
to pull in any minor updates - Run
cargo deny check
to audit new dependencies for duplicates and security advisories - (Optional) Periodically run
cargo clean
to prevent yourtarget
dir from growing too big
- (Optional) Start by upgrading your local tools:
cargo install-update -a
- Run
cargo update
to pull in any minor releases first - Run
cargo upgrade --dry-run
and see which packages have major upgrades - To perform major upgrades You can go crate by crate or all at once - it's up to you
- The tricky part is usually
arrow
anddatafusion
family of crates, to upgrade them you need to:- See what is the latest version of
datafusion
- Go to datafusion repo, switch to corresponding tag, and check its
Cargo.toml
to see which version ofarrow
it depends on - Update to those major versions. For example
datafusion v32
depends onarrow v47
, so the command is:cargo upgrade -p arrow@47 -p arrow-digest@47 -p arrow-flight@47 -p datafusion@32
- Note that arrow-digest is our repo versioned in lockstep with
arrow
, so if the right version of it is missing - you should update it as well
- See what is the latest version of
- If some updates prove to be difficult - ticket them up and leave a
# TODO:
comment inCargo.toml
- Run
cargo update
again to pull in any minor releases that were affected by your upgrades - Run
cargo deny check
to audit new dependencies for duplicates and security advisories - (Optional) Periodically run
cargo clean
to prevent yourtarget
dir from growing too big
We release multi-platform images to provide our users with native performance without emulation. Most of our images are built automatically by CI pipelines, so you may not have to worry about building them. Some images, however, are still built manually.
To build multi-platform image on a local machine we use docker buildx
. It has ability to create virtual builders that run QEMU for emulation.
This command is usually enough to get started:
docker buildx create --use --name multi-arch-builder
If in some situation you want to run an image from different architecture on Linux under emulation - use this command to bootstrap QEMU (source):
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
When using VSCode we recommend following extensions:
rust-analyzer
- Rust language server- Setting up
clippy
:// settings.json { // other settings "rust-analyzer.check.overrideCommand": "cargo clippy --workspace --all-targets" }
- Setting up
Error Lens
- to display errors inline with codeEven Better TOML
- for editing TOML filesDependi
- displays dependency version status inCargo.toml
- Note: It's better to use
cargo upgrade --dry-run
when upgrading to bump deps across entire workspace
- Note: It's better to use
When running kamu
it automatically logs to .kamu/run/kamu.log
. Note that the run
directory is cleaned up between every command.
You can control the log level using standard RUST_LOG environment variable, e.g.:
RUST_LOG=debug kamu ...
RUST_LOG="trace,mio::poll=info" kamu ...
The log file is in Bunyan format with one JSON object per line. It is intended to me machine-readable. When analyzing logs yourself you can pipe it through [bynyan
] tool (see installation instructions above):
cat .kamu/run/kamu.log | bunyan
You can also run kamu with verbosity flags as kamu -vv ...
for it to log straight to STDERR in a human-readable format.
Using kamu --trace
flag allows you to record the execution of the program and open Perfetto UI in a browser, allowing to easily analyze async code execution and task performance.
Note: If you are using Brave or a similar high-security browser and get an error from Perfetto when loading the trace - try disabling the security features to allow the UI app fetch data from
http://localhost:9001
.