Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate DVC (or MLFlow) to track model experiments #20

Open
aazuspan opened this issue Nov 29, 2023 · 0 comments
Open

Integrate DVC (or MLFlow) to track model experiments #20

aazuspan opened this issue Nov 29, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@aazuspan
Copy link
Contributor

We're training a lot of different models with different datasets, architectures, hyperparameters, etc, and it's tough to track results across so many permutations. #19 attempts to improve reproducibility by storing all relevant parameters in dataset file names, but that's not going to be scalable for models and predicted outputs with dozens of parameters.

Tools like DVC and MLFlow track experiments by recording inputs (datasets, scripts, parameters, etc.) with their associated outputs (metrics, models, images, etc.). With DVC in particular (I'm not as familiar with MLFlow), you can set up workflows to run entirely through the tool, so that everything from the parameters used to create the training dataset to the final model are all automatically linked. However, in order to connect input parameters to output files, DVC requires scripts to produce outputs synchronously, rather than submitting tasks that are run asynchronously and downloaded later, which is how we currently collect our sampling and inference data. While it would be possible to adapt our workflow to force synchronous execution by waiting for Earth Engine tasks to complete and downloading the outputs programatically, that would require a substantial redesign and mean that multiple datasets couldn't easily be collected concurrently.

To avoid that limitation and allow asynchronous data creation, my tentative plan is to collect data outside of DVC, using the API developed in #19 to link dataset parameters with the output files. During training, we can manually tell DVC the dataset parameters and it will log them alongside the model and metrics. This is slightly less reliable as we're responsible for making sure the training data is consistent with its creation parameters, but should be more flexible and avoid slowing things down with synchronous execution. Because DVC will track all data and model parameters with the associated model, we should be able to remove the ModelRun class added in #19 to achieve the same goal using model filenames.

Currently, the training notebook is followed by two other notebooks that download NAIP imagery as TFRecords from a test region and generate a map to allow qualitative comparisons between model runs. In order to fit this into the DVC workflow, I think we should 1) move the 03_export_naip notebook into a Python script, with the understanding that it will be run once to produce a test region that can be used to evaluate every model run, and 2) run inference on that test region automatically as part of the training process, logging the resulting map as an artifact with DVC, so that each model run will include both quantitative metrics and a qualitative map.

@aazuspan aazuspan added the enhancement New feature or request label Nov 29, 2023
@aazuspan aazuspan self-assigned this Nov 29, 2023
@aazuspan aazuspan mentioned this issue Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant