Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make uri_file output limitation explicit #72

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
## [Unreleased]

- Added support for pydantic v2 and bumped minimal required pydantic version to `2.0.0` by [@froessler](https://github.com/fdroessler)
- Added adbility to mark a node as deterministic (enables caching on AzureML) by [@tomasvanpottelbergh](https://github.com/tomasvanpottelbergh)
- Added ability to mark a node as deterministic (enables caching on Azure ML) by [@tomasvanpottelbergh](https://github.com/tomasvanpottelbergh)
- Explicitly disabled support for `AzureMLAssetDataSet` outputs of `uri_file` type by [@tomasvanpottelbergh](https://github.com/tomasvanpottelbergh)

## [0.5.0] - 2023-08-11

Expand Down
4 changes: 3 additions & 1 deletion docs/source/05_data_assets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ Azure Data Assets

``kedro-azureml`` adds support for two new datasets that can be used in the Kedro catalog. Right now we support both Azure ML v1 SDK (direct Python) and Azure ML v2 SDK (fsspec-based) APIs.

**For v2 API (fspec-based)** - use ``AzureMLAssetDataSet`` that enables to use Azure ML v2-sdk Folder/File datasets for remote and local runs.
**For v2 API (fspec-based)** - use ``AzureMLAssetDataSet`` that enables to use Azure ML v2 SDK Folder/File datasets for remote and local runs.
Currently only the `uri_file` and `uri_folder` types are supported. Because of limitations of the Azure ML SDK, the `uri_file` type can only be used for pipeline inputs,
not for outputs. The `uri_folder` type can be used for both inputs and outputs.

**For v1 API** (deprecated ⚠️) use the ``AzureMLFileDataSet`` and the ``AzureMLPandasDataSet`` which translate to `File/Folder dataset`_ and `Tabular dataset`_ respectively in
Azure Machine Learning. Both fully support the Azure versioning mechanism and can be used in the same way as any
Expand Down
35 changes: 19 additions & 16 deletions kedro_azureml/generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,19 +154,32 @@ def _get_versioned_azureml_dataset_name(
suffix = ":" + version
return azureml_dataset_name + suffix

def _get_input_type(self, dataset_name: str, pipeline: Pipeline) -> Input:
def _get_input(self, dataset_name: str, pipeline: Pipeline) -> Input:
if self._is_param_or_root_non_azureml_asset_dataset(dataset_name, pipeline):
return "string"
return Input(type="string")
elif dataset_name in self.catalog.list() and isinstance(
ds := self.catalog._get_dataset(dataset_name), AzureMLAssetDataSet
):
if ds._azureml_type == "uri_file" and dataset_name not in pipeline.inputs():
raise ValueError(
"AzureMLAssetDataSets with azureml_type 'uri_file' can only be used as pipeline inputs"
)
return ds._azureml_type
return Input(type=ds._azureml_type)
else:
return "uri_folder"
return Input(type="uri_folder")

def _get_output(self, name):
if name in self.catalog.list() and isinstance(
ds := self.catalog._get_dataset(name), AzureMLAssetDataSet
):
if ds._azureml_type == "uri_file":
raise ValueError(
"AzureMLAssetDataSets with azureml_type 'uri_file' cannot be used as outputs"
)
# TODO: add versioning
return Output(type=ds._azureml_type, name=ds._azureml_dataset)
else:
return Output(type="uri_folder")

def _from_params_or_value(
self,
Expand Down Expand Up @@ -231,21 +244,11 @@ def _construct_azure_command(
},
environment=self._resolve_azure_environment(), # TODO: check whether Environment exists
inputs={
self._sanitize_param_name(name): Input(
type=self._get_input_type(name, pipeline)
)
self._sanitize_param_name(name): self._get_input(name, pipeline)
for name in node.inputs
},
outputs={
self._sanitize_param_name(name): (
# TODO: add versioning
Output(name=ds._azureml_dataset)
if name in self.catalog.list()
and isinstance(
ds := self.catalog._get_dataset(name), AzureMLAssetDataSet
)
else Output()
)
self._sanitize_param_name(name): self._get_output(name)
for name in node.outputs
},
code=self.config.azure.code_directory,
Expand Down