Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to correctly create uri_file data assets ? #70

Open
Gabriel2409 opened this issue Aug 24, 2023 · 22 comments
Open

How to correctly create uri_file data assets ? #70

Gabriel2409 opened this issue Aug 24, 2023 · 22 comments
Labels
help wanted Extra attention is needed

Comments

@Gabriel2409
Copy link
Contributor

I am trying to save a data asset as a uri_file and the dataset is incorrectly saved as a uri_folder when I launch kedro azureml run

I have the following catalog:

projects_train_raw_local:
  type: pandas.CSVDataSet
  filepath: data/01_raw/dataset.csv

projects_train_raw:
    type: kedro_azureml.datasets.AzureMLAssetDataSet
    azureml_dataset: projects_train_raw
    root_dir: data/00_azurelocals/ 
    versioned: True
    azureml_type: uri_file
    dataset:
        type: pandas.CSVDataSet
        filepath: "dataset.csv"

and the following pipeline which just opens the local file and saves it

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        nodes=[
            node(
                func=lambda x: x,
                inputs="projects_train_raw_local",
                outputs="projects_train_raw",
                name="create_train_dataasset",
            )
        ]
    )

I expected a new data asset to be created on azure as an uri_file. However, i get the following info on azure
image

image

It seems my file is not saved correctly, which seems to correspond to this part in cli.py if I am not mistaken

    # 2. Save dummy outputs
    # In distributed computing, it will only happen on nodes with rank 0
    if not pipeline_data_passing and is_distributed_master_node():
        for data_path in azure_outputs.values():
            (Path(data_path) / "output.txt").write_text("#getindata")
    else:
        logger.info("Skipping saving Azure outputs on non-master distributed nodes")

How can I correctly create a uri_file data asset ?

@fdroessler
Copy link
Contributor

I think this is related to the code below in generator.py. Output has as default type uri_folder see here and as this is currently not related to what is specified in the dataset definition, one can't write uri_file datasets as outputs of a node.

            outputs={
                self._sanitize_param_name(name): (
                    # TODO: add versioning
                    Output(name=ds._azureml_dataset)
                    if name in self.catalog.list()
                    and isinstance(
                        ds := self.catalog._get_dataset(name), AzureMLAssetDataSet
                    )
                    else Output()
                )
                for name in node.outputs
            },

In fact, I probably forgot something because I'm not sure azureml_type atm has any impact.... for the loading we are taking the type directly from the azure dataset object. So in order to write other dataset types some changes are necessary. I'll have a look at this in the coming week or maybe you have an idea :)

@Gabriel2409
Copy link
Contributor Author

Hi @fdroessler,
Based on your input, I modified the code from
Output(name=ds._azureml_dataset) to Output(name=ds._azureml_dataset, type=ds._azureml_type).
However I now have another error

Execution failed. User process 'kedro' exited with status code 2. Please check log file 'user_logs/std_log.txt' for error details. Error: /bin/bash: /azureml-envs/azureml_xxxxxxxxxx/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Usage: kedro azureml execute [OPTIONS]
Try 'kedro azureml execute -h' for help.

Error: Invalid value for '--az-output': Path '/mnt/azureml/cr/j/xxxxxxxxxx/cap/data-capability/wd/projects_train_raw/projects_train_raw' does not exist.

It seems the path passed to az-output is incorrect. I find it strange that there are two folders projects_train_raw.

When I looked at the output of the _prepare_command function, i got the following string: kedro azureml -e local execute --pipeline=__default__ --node=create_train_dataasset --az-output=projects_train_raw ${{outputs.projects_train_raw}}

However I am not exactly sure where in the code ${{outputs.projects_train_raw}} is replaced by the actual path.

@marrrcin
Copy link
Contributor

Observing this dicsussion, but waiting for input from @fdroessler 👀

@marrrcin marrrcin added the help wanted Extra attention is needed label Aug 25, 2023
@fdroessler
Copy link
Contributor

I am not sure what is going on. I remember @tomasvanpottelbergh mentioned something on using uri_file as output dataset. I think we have it even excluded as subsequent inputs through

if ds._azureml_type == "uri_file" and dataset_name not in pipeline.inputs():
raise ValueError(
"AzureMLAssetDataSets with azureml_type 'uri_file' can only be used as pipeline inputs"
)

but can't remember exactly why. I think I remember some limitations as the reason but that part Tomas is more familiar with.

@tomasvanpottelbergh
Copy link
Contributor

As @fdroessler said: you shouldn't use the uri_file type for output datasets. I forgot to document this and make this clear in the generator, but the problem is that if you use uri_file in the Azure ML pipeline, the "filename" will be the dataset name (azureml_dataset). This name has no extension, making it impossible to know what file type it is or preview it.

If there is really a demand for this, we could technically set the dataset name to be the filename, but I feel that will just make things inconsistent and create more problems than it solves.

Is there a particular reason why you need a uri_file dataset @Gabriel2409? As far as I know, the uri_file type is just a restricted version of the uri_folder type allowing only a single file.

@Gabriel2409
Copy link
Contributor Author

Hi @tomasvanpottelbergh,
You make fair points and in my case, I can go with a uri_folder dataset.

Nevertheless, I feel like other people might want to use a uri_file data asset, mostly because it is the simplest data asset and it can help get started with kedro-azureml. In my opinion, not supporting uri_file as output is a bit of a shame and might drive people away from the plugin.

However, and that is the main problem from what I understand based on your comments, azure does not permit to create uri_file data assets from within a pipeline. Is this correct ? Is this specific to pipelines ?

I was able to create a uri_file dataset by running the following job but it is not within a pipeline so I don't know if there is an issue associated with the pipeline itself.

from azure.ai.ml import command, Input, Output, MLClient
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

subscription_id = "xxxxxxxxxxxxxxxxxxxxxxx"
resource_group = "myrg"
workspace = "myws"

ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)


input_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv"
output_path = (
    "azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic.csv"
)
data_type = AssetTypes.URI_FILE
# Set the input and output for the job:
inputs = {"input_data": Input(type=data_type, path=input_path)}

outputs = {
    "output_data": Output(
        type=data_type,
        path=output_path,
        name="myurifiledataset",
    )
}

job = command(
    command="cp ${{inputs.input_data}} ${{outputs.output_data}}",
    inputs=inputs,
    outputs=outputs,
    environment="myenv:1",
    compute="mycluster",
)

ml_client.jobs.create_or_update(job)

@tomasvanpottelbergh
Copy link
Contributor

Just to clarify: uri_file is supported in the plugin, but only as a pipeline input. I understand that this is a bit annoying, but as said I don't yet see a good way to support it.

Thanks for the example. Unfortunately I didn't save the pipeline I was using to test this, but I think the problem is that the path argument of the Output needs to be set to a full datastore path to make your example working. As far as I remember, just giving a filename there doesn't work. Can you confirm this?

If that is the case, the only way I see that we can support uri_file outputs is somehow adding the datastore path to the dataset specification. This then raises the question of how to handle versioning so as not to overwrite this file on subsequent runs of the pipeline. For reference, the uri_folder output currently avoids this by letting Azure ML generate a path on the default datastore.

Do you think it is still useful to add the feature given these limitations, or do you see another way to implement it?

@Gabriel2409
Copy link
Contributor Author

Yes I can confirm that adding the path to Output does not solve the issue. I even tried to add the full path to the datastore and it failed as well (might need further testing).

Regarding your comments:

uri_file is supported in the plugin, but only as a pipeline input. I understand that this is a bit annoying, but as said I don't yet see a good way to support it.:

Yes I was able to make it work as an input (except the small issue with local runs, see issue kedro-org/kedro-plugins#68)

This then raises the question of how to handle versioning so as not to overwrite this file on subsequent runs of the pipeline:

In my opinion, if a data asset is the output of a node in the pipeline, it should create a new version. Indeed running the node means you are trying to recreate the data asset. What do you think?

For reference, the uri_folder output currently avoids this by letting Azure ML generate a path on the default datastore

where in the code is this ? I did notice that you don't give the path to the output but I don't know how Azure knows it should use the default datastore. And does that mean that non default datastores are not currently supported? I do believe that the datastore should be an optional parameter. Maybe it is possible to usethe AzureMLDataStoreMixin that was created for file and pandas datasets?

To answer your question I don't see a good way to implement it for now but if you can point me to where in the code Azure takes over and generates the path, I am happy to investigate.

@tomasvanpottelbergh
Copy link
Contributor

In my opinion, if a data asset is the output of a node in the pipeline, it should create a new version. Indeed running the node means you are trying to recreate the data asset. What do you think?

I agree, but requires separating the path into the containing folder and the filename, and probably using the Kedro versioning approach.

For reference, the uri_folder output currently avoids this by letting Azure ML generate a path on the default datastore

where in the code is this ? I did notice that you don't give the path to the output but I don't know how Azure knows it should use the default datastore. And does that mean that non default datastores are not currently supported? I do believe that the datastore should be an optional parameter. Maybe it is possible to usethe AzureMLDataStoreMixin that was created for file and pandas datasets?

This is done here:

def _get_output(self, name):
if name in self.catalog.list() and isinstance(
ds := self.catalog._get_dataset(name), AzureMLAssetDataSet
):
if ds._azureml_type == "uri_file":
raise ValueError(
"AzureMLAssetDataSets with azureml_type 'uri_file' cannot be used as outputs"
)
# TODO: add versioning
return Output(type=ds._azureml_type, name=ds._azureml_dataset)
else:
return Output(type="uri_folder")

As you can see, the path argument of Output is just omitted, which makes Azure ML generate a path on the default datastore. Some way of setting the datastore would indeed be nice, but as said before this means we should probably add some versioning to the path ourselves. If you can investigate the options for setting the datastore/path for different output types that would be appreciated, since they are not well documented in the Azure ML docs.

@Gabriel2409
Copy link
Contributor Author

Hi @tomasvanpottelbergh,
So I have been playing around with Output and in the code I wrote above where I send the job directly to azureml, I am still able to save to a uri_file even without specifying an extension (though preview on azureml does not work):

output_path =   "azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic" # no .csv extension

Which is why I don't understand the error when using kedro-azureml and trying to save as a uri_file

Error: Invalid value for '--az-output': Path '/mnt/azureml/cr/j/xxxxxxxxxx/cap/data-capability/wd/projects_train_raw/projects_train_raw' does not exist.

I would understand the uri_file being saved was saved with an incorrect extension but I don't understand why there is a path error.
What I find even more confusing is that this error disappears if you try to save to an uri_folder.

Is there something I am missing on how node outputs are handled when using an AzureMLAssetDataSet, maybe a subtlety on the save function?
Also do you happen to know why the dataset name is repeated twice in the path?

@tomasvanpottelbergh
Copy link
Contributor

@Gabriel2409 as you say saving the file without an extension is possible, but since this behaviour is not consistent with how the uri_folder dataset works, we didn't want to support the uri_file dataset for outputs.

The error you see is because the path Azure ML injects as a CLI argument is a folder for the uri_folder dataset and a file for the uri_file dataset, whereas our code was only designed for the uri_folder case.

I do see the use of being able to specify the output path / datastore for a dataset and will have a look at adding this. Once this is working for uri_folder, we could probably have it for uri_file as well. Maybe it's even possible to find a workaround to detect the default datastore and generate a random path so the uri_file dataset can also work without specifying the full path.

Let us know if you have any ideas about this or if you want to (help) implement this.

@tomasvanpottelbergh
Copy link
Contributor

@Gabriel2409 I did some more investigation and what I've written above definitely seems feasible (the Azure ML SDK v2 can get the default datastore). There are some design questions though on which it would be good to also have the opinion of @marrrcin and @fdroessler:

  1. Assuming we add something like an azureml_path argument to AzureMLAssetDataSet, do we support the format azureml://datastores/<data_store_name>/paths/<path> or make this the path relative to the root of the datastore and also add an azureml_datastore argument? The latter choice might be useful to be able to use the default datastore without specifying it.
  2. How do we handle versioning? I actually have a use case myself to have an output path for pipelines that should be constant across different runs (doing the versioning in the code using partitions). Therefore I think it would be nice if the plugin respected the versioned: false flag. If so, is there a case for just following the default Kedro versioning behaviour when versioned: true? Or should this behaviour be controlled by the versioned flag on the underlying dataset (which is possibly too confusing)?
  3. As mentioned above, the uri_file output can be supported if the azureml_path is set, or we could generate a random path on the default datastore. Currently this seems to be under azureml/<UUID_of_the_pipeline_node>/, but this UUID is not know before runtime AFAIK. Therefore we could
    1. generate a UUID ourselves, although this will make the behaviour inconsistent with that of uri_folder and could lead to confusion because the UUID does not correspond to any node ID
    2. use a different path such as kedro-azureml/<UUID>/ for all outputs
    3. only support the uri_file case with an explicit specification of the path

What do you all think?

@Gabriel2409
Copy link
Contributor Author

Hi @tomasvanpottelbergh,
thank you for your inputs. I am happy to help implement this feature.

For item 1, personally I would go with the full path. It is quite easy to retrieve the datastore path in azure. Note that the path is only needed when saving. When you load, you can just use the data asset name.

For item 2, I think versioning should be done by default because of the way data assets work. It does not make sense to overwrite the data asset location with new data in my opinion (it may even pose problems, I don't know if Azure allows you to write on the location of an existing data asset). My main point is that when a data asset is an output of a node, it means the user wants to create a new version of this data asset and as such, it should be saved in a new location (though I would be interested in hearing counter arguments).

As you say in item 3, right now, each time you save a data asset, <UUID_of_the_pipeline_node> is added to the output path, which effectively saves new versions of the datasets in different folders. So in a sense, versioning is already implemented. Maybe another solution would be to be able to specify the path to use kedro versioning system but I feel like it is extra work for no benefit as azure already handles the versioning for you.

For the uri_file, once again the name of the file is not really important but the extension is, so I would add another argument for uri_file: filename. I would just use the azureml_path/<node_uuid_or_other_versionning_system>/<filename>. Note that nothing prevents us from using the filename arg with uri_folder (for ex: filename:'.', but it should be renamed to filepath if we do that)

So to summarize, I suggest something like this:

  • new argument: azureml_path
  • new argument for uri_file : filename
  • When using data aset as input, load using the data asset name
  • When using data asset as output, save to azureml_path/<node_uuid_or_other_versionning_system> for uri_folder and azureml_path/<node_uuid_or_other_versionning_system>/<filename> for uri_file

@fdroessler
Copy link
Contributor

My first thought on this is more of a question. Is the aim of the plugin to provide full flexibility and accessibility to as many of AzureML capabilities as possible or should be to as many as sensible?

I am still not 100% convinced that not providing the uri_file dataset as output option will drive people away from the plugin. If I'm already using the plugin than accessing the data from a uri_file or uri_folder is the same as I will be going through the catalog anyway. In fact having a lot of different configuration option and thus the potential to make mistakes when defining a dataset might actually be more frustrating than helpful for the starting user. So for me there is still a decision to be made about if it makes sense to have uri_file as an output dataset.

However IF we choose to do that (and I am not against it just want to challenge it a bit) here are my answers to the questions posed by @tomasvanpottelbergh.

Assuming we add something like an azureml_path argument to AzureMLAssetDataSet, do we support the format azureml://datastores/<data_store_name>/paths/ or make this the path relative to the root of the datastore and also add an azureml_datastore argument? The latter choice might be useful to be able to use the default datastore without specifying it.

Intuitively I would say we should abstract complexity away from the user so I would go for the relative path plus azureml_datastore option. I think this is more in line with what we have done before abstracting the path handling complexity away from the user and into the plugin and I would keep doing that. We shouldn't expect people to have a deep understanding on the paths as a precondition to using the plugin.

How do we handle versioning? I actually have a use case myself to have an output path for pipelines that should be constant across different runs (doing the versioning in the code using partitions). Therefore I think it would be nice if the plugin respected the versioned: false flag. If so, is there a case for just following the default Kedro versioning behaviour when versioned: true? Or should this behaviour be controlled by the versioned flag on the underlying dataset (which is possibly too confusing)?

Not sure I understand the question/implications completely or more specifically what is the difference of this question with 3)? But on the versioned flag I think it should be the one on the AssetDataset and not the underlying dataset. I wonder how this behaves if we have multiple versioned datasets that point to files in a folder dataset O_o maybe then it has to be on the underlying dataset? I think at the moment it needs to default to true but I guess this can be refactored as part of making the uri_file work. However, is this tightly linked to uri_file or more general?
@Gabriel2409 To my knowledge it is possible to "overwrite" versions because in the end a version is just a pointer to a location in the underlying storage so technically all versions can point to the same file if done wrong.

As mentioned above, the uri_file output can be supported if the azureml_path is set, or we could generate a random path on the default datastore. Currently this seems to be under azureml/<UUID_of_the_pipeline_node>/, but this UUID is not know before runtime AFAIK. Therefore we could
generate a UUID ourselves, although this will make the behaviour inconsistent with that of uri_folder and could lead to confusion because the UUID does not correspond to any node ID
use a different path such as kedro-azureml// for all outputs
only support the uri_file case with an explicit specification of the path

I can see the appeal of being able to determine the folder structure in the storage account/datastore ourselves. I need to explore the uri structure a bit more to see which option makes the most sense to me.

@Gabriel2409
Copy link
Contributor Author

Hi @fdroessler,
on that specific note:

To my knowledge it is possible to "overwrite" versions because in the end a version is just a pointer to a location in the underlying storage so technically all versions can point to the same file if done wrong.

I ran some tests. Consider the following code:

from azure.ai.ml import command, Input, Output, MLClient
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential
from azureml.core import Workspace

# Set your subscription, resource group and workspace name:
subscription_id = "mysubscriptionid"
resource_group = "myrg"
workspace = "myws"

ws = Workspace.get(
    subscription_id=subscription_id, resource_group=resource_group, name=workspace
)

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)


input_path = "mydata.csv" #local file 
output_path = "azureml://datastores/workspaceblobstore/paths/mypath/mydata.csv"

data_type = AssetTypes.URI_FILE # can be both uri_file or folder here, it does not matter


# Set the input and output for the job:
inputs = {
    "input_data": Input(
        type=data_type,
        path=input_path,
    )
}

outputs = {
    "output_data": Output(
        type=data_type,
        path=output_path,
        name="mydata",
    )
}

# This command job copies the data to your default Datastore
job = command(
    command="cp ${{inputs.input_data}} ${{outputs.output_data}}",
    inputs=inputs,
    outputs=outputs,
    environment="myenv:1",
    compute="mycluster",
)

# Submit the command
ml_client.jobs.create_or_update(job)

Here mydata.csv looks like this

col1,col2
1,2

The first time I run the code, a new data asset is created on azureml. If I go to the underlying blob storage, i can find it here

https://myblobstorage.blob.core.windows.net/azureml-blobstore-xxxxxxxxxxxxxxxx/mypath/mydata.csv
In azureml, it also creates the first version of the data asset

Now I modify mydata.csv to look like this

col1,col2
1,2
3,4

and I rerun the same code.

The file is overwritten in the blob storage and there is no problem as you suspected.

However, a new version of the data asset is created on azureml which points to the same path.

So now, in azure, I have two versions of the dataset which both point to the same file.
image

So that means that even though azure tells you you have 2 versions, you only have the latest.

Now the question if we specify the exact output path is:

  • Do you think we should prevent this behavior?
  • or do you think it is the responsibility of the user to correctly specify the path and we should only make sure there is no problem if the versioned flag is set to true ?

I personally think we should prevent this behavior and that's why i liked the idea of using the pipeline node uuid. That way versioning is handled only by azure and there is no risk to overwrite existing but from what I understand, this is where we disagree. And I also see your point of having paths that are consistent with the rest of kedro ecosystem.

@marrrcin
Copy link
Contributor

Just a thought - I'm haven't tested it but maybe it's worth trying - since Azure ML claims to supports fsspec, maybe using paths like azureml://datastores/workspaceblobstore/paths/mypath/mydata.csv directly in Kedro's datasets like CSVDataSet will actually work, making the modification of AzureMLAssetDataSet unnecessary?

@tomasvanpottelbergh
Copy link
Contributor

Just a thought - I'm haven't tested it but maybe it's worth trying - since Azure ML claims to supports fsspec, maybe using paths like azureml://datastores/workspaceblobstore/paths/mypath/mydata.csv directly in Kedro's datasets like CSVDataSet will actually work, making the modification of AzureMLAssetDataSet unnecessary?

As far as I remember @fdroessler found that it doesn't adhere completely to the fsspec interface in Kedro, so I don't think that works unfortunately.

I completely agree with not making the plugin overly complex @fdroessler. If supporting uri_file is possible by adding a small workaround, I think it's worth it, but it seems to raise many implementation questions which we'll have to agree on.

Regarding the "output path" feature: I'm surprised by @Gabriel2409's results, because I did the same experiment for uri_folder and there the pipeline fails if the specified folder is not empty. I can see the reason for this choice, but at the same time it makes my use case (writing to a partitioned dataset folder) impossible... Given this and the uri_file overwrite behaviour, I think always enabling (Kedro-style or other) versioning makes the most sense.

@Gabriel2409
Copy link
Contributor Author

@marrrcin I think you make a great point. In generator.py, in _get_output, replacing the return value with
Output(type=ds._azureml_type, name=ds._azureml_dataset, path="azureml://datastores/workspaceblobstore/paths/myfolder/") actually saved the data to azureml://datastores/workspaceblobstore/paths/myfolder/projects_train_dataset.csv in my case. So we could probably keep the class as is and use the underlying dataset filepath in Output (but then we would have problems for local runs and it comes with its own set of problems as not all underlying datasets have a filepath arg).

@tomasvanpottelbergh Running the pipeline a second time did not cause an error but overwrote the previous file so I am not sure why your pipeline fails. I think the current version could completely support your use case of partitioned dataset provided we force the path in the Output.

I used the following catalog

projects_train_dataset#urifolder:
  type: kedro_azureml.datasets.AzureMLAssetDataSet
  versioned: True
  azureml_type: uri_folder
  azureml_dataset: projects_train_dataset
  root_dir: data/00_azurelocals/ # for local runs only
  dataset:
    type: pandas.CSVDataSet
    filepath: "projects_train_dataset.csv"

I think that means that we could add another argument, for ex aml_save_path, append the date if versioned is set to True (the default), and use Output(type=ds._azureml_type, name=ds._azureml_dataset, path=ds._aml_save_path).
That would allow to have all the versions of uri_folders close to each other and even allow for @tomasvanpottelbergh usecase by setting versioned to False (but you would still see the version be incremented on azure which might be a problem).

Note that in the current implementation the filepath of the underlying dataset is appended to the path in azure, though to be honest I still don't understand why.

@tomasvanpottelbergh
Copy link
Contributor

@marrrcin I think you make a great point. In generator.py, in _get_output, replacing the return value with Output(type=ds._azureml_type, name=ds._azureml_dataset, path="azureml://datastores/workspaceblobstore/paths/myfolder/") actually saved the data to azureml://datastores/workspaceblobstore/paths/myfolder/projects_train_dataset.csv in my case. So we could probably keep the class as is and use the underlying dataset filepath in Output (but then we would have problems for local runs and it comes with its own set of problems as not all underlying datasets have a filepath arg).

I was indeed talking about local runs, which would be broken when using the azureml:// path. We also deliberately made it impossible to create data(sets) on Azure ML from local runs, since they are not tracked in Azure ML.

@tomasvanpottelbergh Running the pipeline a second time did not cause an error but overwrote the previous file so I am not sure why your pipeline fails. I think the current version could completely support your use case of partitioned dataset provided we force the path in the Output.

Are you talking about uri_file here @Gabriel2409? I didn't test that, but I got an error for using a uri_folder path which is not empty.

I think that means that we could add another argument, for ex aml_save_path, append the date if versioned is set to True (the default), and use Output(type=ds._azureml_type, name=ds._azureml_dataset, path=ds._aml_save_path). That would allow to have all the versions of uri_folders close to each other and even allow for @tomasvanpottelbergh usecase by setting versioned to False (but you would still see the version be incremented on azure which might be a problem).

Sure, although I would force-enable the versioning unless it is somehow possible to use a non-empty path for uri_folder. I'll let you open a PR so we can make the discussion a bit concrete.

Note that in the current implementation the filepath of the underlying dataset is appended to the path in azure, though to be honest I still don't understand why.

This is because we are always using uri_folder datasets, so we need to add the relative path to the actual file(s). The idea is also to support files that may be nested somewhere in a uri_folder dataset, without having to create a separate dataset for them.

@fdroessler
Copy link
Contributor

As far as I remember @fdroessler found that it doesn't adhere completely to the fsspec interface in Kedro, so I don't think that works unfortunately.

This is related to how fsspec paths are handled in kedro.io and that for azureml fsspec so far you needed a different way to instantiate the filessystem than what is implemented in kedro. Details can be found here: kedro-org/kedro#4314. This is not an issue on remote runs that use Output() however it will probably not work for local runs (did you test this @Gabriel2409 ?). This might not be too much of a blocker as so far we don't want write back to AzureML on local runs but it would mean some complexity for ensuring proper paths in local runs similar to what we do atm.

Reading through this thread I can see the following points (let me know if I miss something):

Pros:

  • Support of uri_file
  • Support versioned = False option
  • Allowing users to decide on nicer co-location of files that belong together in the underlying storage account

Cons:

  • Potentially breaking lineage for local runs when allowing fsspec writing
  • Increasing configuration complexity of datasets
  • fsspec is potentially dependent on kedro/kedro-dataset changes (but maybe not a blocker see above)
  • Changing traceability on the UUID in the path

Outstanding questions:

  • Does the overwrite work for both uri_file and uri_folder?

For me if we can avoid the first two cons I don't see why we should not go ahead. But I agree with @tomasvanpottelbergh that if we think we can avoid them we should continue this discussion on a draft PR with the initial proposed solution and iterate together. WDYT @Gabriel2409 @marrrcin

@Gabriel2409
Copy link
Contributor Author

@fdroessler

Regarding additional configuration complexity, I think this can be solved by providing sane defaults and by making the documentation more comprehensive with multiple examples.

For the local run issue, the way I see it, we have 2 options.

Option 1:

  • add the datastore arg
  • for runs on azure ml, when saving the dataset, use the following path: azureml:pathtodatastore/root_dir/version/filepath_arg/
  • for local runs use the following path workingdirectory/root_dir/version/filepath_arg
  • that way, structure is identical for local and Azure runs

Option 2:

  • have a aml_rootdir and a local_rootdir argument. Local runs use the local_rootdir and allows runs use the aml_rootdir
  • that way you can put your local files in a folder such as data/00_local and the actual dataset in azureml://datastores/workspaceblobstore/different/path

Note that if you really want to keep the pipeline uiid it can be used in the version part of the path

@tomasvanpottelbergh I will open a PR when I have the time so that we can explore the different possibilities. I will also rerun my tests for both uri_files and uri_folders to see if I can reproduce the errors you have with the overwrite.
From my initial explorations, I think the difficulty will be to correctly set the path attribute of the dataset in conjunction with the correct path argument in Output so that it works for uri_files and folders.

@marrrcin
Copy link
Contributor

Let's see it in PR then and continue from there. Thank you guys for a comprehensive discussion on the topic :)

Given the complexity, I agree that this feature should be only explicitly enabled when the user really wants to - if they want to , they will most likely end up searching for it in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants