-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to correctly create uri_file data assets ? #70
Comments
I think this is related to the code below in
In fact, I probably forgot something because I'm not sure |
Hi @fdroessler,
It seems the path passed to When I looked at the output of the However I am not exactly sure where in the code |
Observing this dicsussion, but waiting for input from @fdroessler 👀 |
I am not sure what is going on. I remember @tomasvanpottelbergh mentioned something on using uri_file as output dataset. I think we have it even excluded as subsequent inputs through kedro-azureml/kedro_azureml/generator.py Lines 163 to 166 in 58a26ad
but can't remember exactly why. I think I remember some limitations as the reason but that part Tomas is more familiar with. |
As @fdroessler said: you shouldn't use the If there is really a demand for this, we could technically set the dataset name to be the filename, but I feel that will just make things inconsistent and create more problems than it solves. Is there a particular reason why you need a |
Hi @tomasvanpottelbergh, Nevertheless, I feel like other people might want to use a uri_file data asset, mostly because it is the simplest data asset and it can help get started with kedro-azureml. In my opinion, not supporting uri_file as output is a bit of a shame and might drive people away from the plugin. However, and that is the main problem from what I understand based on your comments, azure does not permit to create uri_file data assets from within a pipeline. Is this correct ? Is this specific to pipelines ? I was able to create a uri_file dataset by running the following job but it is not within a pipeline so I don't know if there is an issue associated with the pipeline itself. from azure.ai.ml import command, Input, Output, MLClient
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential
subscription_id = "xxxxxxxxxxxxxxxxxxxxxxx"
resource_group = "myrg"
workspace = "myws"
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)
input_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/titanic.csv"
output_path = (
"azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic.csv"
)
data_type = AssetTypes.URI_FILE
# Set the input and output for the job:
inputs = {"input_data": Input(type=data_type, path=input_path)}
outputs = {
"output_data": Output(
type=data_type,
path=output_path,
name="myurifiledataset",
)
}
job = command(
command="cp ${{inputs.input_data}} ${{outputs.output_data}}",
inputs=inputs,
outputs=outputs,
environment="myenv:1",
compute="mycluster",
)
ml_client.jobs.create_or_update(job) |
Just to clarify: Thanks for the example. Unfortunately I didn't save the pipeline I was using to test this, but I think the problem is that the If that is the case, the only way I see that we can support Do you think it is still useful to add the feature given these limitations, or do you see another way to implement it? |
Yes I can confirm that adding the path to Regarding your comments:
Yes I was able to make it work as an input (except the small issue with local runs, see issue kedro-org/kedro-plugins#68)
In my opinion, if a data asset is the output of a node in the pipeline, it should create a new version. Indeed running the node means you are trying to recreate the data asset. What do you think?
where in the code is this ? I did notice that you don't give the path to the output but I don't know how Azure knows it should use the default datastore. And does that mean that non default datastores are not currently supported? I do believe that the datastore should be an optional parameter. Maybe it is possible to usethe To answer your question I don't see a good way to implement it for now but if you can point me to where in the code Azure takes over and generates the path, I am happy to investigate. |
I agree, but requires separating the path into the containing folder and the filename, and probably using the Kedro versioning approach.
This is done here: kedro-azureml/kedro_azureml/generator.py Lines 171 to 182 in 80b483e
As you can see, the |
Hi @tomasvanpottelbergh, output_path = "azureml://datastores/workspaceblobstore/paths/quickstart-output/titanic" # no .csv extension Which is why I don't understand the error when using kedro-azureml and trying to save as a uri_file
I would understand the uri_file being saved was saved with an incorrect extension but I don't understand why there is a path error. Is there something I am missing on how node outputs are handled when using an AzureMLAssetDataSet, maybe a subtlety on the save function? |
@Gabriel2409 as you say saving the file without an extension is possible, but since this behaviour is not consistent with how the The error you see is because the path Azure ML injects as a CLI argument is a folder for the I do see the use of being able to specify the output path / datastore for a dataset and will have a look at adding this. Once this is working for Let us know if you have any ideas about this or if you want to (help) implement this. |
@Gabriel2409 I did some more investigation and what I've written above definitely seems feasible (the Azure ML SDK v2 can get the default datastore). There are some design questions though on which it would be good to also have the opinion of @marrrcin and @fdroessler:
What do you all think? |
Hi @tomasvanpottelbergh, For item 1, personally I would go with the full path. It is quite easy to retrieve the datastore path in azure. Note that the path is only needed when saving. When you load, you can just use the data asset name. For item 2, I think versioning should be done by default because of the way data assets work. It does not make sense to overwrite the data asset location with new data in my opinion (it may even pose problems, I don't know if Azure allows you to write on the location of an existing data asset). My main point is that when a data asset is an output of a node, it means the user wants to create a new version of this data asset and as such, it should be saved in a new location (though I would be interested in hearing counter arguments). As you say in item 3, right now, each time you save a data asset, For the uri_file, once again the name of the file is not really important but the extension is, so I would add another argument for uri_file: So to summarize, I suggest something like this:
|
My first thought on this is more of a question. Is the aim of the plugin to provide full flexibility and accessibility to as many of AzureML capabilities as possible or should be to as many as sensible? I am still not 100% convinced that not providing the uri_file dataset as output option will drive people away from the plugin. If I'm already using the plugin than accessing the data from a However IF we choose to do that (and I am not against it just want to challenge it a bit) here are my answers to the questions posed by @tomasvanpottelbergh.
Intuitively I would say we should abstract complexity away from the user so I would go for the relative path plus
Not sure I understand the question/implications completely or more specifically what is the difference of this question with 3)? But on the versioned flag I think it should be the one on the AssetDataset and not the underlying dataset. I wonder how this behaves if we have multiple versioned datasets that point to files in a folder dataset O_o maybe then it has to be on the underlying dataset? I think at the moment it needs to default to true but I guess this can be refactored as part of making the uri_file work. However, is this tightly linked to uri_file or more general?
I can see the appeal of being able to determine the folder structure in the storage account/datastore ourselves. I need to explore the uri structure a bit more to see which option makes the most sense to me. |
Hi @fdroessler,
I ran some tests. Consider the following code: from azure.ai.ml import command, Input, Output, MLClient
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential
from azureml.core import Workspace
# Set your subscription, resource group and workspace name:
subscription_id = "mysubscriptionid"
resource_group = "myrg"
workspace = "myws"
ws = Workspace.get(
subscription_id=subscription_id, resource_group=resource_group, name=workspace
)
# connect to the AzureML workspace
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)
input_path = "mydata.csv" #local file
output_path = "azureml://datastores/workspaceblobstore/paths/mypath/mydata.csv"
data_type = AssetTypes.URI_FILE # can be both uri_file or folder here, it does not matter
# Set the input and output for the job:
inputs = {
"input_data": Input(
type=data_type,
path=input_path,
)
}
outputs = {
"output_data": Output(
type=data_type,
path=output_path,
name="mydata",
)
}
# This command job copies the data to your default Datastore
job = command(
command="cp ${{inputs.input_data}} ${{outputs.output_data}}",
inputs=inputs,
outputs=outputs,
environment="myenv:1",
compute="mycluster",
)
# Submit the command
ml_client.jobs.create_or_update(job) Here mydata.csv looks like this
The first time I run the code, a new data asset is created on azureml. If I go to the underlying blob storage, i can find it here
Now I modify mydata.csv to look like this
and I rerun the same code. The file is overwritten in the blob storage and there is no problem as you suspected. However, a new version of the data asset is created on azureml which points to the same path. So now, in azure, I have two versions of the dataset which both point to the same file. So that means that even though azure tells you you have 2 versions, you only have the latest. Now the question if we specify the exact output path is:
I personally think we should prevent this behavior and that's why i liked the idea of using the pipeline node uuid. That way versioning is handled only by azure and there is no risk to overwrite existing but from what I understand, this is where we disagree. And I also see your point of having paths that are consistent with the rest of kedro ecosystem. |
Just a thought - I'm haven't tested it but maybe it's worth trying - since Azure ML claims to supports |
As far as I remember @fdroessler found that it doesn't adhere completely to the I completely agree with not making the plugin overly complex @fdroessler. If supporting Regarding the "output path" feature: I'm surprised by @Gabriel2409's results, because I did the same experiment for |
@marrrcin I think you make a great point. In @tomasvanpottelbergh Running the pipeline a second time did not cause an error but overwrote the previous file so I am not sure why your pipeline fails. I think the current version could completely support your use case of partitioned dataset provided we force the path in the Output. I used the following catalog projects_train_dataset#urifolder:
type: kedro_azureml.datasets.AzureMLAssetDataSet
versioned: True
azureml_type: uri_folder
azureml_dataset: projects_train_dataset
root_dir: data/00_azurelocals/ # for local runs only
dataset:
type: pandas.CSVDataSet
filepath: "projects_train_dataset.csv" I think that means that we could add another argument, for ex Note that in the current implementation the filepath of the underlying dataset is appended to the path in azure, though to be honest I still don't understand why. |
I was indeed talking about local runs, which would be broken when using the
Are you talking about
Sure, although I would force-enable the versioning unless it is somehow possible to use a non-empty path for
This is because we are always using |
This is related to how fsspec paths are handled in Reading through this thread I can see the following points (let me know if I miss something): Pros:
Cons:
Outstanding questions:
For me if we can avoid the first two cons I don't see why we should not go ahead. But I agree with @tomasvanpottelbergh that if we think we can avoid them we should continue this discussion on a draft PR with the initial proposed solution and iterate together. WDYT @Gabriel2409 @marrrcin |
Regarding additional configuration complexity, I think this can be solved by providing sane defaults and by making the documentation more comprehensive with multiple examples. For the local run issue, the way I see it, we have 2 options. Option 1:
Option 2:
Note that if you really want to keep the pipeline uiid it can be used in the @tomasvanpottelbergh I will open a PR when I have the time so that we can explore the different possibilities. I will also rerun my tests for both uri_files and uri_folders to see if I can reproduce the errors you have with the overwrite. |
Let's see it in PR then and continue from there. Thank you guys for a comprehensive discussion on the topic :) Given the complexity, I agree that this feature should be only explicitly enabled when the user really wants to - if they want to , they will most likely end up searching for it in the docs. |
I am trying to save a data asset as a uri_file and the dataset is incorrectly saved as a uri_folder when I launch
kedro azureml run
I have the following catalog:
and the following pipeline which just opens the local file and saves it
I expected a new data asset to be created on azure as an uri_file. However, i get the following info on azure
It seems my file is not saved correctly, which seems to correspond to this part in cli.py if I am not mistaken
How can I correctly create a uri_file data asset ?
The text was updated successfully, but these errors were encountered: