Skip to content
This repository has been archived by the owner on Jan 2, 2024. It is now read-only.

Added aws s3 data node #828

Closed
wants to merge 42 commits into from

Conversation

Forchapeatl
Copy link

@Forchapeatl Forchapeatl commented Nov 17, 2023

Added AWS S3 DataNode

image

Fixes Avaiga/taipy#536

  • Implemented a new predefined storage type "s3" in DataNodeConfig, to model data stored on an S3 bucket.
  • List and design the configuration attributes required.
    - Create S3 datanode.
    - Read stream object from s3 bucket.
    - Write stream object to s3 bucket.
  • Implemented the corresponding S3DataNode inheriting from the parent DataNode class.

Todo

  • Add reference manual documentation.
  • Write file to s3 bucket

@Forchapeatl
Copy link
Author

Hello @toan-quach and @jrobinAV , please permit me to tag you directly. For some reason, I can't seem to request for a review.
Please have a look at my PR.

@toan-quach
Copy link
Member

@Forchapeatl awesome!! Thanks for tagging us! We will review as soon as we can :D Thanks again for your help!

Copy link
Member

@trgiangdo trgiangdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for a great first PR. I believe this will be a good addition to taipy-core.

I have a few comments about naming and DataNode abstract methods that need to be implemented.
Other than that, there are minor linting errors that should be fixed.

src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
Copy link
Member

@jrobinAV jrobinAV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is going in the right direction. Thanks a lot.

The configure method in the DataNodeConfig is missing though.

src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
@jrobinAV
Copy link
Member

Hello @Forchapeatl.

Thank you very much for your contribution. For information, we are currently working on a significant repository refactoring. We are merging sub-repositories into a single repository. For instance, the taipy-core repository is about to be merged into the taipy repository.

This should not impact your work.

But just so you are not surprised, instead of merging your PR, we will move your code from this branch (in taipy-core) into a new branch of another repository (taipy). We will let you know and let you review the code.

Thank you for your understanding.

@Forchapeatl
Copy link
Author

Forchapeatl commented Nov 21, 2023

Hello @jrobinAV thank you for your kind review. The requested changes have been made. The S3ObjectDataNode has been implemented and tested with the _read and _write fuctions in thier expected signatures.

The issue now is the _append method. I had a chat with @trgiangdo concerning this method. I saw how this was possible when we were adding files to a single bucket. Now our DataNode is making reference to a single object not bucket. Handling this method should be possible by copying all the data to a new object, appendng the new content and writing back to S3 , but I don't think it's worth the effort in cases where the object data (file) is very large ( I may be wrong ) . Please for your opinion.

Reference: Append Data to S3 Object

@trgiangdo
Copy link
Member

Thank you for your information. _append() is an optional method, so if S3 does not support appending, then it's totally fine.

@trgiangdo
Copy link
Member

About the DataNodeConfig, you probably want to take a look at this documentation page.

Then, you can take a look at the src/taipy/core/config/data_node_config.py file, where we define different configuring methods for different data node types, and implement a new configure_s3_data_node() method.

For now, the todo list:

  • Implement configure_s3_data_node() method to the DataNodeConfig class
  • Add _STORAGE_TYPE_VALUE_S3 attribute and add that attribute to the _ALL_STORAGE_TYPES list.
  • Update DataNodeConfig docstring.
  • Add configuration tests for S3DataNode that created from an s3 DataNodeConfig.

@jrobinAV
Copy link
Member

Hello @jrobinAV thank you for your kind review. The requested changes have been made. The S3ObjectDataNode has been implemented and tested with the _read and _write fuctions in thier expected signatures.

The issue now is the _append method. I had a chat with @trgiangdo concerning this method. I saw how this was possible when we were adding files to a single bucket. Now our DataNode is making reference to a single object not bucket. Handling this method should be possible by copying all the data to a new object, appendng the new content and writing back to S3 , but I don't think it's worth the effort in cases where the object data (file) is very large ( I may be wrong ) . Please for your opinion.

Reference: Append Data to S3 Object

Hello,

Yes, I agree with you. Let's forget abt the append method. It is an optional method anyway.

Thank you so much.

@Forchapeatl
Copy link
Author

Hello everyone, the requested changes have been made. This includes.

  • Implement configure_s3_object_data_node() method to the DataNodeConfig class
  • Add _STORAGE_TYPE_VALUE_S3 attribute and add that attribute to the _ALL_STORAGE_TYPES list.
  • Add configuration tests for S3ObjectDataNode that created from an s3_object DataNodeConfig.
  • Update DataNodeConfig docstring.( I don't really know what this means but i believe it's been covered ).

image

Please let me know if more changes are required.

src/taipy/core/config/data_node_config.py Show resolved Hide resolved
src/taipy/core/config/data_node_config.py Show resolved Hide resolved
src/taipy/core/config/data_node_config.py Show resolved Hide resolved
tests/core/data/test_aws_s3_data_node.py Outdated Show resolved Hide resolved
tests/core/data/test_aws_s3_data_node.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Outdated Show resolved Hide resolved
src/taipy/core/data/aws_s3.py Show resolved Hide resolved
@trgiangdo
Copy link
Member

I believe it has the same problem. A small update on the expected_error_message should fix it.

@Forchapeatl
Copy link
Author

Forchapeatl commented Dec 4, 2023

image
Passed all 1108 tests. I belive s3ObjectDataNode has been succesfully added and tested.

@trgiangdo
Copy link
Member

trgiangdo commented Dec 5, 2023

The tests are failing:

File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/taipy/core/data/aws_s3.py", line 13, in <module>
    import boto3
ModuleNotFoundError: No module named 'boto3'
>       assert expected_error_message in caplog.text
E       assert '`storage_type` field of DataNodeConfig `data_nodes` must be either csv, sql_table, sql, mongo_collection, pickle, excel, generic, json, parquet, s3_object, or in_memory. Current value of property `storage_type` is "bar".' in 'ERROR    Taipy:config.py:237 `storage_type` field of DataNodeConfig `data_nodes` must be either csv, sql_table, sql, ..., pickle, excel, generic, json, parquet, in_memory, or s3_object. Current value of property `storage_type` is "bar".\n'
E        +  where 'ERROR    Taipy:config.py:237 `storage_type` field of DataNodeConfig `data_nodes` must be either csv, sql_table, sql, ..., pickle, excel, generic, json, parquet, in_memory, or s3_object. Current value of property `storage_type` is "bar".\n' = <_pytest.logging.LogCaptureFixture object at 0x118d6c970>.text

The parquet, s3_object, or in_memory is different from parquet, in_memory, or s3_object

Can you check again?

tests/conftest.py Outdated Show resolved Hide resolved
remove blank line
@trgiangdo
Copy link
Member

The tests are up and running. That's great 😁

Can you take a look at the linter error in https://github.com/Avaiga/taipy-core/actions/runs/7102562575/job/19333215991?pr=828?
This is so that we have consistent code style in our project.

@Forchapeatl
Copy link
Author

The tests are up and running. That's great 😁

Thank you. The code should be working fine now.

@Forchapeatl
Copy link
Author

Well :) this is unexpected , @trgiangdo , I did run the pylint test as you said . I am abit reluctant to fix this error because , I am unable to reproduce the error on my environment , let me know what you think.
image

@trgiangdo
Copy link
Member

About the error

Running: pycodestyle --max-line-length=120 --ignore=E121,E123,E126,E226,E24,E704,W503,W504,E203 .
./tests/core/data/test_aws_s3_data_node.py:56:62: E231 missing whitespace after ','
./tests/core/data/test_aws_s3_data_node.py:76:61: E231 missing whitespace after ','

You just need to add a space after the comma in line 56 and line 76 and it should works

@trgiangdo
Copy link
Member

Hello @Forchapeatl.

FYI, this taipy-core repository is merged into the taipy repository and is going to be archived soon.

So, instead of merging your PR, can you create a new PR in the main taipy repository, so you can be counted as an official Taipy contributor by Github?
The content of the PR should be very similar.

@trgiangdo
Copy link
Member

Migrated to Avaiga/taipy#585

@trgiangdo trgiangdo closed this Dec 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3DataNode
4 participants