-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for direct S3 access on SageMaker tasks #1081
Comments
After some investigation it seems like using mountpoint-s3 might not be a viable solution because it requires containers to be launched in a specific way which SageMaker does not support. Will look instead into other SageMaker file modes, although for GraphBolt we require access to files that are created by the job and not pre-existing https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html EDIT: The file modes available on SageMaker do not allow reading files that are created on S3 during the training/processing job, which makes them hard to use for our purposes. In addition, streaming file modes create read-only file systems on SM containers, which does no allow e.g. DGL to convert DistDGL files to GraphBolt in-place. |
…ageMaker (#1083) *Issue #, if available:* *Description of changes:* * Add a new SageMaker job to convert DistPart data to GraphBolt. This is our only option currently as there's no way to directly use S3 as a writable, shared file system in SageMaker, see #1081 for details. * The `sagemaker/launch_graphbolt_convert.py` will launch the SageMaker job, that downloads the entire partitioned graph to one instance, then runs the GB conversion, one partition at a time. Because DGL writes the new fused CSC graph representation in the same directory as the input data, we can't use one of SageMaker's FastFile modes to stream the data, as that creates read-only filesystems. * [Optional] We also include an example of how one could use a SageMaker Pipeline to run the GSPartition and GBConvert jobs in sequence, but this can be removed (because SageMaker Pipelines are persistent once created). * Added unit test mechanism to test sagemaker scripts, we start with testing our parsing logic. To make the scripts available to the runner's python runtime we add the `graphstorm/sagemaker/launch` directory to the runner's `PYTHONPATH`. EDIT: One note about the PR: The changes to the partition launch that use a SageMaker Pipeline are for demonstration purposes, I think I'll remove them alltogether and just have separate partition/gbconvert jobs. But we might want to have an example of how to programmatically build an SM pipeline as an example, e.g. from gsprocessing to training (as SM jobs) By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
Because DistDGL and by extension GraphStorm has an assumption of a shared filesystem to function properly, in our SageMaker implementations need to implement various downloads and uploads to "fake" the existence of a shared filesystem, by downloading data locally to specific locations per instance.
This introduces a maintenance burden as we can't make the same environment assumptions for our SageMaker vs. EC2 with EFS execution, and introduces a lot of glue code, to make the two system compatible.
Mountpoint for S3 is an AWS project that allows entire S3 buckets to mounted onto EC2 instances and treated a (mostly) regular filesystem. If we are able to use S3 buckets as virtual shared filesystems for SageMaker we should be able to simplify and align the codebase. We note the use-cases suggested by the mountpoint-s3 project align with ours:
We propose starting with a POC that modifies our SageMaker images and entry points to use mountpoint-s3, but does not affect the user-facing launch scripts, providing a backwards-compatible solution for our users.
Our first target will be adding GraphBolt support to SageMaker DistPartition, which is currently not possible, because DistDGL to GraphBolt partition conversion assumes that the leader instance has access to the entire distributed graph on disk. Following that, we can migrate our other SageMaker tasks to mountpoint-s3, where shared filesystems are normally required:
The text was updated successfully, but these errors were encountered: