Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store blobs on S3 #4088

Open
bsuttor opened this issue Jan 9, 2025 · 11 comments
Open

Store blobs on S3 #4088

bsuttor opened this issue Jan 9, 2025 · 11 comments

Comments

@bsuttor
Copy link
Member

bsuttor commented Jan 9, 2025

PLIP (Plone Improvement Proposal)

Responsible Persons

Proposer: Benoît Suttor

Seconder: Martin Peeters

Abstract

This PLIP proposes adding support for integrating Plone with S3 (Simple Storage Service) for storing content-related files, images, and other binary data. By leveraging S3 protocol as a backend storage solution, Plone would allow websites to offload storage to a scalable, highly available cloud solution, providing cost savings, redundancy, and improved performance for large deployments.

Motivation

Currently, Plone relies on local disk storage for managing files, which can limit scalability, especially for high-traffic sites or sites with significant file storage needs. Integrating S3 into Plone will offer the following benefits:

  • Scalability: Automatically scales with your data needs without requiring manual intervention.
  • Global Availability: S3’s network of data centers provides fast access to content, irrespective of the user’s geographical location.
  • Simplified Maintenance: By outsourcing storage to S3, you can reduce the load on your web server and simplify infrastructure management.

Moreover, many modern web applications and content management systems already leverage S3 for storage, and providing native support in Plone will make it easier for users to integrate Plone into cloud-centric architectures.

Assumptions

Proposal & Implementation

Technical Details

The integration would be implemented using the boto3 library (Python SDK for AWS), which allows interaction with S3.

This integration could be inspired by collective.s3blobs for downloading and uploading blobs to S3.

The following key features would be implemented:

  • File Storage Management: Plone would be able to upload, retrieve, and manage files in an S3 bucket.
  • Transparent File Access: Files that are uploaded through Plone's interface would be stored in S3, while the file paths and metadata would be stored in the Plone database.
  • Configuration: Users would configure their S3 credentials, S3 bucket name, and other options (as region) via the Plone registry or environment variables.
  • Size to upload: Administrator can define the minimal size of a blob to be uploaded to S3 (eg 1Mb). Other blobs are stored in classical blobstorage/relstorage

I thought Relstorage is a good place to implement this because the goal, in my case, is to deploy Plone by separating data from applications. So “Data.fs” could be stored on Postgres (for example) and blobs on S3.

But after talking about it with Maurits, maybe it’s better to add an adapter on ZODB blob or on plone.namedfile and use it ?

This feature would be opt-in and would not break existing Plone setups. Plone installations without this integration would continue to function normally, using blobstorage as before. Admins would need to enable and configure the integration explicitly.

Deliverables

  • Documentation to explain how it's possible to use S3 as blobstorage
  • To be defined

Risks

Potential Issues

  • Cost Control: While S3 offers cost savings, it is important to ensure that users understand the pricing model, as frequent file access or large volumes of data could incur significant charges.
  • Dependency: Users must be aware that they are relying on AWS or MinIO for file storage, which introduces an external dependency and potential single point of failure.
  • Security: Proper handling of credentials and access permissions is crucial to prevent unauthorized access to files.
    Performance: for little file, we need to test if the connection to S3 is fast enough to be effective on production
  • Fallback to Local Storage: In case S3 is unavailable, a fallback mechanism to store files locally could be added ?
  • Caching: it’s always difficult to have a good caching. But it should have a caching of small blobs.

Participants

To be defined, but I am interested

@ale-rt
Copy link
Member

ale-rt commented Jan 9, 2025

Might be relevant https://www.youtube.com/watch?v=kYBBysLk80A, CC @davisagli

@gforcada
Copy link
Member

gforcada commented Jan 9, 2025

We are very interested in this PLIP!

@stevepiercy
Copy link
Contributor

Please add documentation to the Deliverables section.

I would also be very interested in the significantly lower cost B2 storage service from Backblaze. Perhaps this PLIP could design an interface that allows a choice of cloud storage providers, instead of being designed for only one. If that's possible, then one cloud storage service could be a fallback to another.

@mpeeters
Copy link
Member

@stevepiercy B2 is compatible with S3 API. The idea is to be compatible with S3 API that many providers support.

@davisagli
Copy link
Member

@bsuttor @mpeeters Thanks for starting this PLIP. I was also been thinking about it a bit over the holidays. I'll add my notes below in case you want to add some of my ideas to the PLIP, but I think you've already covered a lot of what I had in mind.

Motivation:

  • In a large site, blobs can take up a sizable portion of the database
  • This has a noticeable impact on the performance of whole-db operations like backup/restore, site-to-site copies and packing.
  • S3-compatible storage is widely available and is an obvious alternative to try to support.
  • Storing blobs this way can reduce the frequency of managing changes in disk capacity for the main database.
  • (On the other hand, it comes with the cost of some added complexity of managing interactions with an additional system.)
  • Serving files often has different characteristics and best practices compared to other requests (i.e. cache strategy)

Design goals:

  • Optionally store blobs (files, images) in S3-compatible storage including Amazon S3, other cloud service providers that offer an S3-compatible API, and (self-hosted) MinIO.
  • Configuration by environment variables for access key, secret key, and bucket
  • Ideally make it as compatible as possible with existing code that works with binary data. (So, implement it within the ZODB or within plone.namedfile rather than requiring other code to use something new.)
  • Don't add a hard requirement on any large libraries like boto3 (but an optional extra is okay).
  • Include support in the official plone-backend Docker image.
  • Provide some way to inspect and see what is using space.
  • Provide some way to garbage collect unused blobs (packing)
  • Provide some sort of local cache for frequently accessed blobs
  • Maybe support alternative download schemes (i.e. link to a CDN or to S3 instead of to /@@Download or /@@images). But then how do we handle auth? This might be out of scope.
  • Maybe handle small and large files differently
  • Maybe support custom logic for choosing a bucket dynamically?

I think there are 2 pretty different directions we could go for the implementation:

  1. Try to support it at the ZODB storage level, probably with a wrapper that can be used with various underlying storages. collective.s3blobs uses this approach. I'm worried it might not perform well enough for writes (it's not great to have a transaction pending while we send a lot of data over the internet) and it will make the data in S3 pretty opaque and hard to see anything useful about where it came from.
  2. Add some storage abstraction to plone.namedfile so that it can write to either ZODB blobs or other storage backends. I'm currently leaning toward this way because it means a lot more context would be available, so we could do things like writing to different buckets based on the current path, or tagging files in S3 with what content they are part of.

Prior art to investigate:

@bsuttor
Copy link
Member Author

bsuttor commented Jan 20, 2025

@jensens what do you think about this plip ?

@jensens
Copy link
Member

jensens commented Jan 20, 2025

I think it is overall a good idea. But it is a complex topic.

I tend to support it at the ZODB storage level. @davisagli summarized already most of the problems with it. We may want to store the blobs first in the ZODB (like now) and defer the storage to S3, finally removing it from the ZODB afterwards.

@ericof
Copy link
Member

ericof commented Jan 22, 2025

I'm a bit late to the party, but adding my two cents here

Picking the solution

plone.namedfile

tl;dr: I would prefer not to go the plone.namedfile way.

During BeethovenSprint 2022, @jensens started working on a branch of plone.namedfile as a solution to have blobs being stored on a object storage (S3-like solution).
After the sprint, I decided to create a poc of a site using this branch and see how it would behave, and even tough it is a simple solution that works, the following problems appeared:

  • Not easy to implement versioning of the blob files as we have today
  • Garbage collection also not easy (maybe a solution of marker interface + subscriber could be used, but it seemed I would be solving the problem in the wrong place)

ZODB blobs

I decided to explore the idea of implementing the object-storage integration on the ZODB level, and this is what (I think) needs to be done:

  • Code a solution that would implement ZODB.interfaces.IBlobStorage and ZODB.interfaces.IBlob
  • Use the original file extension (jpg, png, doc, pdf) instead of .blob to allow us to serve blobs directly from the object-storage
  • Expose the object-storage path to plone.namedfile and other packages that could them decide if they would proxy the content or would serve it from object-storage.
  • Study how to patch/adapt relstorage.blobhelper so the solution would also work with Relstorage

Additional Points

  • Security: It is possible to use presigned urls to download the content from object-storage (with a expiration)
  • Compatibility: Maybe use Minio as the lowest common denominator.
  • Metadata: Currently plone.namedfile and plone.scale gather/process the metadata of a blob, so the idea of having a local cache seems interesting to avoid performance issues
  • Performance: I would avoid writing to ZODB (Either Filestorage or Relstorage) and then moving it to the object-storage, as it seems we would have a considerable overhead with this.

@yurj
Copy link
Contributor

yurj commented Jan 23, 2025

  • Use the original file extension (jpg, png, doc, pdf) instead of .blob to allow us to serve blobs directly from the object-storage

In my opinion, we need only to preserve the mime type, which can be stored in S3/ metadata. Also, what happens if you rename the file? You've to rename it in the Cloud. This would happen if you change the mime type too but just metadata.

Another option is to run ZEO directly on S3 and let ZEO serve files directly. This option would benefit Relstorage.

@sneridagh
Copy link
Member

Since I know that @datakurre has lots of experience in this matter, specially with async operations in the backend and related to blobs in/out operations, I'd love to hear his 5c to the problem. So pinging him would be a good idea.

So, Asko, could you please add your take on this complex problem? Thanks in advance!

@datakurre
Copy link
Member

Ok. My 2 cents. It might make sense to not try to achieve all goals at once, but split this into multiple PLIPs. For example, being able to scale blob storage with local MinIO cluster should be kept separate from achieving global availability with AWS S3 based "caching" for public assets.

Also, S3 is not a standard, and implementations vary. We had to support our local storage system, which had quite limited S3 support, and we had to check every feature and adapt our design (e.g. bucket path had to end with filename and extension, because its presigned URLs didn't support setting Content-Disposition). MinIO is good, but then we should be careful in advertising: the implementation working with both AWS and MinIO might still not work with everything advertised as S3 compatible.

Our S3 use case was to completely bypass Plone backend with selected file fields (except for permissions). We implemented Volto widget and middleware, and didn't touch the backend. The widget allowed direct upload to and download from S3 service. All browser access (both read and write) to S3 was done with very short living presigned URLs, and the bucket had no public without these. So, the widget always accessed Volto middleware first, middleware checked permissions from the backend, and then generated presigned URLs when allowed. This solution had no relationship with ZODB transactions and required external scheduled garbage collection to go through the bucket and remove blobs, which no longer had related Plone content (by UID as part of object path at bucket).

I like @ericof 's proposal. Blobstore has been described as "overlay over the regular storage". Overlay, which writes to and reads from S3, should be possible, without requiring changes on any other parts of Plone. It would already solve storing blobs locally with scalable MinIO cluster. If presigned read URLs to object storage paths could then be exposed through plone.namedfile, even better. This would not solve all the goals, but should be the low hanging fruit to start with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests