Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-Type for files uploaded via S3 automatically set to application/xml #1840

Open
westonpace opened this issue Jan 12, 2022 · 2 comments
Labels
documentation This is a problem with documentation. p3 This is a minor priority issue

Comments

@westonpace
Copy link

Describe the bug

When I upload a file to S3 (using a multipart upload request) the content-type of the file will be application/xml unless I specify otherwise. This seems incorrect as a content-type should be omitted if unknown or, at worst, default to application/octet-stream. Per RFC 7231 (3.1.1.5):

A sender that generates a message containing a payload body SHOULD
generate a Content-Type header field in that message unless the
intended media type of the enclosed representation is unknown to the
sender. If a Content-Type header field is not present, the recipient
MAY either assume a media type of "application/octet-stream"
([RFC2046], Section 4.5.1) or examine the data to determine its type.

This ended up causing a bit of confusion here (apache/arrow#11934). An S3 client was trying to be intelligent and inspect the XML data if the file was an XML file and this issue caused the client to inspect files it shouldn't.

Expected behavior

If the content type of a file is not set then the file should either have no content-type or the content-type should be set to application/octet-stream.

Current behavior

The file's content-type is set to application/xml

Steps to Reproduce

Reproducible Gist: https://gist.github.com/westonpace/9c3a0baa48083f33aa4880c0cb6a602b

Possible Solution

When the user does not specify a content-type either leave it unset or default to application/octet-stream

AWS CPP SDK version used

1.8.185

Compiler and Version used

GCC 9.3.0

Operating System and version

Ubuntu 20.04.3

@westonpace westonpace added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 12, 2022
@KaibaLopez
Copy link
Contributor

Hi @westonpace ,
Quick question here before I try to dig too deep into this, have you tried the transferManager to do multipart uploads or is there a reason why you can't? I just tried and I didn't get the same behavior so it might be a good workaround to get you unblocked?

@KaibaLopez KaibaLopez added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. and removed needs-triage This issue or PR still needs to be triaged. labels Jan 19, 2022
@westonpace
Copy link
Author

@KaibaLopez Thanks for the suggestions. I was working on the Apache Arrow S3 filesystem adapter which currently does not use the transfer manager (https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc). Although that may be an interesting experiment someday it would add an extra dependency and be a bit more of a change.

I'm not really blocked by this. It was simple enough to ensure we always specify the content type. Perhaps the main issue was simply that this default isn't documented anywhere and so it was a surprise and took a little while to isolate the root cause.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Jan 21, 2022
@jmklix jmklix added documentation This is a problem with documentation. p3 This is a minor priority issue and removed bug This issue is a bug. labels Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation This is a problem with documentation. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

3 participants