Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support .pages files #8

Open
mish15 opened this issue Apr 6, 2015 · 11 comments
Open

Support .pages files #8

mish15 opened this issue Apr 6, 2015 · 11 comments

Comments

@mish15
Copy link
Member

mish15 commented Apr 6, 2015

Can we easily support the .pages extension?

@oprimus
Copy link
Contributor

oprimus commented Apr 26, 2015

Not as straight forward as first thought. It is just a zip file, but inside there are Apple's IWS files rather than XML. IWA files are a protobuf stream compressed with snappy - sort of.

http://stackoverflow.com/questions/27454317/decompressing-snappy-files-missing-stream-identifier-chunk-and-crc-32c-checksum

https://github.com/google/protobuf
https://code.google.com/p/snappy-go/

@oprimus
Copy link
Contributor

oprimus commented Apr 26, 2015

The snappy-go implementation doesn't seem to be compatible with Apple's butchered implementation. I'm getting over the missing stream identifier by prepending the reader:
snappy.NewReader(io.MultiReader(strings.NewReader("\xff\x06\x00\x00sNaPpY"), file))

The problem now appears to be that Apple is using the old COPY_4 tag which the snappy golang library doesn't support (as in it detects it and says "unsupported COPY_4 tag"). All other golang snappy libraries appear to be based on this one so don't support it either.

I've implemented the COPY_4 tag by porting it from implementations in other languages. In particular https://github.com/gray/compress-snappy/blob/master/src/csnappy_decompress.c. However it's now saying that the input is corrupt so there must be something else which I can't track down.

At this point I've seen nobody successfully reading these out there so they pretty much need to be considered a proprietary file format.

If we want to progress I think the next step is to use the C implementation of snappy to see if that reads it. If it doesn't then I'm not sure where to go next.

@mish15
Copy link
Member Author

mish15 commented Apr 26, 2015

Does this help? .iwa seem to be the same
https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md

Looks like snappy is kind of followed, but not really.

"they do not include the required Stream Identifier chunk, and compressed
chunks do not include a CRC-32C checksum.
The stream is composed of contiguous chunks prefixed by a 4 byte header.
The first byte indicates the chunk type, which in practice is always 0 for
iWork, indicating a Snappy compressed chunk. The next three bytes are
interpreted as a 24-bit little-endian integer indicating the length of the
chunk. The 4 byte header is not included in the chunk length."

On Monday, 27 April 2015, oprimus notifications@github.com wrote:

The snappy-go implementation doesn't seem to be compatible with Apple's
butchered implementation. I'm getting over the missing stream identifier by
prepending the reader:
snappy.NewReader(io.MultiReader(strings.NewReader("\xff\x06\x00\x00sNaPpY"),
file))

The problem now appears to be that Apple is using the old COPY_4 tag which
the snappy golang library doesn't support (as in it detects it and says
"unsupported COPY_4 tag"). All other golang snappy libraries appear to be
based on this one so don't support it either.

I've implemented the COPY_4 tag by porting it from implementations in
other languages. In particular
https://github.com/gray/compress-snappy/blob/master/src/csnappy_decompress.c.
However it's now saying that the input is corrupt so there must be
something else which I can't track down.

At this point I've seen nobody successfully reading these out there so
they pretty much need to be considered a proprietary file format.

If we want to progress I think the next step is to use the C
implementation of snappy to see if that reads it. If it doesn't then I'm
not sure where to go next.


Reply to this email directly or view it on GitHub
#8 (comment).

Hamish Ogilvy
Sajari Pty Ltd
_t: +61 (_0) 414 658 353 | e: *hogilvy@sajari.com
*w:
www.sajari.com

@mish15
Copy link
Member Author

mish15 commented Apr 26, 2015

Any trap on where the corrupt err comes from? e.g. Is it in the header read
or chunk processing loop? You're hardcoding the decoded length in the
stream identifier, which is the first check for corruption.

From what I can read it's definitely doable. Looks like it's in the
snappy "framing format", not pure snappy, so probably needs to be read
and decoded in chunks instead of a single block as per
https://code.google.com/p/snappy/source/browse/trunk/framing_format.txt

Can you upload the WIP branch?

On Monday, 27 April 2015, oprimus notifications@github.com wrote:

The snappy-go implementation doesn't seem to be compatible with Apple's
butchered implementation. I'm getting over the missing stream identifier by
prepending the reader:
snappy.NewReader(io.MultiReader(strings.NewReader("\xff\x06\x00\x00sNaPpY"),
file))

The problem now appears to be that Apple is using the old COPY_4 tag which
the snappy golang library doesn't support (as in it detects it and says
"unsupported COPY_4 tag"). All other golang snappy libraries appear to be
based on this one so don't support it either.

I've implemented the COPY_4 tag by porting it from implementations in
other languages. In particular
https://github.com/gray/compress-snappy/blob/master/src/csnappy_decompress.c.
However it's now saying that the input is corrupt so there must be
something else which I can't track down.

At this point I've seen nobody successfully reading these out there so
they pretty much need to be considered a proprietary file format.

If we want to progress I think the next step is to use the C
implementation of snappy to see if that reads it. If it doesn't then I'm
not sure where to go next.


Reply to this email directly or view it on GitHub
#8 (comment).

Hamish Ogilvy
Sajari Pty Ltd
_t: +61 (_0) 414 658 353 | e: *hogilvy@sajari.com
*w:
www.sajari.com

@oprimus
Copy link
Contributor

oprimus commented Apr 27, 2015

Commit 7ed3c56
Snappy compression needs to be altered to disable checksums for this to work (See below). Otherwise it gets to the point where we can get the uncompressed stream and find the archive length of the first object. However when trying to unmarshal the ArchiveInfo I get an "unexpected EOF".

 vi ~/go/src/code.google.com/p/snappy-go/snappy/decode.go 
                 case chunkTypeCompressedData:
                        // Section 4.2. Compressed data (chunk type 0x00).
                        //if chunkLen < checksumSize {
                        //      r.err = ErrCorrupt
                        //      return 0, r.err
                        //}
                        buf := r.buf[:chunkLen]
                        if !r.readFull(buf) {
                                return 0, r.err
                        }
                        //checksum := uint32(buf[0]) | uint32(buf[1])<<8 | uint32(buf[2])<<16 | uint32(buf[3])<<24
                        //buf = buf[checksumSize:]

                        n, err := DecodedLen(buf)
                        if err != nil {
                                r.err = err
                                return 0, r.err
                        }
                        if n > len(r.decoded) {
                                r.err = ErrCorrupt
                                return 0, r.err
                        }
                        if _, err := Decode(r.decoded, buf); err != nil {
                                fmt.Println("decode error", err)
                                r.err = err
                                return 0, r.err
                        }
                        //if crc(r.decoded[:n]) != checksum {
                        //      fmt.Println("checksum")
                        //      r.err = ErrCorrupt
                        //      return 0, r.err
                        //}
                        r.i, r.j = 0, n
                        continue

mish15 added a commit that referenced this issue Apr 27, 2015
mish15 added a commit that referenced this issue Apr 27, 2015
These came from:
https://github.com/obriensp/iWorkFileFormat/tree/master/iWorkFileInspect
or/iWorkFileInspector/Messages/Proto

See Issue #8
mish15 added a commit that referenced this issue Apr 27, 2015
See Issue #8

This captures some of these.
oprimus pushed a commit that referenced this issue Apr 27, 2015
mish15 added a commit that referenced this issue May 1, 2015
@dhowden
Copy link
Member

dhowden commented Sep 26, 2015

The snappy tests are failing (no doubt due to the changes you mention here not being compatible with the tests). I have marked the failing tests to be skipped for the moment, but we really need to fix this.

@gonedjur
Copy link

gonedjur commented Jan 31, 2018

I see that you include the three cases, if a quickview pdf is available, an xml or the protobuffer iwa.

Does any of this work for iworks'14 files?

@mish15
Copy link
Member Author

mish15 commented Jan 31, 2018

Best thing to do is to test it and see. The pages format is pretty hacky

@gonedjur
Copy link

gonedjur commented Feb 1, 2018

Looks like a no.

2018/02/01 14:39:28 Received file: t.pages (application/vnd.apple.pages)
archiveInfo:
2018/02/01 14:39:28 {"body":"","meta":{},"msecs":2}

Edit:

I wonder how these guys do it. https://cloudconvert.com/formats/document/pages

They manage 5.5 in some way. Only guys I've seen to do it...

@mish15
Copy link
Member Author

mish15 commented Feb 1, 2018

We welcome pull requests! :)

@mish15
Copy link
Member Author

mish15 commented Feb 1, 2018

It’s definitely possible, just need to play with the encoding. It wasn’t documented anywhere well from memory, but may be possibly these days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants