Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: add WARC output option #412

Open
midaspt opened this issue Sep 23, 2024 · 3 comments
Open

Feature request: add WARC output option #412

midaspt opened this issue Sep 23, 2024 · 3 comments

Comments

@midaspt
Copy link

midaspt commented Sep 23, 2024

Hi @ll.

I have been using monolith more and more for webpage capture but couldn't find a way to make downloads in WARC format (as documented at https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/).

I believe such an option would greatly enhance the reach of monolith as a general purpose utility.

Anyways thanks for your great work as it is. 😎

@hugo-akaora
Copy link

Hello, in the meantime you can maybe use https://github.com/steffenfritz/html2warc ?

@snshn
Copy link
Member

snshn commented Dec 2, 2024

Hi @midaspt,

I'm very glad to learn that you're finding use for monolith! WARC can be simply done, I'll likely implement it around the same time as MHTML. The long story short, I'll make monolith first crawl the target document, download all assets into a store of sorts (cache), and then either build a monolithic HTML, MHTML, or WARC. This way it won't require too much redundant code, and will essentially be the same process for every output format. The first step right now is to revamp the caching mechanism, I'll work on it ASAP.

Hi there @hugo-akaora, thank you for the link! It's in Python, but I'll use it as a reference, seems like a straightforward format.

Cheers,
Sunshine

@hugo-akaora
Copy link

Hello @snshn, nice if it can be implemented directly in monolith! <3

It would really great to be able to output multiple format at the same time :) I'll definitely use that feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants