Releases: Own-Data-Privateer/hoardy-web
tool-v0.22.0+simple_server-v1.9.0
[tool-v0.22.0] - 2025-01-17: Incremental improvements
Changed
-
*
:- Formatted code using
black
. - Fixed minor issues found by
pylint
. - Simplified a bunch of code.
- Moved a bunch of code to
kisstdlib
. - Changed the rest to work with the new version of
kisstdlib
. - Improved error handling and error messages.
- Formatted code using
-
serve
:- From now on, when archiving/dumping, by default, it will both return errors to the client and also print them to
stderr
.
The latter can be disabled by--quiet
.
- From now on, when archiving/dumping, by default, it will both return errors to the client and also print them to
-
Improved documentation.
Fixed
-
*
:- Fixed extension detection for
.css
and.mjs
files. - Fixed a potential crash when parsing
WRR
files.
- Fixed extension detection for
-
serve
and, likely,organize
:- Fixed it not working on Windows. (Thanks to @douglasg14b on GitHub!)
[simple_server-v1.9.0] - 2025-01-17: Incremental improvements
Changed
- Formatted code using
black
. - Fixed minor issues found by
pylint
. - Improved metadata.
- Improved documentation.
Fixed
- Fixed version detection, i.e.
--version
option works again.
tool-v0.21.0
[tool-v0.21.0] - 2024-12-29: Bugfixes, incremental improvements
Fixed
-
serve
:-
When remapping URLs, hash/fragment parts will be preserved now.
-
When replaying,
HTTP
status codes will be preserved now. -
From now on, by default,
serve
will replay archivedHTTP
headers overHTTP
, instead of inlining them into renderedHTML
documents (see below). -
When both archiving and replaying, newly dumped reqres that fail given input filters will no longer be indexed and made available for replay.
-
-
mirror
,serve
:-
Reqres containing redirects (e.g.
302 Found
) are handled properly now. -
From now on, implicit favicons will be mirrored and replayed properly (see below).
-
-
*
:- Fixed
--*grep*
filtering of headers with multiple values.
- Fixed
Added
-
scrub
,mirror
,serve
:-
Added
inline_headers
option, making inlining of headers asmeta http-equiv
tags optional. -
Implemented
inline_fallback_icon
option.When enabled, this option adds a fallback
<link rel=icon href="/favicon.ico">
to the result when the input declares no icons and that URL remaps to something useful.This option is then enabled by default, thus fixing replay of implicit favicons.
-
-
serve
:-
Implemented
--web
and--mirror
options which control how headers should be replayed.With
--web
enabled,serve
will evokescrub
with-inline_headers
and will replay those headers overHTTP
instead.With
--mirror
it will continue to usescrub
with+inline_headers
, likemirror
does.From now on,
--web
is the default. -
Implemented
--oldest
and--nearest
options, similar to those ofmirror
. -
Added more namespaces other than
/web/
and madeserve
use them for different kinds of targets when remapping.So that, e.g., links pointing to unavailable URLs get remapped to
/unavailable/<date>/<url>
and links pointing to redirects get remapped to/redirect/<date>/<url>
.This makes links much more informative when hovering other them or when looking at the log output of
serve
.For replay, however, all those namespaces are equivalent and can be used interchangeably.
-
Changed
-
serve
,mirror
:-
Renamed
--ignore-bad-inputs
->--ignore-some-inputs
. -
Changed default input filters to allow reqres containing redirects.
-
-
mirror
:-
Added a default value for the root filters, which is
--root-status-re ".[23]00C"
to prevent redirects being added as roots. -
Added
--queue-all-indexed
option to make the previous item optional. -
Changed (simplified) semantics of the
--boring
option.From now on, making a path
--boring
simply disables queuing of its reqres as roots.This allows for more interesting uses.
-
-
Improved documentation.
extension-v1.19.0
[extension-v1.19.0] - 2024-12-21: Reworked popup UI, better replay integration
Changed (1)
-
Popup UI:
-
Reorganized the whole layout by assigning tags to all elements and allowing switching between those tags as if they were tabs.
The original idea was to unroll in steps a-la
uBlock Origin
, but this is superior. -
Improved some help strings.
-
Added
-
Core + Popup UI + Shortcuts:
-
Added
Replay from the archiving server
configuration option.It's a tristate of: disallow, enable if
Submit dumps via 'HTTP'
option is enabled and the server supports it, enable even ifSubmit dumps via 'HTTP'
option is disabled. -
Added
Include in global replays
per-tab options. -
Added popup UI button and keyboard shortcut both of which re-navigate all tabs for which
Include in global replays
is set to their replays. -
Added popup UI button, keyboard shortcut, and context menu item all of which re-navigate a currently active tab to its replay.
-
Added
Force 'Work offline' in replayed tabs
configuration option which does the same thing the similar options forfile:
anddata:
URL does, but for tabs that point to replay URLs.
Enabled by default. -
Added
🎄 Winter Days mode
seasonal theme. -
Added
Escape notification messages
configuration option to help support more notification daemons.
Disabled by default.
-
Changed (2)
-
The
Help
page:-
Merged "Handling of failures" section into "Archival".
-
Reworded some awkward places.
-
-
Core +
manifest.json
:-
Improved server checking logic and error messages.
-
Improved keyboard shortcut descriptions.
-
-
Improved documentation.
Fixed
-
Core:
-
Snapshot
buttons and keyboard shortcuts will no longer takeDOM
snapshots of replay pages, unlessCapture snapshots of all URLs
option is set. -
On Chromium, fixed
Hoardy-Web
trying to collect and archive replay pages.
-
extension-v1.18.0+tool-v0.20.0+simple_server-v1.8.0
[extension-v1.18.0] - 2024-12-16: Replay integration, incremental improvements
This release integrates the extension
with tool-v0.20.0
, which can now do both archival and replay over HTTP
, see below.
Changed
-
Core:
-
From now on, all requests to all URLs under
Server URL
will be ignored, allowing you work withtool-v0.20.0
-replayed pages without fiddling with any settings. -
From now on, the
extension
will respect archiving server's settings and features given by its/hoardy-web/server-info
endpoint, if such a thing exists. -
The default value of
Server URL
does not specify/pwebarc/dump
endpoint anymore, as this is now configurable server-side.For old configs, you can keep the old value, the archiving server handling code will silently elide that path away.
-
From now on, before the first archival, the
extension
will check that a working archiving server is available at the givenServer URL
and generate errors describing what exactly appears to be broken when not.
-
-
Popup UI:
-
From now on, if you set
Server URL
setting to an empty string, it will be reset to the default value. -
Improved CPU usage when switching tabs really quickly.
-
[tool-v0.20.0] - 2024-12-16: Replay over HTTP
, mirroring of non-GET
reqres
Changed: Incompatible changes
-
export mirror
:- Renamed
export mirror
sub-command to justmirror
.
- Renamed
-
*
:- Renamed all
--no-overwrites
options ->--no-overwrite
.
- Renamed all
-
*
,--expr
:- Renamed
source
->agent
. - Renamed
raw_path_parts
->path_parts
. - Renamed
mq_raw_path
->mq_path
. - Renamed
qsl_urlencode
atom ->unparse_query
.
- Renamed
Fixed: Incompatible changes
-
Improved URL normalization:
- From now on, it will preserve "=" symbols in query strings even when parameter values are empty, like browsers do.
- URL path and query quoting and unquoting is now, hopefully, equivalent to what browsers do, too.
This changes the file names generated by organize
with the default --output
format a bit.
Added
-
serve
:-
Implemented the
serve
sub-command, which runshoardy-web
as a web server that can replay archived data overHTTP
, a-laheritrix
andpywb
.After starting it with something like
hoardy-web serve path/to/your/archives
, you can then navigate to- http://127.0.0.1:3210/web/*/* to see the list of all available URLs and their versions (visits), or to
- something like http://127.0.0.1:3210/web/2/https://archiveofourown.org/works/3733123 to view the latest archived version of that URL, or to
- something like http://127.0.0.1:3210/web/*/https://archiveofourown.org/works/3733123 to view the list of all visits to this URL,
- which also works with glob patterns http://127.0.0.1:3210/web/*/https://archiveofourown.org/works/[0-9]*.
This is very reminiscent of the Wayback Machine by design, yes.
-
Added
/hoardy-web/server-info
endpoint support for future integration with theextension
, similar to that ofsimple_server
(hoardy-web-sas
) now does. -
Implemented archiving server support when running
serve
with--archive-to
option.This is similar to
simple_server
, except newly archived reqres will become available for replay immediately. -
Implemented archiving-server-only mode when running
serve
with--no-replay
option.In this mode, it is essentially equivalent to
simple_server
, excepthoardy-web serve
supports arbitrary--output
formats. -
Implemented
--latest
option, which only indexes and allows replay for the latest available visit to each available URL.Archiving new reqres updates the index accordingly, as expected.
-
Documented it all across the whole repository.
-
-
mirror
:-
Implemented rendering of non-
GET
reqres.So, e.g.,
DOM
snapshots and web search answer pages done viaPOST
will be included in the outputs now.If you do not want some of those, you can filter them out with
--not-method POST
or some such.
-
-
scrub
,mirror
,serve
:-
From now on, malformed URLs will be kept as-is instead of being voided out.
-
From now on, more types of IE-pragmas will be censored out by
-iepragmas
(which is the default). -
From now on,
scrub
will use+verbose
and+whitespace
as defaults.This is a much nicer default, and after content-addressed outputs were implemented in
tool-v0.19.0
, the resulting space savings-verbose,-whitespace
produce are mostly inconsequential now. -
Simplified semantics of
(+|-)pretty
, it does not set theverbose
option anymore.
-
-
*
:- Added
--structure
and--raw-qbody
options. - Added a bunch more parsed URL properties.
- Added a bunch more similar reqres properties.
- Added
Changed
-
mirror
:-
Changed semantics of
--nearest
option a bit.
From now on, it will parse its argument as a time interval and then take the middle of it as the target value.This is much nicer in practice since, from now on, giving
--nearest 2024
is much less likely to get you the stuff from 2023.
It will try to give you stuff nearest to2024-07-02 00:00:00
instead. -
Improved performance.
-
-
*
:- Renamed
--no-remap
option ->--raw-sbody
, the old name is kept as an alias.
- Renamed
-
Improved documentation and help strings.
Most notably, the input filtering options are shown only once now.
-
pyproject.toml
now explicitly specifies optionalmitmproxy
file format support.
Fixed
-
*
in theory, but only ever triggered bymirror
:- Fixed a file descriptor semi-leak when lazily reloading reqres.
[simple_server-v1.8.0] - 2024-12-16
Added
-
Added
-t
,--to
, and--archive-to
aliases for--root
. -
Added
/hoardy-web/server-info
endpoint for future integration with theextension
.
Changed
-
From now on, "/" and most other non-word symbols (except "_", "-", and space) in bucket names are forbidden and will be removed.
This will simplify some future things.
-
From now on, when several buckets are specified via several
profile
query parameters, the last one will be used. -
Renamed
--uncompressed
->--no-compress
, the old name is kept as an alias. -
Slightly improved performance.
-
Started typechecking with
mypy
.
tool-v0.19.0
[tool-v0.19.0] - 2024-12-07: Powerful filtering, exporting of different URL visits, hybrid export modes
Changed: Semantics
-
*
:- In
--expr
expressions,sha256
function changed semantics.
From now on it returns the raw hash digest instead of the hexadecimal one.
To get the old value, usesha256|to_hex
.
- In
Added
-
*
exceptorganize --move
,organize --hardlink
,organize --symlink
,get
, andrun
:-
From now on, all sub-commands except for above can take inputs in all supported file formats.
I.e., you can now do
hoardy-web export mirror --to ~/hoardy-web/mirror1 mitmproxy.*.dump
on
mitmproxy
dumps without evenimport
ing them first. -
By default, the above commands now also automatically dispatch between loaders of different file formats based on file extensions.
So you can mix and match different file formats on the same command line. -
Added a bunch of
--load-*
options that force a specific loader instead, e.g.--load-wrrb
,--load-mitmproxy
.
-
-
*
:-
Added a ton of new filtering options.
For example, you can now do:
hoardy-web find --method GET --method DOM --status-re .200C --response-mime text/html \ --response-body-grep-re "\bPotter\b" ~/hoardy-web/raw
As before, these filters can still be used with other commands, like
stream
, orexport mirror
, etc.--root-*
options ofexport mirror
now use the same syntax and machinery as the normal input filters.Also, the overall filtering semantics changed a bit.
The top-level logical expression the filters compute is now a large conjunction.
I.e. the above example now compiles to, a bit simplified,(response.method == "GET" or response.method == "DOM") and re.match(".200C", status) and (response_mime == "text/html") and re.match("\\bPotter\\b", response.body)
. -
Added a bunch of new
--output
formats.
Mostly, this adds a bunch of output formats that refer tostime
s.
Mainly, to simplifyexport mirror --all
usage, described below.
-
-
export mirror
:-
Implemented exporting of different URL visits.
I.e., you can now export not just
--latest
visit to each URL, but an--oldest
one, or one--nearest
to a given date, or--all
of them. -
Implemented
--latest-hybrid
,--oldest-hybrid
, and--nearest-hybrid
options.These allow you to export each page with resource requisites that are date-vise closest to the
stime
of the page itself, instead of taking globally--latest
,--oldest
, or--nearest
versions of all requisite URLs.At the moment, this takes a lot more memory, but makes the results much more consistent for websites that do not use versioned resource requisites.
-
Implemented
--hardlink
and--symlink
options, which allow exporting into content-addressed destinations.I.e.
export mirror --hardlink
will render and write each exported file to<--to>/_content/<hash/based/path>.<ext>
and only then hardlink the result to<--to>/<output/format/based/path>.<ext>
target destination.
And similarly for--symlink
.Typically, doing this saves quite a bit of space, e.g., when pages refer to the same resource requisites by slightly different URLs, same images and fonts get distributed via different CDN hosts, when you export
--all
visits to some URLs and many of those are absolutely identical, etc.So, from now on,
--hardlink
is the default.
The old behavior can be archived by running it with--copy
instead. -
Implemented
--relative
and--absolute
options, which control if URLs should be remapped to relative or absolutefile:
URLs, respectively.
-
-
Documented all the new things.
-
Added a bunch of new
test-cli.sh
tests.
Changed
-
export mirror
:-
Switched default
--output
tohupq_n
to prevent collisions when using--*-hybrid
and--all
. -
Improved handling of
base
HTML
tags,_target
s are supported now. -
Links that reference a page from itself will no longer refer to the page's filename, even when the link has no
fragment
.The results can be a bit confusing, but this makes the new content de-duplication options much more effective.
-
Made
export mirror
default filters explicit and changed them from--method "GET" --status-re ".200C"
to--method "GET" --method "DOM" --status-re ".200C"
. -
Implemented
--ignore-bad-inputs
and--index-all-inputs
options to allow you to change the above default. -
Improved output log format.
-
-
Improved file loading performance a bit.
-
Improved documentation.
Fixed
-
Added a bunch of new tests for
organize
, which cover theorganize --symlink --latest
bug oftool-v0.18.0
.
Won't happen again. -
Fixed a couple of silly filtering-related bugs.
tool-v0.18.1
[tool-v0.18.1] - 2024-11-30: Hotfixes
Fixed
tool-v0.18.0
introduced a bunch of issues:
-
organize
:-
Fixed
organize --symlink --latest
dereferencing output files, which lead to it overwriting plainWRR
source files containing updated URLs with symlinks to their newer versions.The good news is that this bug was only triggered when
organize --symlink --latest
was run with some newly archived data and, for each updatedURL
, it only overwrote the second to lastWRR
file with a symlink to the latestWRR
file.
Unfortunately, this error was self-propagating, so those files could then get overwritten again by the next invocation oforganize --symlink --latest
with some more new data.
This could happen up to 7 times, at which point it would start crashing, because of the OS symlink deferencing limit.You can check if you were affected by running:
cd ~/web/raw ; find . -type l
The paths it outputs will be the paths of lost
WRR
files.A reminder that it is good to do daily backups, I suppose.
The next version will have a test for this, but I'm releasing this hotfix an hour after I discovered this.
-
Fixed it
assert
-crashing sometimes when running with--symlink
. -
Improved memory consumption a bit.
-
-
export mirror
:- Fixed overly large memory consumption.
tool-v0.18.0
[tool-v0.18.0] - 2024-11-20: Incremental improvements
Added
-
export mirror
:-
Implemented the
--boring
option, which allows you to load some inputPATH
s without adding them as roots, even when no--root-*
options are specified.This make CLI a bit more convenient to use.
TheREADME.md
has a new example showcasing it.
-
-
export mirror
,scrub
:-
Implemented support for
@import
CSS
rules using a string token in place of a URL.As far as I can see, this syntax is rarely used in practice.
But the spec allows this, so. -
Implemented
interpret_noscript
option, which enables inlining ofnoscript
tags whenscrub
is running with-scripts
.That is,
export mirror
will now use this feature by default.This is needed because some websites put
link
tags withCSS
undernoscript
, thus making such pages look broken whenscrub
bed with-scripts
(which is the default) and then opened in a browser with scripts enabled.
-
Changed
-
*
: Refactored/reworked a large chunk of internals, as a result:organize
can now takeWRR
bundles as inputs too,export mirror
became much faster at indexing inputs that contain archives of the same URLs, repeatedly.
In general, these changes are aimed towards making
hoardy-web
completely input-agnostic.
That is, wouldn't it be nice if you could feedmitmproxy
files toexport mirror
directly, instead of going throughimport mitmproxy
first? -
export mirror
,scrub
:-
From now on, it will stop generating
link
tags with void URLs, it will simply censor them out instead. -
scrub
with+verbose
set will now also show originalrel
attr values for censored out tags. -
Also, in general, the outputs of
scrub
with+verbose
set are much prettier now.
-
-
Improved documentation.
tool-v0.17.0
See CHANGELOG.md
.
extension-v1.17.2
[extension-v1.17.2] - 2024-11-09: Documentation fixes, mostly
Changed
-
- Rewrote "Conventions" and "'Work offline' mode" sections of to be much more readable.
-
*
:- Improved contrast when running with a light
CSS
color scheme.
- Improved contrast when running with a light
Fixed
-
Documentation:
- Fixed some typos.
-
*
:- Fixed some potential state display inconsistency bugs and improved UI pages' init performance when the core is very busy.
extension-v1.17.1
[extension-v1.17.1] - 2024-11-01: Annoyance fixes
Changed
-
Popup UI:
-
Reverted most of the block reordering bit of popup UI rework of
extension-v1.17.0
.The "Globally" block is near the top again.
-
Edited the "Persistence" block a bit more.
Mainly, to stop graying out always-useful stat lines, even when the associated features are disabled, to prevent possible confusion there.
-
Renamed some options and stat lines, mostly to make their names shorter to make popup UI on Fenix more readable.
-
-
Toolbar button:
-
Edited its title format to be much shorter, especially on Fenix.
-
Reverted the ordering of parts there to how it was before
extension-v1.17.0
.The (much shorter now) "globally" part is at the front again because otherwise the badge being at the front there too without an explanation of its format is kind of confusing.
-
-
Core + All internal pages:
-
Improved message handling infrastructure.
-
Used it to improve initialization functions of all internal pages, improving efficiency and making the resulting UI much less flaky.
-
-
- Documented what
webNavigation
permission is used for, improved the rest a bit.
- Documented what
-
*
:- Renamed
build.sh
firefox
target tofirefox-mv2
, for consistency.
- Renamed
Fixed
-
UI:
-
Fixed flaky rendering of
Help
andChangelog
pages on Fenix.They render properly now the very first time you load them, no reloads needed.
-
Fixed duplication of history entries when navigating internal links.
-
Fixed source links sometimes failing to being highlighted when pressing the browser's "Back" button.
-
Fixed some small
CSS
nitpicks.
-
-
Popup UI + Documentation:
- Realigned some help strings with reality.
-
Fixed some more mostly inconsequential things.