Skip to content

Releases: Own-Data-Privateer/hoardy-web

tool-v0.22.0+simple_server-v1.9.0

17 Jan 16:48
tool-v0.22.0
Compare
Choose a tag to compare

[tool-v0.22.0] - 2025-01-17: Incremental improvements

Changed

  • *:

    • Formatted code using black.
    • Fixed minor issues found by pylint.
    • Simplified a bunch of code.
    • Moved a bunch of code to kisstdlib.
    • Changed the rest to work with the new version of kisstdlib.
    • Improved error handling and error messages.
  • serve:

    • From now on, when archiving/dumping, by default, it will both return errors to the client and also print them to stderr.
      The latter can be disabled by --quiet.
  • Improved documentation.

Fixed

  • *:

    • Fixed extension detection for .css and .mjs files.
    • Fixed a potential crash when parsing WRR files.
  • serve and, likely, organize:

    • Fixed it not working on Windows. (Thanks to @douglasg14b on GitHub!)

[simple_server-v1.9.0] - 2025-01-17: Incremental improvements

Changed

  • Formatted code using black.
  • Fixed minor issues found by pylint.
  • Improved metadata.
  • Improved documentation.

Fixed

  • Fixed version detection, i.e. --version option works again.

tool-v0.21.0

29 Dec 14:05
tool-v0.21.0
Compare
Choose a tag to compare

[tool-v0.21.0] - 2024-12-29: Bugfixes, incremental improvements

Fixed

  • serve:

    • When remapping URLs, hash/fragment parts will be preserved now.

    • When replaying, HTTP status codes will be preserved now.

    • From now on, by default, serve will replay archived HTTP headers over HTTP, instead of inlining them into rendered HTML documents (see below).

    • When both archiving and replaying, newly dumped reqres that fail given input filters will no longer be indexed and made available for replay.

  • mirror, serve:

    • Reqres containing redirects (e.g. 302 Found) are handled properly now.

    • From now on, implicit favicons will be mirrored and replayed properly (see below).

  • *:

    • Fixed --*grep* filtering of headers with multiple values.

Added

  • scrub, mirror, serve:

    • Added inline_headers option, making inlining of headers as meta http-equiv tags optional.

    • Implemented inline_fallback_icon option.

      When enabled, this option adds a fallback <link rel=icon href="/favicon.ico"> to the result when the input declares no icons and that URL remaps to something useful.

      This option is then enabled by default, thus fixing replay of implicit favicons.

  • serve:

    • Implemented --web and --mirror options which control how headers should be replayed.

      With --web enabled, serve will evoke scrub with -inline_headers and will replay those headers over HTTP instead.

      With --mirror it will continue to use scrub with +inline_headers, like mirror does.

      From now on, --web is the default.

    • Implemented --oldest and --nearest options, similar to those of mirror.

    • Added more namespaces other than /web/ and made serve use them for different kinds of targets when remapping.

      So that, e.g., links pointing to unavailable URLs get remapped to /unavailable/<date>/<url> and links pointing to redirects get remapped to /redirect/<date>/<url>.

      This makes links much more informative when hovering other them or when looking at the log output of serve.

      For replay, however, all those namespaces are equivalent and can be used interchangeably.

Changed

  • serve, mirror:

    • Renamed --ignore-bad-inputs -> --ignore-some-inputs.

    • Changed default input filters to allow reqres containing redirects.

  • mirror:

    • Added a default value for the root filters, which is --root-status-re ".[23]00C" to prevent redirects being added as roots.

    • Added --queue-all-indexed option to make the previous item optional.

    • Changed (simplified) semantics of the --boring option.

      From now on, making a path --boring simply disables queuing of its reqres as roots.

      This allows for more interesting uses.

  • Improved documentation.

extension-v1.19.0

21 Dec 13:35
extension-v1.19.0
Compare
Choose a tag to compare

[extension-v1.19.0] - 2024-12-21: Reworked popup UI, better replay integration

Changed (1)

  • Popup UI:

    • Reorganized the whole layout by assigning tags to all elements and allowing switching between those tags as if they were tabs.

      The original idea was to unroll in steps a-la uBlock Origin, but this is superior.

    • Improved some help strings.

Added

  • Core + Popup UI + Shortcuts:

    • Added Replay from the archiving server configuration option.

      It's a tristate of: disallow, enable if Submit dumps via 'HTTP' option is enabled and the server supports it, enable even if Submit dumps via 'HTTP' option is disabled.

    • Added Include in global replays per-tab options.

    • Added popup UI button and keyboard shortcut both of which re-navigate all tabs for which Include in global replays is set to their replays.

    • Added popup UI button, keyboard shortcut, and context menu item all of which re-navigate a currently active tab to its replay.

    • Added Force 'Work offline' in replayed tabs configuration option which does the same thing the similar options for file: and data: URL does, but for tabs that point to replay URLs.
      Enabled by default.

    • Added 🎄 Winter Days mode seasonal theme.

    • Added Escape notification messages configuration option to help support more notification daemons.
      Disabled by default.

Changed (2)

  • The Help page:

    • Merged "Handling of failures" section into "Archival".

    • Reworded some awkward places.

  • Core + manifest.json:

    • Improved server checking logic and error messages.

    • Improved keyboard shortcut descriptions.

  • Improved documentation.

Fixed

  • Core:

    • Snapshot buttons and keyboard shortcuts will no longer take DOM snapshots of replay pages, unless Capture snapshots of all URLs option is set.

    • On Chromium, fixed Hoardy-Web trying to collect and archive replay pages.

extension-v1.18.0+tool-v0.20.0+simple_server-v1.8.0

16 Dec 11:35
extension-v1.18.0
Compare
Choose a tag to compare

[extension-v1.18.0] - 2024-12-16: Replay integration, incremental improvements

This release integrates the extension with tool-v0.20.0, which can now do both archival and replay over HTTP, see below.

Changed

  • Core:

    • From now on, all requests to all URLs under Server URL will be ignored, allowing you work with tool-v0.20.0-replayed pages without fiddling with any settings.

    • From now on, the extension will respect archiving server's settings and features given by its /hoardy-web/server-info endpoint, if such a thing exists.

    • The default value of Server URL does not specify /pwebarc/dump endpoint anymore, as this is now configurable server-side.

      For old configs, you can keep the old value, the archiving server handling code will silently elide that path away.

    • From now on, before the first archival, the extension will check that a working archiving server is available at the given Server URL and generate errors describing what exactly appears to be broken when not.

  • Popup UI:

    • From now on, if you set Server URL setting to an empty string, it will be reset to the default value.

    • Improved CPU usage when switching tabs really quickly.

[tool-v0.20.0] - 2024-12-16: Replay over HTTP, mirroring of non-GET reqres

Changed: Incompatible changes

  • export mirror:

    • Renamed export mirror sub-command to just mirror.
  • *:

    • Renamed all --no-overwrites options -> --no-overwrite.
  • *, --expr:

    • Renamed source -> agent.
    • Renamed raw_path_parts -> path_parts.
    • Renamed mq_raw_path -> mq_path.
    • Renamed qsl_urlencode atom -> unparse_query.

Fixed: Incompatible changes

  • Improved URL normalization:

    • From now on, it will preserve "=" symbols in query strings even when parameter values are empty, like browsers do.
    • URL path and query quoting and unquoting is now, hopefully, equivalent to what browsers do, too.

This changes the file names generated by organize with the default --output format a bit.

Added

  • serve:

    • Implemented the serve sub-command, which runs hoardy-web as a web server that can replay archived data over HTTP, a-la heritrix and pywb.

      After starting it with something like hoardy-web serve path/to/your/archives, you can then navigate to

      This is very reminiscent of the Wayback Machine by design, yes.

    • Added /hoardy-web/server-info endpoint support for future integration with the extension, similar to that of simple_server (hoardy-web-sas) now does.

    • Implemented archiving server support when running serve with --archive-to option.

      This is similar to simple_server, except newly archived reqres will become available for replay immediately.

    • Implemented archiving-server-only mode when running serve with --no-replay option.

      In this mode, it is essentially equivalent to simple_server, except hoardy-web serve supports arbitrary --output formats.

    • Implemented --latest option, which only indexes and allows replay for the latest available visit to each available URL.

      Archiving new reqres updates the index accordingly, as expected.

    • Documented it all across the whole repository.

  • mirror:

    • Implemented rendering of non-GET reqres.

      So, e.g., DOM snapshots and web search answer pages done via POST will be included in the outputs now.

      If you do not want some of those, you can filter them out with --not-method POST or some such.

  • scrub, mirror, serve:

    • From now on, malformed URLs will be kept as-is instead of being voided out.

    • From now on, more types of IE-pragmas will be censored out by -iepragmas (which is the default).

    • From now on, scrub will use +verbose and +whitespace as defaults.

      This is a much nicer default, and after content-addressed outputs were implemented in tool-v0.19.0, the resulting space savings -verbose,-whitespace produce are mostly inconsequential now.

    • Simplified semantics of (+|-)pretty, it does not set the verbose option anymore.

  • *:

    • Added --structure and --raw-qbody options.
    • Added a bunch more parsed URL properties.
    • Added a bunch more similar reqres properties.

Changed

  • mirror:

    • Changed semantics of --nearest option a bit.
      From now on, it will parse its argument as a time interval and then take the middle of it as the target value.

      This is much nicer in practice since, from now on, giving --nearest 2024 is much less likely to get you the stuff from 2023.
      It will try to give you stuff nearest to 2024-07-02 00:00:00 instead.

    • Improved performance.

  • *:

    • Renamed --no-remap option -> --raw-sbody, the old name is kept as an alias.
  • Improved documentation and help strings.

    Most notably, the input filtering options are shown only once now.

  • pyproject.toml now explicitly specifies optional mitmproxy file format support.

Fixed

  • * in theory, but only ever triggered by mirror:

    • Fixed a file descriptor semi-leak when lazily reloading reqres.

[simple_server-v1.8.0] - 2024-12-16

Added

  • Added -t, --to, and --archive-to aliases for --root.

  • Added /hoardy-web/server-info endpoint for future integration with the extension.

Changed

  • From now on, "/" and most other non-word symbols (except "_", "-", and space) in bucket names are forbidden and will be removed.

    This will simplify some future things.

  • From now on, when several buckets are specified via several profile query parameters, the last one will be used.

  • Renamed --uncompressed -> --no-compress, the old name is kept as an alias.

  • Slightly improved performance.

  • Started typechecking with mypy.

tool-v0.19.0

07 Dec 13:35
tool-v0.19.0
Compare
Choose a tag to compare

[tool-v0.19.0] - 2024-12-07: Powerful filtering, exporting of different URL visits, hybrid export modes

Changed: Semantics

  • *:

    • In --expr expressions, sha256 function changed semantics.
      From now on it returns the raw hash digest instead of the hexadecimal one.
      To get the old value, use sha256|to_hex.

Added

  • * except organize --move, organize --hardlink, organize --symlink, get, and run:

    • From now on, all sub-commands except for above can take inputs in all supported file formats.

      I.e., you can now do

      hoardy-web export mirror --to ~/hoardy-web/mirror1 mitmproxy.*.dump

      on mitmproxy dumps without even importing them first.

    • By default, the above commands now also automatically dispatch between loaders of different file formats based on file extensions.
      So you can mix and match different file formats on the same command line.

    • Added a bunch of --load-* options that force a specific loader instead, e.g. --load-wrrb, --load-mitmproxy.

  • *:

    • Added a ton of new filtering options.

      For example, you can now do:

      hoardy-web find --method GET --method DOM --status-re .200C --response-mime text/html \
        --response-body-grep-re "\bPotter\b" ~/hoardy-web/raw

      As before, these filters can still be used with other commands, like stream, or export mirror, etc.

      --root-* options of export mirror now use the same syntax and machinery as the normal input filters.

      Also, the overall filtering semantics changed a bit.
      The top-level logical expression the filters compute is now a large conjunction.
      I.e. the above example now compiles to, a bit simplified, (response.method == "GET" or response.method == "DOM") and re.match(".200C", status) and (response_mime == "text/html") and re.match("\\bPotter\\b", response.body).

    • Added a bunch of new --output formats.
      Mostly, this adds a bunch of output formats that refer to stimes.
      Mainly, to simplify export mirror --all usage, described below.

  • export mirror:

    • Implemented exporting of different URL visits.

      I.e., you can now export not just --latest visit to each URL, but an --oldest one, or one --nearest to a given date, or --all of them.

    • Implemented --latest-hybrid, --oldest-hybrid, and --nearest-hybrid options.

      These allow you to export each page with resource requisites that are date-vise closest to the stime of the page itself, instead of taking globally --latest, --oldest, or --nearest versions of all requisite URLs.

      At the moment, this takes a lot more memory, but makes the results much more consistent for websites that do not use versioned resource requisites.

    • Implemented --hardlink and --symlink options, which allow exporting into content-addressed destinations.

      I.e. export mirror --hardlink will render and write each exported file to <--to>/_content/<hash/based/path>.<ext> and only then hardlink the result to <--to>/<output/format/based/path>.<ext> target destination.
      And similarly for --symlink.

      Typically, doing this saves quite a bit of space, e.g., when pages refer to the same resource requisites by slightly different URLs, same images and fonts get distributed via different CDN hosts, when you export --all visits to some URLs and many of those are absolutely identical, etc.

      So, from now on, --hardlink is the default.
      The old behavior can be archived by running it with --copy instead.

    • Implemented --relative and --absolute options, which control if URLs should be remapped to relative or absolute file: URLs, respectively.

  • Documented all the new things.

  • Added a bunch of new test-cli.sh tests.

Changed

  • export mirror:

    • Switched default --output to hupq_n to prevent collisions when using --*-hybrid and --all.

    • Improved handling of base HTML tags, _targets are supported now.

    • Links that reference a page from itself will no longer refer to the page's filename, even when the link has no fragment.

      The results can be a bit confusing, but this makes the new content de-duplication options much more effective.

    • Made export mirror default filters explicit and changed them from --method "GET" --status-re ".200C" to --method "GET" --method "DOM" --status-re ".200C".

    • Implemented --ignore-bad-inputs and --index-all-inputs options to allow you to change the above default.

    • Improved output log format.

  • Improved file loading performance a bit.

  • Improved documentation.

Fixed

  • Added a bunch of new tests for organize, which cover the organize --symlink --latest bug of tool-v0.18.0.
    Won't happen again.

  • Fixed a couple of silly filtering-related bugs.

tool-v0.18.1

30 Nov 16:34
tool-v0.18.1
Compare
Choose a tag to compare

[tool-v0.18.1] - 2024-11-30: Hotfixes

Fixed

tool-v0.18.0 introduced a bunch of issues:

  • organize:

    • Fixed organize --symlink --latest dereferencing output files, which lead to it overwriting plain WRR source files containing updated URLs with symlinks to their newer versions.

      The good news is that this bug was only triggered when organize --symlink --latest was run with some newly archived data and, for each updated URL, it only overwrote the second to last WRR file with a symlink to the latest WRR file.
      Unfortunately, this error was self-propagating, so those files could then get overwritten again by the next invocation of organize --symlink --latest with some more new data.
      This could happen up to 7 times, at which point it would start crashing, because of the OS symlink deferencing limit.

      You can check if you were affected by running:

      cd ~/web/raw ; find . -type l

      The paths it outputs will be the paths of lost WRR files.

      A reminder that it is good to do daily backups, I suppose.

      The next version will have a test for this, but I'm releasing this hotfix an hour after I discovered this.

    • Fixed it assert-crashing sometimes when running with --symlink.

    • Improved memory consumption a bit.

  • export mirror:

    • Fixed overly large memory consumption.

tool-v0.18.0

20 Nov 14:36
tool-v0.18.0
Compare
Choose a tag to compare

[tool-v0.18.0] - 2024-11-20: Incremental improvements

Added

  • export mirror:

    • Implemented the --boring option, which allows you to load some input PATHs without adding them as roots, even when no --root-* options are specified.

      This make CLI a bit more convenient to use.
      The README.md has a new example showcasing it.

  • export mirror, scrub:

    • Implemented support for @import CSS rules using a string token in place of a URL.

      As far as I can see, this syntax is rarely used in practice.
      But the spec allows this, so.

    • Implemented interpret_noscript option, which enables inlining of noscript tags when scrub is running with -scripts.

      That is, export mirror will now use this feature by default.

      This is needed because some websites put link tags with CSS under noscript, thus making such pages look broken when scrubbed with -scripts (which is the default) and then opened in a browser with scripts enabled.

Changed

  • *: Refactored/reworked a large chunk of internals, as a result:

    • organize can now take WRR bundles as inputs too,
    • export mirror became much faster at indexing inputs that contain archives of the same URLs, repeatedly.

    In general, these changes are aimed towards making hoardy-web completely input-agnostic.
    That is, wouldn't it be nice if you could feed mitmproxy files to export mirror directly, instead of going through import mitmproxy first?

  • export mirror, scrub:

    • From now on, it will stop generating link tags with void URLs, it will simply censor them out instead.

    • scrub with +verbose set will now also show original rel attr values for censored out tags.

    • Also, in general, the outputs of scrub with +verbose set are much prettier now.

  • Improved documentation.

tool-v0.17.0

09 Nov 17:48
tool-v0.17.0
Compare
Choose a tag to compare

extension-v1.17.2

09 Nov 10:27
extension-v1.17.2
Compare
Choose a tag to compare

[extension-v1.17.2] - 2024-11-09: Documentation fixes, mostly

Changed

  • The Help page:

    • Rewrote "Conventions" and "'Work offline' mode" sections of to be much more readable.
  • *:

    • Improved contrast when running with a light CSS color scheme.

Fixed

  • Documentation:

    • Fixed some typos.
  • *:

    • Fixed some potential state display inconsistency bugs and improved UI pages' init performance when the core is very busy.

extension-v1.17.1

01 Nov 10:36
extension-v1.17.1
Compare
Choose a tag to compare

[extension-v1.17.1] - 2024-11-01: Annoyance fixes

Changed

  • Popup UI:

    • Reverted most of the block reordering bit of popup UI rework of extension-v1.17.0.

      The "Globally" block is near the top again.

    • Edited the "Persistence" block a bit more.

      Mainly, to stop graying out always-useful stat lines, even when the associated features are disabled, to prevent possible confusion there.

    • Renamed some options and stat lines, mostly to make their names shorter to make popup UI on Fenix more readable.

  • Toolbar button:

    • Edited its title format to be much shorter, especially on Fenix.

    • Reverted the ordering of parts there to how it was before extension-v1.17.0.

      The (much shorter now) "globally" part is at the front again because otherwise the badge being at the front there too without an explanation of its format is kind of confusing.

  • Core + All internal pages:

    • Improved message handling infrastructure.

    • Used it to improve initialization functions of all internal pages, improving efficiency and making the resulting UI much less flaky.

  • The Help page:

    • Documented what webNavigation permission is used for, improved the rest a bit.
  • *:

    • Renamed build.sh firefox target to firefox-mv2, for consistency.

Fixed

  • UI:

    • Fixed flaky rendering of Help and Changelog pages on Fenix.

      They render properly now the very first time you load them, no reloads needed.

    • Fixed duplication of history entries when navigating internal links.

    • Fixed source links sometimes failing to being highlighted when pressing the browser's "Back" button.

    • Fixed some small CSS nitpicks.

  • Popup UI + Documentation:

    • Realigned some help strings with reality.
  • Fixed some more mostly inconsequential things.