Skip to content

Releases: openzim/warc2zim

2.2.0

10 Jan 09:52
6d7bf10
Compare
Choose a tag to compare

Changed

  • Upgrade dependencies: zimscraperlib 5.0.0, warcio 1.7.5, cdxj_index 1.4.6 and others
  • Use all rewriting stuff from zimscraperlib
  • Remove most HTML / CSS / JS rewriting logic which is now part of zimscraperlib 5
  • Fix wombat setup settings (especially isSW) (#293)

Fixed

  • Stop checking main entry processability when it is already found (#424)

2.1.3

01 Nov 13:17
a8b934f
Compare
Choose a tag to compare

Changed

  • Upgrade to wombat 3.8.3 (#414)

2.1.2

08 Oct 12:28
dc36cd8
Compare
Choose a tag to compare

Added

  • Enrich test website with img srcset situations (in preparation for #403)

Changed

  • Upgrade dependencies, including wombat 3.8.2 (#407)

Fixed

  • HTML document can be retrieved as fetch resource type (#405)

2.1.1

05 Sep 07:14
1737ab3
Compare
Choose a tag to compare

Changed

  • Upgrade dependencies, including wombat 3.8.0 (#386)

2.1.0

09 Aug 07:43
76c50ed
Compare
Choose a tag to compare

Added

  • New fuzzy-rule for cheatography.com (#342), der-postillon.com (#330), iranwire.com (#363)
  • Properly rewrite redirect target url when present in HTML tag (#237)
  • New --encoding-aliases argument to pass encoding/charset aliases (#331)
  • Add support for SVG favicon (#148)
  • Automatically index PDF content and use PDF title (#289 and #290)

Changed

  • Upgrade to python-scraperlib 4.0.0
  • Generate fuzzy rules tests in Python and Javascript (#284)
  • Refactor HTML rewriter class to make it more open to change and expressive (#305)
  • Detect charset in document header only for HTML documents (#331)
  • Use software property from warcinfo record to set ZIM Scraper metadata (#357)
  • Store ContentDate as metadata, based on WARC-Date (#358)
  • Remove domain specific rules (#328)
  • Revisit retrieve_illustration logic to prefer best favicons (#352 and #369)
  • Upgrade dependencies (zimscraperlib 4.0.0, wombat.js 3.7.12 and others) (#376)

Fixed

  • Handle case where the redirect target is bad / unsupported (#332 and #356)
  • Fixed WARC files handling order to follow creation order (#366)
  • Remove subsequent slashes in URLs, both in Python and JS (#365)
  • Ignore non HTTP(S) WARC records (#351)
  • Fix vimeo_cdn_fix fuzzy rule for proper operation in Javascript (#348)
  • Performance issue linked to new "extensible" HTML rewriting rules (#370)

2.0.3

24 Jul 05:28
72474e8
Compare
Choose a tag to compare

Changed

  • Moved rules definition from JSON to YAML and documented update process (#216)
  • Upgrade to wombat.js 3.7.11

Added

  • Exit with cleaner message when no entries are expected in the ZIM (#336) and when main entry is not processable (#337)
  • Add debug log for items whose content is empty (#344)

Fixed

  • Some resources rewrite mode are still not correctly identified (#326)

2.0.2

18 Jun 13:26
1452298
Compare
Choose a tag to compare

Added

  • Add --ignore-content-header-charsets option to disable automatic retrieval of content charsets from content first bytes (#318)
  • Add --content-header-bytes-length option to specify how many first bytes to consider when searching for content charsets in header (#320)
  • Add --ignore-http-header-charsets option to disable automatic retrieval of content charsets from content HTTP Content-Type headers (#318)

Changed

  • Simplify logic deciding content charset, stop guessing with chardet (#312)

Fixed

  • Rewrite only content with mimetype text-html when WARC-Resource-Type is html (#313)

2.0.1

13 Jun 10:15
90e8bdf
Compare
Choose a tag to compare

Added

  • Add support for multiple languages in --lang CLI argument (#300)

Changed

  • Use the new WARC-Resource-Type header to decide rewrite mode (when present in WARC) (#296)
  • Upgrade Python dependencies + wombat.js 3.7.5

Fixed

  • Drop integrity attribute in HTML <script> and <link> tags (#298)
  • Use automatic detection of content encoding also for JS, JSON and CSS files (#301)
  • Set correct charset in HTML documents (#253)

2.0.0

04 Jun 07:18
d7cbe7e
Compare
Choose a tag to compare

Added

  • Allow to specify a scraper suffix for the ZIM scraper metadata at the CLI (#168)
  • New test website to test many known situations supposed to be handled (#166)

Changed

  • Replace Service Worker approach by scraper-side rewriting of static content (kiwix/overview#95)
  • Adopted Python bootstrap conventions (#152)
  • Upgrade dependencies, especially move to Python 3.12 (only) and zimscraperlib 3.3.2
  • Change wording in logs about the return code 100 (which is not an error code)
  • Added checks in converter.py to verify output directory existence, logging appropriate error messages and cleanly exit if checks fail. (#106)
  • Added check for invalid zim file names (#232)
  • Changed default publisher metadata from 'Kiwix' to 'openZIM' (#150)

1.5.5

18 Jan 07:52
af831c0
Compare
Choose a tag to compare

Changed

  • Code restructuration in preparation for 2.x