Skip to content

Commit

Permalink
Updated CHANGELOG for v5
Browse files Browse the repository at this point in the history
  • Loading branch information
rgaudin committed Jan 7, 2025
1 parent c9e7a98 commit aee2657
Showing 1 changed file with 117 additions and 27 deletions.
144 changes: 117 additions & 27 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,38 +7,128 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Breaking Changes

- Renamed `filesystem.validate_zimfile_creatable` to `filesystem.file_creatable` to reflect general applicability to check file creation beyond ZIM files #200
- Remove any "ZIM" reference in exceptions while working with files #200
- Significantly enhance the safety of metadata manipulation (#205)
- add types for all metadata, one type per metadata name plus some generic ones for non-standard metadata
- all types are responsible to validate metadata value at initialization time
- validation checks for adherence to the ZIM specification and conventions are automated
- cleanup of unwanted control characters and stripping white characters are **automated in all text metadata**
- whenever possible, try to **automatically clean a "reasonably" bad metadata** (e.g. automaticall accept and remove duplicate tags - harmless - but not duplicate language codes - codes are supposed to be ordered, so it is a weird situation) ; this is an alignment of paradigm, because for some metadata the lib was permissive, while for other it was quite restrictive ; this PR tries to align this and **make the lib as permissive as possible**, avoiding to fail a scraper for something which could be automatically fixed
- it is now possible to disable ZIM conventions checks with `zim.metadata.check_metadata_conventions`
- simplify `zim.creator.Creator.config_metadata` by using these types and been more strict:
- add new `StandardMetadata` class for standard metadata, including list of mandatory one
- by default, all non-standard metadata must start with `X-` prefix
- this not yet an openZIM convention / specification, so it is possible to disable this check with `fail_on_missing_prefix` argument
- simplify `add_metadata`, use same metadata types
- simplify `zim.creator.Creator.start` with new types, and drop all metadata from memory after being passed to the libzim
- drop `zim.creator.convert_and_check_metadata` (not usefull anymore, simply use proper metadata type)
- move `MANDATORY_ZIM_METADATA_KEYS` and `DEFAULT_DEV_ZIM_METADATA` from `constants` to `zim.metadata` to avoid circular dependencies
- new `inputs.unique_values` utility function to compute the list of uniques values from a given list, but preserving initial list order
- in `__init__` of `zim.creator.Creator`, rename `disable_metadata_checks` to `check_metadata_conventions` for clarity and brevity
- beware that this manipulate the global `zim.metadata.check_metadata_conventions`, so if you have many creator running in parallel, they can't have different settings, last one initialized will "win"
This is a major release with a lot of breaking changes but most changes are easy to fix.

It focuses on type safety with the introduction of runtime checks: any call to zimscraperlib API must match the type definition or an exception will be raised.

Documentation is available as docstrings and on https://python-scraperlib.readthedocs.io

Main changes includes:

- ZIM metadata handling has changes completely with new types for each kind of metadata.
- `i18n` module has been redesigned around a single main class `Language`
- New `rewriting` module for HTTML/CSS/JS (that one being done at runtime via Wombat)
- Now supporting only Python 3.12

### Added

- Add `filesystem.validate_folder_writable` to check if a folder can be written to #200
- Expose `constants.VERSION` to have access to zimscraperlib version from scrapers #224
- Added mkdocs based documentation site. #92
- Documentation using `mkdocs`, published on readthedocs.com (#92)
- `rewriting` module to rewrite URLs in content for generic scrapers
- `rewriting.css` to rewrite URLs in CSS files
- `rewriting.html` to rewrite URLs in HTML files
- `rewriting.js` to rewrite URLs in JS files (at runtime, using `wombat`)
- `wombat-setup` javascript module in `javascript/`
- `typing` module with custom types:
- `Callback` to use where we expect callbacks
- `SupportsWrite`, `SupportsRead`, `SupportsSeeking` `SupportsSeekableRead` and `SupportsSeekableWrite`: protocols for IO type annotations
- `zim.metadata` module with a type-based approach for each kind of metadata and helpers for custom ones
- [`zim.metadata`] `APPLY_RECOMMENDATIONS`: general flag to toggle openZIM-recommended constraints
- [`zim.metadata`] Type-based classes: `Metadata`, `TextBasedMetadata`, `TextListBasedMetadata`, `DateBasedMetadata`, `IllustrationBasedMetadata`
- [`zim.metadata`] Usage-based classes: `NameMetadata`, `LanguageMetadata`, `DefaultIllustrationMetadata`, etc.
- [`zim.metadata`] `StandardMetadataList` to package the standard metadata
- See details for additional API endpoints and variables
- [`constants`] `DEFAULT_WEB_REQUESTS_TIMEOUT` exposed for `download` module
- [`download`] `stream_file()` now accepts `timeout: int` param (defaults to constant timeout) (#222)
- [`filesystem`] `path_from` context manager to acquire a pathlib `Path` from `Path` or `TemporaryDirectory`
- [`i18n`] `Language`, `get_language()` and `get_language_or_none()`. See breaking changes
- [`image.optimization`] `OptimizePngOptions` dataclass to store PNG options
- [`image.optimization`] `OptimizeJpgOptions` dataclass to store JPEG options
- [`image.optimization`] `OptimizeGifOptions` dataclass to store WebP options
- [`image.optimization`] `OptimizeOptions` dataclass to store cross-formats options
- [`inputs`] `unique_values()` to deduplicate a list while preserving order
- [`logging`] `DEFAULT_FORMAT_WITH_THREADS` as many scrapers uses threads
- [`video.encoding`] `reencode()`'s `existing_tmp_path` param
- [`zim.filesystem`] `validate_folder_writable()` to ensure one can write into a folder (#200)
- [`zim.creator`] `Creator._get_first_language_metadata_value()` to retrieve first language from metadata
- [`zim.items`] `no_indexing_indexdata()` to get an IndexData that disables indexing
- [`zim.items`] `URLItem.get_mimetype()` now only returning `str`

## Changed (Breaking)

- Entire API is now type-protected using beartype. Any call to scraperlib that doesn't satisfy the annotated types will raise an exception
- [`constants`] `MANDATORY_ZIM_METADATA_KEYS` and `DEFAULT_DEV_ZIM_METADATA` moved to `zim/metadata`
- [`download`] `YoutubeDownloader.download`'s `options` parameters now expect an `dict[str, Any]` instead of `dict`
- [`download`] `YoutubeConfig` options now limited to `str | bool | int | None`
- [`download`] `_get_retry_adapter()` now exposed as `get_retry_adapter()`
- [`download`] `stream_file`'s `byte_stream' param now more flexible, accepting `SupportsWrite[bytes] | SupportsSeekableWrite[bytes]`
- [`download`] `stream_file`'s `proxies` param now accepting `dict[str, str]` instead of `dict`
- [`filesystem`] `delete_callback()` is now a simple callback accepting an `fpath` and deleting it (doesnt chain other callback anymore).
- [`filesystem`] `delete_callback()` doesnt fail on missing file (#192)
- [`i18n`] Redesigned API around a single object:
- `Language` which is inited with any acceptable code. Raises `NotFoundError` on 639-3 matching failure
- `find_language_names()` is retained but only accepts a `query: str`
- added `get_language()` and `get_language_or_none()` as shortcuts around `Language`
- `is_valid_iso_639_3()` is retained
- [`image.conversion`] `convert_image()` now accepts `io.BytesIO` in place of `IO[bytes]` for `src` and `dst`.
- [`image.conversion`] `convert_svg2png()` now accepts `io.BytesIO` in place of `IO[bytes]` for `src` and `dst`.
- [`image.optimization`] `optimize_png()` now accepts `options: OptimizePngOptions` instead of individual params.
- [`image.optimization`] `optimize_jpeg()` now accepts `options: OptimizeJpgOptions` instead of individual params.
- [`image.optimization`] `optimize_webp()` now accepts `options: OptimizeWebpOptions` instead of individual params.
- [`image.optimization`] `optimize_gif()` now accepts `options: OptimizeGifOptions` instead of individual params.
- [`image.presets`] All presets now use the new options dataclass instead of ClassVar dict
- [`image.probing`] `format_for()` now accepts `io.BytesIO` in place of `IO[bytes]` for `src`.
- [`image.probing`] `is_valid_image()` now accepts `io.BytesIO` in place of `IO[bytes]` for `image`.
- [`image.utils`] `save_image()` now accepts `io.BytesIO` in place of `IO[bytes]` for `dst`.
- [`video.config`] `Config` was mostly not using type annotations.
- [`video.config`] `Config` options only expecting `str | None`
- [`video.presets`] All options only expecting `str | None`
- [`video.encoding`] `reencode()` now always returning a `tuple[bool, CompletedProcess]`
- [`zim._libkiwix`] `MimetypeAndCounter` now expects specific types for `mimetype: str` and `value: int`
- [`zim.filesystem`] `make_zim_file()` publisher`param now properly expects an`str`
- [`zim.filesystem`] `IncorrectZIMPathError` renamed to `IncorrectPathError`
- [`zim.filesystem`] `MissingZIMFolderError` renamed to `MissingFolderError`
- [`zim.filesystem`] `NotADirectoryZIMFolderError` renamed to `NotADirectoryFolderError`
- [`zim.filesystem`] `NotWritableZIMFolderError` renamed to `NotWritableFolderError`
- [`zim.filesystem`] `IncorrectZIMFilenameError` renamed to `IncorrectFilenameError`
- [`zim.filesystem`] `validate_zimfile_creatable()` renamed to `validate_file_creatable()`
- [`zim.items`] `Item` and `StaticItem` now expecting `hints` as `dict[libzim.writer.Hint, int]` instead of `dict`
- [`zim.items`] `Item.get_hints()` now returning `dict[libzim.writer.Hint, int]` instead of `dict`
- [`zim.items`] `URLItem.download_for_size()` now specifying type annotations and reordered params
- [`zim.providers`] `FileLikeProvider.gen_blob()` and `URLProvider.gen_blob()` now properly annotates return type (`Generator[libzim.writer.Blob, None, None]`)
- [`zim.providers`] `URLProvider.get_size_of()` param `url` now explicitly expects an `str`
- [`zim.creator`] `Creator.config_metadata()` signature changed, now mainly accepting a `StandardMetadataList`
- [`zim.creator`] `Creator.config_dev_metadata()` signature changed to accept now metadata types
- [`zim.creator`] `Creator.add_item_for()`'s `callback` renamed to `callbacks` and accepting `Callback`
- [`zim.creator`] `Creator.add_item()`'s `callback` renamed to `callbacks` and accepting `Callback`

## Changed

- [deps] `iso639-lang` now requires at least v2.4.0
- [`download`] `stream_file()` now return `tuple[int, requests.structures.CaseInsensitiveDict[str]]` instead of `tuple[int, requests.structures.CaseInsensitiveDict]`
- [`download`] `stream_file()` now accepts both `fpath` and `byte_stream` params (writes to both)
- [`image.utils`] `save_image()` now accepts `Any` `**params`.
- [`zim.archive`] `Archive.counters` now returning `CounterMap` (compatible with previous `dict[str, int]`)

### Fixed
## Fixed

- Set default timeout in `download.stream_file` to 10 seconds, and allow to override value #222
- Direct dependencies now properly references: pillow, urllib3, piexif, idna (#226)
- [`download`] `YoutubeDownloader.download` now respects its return type (`bool | Future[Any]`)
- [`image.conversion`] `convert_image()` `**params` properly declared as accepting `None`.
- [`logging`] `getLogger()`'s' `console` now properly accepting `TextIO | io.StringIO | None`
- [`video.probing`] `get_media_info()` type annotation for `src_path`
- [`zim.archive`] `Archive.get_item()` return type (`libzim.reader.Item`)

## Removed

- Support for Python 3.8/3.9/3.10/3.11. Only Python 3.12 is supported now.
- [`i18n`] `Lang` (See breaking changes)
- [`i18n`] `get_iso_lang_data()` (See breaking changes)
- [`i18n`] `update_with_macro()` (See breaking changes)
- [`i18n`] `get_language_details()` (See breaking changes)
- [`uri`] `rebuild_uri` `failsafe` param (was only handling incorrect types)
- [`video.encoding`] `reencode()`'s `with_process` param
- [`zim.creator`] `Creator.validate_metadata()`
- [`zim.creator`] `Creator.convert_and_check_metadata()`

## [4.0.0] - 2024-08-05

Expand Down

0 comments on commit aee2657

Please sign in to comment.