Skip to content

Commit

Permalink
first release!
Browse files Browse the repository at this point in the history
  • Loading branch information
cs127 committed May 17, 2024
0 parents commit 04ed311
Show file tree
Hide file tree
Showing 4 changed files with 1,086 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"files.associations": {
"*.h": "c",
}
}
62 changes: 62 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# unicorn

unicorn is a lightweight implementation of most of the standard C wide character functions, for platforms that don't support them, but still have a wide character type in the form of `wchar_t`, **as long as it is at least 16 bits (if unsigned) or 17 bits (if signed)**.

> [!NOTE]
> this is just a hobby project.
> as much as I try to fix issues, you should still probably not expect it to always work properly.
> also, the code isn't exactly the most optimized. you have my warning.
unlike the standard functions which are locale-dependent, unicorn does not support locales, and always uses the same text encodings:

* wide characters (`wchar_t`) are assumed to be encoded in UTF-32 if `WCHAR_MAX` is at least `0x10FFFF` (e.g. Linux), or UTF-16 otherwise (e.g. Windows).
* surrogates (`U+D800`-`U+DFFF`) are considered invalid in UTF-32.
* a new function (`mbstowc`) has been implemented as an alternative to `mbtowc` to allow converting individual non-BMP characters in UTF-16.
* multibyte strings (used in `mbstowcs` and the like) are assumed to be encoded in UTF-8.
* surrogates (`U+D800`-`U+DFFF`) are considered invalid in multibyte strings.
* characters of length 5-8 are considered invalid, and so are 4-byte characters that exceed `U+10FFFF`.

> [!WARNING]
> do not put overlong characters (characters encoded in a larger number of bytes than needed) in your multibyte strings!
> currently, unicorn does not consider them invalid, but **this will change**.
everything that unicorn implements uses the same name as its counterpart in standard C, except with a `UC_` prefix.
the only exception being the `wchar_t` type. unicorn uses the standard `wchar_t`.

## compatibility

unicorn is *almost* C89-compatible, except that it needs to know the maximum possible value of the `wchar_t` type.
if your compiling environment does not support C99 or newer, then unless your compiler itself predefines `WCHAR_MAX`, `__WCHAR_MAX`, or `__WCHAR_MAX__`, you need to manually define one of them during compile time (make sure to give it the correct value!).

## what's not implemented

* the following will be implemented in a later update:
* `wcstok` function.

* the following do not need to be implemented, because UTF-8 is stateless:
* `mbstate_t` type.
* `mbsinit` function.
* thread-safe versions of encoding conversion functions.

* the following are not planned to be implemented any time soon (or maybe ever):
* `wctype_t` type.
* character type functions (`towlower`, `towupper`, `wcscasecmp`, `wcscasecmp_l`, `wcsncasecmp`, `wcsncasecmp_l`, `wctype`, and the `isw` family, including `iswctype`).
* string to number conversion functions (`wcstol`, `wcstoul`, `wcstoll`, `wcstoull`, `wcstof`, `wcstod`, and `wcstold`).
* functions that interact with file streams (e.g. `fgetws`, `fputws`, `wprintf`).
* `wcscoll` and `wcscoll_l` functions.
* `wcsftime` function.
* `wcsdup` function.
* `wcwidth` and `wcswidth` functions.
* `wcsxfrm` and `wcsxfrm_l` functions.

## what *is* implemented

> [!IMPORTANT]
> you need to append a `UC_` prefix to the names of these functions, types, and macros!
* every `wchar.h` function not mentioned above, including a few nonstandard POSIX-only functions, like `wcpcpy`.
* `wint_t` type (equivalent to `signed long int`), with range macros `WINT_MIN` and `WINT_MAX`.
* `WEOF` macro (evaluates to `-1`).
* `MB_LEN_MAX` and `MB_CUR_MAX` macros (both evaluate to `4`, because the multibyte encoding is always UTF-8).
* wide character related `stdlib.h` functions (e.g. `wcstombs`, `mbstowcs`, `mblen`).
* nonstandard `mbstowc` function, which is an alternative to `mbtowc`, but expects a `wchar_t*` instead of `wchar`, to be able to read surrogate pairs in UTF-16.
Loading

0 comments on commit 04ed311

Please sign in to comment.