-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 04ed311
Showing
4 changed files
with
1,086 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
{ | ||
"files.associations": { | ||
"*.h": "c", | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# unicorn | ||
|
||
unicorn is a lightweight implementation of most of the standard C wide character functions, for platforms that don't support them, but still have a wide character type in the form of `wchar_t`, **as long as it is at least 16 bits (if unsigned) or 17 bits (if signed)**. | ||
|
||
> [!NOTE] | ||
> this is just a hobby project. | ||
> as much as I try to fix issues, you should still probably not expect it to always work properly. | ||
> also, the code isn't exactly the most optimized. you have my warning. | ||
unlike the standard functions which are locale-dependent, unicorn does not support locales, and always uses the same text encodings: | ||
|
||
* wide characters (`wchar_t`) are assumed to be encoded in UTF-32 if `WCHAR_MAX` is at least `0x10FFFF` (e.g. Linux), or UTF-16 otherwise (e.g. Windows). | ||
* surrogates (`U+D800`-`U+DFFF`) are considered invalid in UTF-32. | ||
* a new function (`mbstowc`) has been implemented as an alternative to `mbtowc` to allow converting individual non-BMP characters in UTF-16. | ||
* multibyte strings (used in `mbstowcs` and the like) are assumed to be encoded in UTF-8. | ||
* surrogates (`U+D800`-`U+DFFF`) are considered invalid in multibyte strings. | ||
* characters of length 5-8 are considered invalid, and so are 4-byte characters that exceed `U+10FFFF`. | ||
|
||
> [!WARNING] | ||
> do not put overlong characters (characters encoded in a larger number of bytes than needed) in your multibyte strings! | ||
> currently, unicorn does not consider them invalid, but **this will change**. | ||
everything that unicorn implements uses the same name as its counterpart in standard C, except with a `UC_` prefix. | ||
the only exception being the `wchar_t` type. unicorn uses the standard `wchar_t`. | ||
|
||
## compatibility | ||
|
||
unicorn is *almost* C89-compatible, except that it needs to know the maximum possible value of the `wchar_t` type. | ||
if your compiling environment does not support C99 or newer, then unless your compiler itself predefines `WCHAR_MAX`, `__WCHAR_MAX`, or `__WCHAR_MAX__`, you need to manually define one of them during compile time (make sure to give it the correct value!). | ||
|
||
## what's not implemented | ||
|
||
* the following will be implemented in a later update: | ||
* `wcstok` function. | ||
|
||
* the following do not need to be implemented, because UTF-8 is stateless: | ||
* `mbstate_t` type. | ||
* `mbsinit` function. | ||
* thread-safe versions of encoding conversion functions. | ||
|
||
* the following are not planned to be implemented any time soon (or maybe ever): | ||
* `wctype_t` type. | ||
* character type functions (`towlower`, `towupper`, `wcscasecmp`, `wcscasecmp_l`, `wcsncasecmp`, `wcsncasecmp_l`, `wctype`, and the `isw` family, including `iswctype`). | ||
* string to number conversion functions (`wcstol`, `wcstoul`, `wcstoll`, `wcstoull`, `wcstof`, `wcstod`, and `wcstold`). | ||
* functions that interact with file streams (e.g. `fgetws`, `fputws`, `wprintf`). | ||
* `wcscoll` and `wcscoll_l` functions. | ||
* `wcsftime` function. | ||
* `wcsdup` function. | ||
* `wcwidth` and `wcswidth` functions. | ||
* `wcsxfrm` and `wcsxfrm_l` functions. | ||
|
||
## what *is* implemented | ||
|
||
> [!IMPORTANT] | ||
> you need to append a `UC_` prefix to the names of these functions, types, and macros! | ||
* every `wchar.h` function not mentioned above, including a few nonstandard POSIX-only functions, like `wcpcpy`. | ||
* `wint_t` type (equivalent to `signed long int`), with range macros `WINT_MIN` and `WINT_MAX`. | ||
* `WEOF` macro (evaluates to `-1`). | ||
* `MB_LEN_MAX` and `MB_CUR_MAX` macros (both evaluate to `4`, because the multibyte encoding is always UTF-8). | ||
* wide character related `stdlib.h` functions (e.g. `wcstombs`, `mbstowcs`, `mblen`). | ||
* nonstandard `mbstowc` function, which is an alternative to `mbtowc`, but expects a `wchar_t*` instead of `wchar`, to be able to read surrogate pairs in UTF-16. |
Oops, something went wrong.