first release!

cs127 · May 17, 2024 · 04ed311 · 04ed311
commit 04ed311
Show file tree

Hide file tree

Showing 4 changed files with 1,086 additions and 0 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,5 @@
+{
+	"files.associations": {
+		"*.h": "c",
+	}
+}
diff --git a/README.md b/README.md
@@ -0,0 +1,62 @@
+# unicorn
+
+unicorn is a lightweight implementation of most of the standard C wide character functions, for platforms that don't support them, but still have a wide character type in the form of `wchar_t`, **as long as it is at least 16 bits (if unsigned) or 17 bits (if signed)**.
+
+> [!NOTE]
+> this is just a hobby project.
+> as much as I try to fix issues, you should still probably not expect it to always work properly.
+> also, the code isn't exactly the most optimized. you have my warning.
+
+unlike the standard functions which are locale-dependent, unicorn does not support locales, and always uses the same text encodings:
+
+* wide characters (`wchar_t`) are assumed to be encoded in UTF-32 if `WCHAR_MAX` is at least `0x10FFFF` (e.g. Linux), or UTF-16 otherwise (e.g. Windows).
+  * surrogates (`U+D800`-`U+DFFF`) are considered invalid in UTF-32.
+  * a new function (`mbstowc`) has been implemented as an alternative to `mbtowc` to allow converting individual non-BMP characters in UTF-16.
+* multibyte strings (used in `mbstowcs` and the like) are assumed to be encoded in UTF-8.
+  * surrogates (`U+D800`-`U+DFFF`) are considered invalid in multibyte strings.
+  * characters of length 5-8 are considered invalid, and so are 4-byte characters that exceed `U+10FFFF`.
+
+> [!WARNING]
+> do not put overlong characters (characters encoded in a larger number of bytes than needed) in your multibyte strings!
+> currently, unicorn does not consider them invalid, but **this will change**.
+
+everything that unicorn implements uses the same name as its counterpart in standard C, except with a `UC_` prefix.
+the only exception being the `wchar_t` type. unicorn uses the standard `wchar_t`.
+
+## compatibility
+
+unicorn is *almost* C89-compatible, except that it needs to know the maximum possible value of the `wchar_t` type.
+if your compiling environment does not support C99 or newer, then unless your compiler itself predefines `WCHAR_MAX`, `__WCHAR_MAX`, or `__WCHAR_MAX__`, you need to manually define one of them during compile time (make sure to give it the correct value!).
+
+## what's not implemented
+
+* the following will be implemented in a later update:
+  * `wcstok` function.
+
+* the following do not need to be implemented, because UTF-8 is stateless:
+  * `mbstate_t` type.
+  * `mbsinit` function.
+  * thread-safe versions of encoding conversion functions.
+
+* the following are not planned to be implemented any time soon (or maybe ever):
+  * `wctype_t` type.
+  * character type functions (`towlower`, `towupper`, `wcscasecmp`, `wcscasecmp_l`, `wcsncasecmp`, `wcsncasecmp_l`, `wctype`, and the `isw` family, including `iswctype`).
+  * string to number conversion functions (`wcstol`, `wcstoul`, `wcstoll`, `wcstoull`, `wcstof`, `wcstod`, and `wcstold`).
+  * functions that interact with file streams (e.g. `fgetws`, `fputws`, `wprintf`).
+  * `wcscoll` and `wcscoll_l` functions.
+  * `wcsftime` function.
+  * `wcsdup` function.
+  * `wcwidth` and `wcswidth` functions.
+  * `wcsxfrm` and `wcsxfrm_l` functions.
+
+## what *is* implemented
+
+> [!IMPORTANT]
+> you need to append a `UC_` prefix to the names of these functions, types, and macros!
+
+* every `wchar.h` function not mentioned above, including a few nonstandard POSIX-only functions, like `wcpcpy`.
+* `wint_t` type (equivalent to `signed long int`), with range macros `WINT_MIN` and `WINT_MAX`.
+* `WEOF` macro (evaluates to `-1`).
+* `MB_LEN_MAX` and `MB_CUR_MAX` macros (both evaluate to `4`, because the multibyte encoding is always UTF-8).
+* wide character related `stdlib.h` functions (e.g. `wcstombs`, `mbstowcs`, `mblen`).
+* nonstandard `mbstowc` function, which is an alternative to `mbtowc`, but expects a `wchar_t*` instead of `wchar`, to be able to read surrogate pairs in UTF-16.