-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize wordlist.h for Unishox3 #52
Comments
Hi, thank you for sharing this. I appreciate the amount of thought that went into this. If I understand your solution correctly, it seeks to save the space occupied by the pointer, which is 8 bytes on a 64-bit machine and also the null terminator. So assuming we only use a 4 byte integer array, the total savings per entry would be 8+1-4 = 5 bytes, which translates to 5*1.2m = around 6mb. Considering the size of data is 14mb, so the percentage of savings achieved is 6/(9+14)*100 = ~28%. This is of course quite significant for a 64-bit machine. I had given this a lot of thought too. I also wanted to cram the dictionary into flash memories on embedded devices which don't have fancy amount of RAM (at max 512kb). These devices don't even 14mb kind of flash memories. Most popular ones have 256kb or at max 4mb. The solution I came up initially was similar to yours, but I planned to use a block based approach where the pointer size is only 2 bytes and I can avoid the null terminator if I use an array in each block so length of string is obtained from the position of the next string. Then I can use binary search first on the blocks and then do another binary search within the block. So the savings per entry would be 8+1-2 = 7 bytes, not considering overheads for each block. The total savings would be 7*1.2m = 8.5mb roughly. However, this still means the dictionary size has to be at least (14+2.4+etc) = ~17mb, which is quite high, considering I plan to add more entries to the dictionary, if possible get the entire Wiktionary in, which is around 8 million entries. I was still happy with this and was even planning a separate lib, until I searched and came across Marisa developed by Susumu Yata, which could shrink 14mb to 3.4mb, including pointer and all. However this lib is not endian compatible and can work only on esp32 at best, so I am planning to make my own with improvements hopefully bringing size down to 3mb and also make it work on other micro controllers. If you see the present code, there is also some trouble locating the best match in the dictionary because of using binary search based design. The better approach is to use trie based design, where such feature has been already developed by the scientist in the form of "Common Prefix Search". I am also fine tuning the dictionary here: https://github.com/siara-cc/Text_frequency_research, so this is the status of this so far. It is nice to know that you have developed a better one. If you create a PR I will merge it, but otherwise I am attempting to save even more using the above plan. |
Regarding the idea of using In terms of binary search vs. trie-based design, binary search can be faster on modern architectures due to caching and spatial locality. On the other hand, a tree-based design may result in more cache misses due to the abundance of pointers. However, this may not apply to MCUs where the CPU is not as fast. |
Thank you for the PR !! I have merged it. About int16 (uint16), I meant to divide the storing of entries in blocks of say 64k bytes, each having the cumulative number of words in the header followed by an array of uint16 pointers to entries in the block followed by actual entries. Marisa is a succinct LOUDS based Patricia trie. I think the Patricia trie based approach will have lesser cache misses than the lg2(n) worst case probability of binary search, which in this case is 20. However Marisa trie is a succinct trie with no pointers so it will have to depend on lookup tables for range() and select() queries. On the contrary, a LOUDS based trie has the property of exiting early for non-existing entries as opposed to binary search it is late-exit. Dictionary lookup for Unishox would have more fail rate than hit rate. So which approach will be faster remains to be seen. I think this change will be helpful for comparing the performance! |
@eos175 FYI I integrated a Bloom filter and Marisa and got mixed results:
|
The current implementation of Unishox3 uses
wordlist.h
which is a header file containing arrays of strings that are used for encoding and decoding text. However, the current implementation has some inefficiencies that could be optimized.Firstly, the current implementation calls
strlen
to get the length of each string in the arrays. One potential solution to this is to usestd::string_view
, which gets the length of the string at compile time. However, this approach can increase memory usage since there are many strings in the array.To address this issue, the
wordlist.h
file can be flattened into a single string,wordlist.bin
, which can be compiled into the binary. This significantly reduces the amount of memory required since there are no longer any pointers to the individual strings, and the string length is already known at compile time. To know where each string is located in the flattened string, an array of indices is usedwordlist_index.h
.Additionally, this optimization can improve CPU performance since the strings are now located in a contiguous block of memory, which improves spatial locality and cache performance.
To implement this optimization, we propose the following changes:
1. Flatten the
wordlist.h
file into a single string,wordlist.bin
, which can be compiled into the binary.2. Create an array of indices,
wordlist_index
, to keep track of where each string starts and ends in the flattened string.3. Modify the
getDictWord()
function to use thewordlist_index
array to retrieve the correct string from the flattened wordlist.bin string.This optimization can also reduce the time required for compilation since the
wordlist.h
file is no longer required, and the compilation of the binary is faster.We have tested this implementation and found that it reduces both memory usage and execution time compared to the original implementation.
The text was updated successfully, but these errors were encountered: