-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can Kiwix-JS deal with an online ZIM file available through HTTP? #595
Comments
Kiwix JS can load a ZIM file into memory over HTTP(S) and use it that way, which is not much good because you might as well be downloading it in that case. Sharun-s had developed a technique using XML Range Requests to access part of a ZIM file over http(s). It was slow. |
I might add that the slowest element here would be binary search of the title list and url list, as we would be opening potentially hundreds of XMLHttpRequests to the server for very small bits of data. There is a caching module (developed by peter-x) that helps in that situation, but given #581 this could still be pretty slow on large files. |
@Jaifroid I mean without downloading the whole file. Just dealing like with a local filesystem file, just downloading the cluster and unpacking them? |
@kelson42 Sorry, did not see your further query till now. Kiwix JS cannot currently download clusters over http, but sharun-s developed a prototype of that capability in his branch. He used XML range requests to read clusters from a file on localhost, but there was no reason why the exact same technique could not be used for a remote server. XML range requests ask the server only for a small part of a file, so it fits what you are asking about. The problem is not integrating sharun-s's technique (that is not particularly difficult). The problem is that it would inevitably be slow: binary search in particular is the main problem, because you would generate hundreds of requests to the server for each article. We do have block cache code that would greatly reduce the number of requests by caching dirEntries and clusters that are already requested, and that helps a lot. See #619. So, it may be feasible for smallish ZIM files, but probably excruciatingly slow for anything with a very large url pointer list that needs to be searched. What's your use case? |
PS the server would need to accept range requests: see https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests. I think most do (those that allow resumable download, or skipping through a streaming media file). |
@Jaifroid Thank you for your answer. My feelings about this approach are mitigated, I don't really believe in this for the future (but I might be wrong). But I wanted to get this documented properly for my own knowlege and for a few other reasons as well. One of them is that @ikreymer, the lead developer of https://github.com/webrecorder, has interest on this approach and has built a similar feature around the WARC format. This is a WIP, but he might give here a few pointers to a code/demo. We might work with him more closely in the near future for the Zimit project and he is interested in the ZIM format as an alternative storage format to WARC. An other one is that guys, in particular @lidel, at IPFS are interested as well in that approach, and are building a grant to hack Kiwix-JS to access ZIM files on IPFS that way. See ipfs/devgrants#49. |
Yes, it seems quite specialized, as most users would prefer to access the original resource from which the ZIM was made. It could work as a kind of replacement for Kiwix Serve in local area networks not connected to the Internet, or as a static mirror for Wikipedia where articles can change very fast or get deleted. But it will never be so fast as Kiwix Serve, which could do the same job. |
Apologies for delay! I've launched a project I've been working on, called: https://replayweb.page It's a fully browser based replay system for web archives. I've focused on the WARC format initially, as that remains a core source of web archive data, but it would be great to support ZIM format as well. The system is designed to replay archived web data in a service worker, loading everything on-demand. (Small files or where on-demand load is not possible are loaded fully). Here is an example of a 17GB web archive of a complex, interactive website, loaded entirely from static storage: This page shows how it can also be embedded within another site in an iframe: The whole archive is downloaded from a single file. To make this work, there is a small binary-searchable file that's downloaded first, and it serves as an index to other index data and offsets to WARC data. I think this could also work for ZIM, but may need to generate a similar index outside of the ZIM file, and package it up together. (The key is that binary search does not happen over http, but locally). I think it really amounts to have an efficient index that can be loaded client-side. I am working on a spec for such a ZIP based format here, which would bundle existing archive, indices and metadata, in one package, here: Happy to answer any questions about any of this! |
Fascinating @ikreymer. This looks excellent, congratulations! For this to work with ZIM files, there are some options I can think of. You are absolutely right that binary search will be the greatest challenge, and I agree that it is necessary to download and cache the index to be searched, because of its nature binary search involves large hops within an index that become progressively smaller. I think at the very least we would need to access the archive and download from it the Title Pointer List, and save it/cache it. For full Wikipedia this would be extremely large, probably unfeasible, but for smaller themed Wikipedia ZIMs it should be fine. Kiwix JS accesses the standard ZIM indices, and it wouldn't be too hard to extend the code to get the full index so long as the server supports XHR range requests. There is an alternative for newer ZIMs, which is the Xapian compressed full-text index that some ZIMs contain. Again, this would be impossibly large to download for a full Wikipedia ZIM, but doable for smaller ones I think (I haven't worked with the Xapian indices because Kiwix JS doesn't support them yet). |
I've added a quick prototype for replayweb.page to load ZIM files, packaged into the WACZ format I have been working on. Here is a small wiki from (https://download.kiwix.org/zim/wikipedia/wikipedia_en_100_2020-06.zim, 300MB) There is no full text search yet, and I didn't use the title index, but you actually can search ZIM entries by url, for example: And here is full German wikipedia. In this example, its just loading the index from DIgitalOcean, the .zim is not bundled, as I didn't want to download and re-upload 37GB file, so its loaded directly from https://download.kiwix.org/zim/wikipedia/wikipedia_de_all_maxi_2020-05.zim It's a little bit slower, especially on initial load, but I think generally usable: Since replayweb.page uses full urls, the urls are mapped to What I did was generate an index, similar to the style I have for WARCs, that allows going directly to the entry. The clustering in ZIM makes it slightly more complicated, as have to decompress from beginning of cluster, rather than from the response itself.. Not sure if there's any way to avoid that. The index adds a 162MB for this large 37GB archive, which is negligible anyway, but certainly could be reduced in size further -- so far I optimized for access only. I used the existing xzdec.js from kiwix-js and the copied the approach kiwix-js uses for decompressing the cluster, |
@ikreymer This is hugely impressive! Congratulations! I'm not sure how much of Kiwix JS you've used other than xzdec, but you should be aware of #630 which has recently been merged and squashes a Kiwix-wide bug that forced the reader to attempt to load the entire ZIM into memory on recent ZIMs which have moved the URL pointer list to the end of the file. You should also be aware of an upcoming compression format change from XZ to ZSTD. See #611 , where I have begun to make a little bit of progress towards supporting a wasm/asm version of ZSTD decompression. If it's something you can help with, any contribution would be gratefully received. I'll look into your implementation when I have a bit more time, but it looks very exciting. |
@ikreymer You say download.kiwix.org doesn't support CORS, but we had a specific issue on this and it was fixed some time ago. I just checked again with an online checker and I get the following: |
@Jaifroid @ikreymer Considering that this SW reads a ZIM file directly, it will probably be read from one of the mirrors... which might not have CORS configured properly. For this demo, I would recommend to use https://mirror.download.kiwix.org/ to secure we use a Web server we control. |
@Jaifroid Yes, I was referring to the direct download mirror links specifically. It does not appear that any of them support CORS for the ZIM downloads, including https://mirror.download.kiwix.org/ so was justing a proxy for now.. This xzdec is the only component of kiwix-js that this is using at the moment.
The By comparison, for WARCs, there is only a I wonder if the main perf difference is in the compression itself (would be interesting to have some benchmarks of decompression of gzip vs xz vs zstd, esp in JS), or the additional offset to the beginning of the response.. |
Since the new Wikipedia EN maxi ZIM just came out, wanted to make a client-side update for that as well. |
@ikreymer This is really great, it seems we are really not far from what ipfs/devgrants#49 wants. Does something stops you to use the ZIM article index in place of https://dh-preserve.sfo2.cdn.digitaloceanspaces.com/webarchives/zim/zim-wiki-en-all.wacz? It seems the is the only thing left to be able to complete ipfs/devgrants#49? |
Unfortunately, I don't think the ZIM format alone as is will work, because it would require a binary search over the entire index over HTTP, and too many seeks to get the content. As @Jaifroid points out, this won't really scale to this size. The approach I am using is 'clustering' the sorted index, similar to what ZIM is doing for data, but for the index. The index is grouped into sorted clusters, and a binary search is performed over clusters, of say ever 1000 entries. You could probably change the ZIM format to have a cluster for the index as well, but I assume that would be a fairly big change to the ZIM format? The goal would be to minimize the amount of random access seeking that is necessary to find the offset to the blob, given a URL. (With this index, there are 2 seeks only). The approach I'm using precomputes all of these lookups to minimize access time. I'm sure its not necessarily the best indexing approach, just one that I knew would work, as its been implemented before. There are probably other ways to do this also, and reduce the extra index to be even smaller. |
@ikreymer Thank you for this interesting explanation. My first though was like yours and the one of @Jaifroid. Thinking twice about it during this WE I don't believe anymore that the format is not proper for that. Let's take it from scratch, the search in the urlPtrList complexity is O(log n). This means that for 1M entries (kind of worse case scenario) you will need ~20 HTTP requests. This might sounds a lot but in fact:
I believe it could work properly only with a local indexedDb cache build dynamically. It will probably be a bit slower (in particular for the first pages) than what you have done. If we want maximal and constant speed, we would indeed have to prefill this cache, this could be done relatively easy by proposing the N first levels of the binary search tree in a form of a json in the ZIM file to populate it (but to me this is just an optimisation). |
@kelson42 We have code for a block cache which is particularly effective with binary search (see #619). It was written by peter-x, and has been running on kiwix-js-windows for a few months now. The speedup is most notable in title search which becomes almost instant. The speed of search you see demonstrated at this video time-point is down to the binary search cache. Whether it would be sufficient to make it usable over HTTP would be a matter of trial-and-error. There are parameters that can be tweaked, but the cache isn't currently permitted to grow to anything as large as ~400MB, as it is currently kept in memory rather than in indexDB. The latter is perfectly possible: although slower to access than memory, it should be "fast enough". |
@ikreymer I just re-iterate, in case you didn't see it, that the ZIM file compression is changing this summer to zstandard. I have compiled a |
To clarify, the binary search is over the initial index, which is about 1.2M here. The total size of all index clusters is ~400MB, but they would not all be loaded unless viewing thousands of articles across the entire range. The binary search over this 1.2M secondary index could be done in memory too, and probably would be faster still -- IndexedDB is used as service workers as storing global data in service workers is discouraged. The way to look at it is it adds about a ~0.4% increase in storage size (400MB for 89GB) in exchange for significantly improved access time.. And even the 0.4 could likely be made lower. In addition to the binary search for directory entries, there is still the cluster lookup, and the blob lookup within the cluster, and fetching the remainder of the cluster, which I think adds another 1-2 fetches (or can the size of the cluster known and can be prefetched? I couldn't find that) So probably you would end up with at least 4-5 lookups at the least, just a guess.. I am still skeptical that it would be performant enough, but there are many other factors. Maybe IPFS would be faster than HTTP fetch, but maybe not.. Luckily with Wikipedia, the page can be rendered right away and it is ok to wait for images to load as long as at least the text is there -- more generally, other pages may require CSS, JS files to load before the page can render, which can lead to an unreasonably long wait times before a user sees anything. But I agree, could use more testing experimentation! Re: ZStd, sure, I could take a look when I have a chance. I didn't see any ZIM files that use Zstd yet, are there some examples? Would be interesting to see how Zstd decompression compares to XZ.. Is decompression generally faster or slower compared to XZ? |
Thanks @ikreymer. We currently have a small test zim in zstd that just contains numbers, I believe, as test data. I have added it to this branch and directory: https://github.com/kiwix/kiwix-js/tree/Add-zstd-decompression-support/tests . @kelson42 Do we yet have a more complete demo ZIM file with real-world articles? E.g. Ray Charles or something small? What's the ETA on releasing the first zstd ZIMs? |
Hey everyone, I modified Kiwix-JS to access a ZIM file of the English Wikipedia (100 GB) over HTTP and wanted to report on my experience here. You can see the result here: https://siasky.net/_BmN2WhKvKAm0CzsfmHyiP81yUnYyQxh5To8JkVbj1RUxg/www/index.html It's not that fast, but I find it to be reasonably usable. In a first step I used the work done by sharun-s to simply turn all file access operations into XMLHttpRequestS. I then integrated the block-level cache developed by peter-x (from this pull request: #183) and tuned it a bit to cut down on the number of requests. This still resulted in a large number of requests, so I added yet another cache. Since a lot of binary searches are happening, I prepared a cache that includes the first few steps for all possible binary searches. In essence the cache allows the binary search to narrow down to around 100 items. To complete the search, it then needs to fetch those items and perform the rest of the binary search. The block-level cache actually helps here and will often fetch those 100 items in one go. The code can be found in this branch: https://github.com/javgh/kiwix-js/tree/skynet This is all done in a very hack-ish manner, so not ready to be merged or anything. It also needs quite a bit of memory (> 100 MB), so might not even be desirable in this form. But should hopefully be helpful as a data point on how well this can perform and be useful to others who are interested in hosting ZIM files on the web. |
@javgh This is really great, and it seems to work good for me. I would love to see a way to merge your effort cleanly in our Kiwix JS code base. |
@lidel Good to know, yes, makes it easier to serve archives from IPFS, and will be needed for the replayweb.page access as well. Nice to see this implementation! I only took a brief look so far, so apologies if this is incorrect, but it seems that this approach is not too different from what I was trying to do above as well (no longer working as the June zim file has been removed), namely that a secondary index cache needs to be computed and made available separate from the ZIM file. To compute this binary search index, it would be necessary to first read the entire ZIM file. To make this work then, the input to the system is not a ZIM file, but a ZIM file + precomputed index. I would say that this is no longer just reading from ZIM but from a new format, one that combines ZIM + a new index. This is exactly what I've been trying to do with the wacz format, which can bundle together a ZIM file + another index. I still not sure that it is not possible to simply take a ZIM file, as is, on its own and have this work, without doing an offline pre-computation and producing a new file format (ZIM + some type of extra index), as done here and in my experiments. If this is important use case, perhaps it would be possible to consider changes to the ZIM format to include such a precomputed index in the ZIM itself? |
PS @javgh I integrated peter-x 's block-level cache into Kiwix JS Windows. It really does make a difference with slow file access and/or access over network, precisely as you say because it cuts binary search right down. I experimented with this on the PWA vesrion used on Android, because the filesystem is presented to the browser engine in a very very slow way, basically rendering Kiwix JS unsuable. But with peter-x's code, it suddenly became vialbe to search a 90GB full English Wikipedia in the PWA app on Android (not that I would use it instead of the much faster Android app -- but it's a demonstration of how well that optimization works.). |
@ikreymer For a PoC having index in a separate file is perfectly fine approach. I suspect that if we create JS reader for "namespace X", we may have luck with leveraging full-text index that is already present in ZIM file. If that is not enough, we could augument ZIM format by adding additional index to benefit lazy-loading over HTTP/IPFS. Either way, the end goal should be that a Sidenote: an optimization similar to block caching mentioned by @Jaifroid could be provided by a local IPFS node for free (without any additional development): in IPFS data is chunked into blocks, and each block is cached in local repo after it is fetched for the first time. Then, all future reads of the block are local, so no need for remote range requests. |
@lidel @Jaifroid I was referring to a secondary index for the title/url ptr list, just the title/url mapping to block/blob offset.
Yes, that my point is just that to make loading over HTTP workable requires changing the ZIM format to include an extra index. This would mean changing the ZIM writer to create such an index, perhaps a binary search pre-cache, and build it into the ZIM. But without such a change to the ZIM format/zim writers, this extra index will need to be generated and side-loaded separately. For example, in that prototype, it looks like the ZIM is loaded from: https://siasky.net/fANHCNNOgRNk0uc16BMFb5KoEbWrovtG9efLbS-toIYXmQ?r=0.9875405273101383 And the compressed 'secondary index' (binary search precache?) is loaded from: https://siasky.net/AAAJfZwOeKiZ0uFQikbWoSD3eO6RmhSKWPHnKX8c2_L_ng, and is only 11MB compressed! In my experiment, the secondary index was much larger (~400MB), and it was storing every URL and a reusing a system designed for warc files), so this seems like a much better approach! But the fact remains that a secondary index of some sort in addition to the ZIM is needed unless the ZIM format is changed. |
@ikreymer You could discuss needed changes on openzim/libzim#424 where there is discussion of a new "fanout" search algorithm (would that algorithm help - if you have ideas for a change in format that would speed up access, it might be a good place to see what is possible. |
Congratulations for all this promising work! Regarding the block-level cache, there is one in the libzim C implementation. This library also implements the openZIM features that are currently missing in our Javascript code (like the full-text search in the xapian index). |
Both @ikreymer and @javgh have created workable POC to read ZIM files online in Javascript. To achieve that. they have replaced the online reading of the url/title index by a trivial solution (good for a POC but not for a viable product). The key point now, if we want to make the next step forward, is to create an online index/dirent reader which works efficiently. To me, this seems pretty doable. Changing the ZIM format or additional index will be considered for this use-case only if proven that there is not other viable path. |
@kelson42 I agree that this looks achievable. I also agree with @mossroy that we would hugely welcome help towards achieving #508 etc. (porting of libzim to JS with Emscripten). If any of the devs here could help with that, we'd be very happy! However, I don't believe the libzim port would help with this issue (reading ZIM files over HTTP) because libzim reads files, not network requests, whereas Kiwix JS just decodes binary data/compressed strings, which can be obtained from anywhere. Peter-x's block-level cache could be merged in Kiwix JS, if @mossroy agrees (it is trivial to merge and was actually finished). It resides in a separate (RequireJS) module and does not complicate the main codebase. At the time it was not judged to provide a sufficient performance increase to warrant merging. However, having used it in production in KJSW, I think that judgement was based on the then limited title search algorithm. Now that we search many many more case combinations, the block cache provides tangible benefit, and clearly would help with this issue. There are performance comments in #619. |
Mmmh, you're certainly right @Jaifroid It might be complicated to read ZIM files through HTTP with a libzim port. But this API https://emscripten.org/docs/api_reference/fetch.html might do what we need : if it were (optionally) used in libzim, it could replace the file slice reads by XHR range requests (done by the browser). In any case, you convinced me to consider #619 : it's true that it makes a real difference. If you can make a PR, we would discuss it. |
Well spotted @mossroy , yes that API supports byte ranges, which could be interesting depending on whether it directly patches C++ file reads. |
@kelson42 My conclusion, based on my experiments and looking at @javgh work, is that it is only possible by changing the ZIM format to include an extra cache to optimize XHR lookups over http. If you're not willing to change the ZIM format, it is not really feasible with ZIM alone as it is today. I think the first discussion should be about how/what type of index to add to ZIM, maybe its just what @javgh has generated or something similar. |
@kelson42 peter-x's block-cache feature is leveraged by @javgh, and what I have proposed in #658 fixes some issues with that code and incorporates it correctly into Kiwix JS's codebase. I think it is a prerequisite for a full implementation of @javgh's or @ikreymer's ideas for remote file reading. Personally, I think it could be a USP of Kiwix JS if it could read remote ZIM archives, with or without a generated external title pointer index, and I hope @javgh and/or @ikreymer will continue to work towards a solution that could be merged into Kiwix JS. |
@Jaifroid My question is if we should, not if someone will or could? If “yes” how this feature would integrate in term of UX with the current software? What is USP? |
@kelson42 It's very hard to answer directly. Whether we "should" depends on whether there is a use case and whether there is enough developer time. I personally use Kiwix JS probably like the majority of people, i.e. for offline access to ZIM archives I have downloaded. However, I can imagine a strong use case for accessing a large ZIM or library of ZIMs over an intranet in areas where access to the Internet is restricted. You already have a solution for this, but in this configuration Kiwix JS could provide an interesting non-hardware-based, client-side alternative. Another use case could be "sampling" ZIM archives before downloading one. In terms of UX and UI, @javgh's version uses the standard Kiwix JS interface, but dedicates the app to one remote ZIM for now. We would need a way to discover other remote ZIMs, but that wouldn't be hard. @ikreymer's approach is a very different, dedicated system that only uses the Kiwix JS back end. Currently in both cases there is a requirement for an external index, so some specialist approaches are required that only @javgh and @ikreymer could drive forward. USP (Unique Selling Point) is jargon, sorry - I just meant "unique feature of this client". |
@Jaifroid @mossroy I think been able to read online ZIM files directly via HTTP might open new interesting perspectives. At least the one allowing to read ZIM files directly from IPFS. It is a bit adventurous, but might be worth the effort. For the project, this is not a priority, but I feel ready to support the idea. I believe as well that this feature could be integrated relatively easily to the current Kiwix JS by allowing to choose an URL and then have something like a "Online Preview" checkbox. I would recommend to close this ticket as we have the answer to the original question and open a new ticket to request this new feature (I'm voluntering to phrase it). |
@kelson42 I agree! |
I wanted to add some more technical details regarding my POC, as it might be helpful for future design decisions. It's indeed similar to what @ikreymer did before in that it's a ZIM file + precomputed index. The Wikipedia ZIM file has around 20 million dirEntries (around 6 million actual articles). Just all the pointers for the dirEntries are already 160 MB and then a dirEntry itself is around 50 bytes in the file, so all dirEntries add up to around 1 GB. Binary searching through this will take around log_2(20 million) ~= 24 steps. My precomputed index contains the results for the first 18 steps for all possible searches. These are 2 ^ 18 = 262144 items and I'm doing it both for "search by url" and "search by title", so it's twice that. For the index I reused the
It's around 50 MB in size. I then compressed it with xz to 11 MB, as it was easy to reuse Kiwix's xz decompressor. I tuned the block-level cache for a fairly large block size, as I wanted to ensure that the last 6 or 7 steps of any binary search would trigger a single read by the block cache and then be able to complete with just that. So my settings there are: 4096 cache slots with a size of 16384 each (so around 67 MB in total). Perhaps the block size would not need to be quite as large. The stats for a first search look something like this:
Regarding Xapian: I hadn't looked at it before, but I just now checked it for the Wikipedia file. It has a "Xapian Fulltext Index" and "Xapian Title Index". The title index alone is already 2.5 GB in size, so I'm not sure if it would be helpful as an alternative to such a custom precomputed index. I found this paragraph on https://xapian.org/docs/scalability.html :
I wonder if it would be possible to have an Emscripten-compiled version of Xapian that does HTTP range requests into the Xapian database and how fast that would be. |
Pinging @cptolemy who is interested in the issues behind this thread. |
@Jaifroid why reopening this issue? |
@kelson42 Only because @cptolemy wrote to me on another Repo about wanting to develop a very similar-sounding solution to those already extensively discussed above (and partially implemented) by lidl and ikreymer and javgh. I promised to ping cptolemy in this issue, and thought it could be re-opened in case he has anything to add. I'll close again in a few days if there is no further discussion or query. |
OK. #659 is open and states clearly what should be done IMO, to the contrary to this ticket which is a question which as been answered already. Closed tickets can still welcome comments. |
I know we have talked about that in the past... but it the outcome of the discussion is unclear to me.
The text was updated successfully, but these errors were encountered: