Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update en.wikipedia-on-ipfs.org #61

Closed
5 of 9 tasks
lidel opened this issue Sep 9, 2019 · 29 comments · Fixed by #92
Closed
5 of 9 tasks

Update en.wikipedia-on-ipfs.org #61

lidel opened this issue Sep 9, 2019 · 29 comments · Fixed by #92
Labels
Epic help wanted language language-specific issues P2 Medium: Good to have, but can wait until someone steps up status/blocked Unable to be worked further until needs are met

Comments

@lidel
Copy link
Member

lidel commented Sep 9, 2019

This could be done manually or as a part of #58

@lidel lidel added the language language-specific issues label Sep 9, 2019
@lidel lidel added the P2 Medium: Good to have, but can wait until someone steps up label Oct 28, 2019
@RubenKelevra
Copy link

I would love to be part of the collaborative cluster, any ETA of that?

@lidel
Copy link
Member Author

lidel commented Jan 20, 2020

We should have collaborative cluster set up as soon as TR version is updated (track progress in #60 and #67). TR version is much smaller, so we use it as the test case.

Due to other priorities and the size of EN version I can't commit to any specific date, but it should happen sometime in Q1 or Q2.

@RubenKelevra
Copy link

Cool news! Thanks for the work!

Maybe setup the cluster now, in the mean time people can join the cluster, while you work in an updated version of the data? :)

@lidel
Copy link
Member Author

lidel commented Jan 21, 2020

Let's track that in #68

@lidel lidel added the Epic label Mar 9, 2020
@ShadowJonathan
Copy link

Is it possible that this recreation job would be done on a monthly basis?

Every month a cron job would execute scaping/downloading + pushing it on IPFS, and then provide a new pin for the collaberative clusters, and update the dnslink to match the new version.

This also needs #71, if it's possible to de-duplicate unedited wiki pages between months, the new pin can stay up-to-date while requiring minimal extra data to be pulled.

@lidel
Copy link
Member Author

lidel commented May 26, 2020

Yes, ideally, in short term we would update every time the new snapshot is published by Kiwix.

Unfortunately English version is blocked on scraping/downloading step:

Upstream Kiwix project is having trouble with generating .zim archive which includes images, progress can be tracked in: openzim/mwoffliner#1043

@lidel lidel added the status/blocked Unable to be worked further until needs are met label May 26, 2020
@ShadowJonathan
Copy link

Alright, i'll be keeping an eye out on that issue as well 👍

@lidel
Copy link
Member Author

lidel commented Jul 3, 2020

https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2020-06.zim.torrent just landed (88GB)

If anyone has bandwidth and storage to download it and make a test build using instructions from README to see if there are any issues and report back in comment here, that would help a lot.

(I may do it eventually but super thin on free time so no ETA)

@ShadowJonathan
Copy link

@lidel openzim/mwoffliner#1043 seems to be fixed, would this issue be unblocked?

@kelson42
Copy link

kelson42 commented Jan 9, 2021

@lidel Same remark here, was is still missing?

@lidel
Copy link
Member Author

lidel commented Jan 25, 2021

I dropped the ball here :(
Main factor here was complexity of current solution vs the lack of time on my end, unfortunately.

I suspect by now we need to switch scripts from third-party ZIM decompressor extract_zim to official zimdump from zim-tools that has support for new compression scheme – #66

@kelson42 if you feel updating old Turkish and English snapshots is worth the effort, we could try to do a manual one-off update. I'll try to allocate some time this week and see if its feasible with updated tooling.

But it is not sustainable long-term, for which we need to use ZIMs directly (#42).

@lidel
Copy link
Member Author

lidel commented Feb 15, 2021

Now that #77 landed I will attempt to build English in the next two weeks and see how it goes.

@lidel lidel pinned this issue Feb 15, 2021
@lidel
Copy link
Member Author

lidel commented Feb 16, 2021

./mirrorzim.sh --languagecode=en --wikitype=wikipedia --hostingdnsdomain=en.wikipedia-on-ipfs.org failed after ~21h during redirect fixup stage with:

[..]
sed: couldn't open temporary file tmp/wikipedia_en_all_maxi_2020-12/wiki/sedNMHlNz: No space left on device
sed: couldn't open temporary file tmp/wikipedia_en_all_maxi_2020-12/wiki/sedhVhHaV: No space left on device
sed: couldn't open temporary file tmp/wikipedia_en_all_maxi_2020-12/wiki/sedWypHuG: No space left on device
sed: couldn't open temporary file tmp/wikipedia_en_all_maxi_2020-12/wiki/seda2aFbP: No space left on device
[..]

I suspected disk fs running out of inodes, but that's not the case 🤔

Filesystem        Inodes    IUsed     IFree IUse%
/dev/sda1      366215168 23700775 342514393    7%

Most likely sed was generating some IO overhead, so just to remove this brittle breaking point I've tweaked idempotency a bit and disabled debug logs for redirect fix in 84a70b9

Restarted the build. 🤞

I also ordered 1TB SSD, so if this fails again, I might be able to retry on a faster setup.

@lidel
Copy link
Member Author

lidel commented Feb 18, 2021

Ok, it failed again after ~45h with the same error from sed 😿
Good news is that my 1TB PCI-e SSD arrived, I'll try to tweak fs parameters on it, make some analysis on how we use sed, and resume.

@lidel
Copy link
Member Author

lidel commented Feb 18, 2021

The sed issue seems to be gone when running on SSD.
I am now in process of ipfs add and it seems to be CPU-bound, IO is no longer an issue..
Will post updates when its done.

@lidel lidel mentioned this issue Feb 22, 2021
5 tasks
@lidel
Copy link
Member Author

lidel commented Feb 22, 2021

Badger(?) datastore seems to have a problem with many small files, filed #85 to investigate specifics.

Next:

  • Will look into flatfs, maybe it is enough for this specific build. (or maybe it is not badger at all)
  • If not, will have to debug, get some memory dumps / experiment with badger settings.

@lidel
Copy link
Member Author

lidel commented Feb 23, 2021

Good news! I ended up switching to flatfs datastore with sync: false and that (with my new SSD) finished without a hiccup:

Quick notes:

  • This is a lot of files, so publishing them on DHT takes time.
    • The first article of each category should work – I precached them on other node
    • If you are unable to load deeper articles via the public gateway, run own node and connect to mine via ipfs swarm connect /p2p/12D3KooWFRcqpEhdCfAY6HYcrJaQjDBbPNp2ycxDY54LZNx6UgLM
  • We might be able to get the original /wiki/Main_Page with fix from Add Myanmar Wikipedia #83 or when we fix Handle _exceptions/ directory #80, but for now the kiwix one should do just fine.
  • 💔 articles like /wiki/Africa, /wiki/Dance, /wiki/Design /wiki/Music /wiki/Poetry /wiki/Asia seem to be broken due to Handle _exceptions/ directory #80, so far the only blocker I see for shipping this

@kelson42 any suggestion here for 💔 ?
Would your work related to "removing namespaces" help solving _exceptions, or is it unrelated?

I want to understand if it makes sense to add code for fixing exceptions as post-unpack step, as it will be a very expensive process given the size of English wiki, or if I should fill an issue in https://github.com/openzim/zim-tools/issues for this.

FYSA the way we would like to handle _exceptions like this is is to move wiki/Asia (article) to wiki/Asia/index.html (subdir) so it gets rendered correctly on IPFS gateway (it takes care of dangling /). This way folder does not collide with article, and you can have wiki/Asia/Tokyo (file) or even wiki/Africa/Tokyo/index.html (subdir)

@kelson42
Copy link

@lidel I have noe clue why th article "Africa dance" is an exception, so n way to say if this is legit or not.

@MatthewSteeples
Copy link

/ipfs/bafybeicarbywfeinwuwxcivurnle2mzwue42xr3c3dutrf2mngfun6pdum/wiki/Operating_system (linked to from /ipfs/bafybeicarbywfeinwuwxcivurnle2mzwue42xr3c3dutrf2mngfun6pdum/wiki/Software) does something weird too (might be the same issue as Africa Dance).

It shows the standard "folder" view with a single file in it called kernel, which just consists of a meta refresh tag that takes you to Kernel_(operating_system), which doesn't exist

@ShadowJonathan
Copy link

Maybe _/ and uppercase-lowercase issues? Wikipedia corrects these automatically, but maybe this file doesn't?

@lidel
Copy link
Member Author

lidel commented Feb 23, 2021

I am sorry I was a bit vague in the problem description. What I meant are issues with pages that have conflict with articles that had / in name, for example:

@kelson42
Copy link

@lidel I think all of this is a problem we have identified 10 days ago around the mgmt of articles including a /. To me, this has to be treated/handled properly here. There is not much more which can be done IMO on zimdump side, other please let me know.

@lidel
Copy link
Member Author

lidel commented Feb 23, 2021

@kelson42 I've filled openzim/zim-tools#226 with some ideas, let's discuss there if it is feasible on your end.

@lidel
Copy link
Member Author

lidel commented Mar 3, 2021

Quick update on updating English to 2021-02: I retried with fix for openzim/zim-tools#227 and unpacking step went ok. 👍
Unfortunately flatfs seem to take more space than badgerds and I've run out of space on 1TB drive 🤦‍♂️
I'll see if I can work around this somehow. If not, I can always try to tweak badgerds settings to avoid issues described in #85, but that is plan B.

@lidel
Copy link
Member Author

lidel commented Mar 5, 2021

I cheated a bit and run ipfs add --no-copy with filestore enabled overnight just to see how this alternative backend performs in hope it will "just work" – and it did! Sadly I did not log times, but it seems to be faster than (flatfs with sync) and somewhere between (flatfs without sync) and (badger). For sure does not utilize full potential of SSD, but does the job as a slow workaround.

Anyway, generated a version with changes from #88:

Give it a try:

https://bafybeiehlicfvvqhauuxyj7ghspu63uv7vlq224aqytcfv5frewve5jxoq.ipfs.dweb.link/wiki/
This is not pinned anywhere yet. The first link of each category should work, but if you are unable to fetch deeper pages, manually connect to the provider means this is no longer available and check more up to date link in comments below.

@lidel
Copy link
Member Author

lidel commented Mar 7, 2021

FYI I'm taking /p2p/12D3KooWFRcqpEhdCfAY6HYcrJaQjDBbPNp2ycxDY54LZNx6UgLM offline for now to re-generate with fix from #89

@lidel
Copy link
Member Author

lidel commented Mar 9, 2021

#89 is fixed.

Give a new build a try:

👉 https://bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze.ipfs.dweb.link/wiki/
This is not pinned anywhere yet. The first link of each category should work, but if you are unable to fetch deeper pages, manually connect to the provider via: ipfs swarm connect /p2p/12D3KooWFRcqpEhdCfAY6HYcrJaQjDBbPNp2ycxDY54LZNx6UgLM

Lmk if you find any broken articles or unexpected behaviors. 🙏
If nothing too severe is found, I'd like to ship this as-is, because it is still way better than the current 4-year-old version at en.wikipedia-on-ipfs.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic help wanted language language-specific issues P2 Medium: Good to have, but can wait until someone steps up status/blocked Unable to be worked further until needs are met
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@lidel @MatthewSteeples @RubenKelevra @kelson42 @ShadowJonathan and others